Data Science/Python
[Python] NLP : TreebankWordTokenizer, Pos Tag
DS-9VM
2022. 9. 17. 06:38
728x90
nltk 모듈을 이용해서 Treebank WordTokenzer로 향상된 단어 토근화 방법과 pos_tag함수로 품사 태깅
from nltk.tag import pos_tag
from nltk.tokenize import TreebankWordTokenizer
nltk.download('averaged_perceptron_tagger') # 품사 태킹을 위한 데이터 다운로드가 필요함
text = "some teacher don't know how to teach it in then way that students understand it. \n\r that causes students to fail and they may repeat the class. The X.G Compay"
tb_worktokenizer = TreebankWordTokenizer()
tokenized = tb_worktokenizer.tokenize(text)
tokenized_postag = pos_tag(tokenized)
print("단어 토큰:", tokenized)
print("품사 태깅:", tokenized_postag)
# Output is like :
'''
단어 토큰: ['some', 'teacher', 'do', "n't", 'know', 'how', 'to', 'teach', 'it', 'in', 'then', 'way', 'that', 'students', 'understand', 'it.', 'that', 'causes', 'students', 'to', 'fail', 'and', 'they', 'may', 'repeat', 'the', 'class.', 'The', 'X.G', 'Compay']
품사 태깅: [('some', 'DT'), ('teacher', 'NN'), ('do', 'VBP'), ("n't", 'RB'), ('know', 'VB'), ('how', 'WRB'), ('to', 'TO'), ('teach', 'VB'), ('it', 'PRP'), ('in', 'IN'), ('then', 'RB'), ('way', 'NN'), ('that', 'IN'), ('students', 'NNS'), ('understand', 'VBP'), ('it.', 'NN'), ('that', 'WDT'), ('causes', 'VBZ'), ('students', 'NNS'), ('to', 'TO'), ('fail', 'VB'), ('and', 'CC'), ('they', 'PRP'), ('may', 'MD'), ('repeat', 'VB'), ('the', 'DT'), ('class.', 'NN'), ('The', 'DT'), ('X.G', 'NNP'), ('Compay', 'NNP')]
'''
728x90