Data Science/Python

[Python] NLP : TreebankWordTokenizer, Pos Tag

DS-9VM 2022. 9. 17. 06:38
728x90

nltk 모듈을 이용해서 Treebank WordTokenzer로 향상된 단어 토근화 방법과 pos_tag함수로 품사 태깅

from nltk.tag import pos_tag
from nltk.tokenize import TreebankWordTokenizer

nltk.download('averaged_perceptron_tagger')   # 품사 태킹을 위한 데이터 다운로드가 필요함

text = "some teacher don't know how to teach it in then way that students understand it. \n\r that causes students to fail and they may repeat the class. The X.G Compay"

tb_worktokenizer = TreebankWordTokenizer()
tokenized = tb_worktokenizer.tokenize(text)
tokenized_postag = pos_tag(tokenized)

print("단어 토큰:", tokenized)
print("품사 태깅:", tokenized_postag)

# Output is like :
'''
단어 토큰: ['some', 'teacher', 'do', "n't", 'know', 'how', 'to', 'teach', 'it', 'in', 'then', 'way', 'that', 'students', 'understand', 'it.', 'that', 'causes', 'students', 'to', 'fail', 'and', 'they', 'may', 'repeat', 'the', 'class.', 'The', 'X.G', 'Compay']
품사 태깅: [('some', 'DT'), ('teacher', 'NN'), ('do', 'VBP'), ("n't", 'RB'), ('know', 'VB'), ('how', 'WRB'), ('to', 'TO'), ('teach', 'VB'), ('it', 'PRP'), ('in', 'IN'), ('then', 'RB'), ('way', 'NN'), ('that', 'IN'), ('students', 'NNS'), ('understand', 'VBP'), ('it.', 'NN'), ('that', 'WDT'), ('causes', 'VBZ'), ('students', 'NNS'), ('to', 'TO'), ('fail', 'VB'), ('and', 'CC'), ('they', 'PRP'), ('may', 'MD'), ('repeat', 'VB'), ('the', 'DT'), ('class.', 'NN'), ('The', 'DT'), ('X.G', 'NNP'), ('Compay', 'NNP')]
'''

 

728x90