728x90
nltk 모듈을 이용해서 Treebank WordTokenzer로 향상된 단어 토근화 방법과 pos_tag함수로 품사 태깅
from nltk.tag import pos_tag
from nltk.tokenize import TreebankWordTokenizer
nltk.download('averaged_perceptron_tagger') # 품사 태킹을 위한 데이터 다운로드가 필요함
text = "some teacher don't know how to teach it in then way that students understand it. \n\r that causes students to fail and they may repeat the class. The X.G Compay"
tb_worktokenizer = TreebankWordTokenizer()
tokenized = tb_worktokenizer.tokenize(text)
tokenized_postag = pos_tag(tokenized)
print("단어 토큰:", tokenized)
print("품사 태깅:", tokenized_postag)
# Output is like :
'''
단어 토큰: ['some', 'teacher', 'do', "n't", 'know', 'how', 'to', 'teach', 'it', 'in', 'then', 'way', 'that', 'students', 'understand', 'it.', 'that', 'causes', 'students', 'to', 'fail', 'and', 'they', 'may', 'repeat', 'the', 'class.', 'The', 'X.G', 'Compay']
품사 태깅: [('some', 'DT'), ('teacher', 'NN'), ('do', 'VBP'), ("n't", 'RB'), ('know', 'VB'), ('how', 'WRB'), ('to', 'TO'), ('teach', 'VB'), ('it', 'PRP'), ('in', 'IN'), ('then', 'RB'), ('way', 'NN'), ('that', 'IN'), ('students', 'NNS'), ('understand', 'VBP'), ('it.', 'NN'), ('that', 'WDT'), ('causes', 'VBZ'), ('students', 'NNS'), ('to', 'TO'), ('fail', 'VB'), ('and', 'CC'), ('they', 'PRP'), ('may', 'MD'), ('repeat', 'VB'), ('the', 'DT'), ('class.', 'NN'), ('The', 'DT'), ('X.G', 'NNP'), ('Compay', 'NNP')]
'''
728x90
'Data Science > Python' 카테고리의 다른 글
[Python] pandas : Adding Multiple Columns from DatatFrame.apply() (0) | 2022.09.17 |
---|---|
[Python] Faiss 활용한 Vector 유사도 측정 및 검색 (0) | 2022.09.17 |
[python] ML Model Parameter Optimization : GridSearchCV (0) | 2022.09.17 |
[colab] Kaggle Dataset Load in Colab (0) | 2022.09.08 |
[python] dataframe apply() multiprocessing (0) | 2022.09.05 |
최근댓글