Janomeによる形態素解析と形態素を見やすく整形¶
mecabめんどうだったのでInstall
# https://github.com/mocobeta/janome
pip install janome
example_janome.py¶
# 見づらい
from janome.tokenizer import Tokenizer
t = Tokenizer()
for token in t.tokenize(u'すもももももももものうち'):
print(token)
print(type(token))
for attr_name in dir(token):
if attr_name.startswith("_"):
continue
attr = getattr(token, attr_name)
print(attr_name, attr, type(attr))
break
すもも 名詞,一般,*,*,*,*,すもも,スモモ,スモモ
<class 'janome.tokenizer.Token'>
base_form すもも <class 'str'>
infl_form * <class 'str'>
infl_type * <class 'str'>
node_type SYS_DICT <class 'str'>
part_of_speech 名詞,一般,*,* <class 'str'>
phonetic スモモ <class 'str'>
reading スモモ <class 'str'>
surface すもも <class 'str'>
janome_token_dataframe.py
import pandas as pd
from janome.tokenizer import Tokenizer
pd.DataFrame(
{attr_name: getattr(token, attr_name) for attr_name in dir(token) if not attr_name.startswith("_")}
for token in Tokenizer().tokenize(u'すもももももももものうち')
)