# 階級作成とDummy変数の作成 - いままでの階級作成は、dictを作ってmapさせていた - 右区間の開閉を指定できる - 数値の範囲を示す文字列を作成することでmapのようなことができる - dummy変数便利、SQLを複雑にしないですむ - おまけにのせたfactorizeは数値しか扱えないライブラリには便利そう - ただし、numpyとはnanの扱いが少し違うらしい ```python3 import numpy as np import pandas as pd ``` ## 参考 - http://pandas.pydata.org/pandas-docs/stable/generated/pandas.cut.html - http://pandas.pydata.org/pandas-docs/stable/reshaping.html#computing-indicator-dummy-variables ## データ ```python3 np.random.seed(0) df_for_cut = pd.DataFrame(np.random.randint(1, 99, 1000), columns=["age"]) df_for_cut.tail() ```
age
995 36
996 89
997 50
998 80
999 85
## bin作成 ```python3 bins = list(range(0, 100+1, 10)) bins [0, 10, 20, 30, 40, 50, 60, 70, 80, 90, 100] ``` ## binのラベル ```python3 bins_labels = [str(b) + " - " + str(b + 10 - 1) for b in bins[:-1]] bins_labels ['0 - 9', '10 - 19', '20 - 29', '30 - 39', '40 - 49', '50 - 59', '60 - 69', '70 - 79', '80 - 89', '90 - 99'] ``` ```python3 df_for_cut["age_group"] = pd.cut(df_for_cut.age, bins=bins) df_for_cut["age_group_right"] = pd.cut(df_for_cut.age, bins=bins, right=False) df_for_cut["age_group_label_F"] = pd.cut(df_for_cut.age, bins=bins, labels=False) df_for_cut["age_group_labels"] = pd.cut(df_for_cut.age, bins=bins, labels=bins_labels) df_for_cut.tail() ```
age age_group age_group_right age_group_label_F age_group_labels
995 36 (30, 40] [30, 40) 3 30 - 39
996 89 (80, 90] [80, 90) 8 80 - 89
997 50 (40, 50] [50, 60) 4 40 - 49
998 80 (70, 80] [80, 90) 7 70 - 79
999 85 (80, 90] [80, 90) 8 80 - 89
```python3 df_for_cut.age_group.unique() [(40, 50], (60, 70], (0, 10], (80, 90], (20, 30], (30, 40], (70, 80], (10, 20], (50, 60], (90, 100]] Categories (10, object): [(0, 10] < (10, 20] < (20, 30] < (30, 40] ... (60, 70] < (70, 80] < (80, 90] < (90, 100]] ``` ```python3 df_for_cut.age_group_label_F.unique() array([4, 6, 0, 8, 2, 3, 7, 1, 5, 9]) ``` pd.qcut(quantile cut) もあるが、こちらは分位数または分位のリストを指定してするものもある。 ```python3 qcuted_4 = pd.qcut(df_for_cut["age"], q=4) qcuted_4.tail() q = [0, .25, .5, .75, 1] qcuted_list = pd.qcut(df_for_cut["age"], q=q) qcuted_list.tail() ``` ## Dummy変数 ```python3 dummies = pd.get_dummies(df_for_cut['age_group'], prefix='age_group') df_for_cut_with_dummies = pd.concat([df_for_cut, dummies], axis=1) df_for_cut_with_dummies.tail() ```
age age_group age_group_right age_group_(0, 10] age_group_(10, 20] age_group_(20, 30] age_group_(30, 40] age_group_(40, 50] age_group_(50, 60] age_group_(60, 70] age_group_(70, 80] age_group_(80, 90] age_group_(90, 100]
995 36 (30, 40] [30, 40) 0 0 0 1 0 0 0 0 0 0
996 89 (80, 90] [80, 90) 0 0 0 0 0 0 0 0 1 0
997 50 (40, 50] [50, 60) 0 0 0 0 1 0 0 0 0 0
998 80 (70, 80] [80, 90) 0 0 0 0 0 0 0 1 0 0
999 85 (80, 90] [80, 90) 0 0 0 0 0 0 0 0 1 0
```python3 pd.get_dummies(pd.DataFrame({"a": list("AB"), "b": list("CD")}), prefix=list("ab")) # Series # prefixはない, split+expandをさらに加工する必要がなくなる pd.Series(["a|b|c", "e|fg"]).str.get_dummies() pd.Series(["a|b|c", "e|fg"]).str.split("|", expand=True) ```
a_A a_B b_C b_D
0 1 0 1 0
1 0 1 0 1
```python3 factors = pd.Series(["B", np.nan, "a", np.nan, 123, 0.4, np.inf]) factors 0 B 1 NaN 2 a 3 NaN 4 123 5 0.4 6 inf dtype: object ``` ## おまけ ```python3 factors.factorize() (array([ 0, -1, 1, -1, 2, 3, 4]), Index(['B', 'a', 123, 0.4, inf], dtype='object')) ```