new_pivot_unstack_melt_MultiIndex¶
In [1]:
import numpy as np
import pandas as pd
データ¶
- 4人のデータ
- 性別, 年齢
- ある年の体重
- データ入力がめんどうなので年しか変えていない
In [2]:
import io
df = pd.read_csv(io.StringIO("""
id sex age year weight
a f 10 2016 49
b m 10 2016 59
c f 10 2016 54
d m 10 2016 64
a f 10 2017 49
b m 10 2017 59
c f 10 2017 54
d m 10 2017 64
"""), sep=" ")
df
Out[2]:
id | sex | age | year | weight | |
---|---|---|---|---|---|
0 | a | f | 10 | 2016 | 49 |
1 | b | m | 10 | 2016 | 59 |
2 | c | f | 10 | 2016 | 54 |
3 | d | m | 10 | 2016 | 64 |
4 | a | f | 10 | 2017 | 49 |
5 | b | m | 10 | 2017 | 59 |
6 | c | f | 10 | 2017 | 54 |
7 | d | m | 10 | 2017 | 64 |
年ごとの体重¶
In [3]:
df.pivot(index="id", columns="year", values="weight")
Out[3]:
year | 2016 | 2017 |
---|---|---|
id | ||
a | 49 | 49 |
b | 59 | 59 |
c | 54 | 54 |
d | 64 | 64 |
pivotの限界¶
各引数を複数指定できない
In [4]:
try:
df.pivot(index=["id", "sex"], columns="year", values="weight")
except:
print("err multi index")
try:
df.pivot(index="id", columns=["year", "sex"], values="weight")
except:
print("err multi columns")
try:
df.pivot(index="id", columns="year", values=["weight", "age"])
except:
print("err multi values")
err multi index
err multi columns
err multi values
メソッドのみで対応する¶
index と unstack¶
In [5]:
indexed_unstacked = df.set_index(["id", "sex", "year"]).unstack(["year", "sex"])
indexed_unstacked
Out[5]:
age | weight | |||||||
---|---|---|---|---|---|---|---|---|
year | 2016 | 2017 | 2016 | 2017 | ||||
sex | f | m | f | m | f | m | f | m |
id | ||||||||
a | 10.0 | NaN | 10.0 | NaN | 49.0 | NaN | 49.0 | NaN |
b | NaN | 10.0 | NaN | 10.0 | NaN | 59.0 | NaN | 59.0 |
c | 10.0 | NaN | 10.0 | NaN | 54.0 | NaN | 54.0 | NaN |
d | NaN | 10.0 | NaN | 10.0 | NaN | 64.0 | NaN | 64.0 |
MultiIndexを1次元のリストにする¶
- MultiIndexはtupleのlistとして扱える
- 文字列結合して対応する
In [6]:
indexed_unstacked.columns
Out[6]:
MultiIndex(levels=[['age', 'weight'], [2016, 2017], ['f', 'm']],
labels=[[0, 0, 0, 0, 1, 1, 1, 1], [0, 0, 1, 1, 0, 0, 1, 1], [0, 1, 0, 1, 0, 1, 0, 1]],
names=[None, 'year', 'sex'])
In [7]:
list(indexed_unstacked.columns)
Out[7]:
[('age', 2016, 'f'),
('age', 2016, 'm'),
('age', 2017, 'f'),
('age', 2017, 'm'),
('weight', 2016, 'f'),
('weight', 2016, 'm'),
('weight', 2017, 'f'),
('weight', 2017, 'm')]
In [8]:
indexed_unstacked.columns = ["_".join(map(str, c)) for c in indexed_unstacked.columns]
indexed_unstacked
Out[8]:
age_2016_f | age_2016_m | age_2017_f | age_2017_m | weight_2016_f | weight_2016_m | weight_2017_f | weight_2017_m | |
---|---|---|---|---|---|---|---|---|
id | ||||||||
a | 10.0 | NaN | 10.0 | NaN | 49.0 | NaN | 49.0 | NaN |
b | NaN | 10.0 | NaN | 10.0 | NaN | 59.0 | NaN | 59.0 |
c | 10.0 | NaN | 10.0 | NaN | 54.0 | NaN | 54.0 | NaN |
d | NaN | 10.0 | NaN | 10.0 | NaN | 64.0 | NaN | 64.0 |
カラムを結合して対応する¶
Rのtidyrではuniteという関数を使って対応しているので、それの真似
In [9]:
df_unite = df.copy()
df_unite["year_sex"] = df_unite["year"].apply(str) + "_" + df_unite["sex"]
df_unite
Out[9]:
id | sex | age | year | weight | year_sex | |
---|---|---|---|---|---|---|
0 | a | f | 10 | 2016 | 49 | 2016_f |
1 | b | m | 10 | 2016 | 59 | 2016_m |
2 | c | f | 10 | 2016 | 54 | 2016_f |
3 | d | m | 10 | 2016 | 64 | 2016_m |
4 | a | f | 10 | 2017 | 49 | 2017_f |
5 | b | m | 10 | 2017 | 59 | 2017_m |
6 | c | f | 10 | 2017 | 54 | 2017_f |
7 | d | m | 10 | 2017 | 64 | 2017_m |
In [10]:
df_unite.pivot(index="id", columns="year_sex", values="weight")
Out[10]:
year_sex | 2016_f | 2016_m | 2017_f | 2017_m |
---|---|---|---|---|
id | ||||
a | 49.0 | NaN | 49.0 | NaN |
b | NaN | 59.0 | NaN | 59.0 |
c | 54.0 | NaN | 54.0 | NaN |
d | NaN | 64.0 | NaN | 64.0 |
おまけ:元のデータフレームに戻す¶
In [11]:
# startswith(tuple) で複数対応できる, builtinも/endswithも
all_columns = indexed_unstacked.columns[
indexed_unstacked.columns.str.startswith(("age", "weight",))]
all_columns
Out[11]:
Index(['age_2016_f', 'age_2016_m', 'age_2017_f', 'age_2017_m', 'weight_2016_f',
'weight_2016_m', 'weight_2017_f', 'weight_2017_m'],
dtype='object')
In [12]:
melted = pd.melt(indexed_unstacked.reset_index(), id_vars='id', value_vars=list(all_columns))
melted.head()
Out[12]:
id | variable | value | |
---|---|---|---|
0 | a | age_2016_f | 10.0 |
1 | b | age_2016_f | NaN |
2 | c | age_2016_f | 10.0 |
3 | d | age_2016_f | NaN |
4 | a | age_2016_m | NaN |
In [13]:
melted = melted.dropna().reset_index(drop=True)
melted = (melted
# join: index join
.join(melted.variable.str.split("_", expand=True).reset_index(drop=True))
.drop("variable", axis=1)
.rename(columns={0: "column", 1: "year", 2: "sex"})
)
pd.concat([melted.head(), melted.tail()])
Out[13]:
id | value | column | year | sex | |
---|---|---|---|---|---|
0 | a | 10.0 | age | 2016 | f |
1 | c | 10.0 | age | 2016 | f |
2 | b | 10.0 | age | 2016 | m |
3 | d | 10.0 | age | 2016 | m |
4 | a | 10.0 | age | 2017 | f |
11 | d | 64.0 | weight | 2016 | m |
12 | a | 49.0 | weight | 2017 | f |
13 | c | 54.0 | weight | 2017 | f |
14 | b | 59.0 | weight | 2017 | m |
15 | d | 64.0 | weight | 2017 | m |
In [14]:
pd.concat(
[
melted[melted.column == "age"].drop("column", axis=1).rename(columns={"value": "age"}),
melted[melted.column == "weight"]["value"].rename("weight").reset_index(drop=True),
], axis=1)
Out[14]:
id | age | year | sex | weight | |
---|---|---|---|---|---|
0 | a | 10.0 | 2016 | f | 49.0 |
1 | c | 10.0 | 2016 | f | 54.0 |
2 | b | 10.0 | 2016 | m | 59.0 |
3 | d | 10.0 | 2016 | m | 64.0 |
4 | a | 10.0 | 2017 | f | 49.0 |
5 | c | 10.0 | 2017 | f | 54.0 |
6 | b | 10.0 | 2017 | m | 59.0 |
7 | d | 10.0 | 2017 | m | 64.0 |
ピボットテーブルでの対応¶
- おすすめしない
- index/columnsの組合せによってvaluesがユニークにならない場合正しくないため
- pivot/unstackではユニークにならないときにErrorとなるため安全
- ただし、一番記述しやすい
In [15]:
pivot_table = df.pivot_table(index="id", columns=["year", "sex"], values="weight", aggfunc=np.sum)
pivot_table
Out[15]:
year | 2016 | 2017 | ||
---|---|---|---|---|
sex | f | m | f | m |
id | ||||
a | 49.0 | NaN | 49.0 | NaN |
b | NaN | 59.0 | NaN | 59.0 |
c | 54.0 | NaN | 54.0 | NaN |
d | NaN | 64.0 | NaN | 64.0 |
おまけ: MultiIndexの扱い¶
- MutiIndexを持つDataFrameからのカラム選択
- Index階層が減ることによりIndexに変化する
- MultiIndexカラムの追加
In [16]:
diff = pivot_table.loc[:, 2017] - pivot_table.loc[:, 2016]
diff
Out[16]:
sex | f | m |
---|---|---|
id | ||
a | 0.0 | NaN |
b | NaN | 0.0 |
c | 0.0 | NaN |
d | NaN | 0.0 |
In [17]:
diff.columns
Out[17]:
Index(['f', 'm'], dtype='object', name='sex')
In [18]:
diff.columns = pd.MultiIndex.from_product([("diff",), ("f", "m")])
diff
Out[18]:
diff | ||
---|---|---|
f | m | |
id | ||
a | 0.0 | NaN |
b | NaN | 0.0 |
c | 0.0 | NaN |
d | NaN | 0.0 |
In [19]:
pd.concat([pivot_table, diff], axis=1)
Out[19]:
year | 2016 | 2017 | diff | |||
---|---|---|---|---|---|---|
sex | f | m | f | m | f | m |
id | ||||||
a | 49.0 | NaN | 49.0 | NaN | 0.0 | NaN |
b | NaN | 59.0 | NaN | 59.0 | NaN | 0.0 |
c | 54.0 | NaN | 54.0 | NaN | 0.0 | NaN |
d | NaN | 64.0 | NaN | 64.0 | NaN | 0.0 |
In [ ]: