new_pivot_unstack_melt_MultiIndex

In [1]:
import numpy as np
import pandas as pd

データ

  • 4人のデータ
  • 性別, 年齢
  • ある年の体重
  • データ入力がめんどうなので年しか変えていない
In [2]:
import io
df = pd.read_csv(io.StringIO("""
id sex age year weight
a f 10 2016 49
b m 10 2016 59
c f 10 2016 54
d m 10 2016 64
a f 10 2017 49
b m 10 2017 59
c f 10 2017 54
d m 10 2017 64
"""), sep=" ")
df
Out[2]:
id sex age year weight
0 a f 10 2016 49
1 b m 10 2016 59
2 c f 10 2016 54
3 d m 10 2016 64
4 a f 10 2017 49
5 b m 10 2017 59
6 c f 10 2017 54
7 d m 10 2017 64

年ごとの体重

In [3]:
df.pivot(index="id", columns="year", values="weight")
Out[3]:
year 2016 2017
id
a 49 49
b 59 59
c 54 54
d 64 64

pivotの限界

各引数を複数指定できない

In [4]:
try:
    df.pivot(index=["id", "sex"], columns="year", values="weight")
except:
    print("err multi index")

try:
    df.pivot(index="id", columns=["year", "sex"], values="weight")
except:
    print("err multi columns")

try:
    df.pivot(index="id", columns="year", values=["weight", "age"])
except:
    print("err multi values")
err multi index
err multi columns
err multi values

メソッドのみで対応する

index と unstack

In [5]:
indexed_unstacked = df.set_index(["id", "sex", "year"]).unstack(["year", "sex"])
indexed_unstacked
Out[5]:
age weight
year 2016 2017 2016 2017
sex f m f m f m f m
id
a 10.0 NaN 10.0 NaN 49.0 NaN 49.0 NaN
b NaN 10.0 NaN 10.0 NaN 59.0 NaN 59.0
c 10.0 NaN 10.0 NaN 54.0 NaN 54.0 NaN
d NaN 10.0 NaN 10.0 NaN 64.0 NaN 64.0

MultiIndexを1次元のリストにする

  • MultiIndexはtupleのlistとして扱える
  • 文字列結合して対応する
In [6]:
indexed_unstacked.columns
Out[6]:
MultiIndex(levels=[['age', 'weight'], [2016, 2017], ['f', 'm']],
           labels=[[0, 0, 0, 0, 1, 1, 1, 1], [0, 0, 1, 1, 0, 0, 1, 1], [0, 1, 0, 1, 0, 1, 0, 1]],
           names=[None, 'year', 'sex'])
In [7]:
list(indexed_unstacked.columns)
Out[7]:
[('age', 2016, 'f'),
 ('age', 2016, 'm'),
 ('age', 2017, 'f'),
 ('age', 2017, 'm'),
 ('weight', 2016, 'f'),
 ('weight', 2016, 'm'),
 ('weight', 2017, 'f'),
 ('weight', 2017, 'm')]
In [8]:
indexed_unstacked.columns = ["_".join(map(str, c)) for c in indexed_unstacked.columns]
indexed_unstacked
Out[8]:
age_2016_f age_2016_m age_2017_f age_2017_m weight_2016_f weight_2016_m weight_2017_f weight_2017_m
id
a 10.0 NaN 10.0 NaN 49.0 NaN 49.0 NaN
b NaN 10.0 NaN 10.0 NaN 59.0 NaN 59.0
c 10.0 NaN 10.0 NaN 54.0 NaN 54.0 NaN
d NaN 10.0 NaN 10.0 NaN 64.0 NaN 64.0

カラムを結合して対応する

Rのtidyrではuniteという関数を使って対応しているので、それの真似

In [9]:
df_unite = df.copy()
df_unite["year_sex"] = df_unite["year"].apply(str) + "_" + df_unite["sex"]
df_unite
Out[9]:
id sex age year weight year_sex
0 a f 10 2016 49 2016_f
1 b m 10 2016 59 2016_m
2 c f 10 2016 54 2016_f
3 d m 10 2016 64 2016_m
4 a f 10 2017 49 2017_f
5 b m 10 2017 59 2017_m
6 c f 10 2017 54 2017_f
7 d m 10 2017 64 2017_m
In [10]:
df_unite.pivot(index="id", columns="year_sex", values="weight")
Out[10]:
year_sex 2016_f 2016_m 2017_f 2017_m
id
a 49.0 NaN 49.0 NaN
b NaN 59.0 NaN 59.0
c 54.0 NaN 54.0 NaN
d NaN 64.0 NaN 64.0

おまけ:元のデータフレームに戻す

In [11]:
# startswith(tuple) で複数対応できる, builtinも/endswithも
all_columns = indexed_unstacked.columns[
    indexed_unstacked.columns.str.startswith(("age", "weight",))]
all_columns
Out[11]:
Index(['age_2016_f', 'age_2016_m', 'age_2017_f', 'age_2017_m', 'weight_2016_f',
       'weight_2016_m', 'weight_2017_f', 'weight_2017_m'],
      dtype='object')
In [12]:
melted = pd.melt(indexed_unstacked.reset_index(), id_vars='id', value_vars=list(all_columns))
melted.head()
Out[12]:
id variable value
0 a age_2016_f 10.0
1 b age_2016_f NaN
2 c age_2016_f 10.0
3 d age_2016_f NaN
4 a age_2016_m NaN
In [13]:
melted = melted.dropna().reset_index(drop=True)
melted = (melted
    # join: index join
    .join(melted.variable.str.split("_", expand=True).reset_index(drop=True))
    .drop("variable", axis=1)
    .rename(columns={0: "column", 1: "year", 2: "sex"})
)
pd.concat([melted.head(), melted.tail()])
Out[13]:
id value column year sex
0 a 10.0 age 2016 f
1 c 10.0 age 2016 f
2 b 10.0 age 2016 m
3 d 10.0 age 2016 m
4 a 10.0 age 2017 f
11 d 64.0 weight 2016 m
12 a 49.0 weight 2017 f
13 c 54.0 weight 2017 f
14 b 59.0 weight 2017 m
15 d 64.0 weight 2017 m
In [14]:
pd.concat(
    [
        melted[melted.column == "age"].drop("column", axis=1).rename(columns={"value": "age"}),
        melted[melted.column == "weight"]["value"].rename("weight").reset_index(drop=True),
    ], axis=1)

Out[14]:
id age year sex weight
0 a 10.0 2016 f 49.0
1 c 10.0 2016 f 54.0
2 b 10.0 2016 m 59.0
3 d 10.0 2016 m 64.0
4 a 10.0 2017 f 49.0
5 c 10.0 2017 f 54.0
6 b 10.0 2017 m 59.0
7 d 10.0 2017 m 64.0

ピボットテーブルでの対応

  • おすすめしない
  • index/columnsの組合せによってvaluesがユニークにならない場合正しくないため
  • pivot/unstackではユニークにならないときにErrorとなるため安全
  • ただし、一番記述しやすい
In [15]:
pivot_table = df.pivot_table(index="id", columns=["year", "sex"], values="weight", aggfunc=np.sum)
pivot_table
Out[15]:
year 2016 2017
sex f m f m
id
a 49.0 NaN 49.0 NaN
b NaN 59.0 NaN 59.0
c 54.0 NaN 54.0 NaN
d NaN 64.0 NaN 64.0

おまけ: MultiIndexの扱い

  • MutiIndexを持つDataFrameからのカラム選択
    • Index階層が減ることによりIndexに変化する
  • MultiIndexカラムの追加
In [16]:
diff = pivot_table.loc[:, 2017] - pivot_table.loc[:, 2016]
diff
Out[16]:
sex f m
id
a 0.0 NaN
b NaN 0.0
c 0.0 NaN
d NaN 0.0
In [17]:
diff.columns
Out[17]:
Index(['f', 'm'], dtype='object', name='sex')
In [18]:
diff.columns = pd.MultiIndex.from_product([("diff",), ("f", "m")])
diff
Out[18]:
diff
f m
id
a 0.0 NaN
b NaN 0.0
c 0.0 NaN
d NaN 0.0
In [19]:
pd.concat([pivot_table, diff], axis=1)
Out[19]:
year 2016 2017 diff
sex f m f m f m
id
a 49.0 NaN 49.0 NaN 0.0 NaN
b NaN 59.0 NaN 59.0 NaN 0.0
c 54.0 NaN 54.0 NaN 0.0 NaN
d NaN 64.0 NaN 64.0 NaN 0.0
In [ ]: