DataFrameが等しいことを確認する¶

動機¶

２つのDataFrameを比較して正しいことを確認する機会があった

準備¶

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

%matplotlib inline

NAを含むDataFrameを作成¶

np.random.seed(0)
df = pd.DataFrame(np.random.random_integers(1, 4, size=(3, 4)), columns=list("abcde"))
df["c"] = np.nan
other = df.copy()
df

	a	b	c	d
0	1	4	NaN	1
1	4	4	NaN	4
2	2	4	NaN	3

各要素が等しいか, DataFrame同士が等しいかを確認¶

df == other

	a	b	c	d
0	True	True	False	True
1	True	True	False	True
2	True	True	False	True

np.nan == np.nan, np.nan != np.nan

(False, True)

df.equals(other)

True

NA同士は等しくない(SQLにおけるNULL)が、DataFrameとしては等しい

等しくない場合、どこが等しくないかを確認する¶

NAを特定の文字列にし、要素の比較をしたときに等しくなるようにする
DataFrame同士が等しくないようにするため、otherを変更する

df = df.fillna("NA String")
other = other.fillna("NA String")
other["a"] = 4
other.iloc[0, 1] = 100
other

	a	b	c	d
0	4	100	NA String	1
1	4	4	NA String	4
2	4	4	NA String	3

# == の method version
eq = df.eq(other)
eq

	a	b	c	d
0	False	False	True	True
1	True	True	True	True
2	False	True	True	True

df.equals(other)

False

NAであった要素は等しくなっている
変更をしたため、DataFrameとしては等しくない

等しくないColumnとIndexの特定およびどれくらい等しいか¶

要素比較結果のDataFrameに対してallをColumnとIndex方向の両方に適用して特定する

pd.DataFrame(eq.all(axis=1))

	0
0	False
1	True
2	False

pd.DataFrame(eq.all()).T

	a	b	c	d
0	False	False	True	True

pd.concat(
    [
        pd.DataFrame(eq.sum()).T,
        pd.DataFrame(eq.sum()).T / len(df)
    ]
, ignore_index=True)

	a	b	c	d
0	1.000000	2.000000	3	3
1	0.333333	0.666667	1	1

print(pd.options.display.float_format)
with pd.option_context("display.float_format", "{:.2f}%".format):
    print(pd.DataFrame(eq.sum()).T / len(df) * 100)

None
       a      b       c       d
0 33.33% 66.67% 100.00% 100.00%