url = 'https://archive.ics.uci.edu/ml/machine-learning-databases/voting-records/house-votes-84.data'
df = pd.read_csv(url, header=None, index_col=0)
df[df.eq('?')] = np.nan
df[df.eq('y')] = 1.0
df[df.eq('n')] = 0.0
df = df.reset_index()
結果:
In [67]: df
Out[67]:
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
0 republican 0 1 0 1 1 1 0 0 0 1 NaN 1 1 1 0 1
1 republican 0 1 0 1 1 1 0 0 0 0 0 1 1 1 0 NaN
2 democrat NaN 1 1 NaN 1 1 0 0 0 0 1 0 1 1 0 0
3 democrat 0 1 1 0 NaN 1 0 0 0 0 1 0 1 0 0 1
4 democrat 1 1 1 0 1 1 0 0 0 0 1 NaN 1 1 1 1
5 democrat 0 1 1 0 1 1 0 0 0 0 0 0 1 1 1 1
6 democrat 0 1 0 1 1 1 0 0 0 0 0 0 NaN 1 1 1
7 republican 0 1 0 1 1 1 0 0 0 0 0 0 1 1 NaN 1
8 republican 0 1 0 1 1 1 0 0 0 0 0 1 1 1 0 1
9 democrat 1 1 1 0 0 0 1 1 1 0 0 0 0 0 NaN NaN
.. ... ... ... ... ... ... .. ... ... ... ... ... ... ... ... ... ...
425 democrat 0 0 1 0 0 0 1 1 0 1 1 0 0 0 1 NaN
426 democrat 1 0 1 0 0 0 1 1 1 1 0 0 0 0 1 1
427 republican 0 0 0 1 1 1 1 1 0 1 0 1 1 1 0 1
428 democrat NaN NaN NaN 0 0 0 1 1 1 1 0 0 1 0 1 1
429 democrat 1 0 1 0 NaN 0 1 1 1 1 0 1 0 NaN 1 1
430 republican 0 0 1 1 1 1 0 0 1 1 0 1 1 1 0 1
431 democrat 0 0 1 0 0 0 1 1 1 1 0 0 0 0 0 1
432 republican 0 NaN 0 1 1 1 0 0 0 0 1 1 1 1 0 1
433 republican 0 0 0 1 1 1 NaN NaN NaN NaN 0 1 1 1 0 1
434 republican 0 1 0 1 1 1 0 0 0 1 0 1 1 1 NaN 0
[435 rows x 17 columns]
私もインデックスをリセットする必要はありません。これは、列がすべてfloat64の代わりにオブジェクト値をとることと関係しています。私は回避策を見つけましたが、上記のすべてのオブジェクトが出力される理由はまだ解りません。 – a1letterword