データ全体にヘッダ行が複数あるPandas Dataframe（CSVから）

Test data file私はCSVファイルから作成したデータフレームを使用しています。データには、データ全体のヘッダー行があり、次のヘッダー行まで、そのデータの下の行について何かを識別します。データ全体にヘッダ行が複数あるPandas Dataframe（CSVから）

データは次のようになります。

2001|  |colour |Price | Quantity sold<br> 
Shoes|<br> 
Blank | High heal Shoes| red |£22|44<br> 
Blank | Low heal Shoes|red |£22|44<br> 
Slippers|<br> 
Blank | High heal Slippers| red |£22|44<br> 
Blank | High heal Slippers| blue |£22|44<br> 
Blank | Low heal Slippers| red |£22|44<br> 
2002| |colour |Price | Quantity sold<br> 
Shoes|<br> 
Blank | High heal Shoes| red |£22|44<br> 
Blank | Low heal Shoes|red |£22|44<br> 
Slippers|<br> 
Blank | High heal Slippers| red |£22|44<br> 
Blank | High heal Slippers| blue |£22|44<br> 
Blank | Low heal Slippers| red |£22|44<br>

これはどのような構造ですか？

私は、このデータフレームを通して、ヘッダー行（つまり、2001年、2002年など）から毎年の特定のアイテム（スリッパ）に関するすべてのデータを取得する必要があります。各データ行の横に対応する年の行を追加することも役立ちます。

どうすればよいですか？

出典

2017-11-13 SANM2009

用途：

df = pd.read_csv('test.csv') 

#get value of first column (here 2001) 
col = df.columns[0] 

#forward fill last previous value 
df[col] = df[col].ffill() 
#convert first column to numeric 
num = pd.to_numeric(df[col], errors='coerce') 
#forward fill again, first group replace by value of first column name 
df['Year'] = num.ffill().fillna(col) 
#change columns names 
df = df.rename(columns={col:'Shoes', 'Unnamed: 1':'Names'}) 
#remove unnecessary rows 
df = df[num.isnull() & df['colour'].notnull()].reset_index(drop=True)

print (df) 
      Shoes  Names colour price Quantity sold Year 
0 Type A shoes Sub type A  red 22    5 2001 
1 Type A shoes Sub type A green 11    5 2001 
2 Type A shoes Sub type A yellow 44    5 2001 
3 Type A shoes Sub type B  red 33    5 2001 
4 Type A shoes Sub type B green 66    5 2001 
5 Type A shoes Sub type B yellow 22    5 2001 
6 Type B shoes Sub type A  red 11    5 2001 
7 Type B shoes Sub type A green 44    5 2001 
8 Type B shoes Sub type A yellow 33    5 2001 
9 Type B shoes Sub type B  red 66    5 2001 
10 Type B shoes Sub type B green 21    5 2001 
11 Type B shoes Sub type B yellow 22    5 2001 
12 Type A shoes Sub type A  red 22    5 2002 
13 Type A shoes Sub type A green 11    5 2002 
14 Type A shoes Sub type A yellow 44    5 2002 
15 Type A shoes Sub type B  red 33    5 2002 
16 Type A shoes Sub type B green 66    5 2002 
17 Type A shoes Sub type B yellow 22    5 2002 
18 Type B shoes Sub type A  red 11    5 2002 
19 Type B shoes Sub type A green 44    5 2002 
20 Type B shoes Sub type A yellow 33    5 2002 
21 Type B shoes Sub type B  red 66    5 2002 
22 Type B shoes Sub type B green 21    5 2002 
23 Type B shoes Sub type B yellow 22    5 2002

EDIT：返信用

df = pd.read_csv('testV2.csv', sep='\t') 
#print (df) 

#get value of first column (here 2001) 
col = df.columns[0] 

#forward fill last previous value 
df[col] = df[col].ffill() 
#convert first column to numeric 
num = pd.to_numeric(df[col], errors='coerce') 
#forward fill again, first group replace by value of first column name 
df['Year'] = num.ffill().fillna(col) 
#change columns names 
df = df.rename(columns={col:'Top Category', 'Unnamed: 1':'Names'}) 
#remove unnecessary rows 
df = df[num.isnull() & (df['Top Category'] != 'Top Category')].reset_index(drop=True)

print (df) 

    Top Category Names Colour Price Sold Year 
0  Item 1 Type 1  -  2 NaN 2001 
1  Item 2 Type 1  -  2 NaN 2001 
2  Item 3 Type 1 red  2 5 2001 
3  Item 3 Type 2 blue  2 5 2001 
4  Item 3 Type 3 green  2 5 2001 
5  item 4 Type 1 red  2 5 2001 
6  item 4 Type 2 blue  3 NaN 2001 
7  item 4 Type 3 green  3 NaN 2001 
8  Item 1 Type 1  -  3 NaN 2002 
9  Item 2 Type 1  -  3 NaN 2002 
10  Item 3 Type 1 red  3 5 2002 
11  Item 3 Type 2 blue  3 5 2002 
12  Item 3 Type 3 green  3 5 2002 
13  Item4 Type 1 red  3 NaN 2002 
14  Item4 Type 2 blue  3 NaN 2002 
15  Item4 Type 3 green  3 NaN 2002 
16  Item 1 Type 1  -  3 NaN 2003 
17  Item 2 Type 1  -  3 NaN 2003 
18  Item 3 Type 1 red  3 5 2003 
19  Item 3 Type 2 blue  3 5 2003 
20  Item 3 Type 3 green  3 5 2003 
21  Item4 Type 1 red  3 NaN 2003 
22  Item4 Type 2 blue  3 NaN 2003 
23  Item4 Type 3 green  3 NaN 2003

出典

2017-11-13 10:25:38 jezrael

感謝。私はいくつかの行で何が起こっているのか分かりません。私は何か質問しても構わないと思います。この行は何をしていますか？ df [col] = df [col] .str.strip（）。replace（ 'Blank'、np.nan）.ffill（）そして特にforward fillは何をしますか？ – SANM2009

問題ありません。しかし、私の解決策がうまくいかない場合は、ファイルの実際のフォーマットに問題がある可能性があります。実際の区切り文字、実際の空白値でサンプルファイルを共有することは可能ですか？ – jezrael

'ffill（）'最後の既知のNaN以外の値を置き換えます。したがって '1,2、NaN、NaN、4,7、NaN'なら' 1,2,2,2,4,7,7'を返します – jezrael

データ全体にヘッダ行が複数あるPandas Dataframe（CSVから）

答えて

関連する問題