2017-06-07 6 views
0

read_csv空白分離一定ではないので:私はパンダのデータフレームには、このテキストファイル(philadelphia.txt)を作成しようとしています

STATION   STATION_NAME          DATE  TAVG  TMAX  TMIN  
----------------- -------------------------------------------------- -------- -------- -------- -------- 
GHCND:USW00094732   PHILADELPHIA NE PHILADELPHIA AIRPORT PA US 19970605 -9999 74  47  
GHCND:USW00094732   PHILADELPHIA NE PHILADELPHIA AIRPORT PA US 19970606 -9999 68  50  
GHCND:USW00094732   PHILADELPHIA NE PHILADELPHIA AIRPORT PA US 19970608 -9999 72  50  
GHCND:USW00094732   PHILADELPHIA NE PHILADELPHIA AIRPORT PA US 19970609 -9999 83  47  
GHCND:USW00094732   PHILADELPHIA NE PHILADELPHIA AIRPORT PA US 19970610 -9999 86  55  
GHCND:USW00094732   PHILADELPHIA NE PHILADELPHIA AIRPORT PA US 19970611 -9999 88  61  
GHCND:USW00094732   PHILADELPHIA NE PHILADELPHIA AIRPORT PA US 19970612 -9999 83  70  
GHCND:USW00094732   PHILADELPHIA NE PHILADELPHIA AIRPORT PA US 19970613 -9999 80  66  
GHCND:USW00094732   PHILADELPHIA NE PHILADELPHIA AIRPORT PA US 19970614 -9999 80  64  
GHCND:USW00094732   PHILADELPHIA NE PHILADELPHIA AIRPORT PA US 19970615 -9999 77  55  
GHCND:USW00094732   PHILADELPHIA NE PHILADELPHIA AIRPORT PA US 19970616 -9999 79  49 

しかし、私は

data = pd.read_csv('philadelphia.txt', sep="\s+", header=0) 

を使用している場合正しいヘッダーを作成しますが、ステーション名データを分割する問題が発生します。私はそれが列名 "STATION_NAME"の下に含まれるようにしたいが、sep = "\ s +"はスペースでそれを分割し、エラーが出る。

pandas.errors.ParserError: Error tokenizing data. C error: Expected 6 fields in line 3, saw 11 

ステーション名を個々の単語に分割せずにデータを6列に分割するにはどうすればよいですか?

(yellowknife.txtなどの)異なるステーション名を持つ他のテキスト文書を渡すこともできます。

STATION   STATION_NAME          DATE  TMAX  TMIN  
----------------- -------------------------------------------------- -------- -------- -------- 
GHCND:CA002204101         YELLOWKNIFE A CA 20130117 -21  -35  
GHCND:CA002204101         YELLOWKNIFE A CA 20130118 -15  -21  
GHCND:CA002204101         YELLOWKNIFE A CA 20130119 -17  -29  
GHCND:CA002204101         YELLOWKNIFE A CA 20130120 -18  -28  
GHCND:CA002204101         YELLOWKNIFE A CA 20130121 -21  -34  
GHCND:CA002204101         YELLOWKNIFE A CA 20130122 -16  -30  
GHCND:CA002204101         YELLOWKNIFE A CA 2013-17  -28  
GHCND:CA002204101         YELLOWKNIFE A CA 20130124 -5  -17  

答えて

0

使用read_fwf()方法:

In [7]: df = pd.read_fwf(r'/path/to/file.csv').drop(0) 

In [8]: df 
Out[8]: 
       STATION        STATION_NAME  DATE TAVG TMAX TMIN 
1 GHCND:USW00094732 PHILADELPHIA NE PHILADELPHIA AIRPORT PA US 19970605 -9999 74 47 
2 GHCND:USW00094732 PHILADELPHIA NE PHILADELPHIA AIRPORT PA US 19970606 -9999 68 50 
3 GHCND:USW00094732 PHILADELPHIA NE PHILADELPHIA AIRPORT PA US 19970608 -9999 72 50 
4 GHCND:USW00094732 PHILADELPHIA NE PHILADELPHIA AIRPORT PA US 19970609 -9999 83 47 
5 GHCND:USW00094732 PHILADELPHIA NE PHILADELPHIA AIRPORT PA US 19970610 -9999 86 55 
6 GHCND:USW00094732 PHILADELPHIA NE PHILADELPHIA AIRPORT PA US 19970611 -9999 88 61 
7 GHCND:USW00094732 PHILADELPHIA NE PHILADELPHIA AIRPORT PA US 19970612 -9999 83 70 
8 GHCND:USW00094732 PHILADELPHIA NE PHILADELPHIA AIRPORT PA US 19970613 -9999 80 66 
9 GHCND:USW00094732 PHILADELPHIA NE PHILADELPHIA AIRPORT PA US 19970614 -9999 80 64 
10 GHCND:USW00094732 PHILADELPHIA NE PHILADELPHIA AIRPORT PA US 19970615 -9999 77 55 
11 GHCND:USW00094732 PHILADELPHIA NE PHILADELPHIA AIRPORT PA US 19970616 -9999 79 49 

列:

In [9]: df.columns.tolist() 
Out[9]: ['STATION', 'STATION_NAME', 'DATE', 'TAVG', 'TMAX', 'TMIN'] 
関連する問題