2017-07-13 10 views
1

私は 'pd.read_csv'を使用してPythonのpandasデータフレームに2列と19,333行の1.3GBのcsvファイルを読み込もうとしていますが、 'CParserError:Error tokenizing data'というエラーメッセージが生成され続けます。 Cエラー:メモリ不足です 'というメッセージが表示され、「chunksize」のようにオンラインで投稿された多くの推奨事項を試してみましたが、動作していないようで、「カーネルが再起動しました。 'pd.read_csv'を実行しているときの出力は次のとおりです。実際に1.3GBのcsvファイルをテキスト情報とともにPythonのpandasオブジェクトに読み込む方法は?

import pandas as pd 
import numpy as np 
import os 

os.chdir("/home/swhan/Downloads") 

CORPUS = pd.read_csv('10k_2005_2008_file.csv') 
Traceback (most recent call last): 

    File "<ipython-input-1-8136c4f0354a>", line 7, in <module> 
    CORPUS = pd.read_csv('10k_2005_2008_file.csv') 

    File "/home/swhan/anaconda2/lib/python2.7/site-packages/pandas/io/parsers.py", line 646, in parser_f 
    return _read(filepath_or_buffer, kwds) 

    File "/home/swhan/anaconda2/lib/python2.7/site-packages/pandas/io/parsers.py", line 401, in _read 
    data = parser.read() 

    File "/home/swhan/anaconda2/lib/python2.7/site-packages/pandas/io/parsers.py", line 939, in read 
    ret = self._engine.read(nrows) 

    File "/home/swhan/anaconda2/lib/python2.7/site-packages/pandas/io/parsers.py", line 1508, in read 
    data = self._reader.read(nrows) 

    File "pandas/parser.pyx", line 848, in pandas.parser.TextReader.read (pandas/parser.c:10415) 

    File "pandas/parser.pyx", line 870, in pandas.parser.TextReader._read_low_memory (pandas/parser.c:10691) 

    File "pandas/parser.pyx", line 924, in pandas.parser.TextReader._read_rows (pandas/parser.c:11437) 

    File "pandas/parser.pyx", line 911, in pandas.parser.TextReader._tokenize_rows (pandas/parser.c:11308) 

    File "pandas/parser.pyx", line 2024, in pandas.parser.raise_parser_error (pandas/parser.c:27037) 

CParserError: Error tokenizing data. C error: out of memory 

、csvファイルには、各IDのための長いテキスト情報のための2つの列、IDに1つ、および別で構成され、これかのように見えるサブセットそのうち:

id text 
12 python pandas read data of the form ... 
13 how to remove file does not exist error ... 
41 pandas unable to find files ... 
99 issue with python is not a simple problem ... 

csv file picture

このファイルをpandasのデータフレームオブジェクトに読み込む方法はありませんか?ちなみに、デスクトップには32GBのRAMがあります。前もって感謝します!

'チャンク'

df = pd.DataFrame() 
reader = pd.read_csv("10k_2005_2008_file.csv", chunksize=10**3) 
for chunk in reader: 
    df = pd.concat([df, chunk], ignore_index=True) 

df 
Out[6]: 
      ID            text 
0  255618 ['ITEM1.BUSINESSIn this annual report onForm10... 
1  94740 ['Item 1. Business.GeneralCommunity CapitalCor... 
2  145200 ['ITEM 1.BUSINESSGeneralCommunityBank Shares o... 
3  145201 ['ITEM 1. BUSINESSGeneralCommunity Bank Share... 
4  145202 ['Item 1. BusinessGeneralCommunity Bank Shares... 
5  145203 ['Item1.BusinessGeneralCommunityBank Shares of... 
6  221548 ['Item1.BusinessOverviewTravelzoo Inc. (the Co... 
7  121633 ['Item1. BusinessGeneralSterling Financial Cor... 
8  172796 ['Item 1. BusinessGeneralWe are a Maryland cor... 
9  172797 ['Item 1. BusinessGeneralWe are a Maryland cor... 
10  121632 ['Item 1.BusinessGeneralCompanyGrowthProfitabi... 
11  28995 ['ITEM 1. Business.(Dollars in millions)We res... 
12  28994 ['ITEM 1. Business.GeneralAt December31, 2004,... 
13  28997 ['Item1.Business.GeneralService Corporation In... 
14  28996 ['ITEM 1. Business.GeneralAt December31, 2004,... 
15  118636 ['Item1.BusinessWe are a broadcast company pri... 
16  28993 ['ITEM 1. Business.GeneralAt December31, 2004,... 
17  101760 ['ITEM1.BUSINESSCorporateProfileCognex Corpora... 
18  145752 ['Item 1: Election of Directors; Nomineesfor D... 
19  94744 ['ITEM1.BUSINESS.GeneralCommunityCapital Corpo... 
20  28999 ['Item1.Business.GeneralService Corporation In... 
21  28998 ['Item1.Business.GeneralService Corporation In... 
22  1868 ['ITEM1.BUSINESSCompany OverviewWe are a world... 
23  269745 ['Item1"BusinessThe CompanyThe 2004 Reorganiza... 
24  181343 ['ITEM 1. BUSINESSMKS Instruments, Inc. ("the... 
25  220768 ['ITEM1. BUSINESS General The Company Sierr... 
26  181345 ['Item1.BusinessMKS Instruments, Inc. (the Com... 
27  145750 ['Item1. Business BurlingtonNorthern Santa F... 
28  181346 ['Item1.BusinessMKS Instruments, Inc. (the Com... 
29  145751 ['Item 1: Election of Directors; Nominees for ... 
     ...            ... 
19303 26477 ['ITEM1.BUSINESS Precision Castparts Corp. (P... 
19304 256145 ['Item1 Business,Item1A Risk Factors, and Item... 
19305 222814 ['Item1. Business. General Our company, Rock... 
19306 73641 ['ITEM 1. BUSINESSGENERALTexas Regional Bancsh... 
19307 66997 ['ITEM 1. BUSINESSOur CompanyWe are a leading ... 
19308 66996 ['ITEM 1. BUSINESSOur CompanyWe are a leading ... 
19309 66994 ['ITEM1. BUSINESS Our Company We are a leadi... 
19310 66993 ['ITEM 1. BUSINESS Our CompanyWe are a leadi... 
19311 7929 ['Item1. Business(a)General development of bus... 
19312 114251 ['Item1.BusinessGeneralTerra Nitrogen Company,... 
19313 114250 ['Item1 BusinessGeneralTerra Nitrogen Company,... 
19314 198077 ['Item1. BusinessGeneral DescriptionTeam Finan... 
19315 162197 ["ITEM 1. BUSINESSWintrust Financial Corporati... 
19316 25524 ['Item 1. BusinessEnvironmental. Contamination... 
19317 190015 ['Item 1. Description of Business.GeneralEVCI ... 
19318 5634 ['Item 1.BusinessGeneral CDI Corp. (the Compa... 
19319 5635 ['Item 1.BusinessGeneral CDI Corp. (the Compa... 
19320 190932 ['ITEM 1. BUSINESSORGANIZATION AND GENERAL B... 
19321 190933 ['ITEM 1. BUSINESSORGANIZATION AND GENERAL B... 
19322 5632 ['Item 1.BusinessGeneral CDI Corp., (the Comp... 
19323 5633 ['Item 1.BusinessGeneral CDI Corp. (the Compa... 
19324 38349 ['Item 1. BusinessThe CompanyNatures SunshineP... 
19325 222816 ['Item1 above.Weoperate on a 52/53 week fiscal... 
19326 222815 ['Item1. Business.GeneralOur company, Rockwell... 
19327 213793 ['Item1.BusinessTvia,Inc. is a fabless semicon... 
19328 8489 ['ITEM1.BusinessCrown Crafts, Inc. (the Compan... 
19329 224247 ['Item1.Business GENERAL We are asolutions... 
19330 198076 ['Item1. BusinessGeneral DescriptionTeam Finan... 
19331 34149 ['Item1. BusinessVF Corporation, organized in ... 
19332 34148 ['Item1 in PartI, Items 5, 6, 7, 7A, 8 and 9A ... 

[19333 rows x 2 columns] 
+1

'chunksize'を使うと出力を表示できますか? – BallpointBen

+0

私はpythonコードで上記の投稿を 'chunksize'で編集しました。ありがとう! – krcoder

+0

小さいチャンクサイズを使用します。多分1000? – BallpointBen

答えて

0

Pandas docs saysとPythonコードで代替してみてください。

Note It is worth noting however, that concat (and therefore append) makes a full copy of the data, and that constantly reusing this function can create a significant performance hit. If you need to use the operation over several datasets, use a list comprehension.

frames = [ process_your_file(f) for f in files ] 
result = pd.concat(frames) 

ので、この方法を試してみてください。

reader = pd.read_csv("10k_2005_2008_file.csv", chunksize=10**3) 
df = pd.concat([x for x in reader], ignore_index=True) 
関連する問題