openpyxl readonly use_iterators

私はいくつかの巨大なExcelファイルを持っていますが、私は "控えめな"もの（50Mバイト）でも邪魔されています。私は最初の2行をスキップする必要がありますが、私はそれが減速だとは思わない。他に何か考えてもらえますか？openpyxl readonly use_iterators

wb = load_workbook(MyFile,read_only=True) 
ws = wb.active 

NDepth = ws.max_row-2 
NTime = ws.max_column -1 

Local_Depth = np.zeros((NDepth,)) 
Local_Temp = np.zeros((NDepth,NTime)) 

iterlist = islice(ws.iter_rows(),2,None) 

start = time.time() 

i=0 
for row in iterlist: 
    Local_Depth[i] = row[0].value 
    j=0 
    for col in row[1:]: 
     Local_Temp[i,j] = col.value 
     j += 1 
    i += 1 

print "Done", time.time()-start

M4700 Dell Precisionでファイルをロードするのに7分以上かかることがあります。約8000行と800列。確かに何か間違っているはずですか？ Python 2.7のセットアップでは、どこか別の調整が必要なのでしょうか？

ありがとう、ジョン

出典

2016-03-26 Tunneller

にしています - それは非常に簡単で、非常に素敵で高速なアルゴリズムの中で使用します。 – MaxU

7 * 60 /（8000 * 800）〜= 0,066 miliseconds/cellこれはあまりにも悪くないようです。 –

イテレータなしでコードを実行しようとしましたか？つまり、 'use_iterators = False'ですか？ –

私はこの仕事のためにパンダを試してみます。それは非常に簡単で、あなたに多くの力を与えます。（iPythonから、それは約取った私の自宅のノートPC上で2分。）

import time 
import numpy as np 
import pandas as pd 

# let's generate some sample data (8000 rows, 800 columns) 
data = np.random.randint(0, 100, (8000, 800)) 

# let's generate column names from 'col001' to 'col800' 
cols = ['col{0:02d}'.format(i) for i in range(1,801)] 

# generating Pandas data frame from numpy array 
df = pd.DataFrame(data, columns=cols) 

# write generated DF (Data Frame) to Excel file 
df.to_excel(r'd:/temp/sample.xlsx', index=False) 
# we are done with sample data 

##################################################################### 
# 
# interesting part starts here ... 
# 
##################################################################### 

start = time.time() 

# read up the Excel file (skipping first two rows) 
df = pd.read_excel(r'd:/temp/sample.xlsx', skiprows=2) 

print "Done", time.time()-start 

# print the shape of out DF 
print(df.shape)

出力：ここ

は小さな一例である今、あなたは、メモリ内のすべてのデータなどを持っている

In [24]: %paste 
start = time.time() 

# read up the Excel file 
df = pd.read_excel(r'd:/temp/sample.xlsx', skiprows=2) 

print "Done", time.time()-start 
## -- End pasted text -- 
Done 124.375999928 

In [25]: 

In [25]: df.shape 
Out[25]: (7998, 800) 

In [26]: # print the shape of out DF 

In [27]: print(df.shape) 
(7998, 800)

DFを使用して、すべてのパンダのパワーを使って非常に快適に処理できます

PS次のPythonモジュールをインストールする必要があります：

Excelファイルは、長方形のシートであることを考えると、私は巨大なスピードを得るように私が取る場合は3210匹の

numpyの
パンダ
はopenpyxlまたはXlsxWriter（あるいはその両方）

出典

2016-03-26 15:46:14 MaxU

大きなExcelファイルを読むときのパフォーマンスは、openpyxlとxlrd（これはPandasが内部的に読み込みに使用しているもの）と非常によく似ています。どちらの場合も、データが行単位で格納されるため、Pandasは行を列に変換してから何かできるようにする必要があります。したがって、簡単な操作のために、これはより速く実行されません。もちろん、Pandasは集約操作の方が高速です。 –

-1

をxlrd ZIPに直接行くだけでAttachedは、一連の行を返すジェネレータのコードの一部です。この行を解析して、各行の<v>と</v>の間の実際の値を検索します。私はかなりこれをしようとはしていない - それは非常に非Pythonyに見えます。

def rowList(): 
with zipfile.ZipFile('MyFile.xlsx', mode='r') as z: 
    with z.open('xl/worksheets/sheet1.xml', 'r') as f: 
     irow =-1 
     while irow <0: 
      hstring = f.read(50000) 
      if hstring == "": 
       break; 
      irow = hstring.find("sheetData") 
     if irow < 0: 
      return 
     ist = hstring.find("<dimension") 
     string = hstring[ist+16: ist+50] 
     itl = string.find("/>") 
     yield string[:itl-1] 
     string = hstring[irow+10:] 
     while True: 
      irow = string.find("</row>") 
      while irow<0: 
       hstring = f.read(50000) 
       if hstring == "": 
        break; 
       string += hstring 
       irow = string.find("</row>") 
      if irow < 0: 
       return 
      irow +=6 
      ist = string.find("<c") 
      yield string[ist:irow-6] 
      string = string[irow:]

計算時間を分から秒に減らします。

私には、openpyxlに似ているものがあるかと思います。これは、非常にフラットなファイルを開いていることを伝えるパラメータです。

出典

2016-03-27 06:08:02 Tunneller

xlsxファイルは常にzipアーカイブですが、このアプローチは非常に柔軟性がなく、エラーが発生しやすくなります。たとえば、ディメンション・タグは欠落している可能性が非常に高いです。このため、ソースを読み込むために何らかのXMLライブラリを使用する必要があります。それはiterparseが非常に速いことが判明したので。私はあなたが観察している速度の違いは、まったく異なる何かが原因で起こっていると考えています。 –

列の解析に投げ入れる ...は、完全なデータ読み込みのために105秒を取得します。その下のパンダの例ほど速くはありません。私は上記のこのハッキングは行かない方がいいと思っていますが、ボトルネックを明確にするのに役立ちます。 Openpyxlは次元を見つけるのが非常に速かった。私はそれがどこにあるのか不思議だったので、上記の擬似コードの<次元検索>を残しました。 – Tunneller

プロファイリングを実行すると、そのセルの文字列値を適切なPython値（数値、ブール値、日時、文字列または数式）に変換するのにかかる時間がわかります。カスタムパーサーが避けることができるCellオブジェクトのインスタンス化に関連するオーバーヘッドがあります。しかし、ピーピーを使用すると、最高のブーストが得られるはずです。 –

私がダウンして22秒

import numpy as np 
import time 
from openpyxl import Workbook 
from openpyxl import load_workbook 
import zipfile 


def rowList(fullfilename): 
with zipfile.ZipFile(fullfilename, mode='r') as z: 
    with z.open('xl/worksheets/sheet1.xml', 'r') as f: 
     irow =-1 
     while irow <0: 
      hstring = f.read(50000) 
      if hstring == "": 
       break; 
      irow = hstring.find("sheetData") 
      if irow < 0: 
       return 
     string = hstring[irow+10:] 
     while True: 
      irow = string.find("</row>") 
      while irow<0: 
       hstring = f.read(50000) 
       if hstring == "": 
        break; 
       string += hstring 
       irow = string.find("</row>") 
      if irow < 0: 
       return 
      irow +=6 
      ist = string.find("<c") 
      yield string[ist:irow-6] 
      string = string[irow:] 


def splitRow(func,row): 

j = 0 
c1 = row.find("<v") 
c2 = 0 
while c1 > 0: 
    c1 += c2 + 3 
    c2 = c1 + row[c1:].find("</v") 
    yield func(row[c1:c2]) 
    j += 1 
    c2 += 3 
    c1 = row[c2:].find("<v") 


start = time.time() 

wb = load_workbook(MyFile,read_only=True, use_iterators=True) 
ws = wb.active 
NDepth = ws.max_row-2 
NTime = ws.max_column -1 
wb._archive.close() 

Local_Store = np.empty((NDepth,NTime+1)) 
Local_Time = np.empty((NTime,)) 

print NDepth, NTime 
print "Data Accessed via Iterators", time.time()-start 

start = time.time() 

print "About to call RowList" 

i = -2 
j = 1 
for row in rowList(MyFile): 
if i == -2: 
    True 
else: 
    if i == -1: 
     Local_Time[:] = list(splitRow(float,row)) 
    else: 
     Local_Store[i,:] = list(splitRow(float,row)) 

i += 1 

print i, "Rows Parsed", time.time()-start

それはあなたが...私は、個人的には、そのためのパンダモジュールを使用することになり、このエクセルファイルからの読み取りされたデータをどうしようとしているかに依存し

出典

2016-03-29 21:32:46 Tunneller

行を解析するために、私は試しました\t \t根= ET.fromstring（行）; mylist = root.itertext（）; floaty = map（float、mylist）しかし、もう一度犬より遅く戻った。寸法を取得するためのOpenPyXLへの呼び出しはわずか100分の1秒でした。 – Tunneller

答えて

関連する問題