Pythonで大きなファイルのサブセットを効率的に抽出

私は数百万行の大きなファイルを持っています。私はランダムに一様にこのファイルからより小さな（250000行）を抽出したいと思います。私は次のコードを実行しましたが、それは驚くほど非常に遅く、実際には使えないほど遅いです。私はそれをスピードアップするために何ができますか？Pythonで大きなファイルのサブセットを効率的に抽出

def get_shorter_subset(fname, new_len): 
"""Extract a random shorter subset of length new_len from a given file""" 
    out_lines = [] 
    with open(fname + "short.out", 'w') as out_file: 
     with open(fname, 'r') as in_file: 
     all_lines = in_file.readlines() 
     total = len(all_lines) 
     print "Total lines:", total 
     for i in range(new_len): 
      line = np.random.choice(all_lines) 
      out_lines.append(line.rstrip('\t\r\n')) 
      #out_file.write(line.rstrip('\t\r\n')) 
      print "Done with", i, "lines" 
      all_lines.remove(line) 
     out_file.write("\n".join(out_lines))

出典

2017-11-07 user3079275

ので、問題：

all_lines = in_file.readlines()はおそらくメモリにすべての行ない、これを行うための最善の方法を読み込んで...しかし、あなたはそれをやろうとしている場合は、間違いなくドン」これを行うには：all_lines.remove(line)これはO（N）演算であり、これはループで行い、2次的な複雑さを与えます。

私はあなたが単にの効果に何かをすることによって、巨大なパフォーマンスの向上を得ることが疑われる

：あなたはまたのmmapを使用しようとすることができ

idx = np.arange(total, dtype=np.int32) 
idx = np.random.choice(idx, size=new_len, replace=False) 
for i in idx: 
    outfile.write(all_lines[i])

出典

2017-11-07 22:58:05

：

https://docs.python.org/3.6/library/mmap.html

出典

2017-11-07 22:59:30 mamcx

あなたはすべてを読み込みますそれらの行をメモリに保持し、結果のテキストに対して250Kの大文字の文字列演算を実行します。ファイルから行を削除するたびに、Pythonは残りの行の新しいコピーを作成する必要があります。

代わりに、ランダムサンプルを取るだけです。たとえば、500万行がある場合、ファイルの5％が必要です。一度に1行ずつファイルを読み込みます。ランダムな浮動小数点数をロールします。 < = 0.05の場合、その行を出力に書き出します。

このような大きなサンプルでは、目的のサイズの出力になります。

出典

2017-11-07 22:59:41 Prune

Python numpyライブラリを利用してください。 numpy.choice()関数は、必要な機能を提供します。 1回の呼び出しで必要なサイズまでラインのサンプルを取得します。だからあなたの関数は次のようになります。

import numpy as np 

def get_shorter_subset(fname, new_len): 
    """Extract a random shorter subset of length new_len from a given file""" 

    with open(fname + " short.out", 'w') as out_file, open(fname, 'r') as in_file: 
     out_file.write(''.join(np.random.choice(list(in_file), new_len, False))) 

get_shorter_subset('input.txt', 250000)

出典

2017-11-07 23:01:24

の答えをありがとう、私は、各インデックスにしてそれに基づいて各要素を選ぶか、廃棄（確率は/ full_sizeをnew_sizeに対応して）乱数を生成するためのソリューションをしました。だからコードは：

def get_shorter_subset(fname, new_len): 
"""Extract a random shorter subset of length new_len from a given 
    file""" 
    out_lines = [] 
    with open(fname + "short.out", 'w') as out_file: 
     with open(fname, 'r') as in_file: 
      all_lines = in_file.readlines() 
      total = len(all_lines) 

      freq = total/new_len + 1 
      print "Total lines:", total, "new freq:", freq 
      for i, line in enumerate(all_lines): 
       t = np.random.randint(1,freq+1) 
       if t == 1: 
        out_lines.append(line.rstrip('\t\r\n')) 
       #out_file.write(line.rstrip('\t\r\n')) 
       if i % 10000 == 0: 
        print "Done with", i, "lines" 

     out_file.write("\n".join(out_lines))

出典

2017-11-08 19:03:47 user3079275

Pythonで大きなファイルのサブセットを効率的に抽出

答えて

関連する問題