最適化

I以下の機能があります。私は最大のCPU時間はzip()機能に費やされていると思われることを見つけるために、プロファイラを使用している最適化

def filetxt(): 
    word_freq = {} 
    lvl1  = [] 
    lvl2  = [] 
    total_t = 0 
    users  = 0 
    text  = [] 

    for l in range(0,500): 
     # Open File 
     if os.path.exists("C:/Twitter/json/user_" + str(l) + ".json") == True: 
      with open("C:/Twitter/json/user_" + str(l) + ".json", "r") as f: 
       text_f = json.load(f) 
       users = users + 1 
       for i in range(len(text_f)): 
        text.append(text_f[str(i)]['text']) 
        total_t = total_t + 1 
     else: 
      pass 

    # Filter 
    occ = 0 
    import string 
    for i in range(len(text)): 
     s = text[i] # Sample string 
     a = re.findall(r'(RT)',s) 
     b = re.findall(r'(@)',s) 
     occ = len(a) + len(b) + occ 
     s = s.encode('utf-8') 
     out = s.translate(string.maketrans("",""), string.punctuation) 


     # Create Wordlist/Dictionary 
     word_list = text[i].lower().split(None) 

     for word in word_list: 
      word_freq[word] = word_freq.get(word, 0) + 1 

     keys = word_freq.keys() 

     numbo = range(1,len(keys)+1) 
     WList = ', '.join(keys) 
     NList = str(numbo).strip('[]') 
     WList = WList.split(", ") 
     NList = NList.split(", ") 
     W2N = dict(zip(WList, NList)) 

     for k in range (0,len(word_list)): 
      word_list[k] = W2N[word_list[k]] 
     for i in range (0,len(word_list)-1): 
      lvl1.append(word_list[i]) 
      lvl2.append(word_list[i+1])

とjoinとsplitをコードの一部として、私は、もし私が見落としてしまったことを見逃しているのであれば、コードをきれいにして最適化することができます。 zip()機能。どんな助けもありがとう！

P.S.この関数の基本的な目的は、20個ほどのツイートが含まれているファイルをロードすることです。この関数を使用して送信されるファイルは、20k〜50k程度です。

1 "love" 
2 "pasa" 
3 "mirar" 
4 "ants" 
5 "kers" 
6 "morir" 
7 "dreaming" 
8 "tan" 
9 "rapido" 
10 "one" 
11 "much" 
12 "la" 
... 
10 1 
13 12 
1 7 
12 2 
7 3 
2 4 
3 11 
4 8 
11 6 
8 9 
6 5 
9 20 
5 8 
20 25 
8 18 
25 9 
18 17 
9 2 
...

出典

2011-01-10 eWizardII

LINK

・ホープ、このことができます...：

import string 
from collections import defaultdict 
rng = xrange if xrange else range 

def filetxt(): 
    users  = 0 
    total_t = 0 
    occ  = 0 

    wordcount = defaultdict(int) 
    wordpairs = defaultdict(lambda: defaultdict(int)) 
    for filenum in rng(500): 
     try: 
      with open("C:/Twitter/json/user_" + str(filenum) + ".json",'r') as inf: 
       users += 1 
       tweets = json.load(inf) 
       total_t += len(tweets) 

       for txt in (r['text'] for r in tweets): 
        occ += txt.count('RT') + txt.count('@') 
        prev = None 
        for word in txt.encode('utf-8').translate(None, string.punctuation).lower().split(): 
         wordcount[word] += 1 
         wordpairs[prev][word] += 1 
         prev = word 
     except IOError: 
      pass

出典

2011-01-10 06:44:01

ありがとう、私は基本的にあなたと@milkypostmanからいくつかを取ったが、何か理由でこれに短い答えを探している人は、これを答えとしてマークします。さらに、私が作っていた最大の間違いは、W2N = dict（zip（WList、NList））を呼び出すことでした。これは、ループごとに大規模な辞書を作成し続け、CPU時間を無駄にしていました。解決策は、これをループの外側に配置することでした。約5分かかっていた1000個のファイルが生成されましたが、現在では.59秒かかります。 – eWizardII

いくつかのこと：出力は、私がどの何にリンクされている言葉、例えば続いつぶやき内のすべての異なる単語のリストを生成することがあります。

WList = ', '.join(keys) 
<snip> 
WList = WList.split(", ")

Wlist = list(keys)をする必要があること：一緒に入れたときにこれらの行は、私にとっては奇妙です。

これを最適化してもよろしいですか？それはあなたの時間の価値があるので、本当に遅いですか？最後に、コードからコードを解読させる代わりに、スクリプトが行うべきことの説明が素晴らしいでしょう:)

出典

2011-01-10 03:55:26 orlp

おかげで、私は説明を追加します - 問題は、私は20Kのように渡ってこの関数を呼び出しています - 50kファイルの場合、各ファイルには約20文字の文字列または20個のつぶやきが含まれています。だから私は1000ファイルがある場合、それは実行するために約5分かかります。 – eWizardII

私はあなたのコードを私が何かに変更する自由を取った書き込みが多い

from itertools import izip 
def filetxt(): 
    # keeps track of word count for each word. 
    word_freq = {} 
    # list of words which we've found 
    word_list = [] 
    # mapping from word -> index in word_list 
    word_map = {} 
    lvl1  = [] 
    lvl2  = [] 
    total_t = 0 
    users  = 0 
    text  = [] 

    ####### You should replace this with a glob (see: glob module) 
    for l in range(0,500): 
     # Open File 
     try: 
      with open("C:/Twitter/json/user_" + str(l) + ".json", "r") as f: 
       text_f = json.load(f) 
       users = users + 1 
       # in this file there are multiple tweets so add the text 
       # for each one. 
       for t in text_f.itervalues(): 
        text.append(t) ## CHECK THIS 
     except IOError: 
      pass 

    total_t = len(text) 
    # Filter 
    occ = 0 
    import string 
    for s in text: 
     a = re.findall(r'(RT)',s) 
     b = re.findall(r'(@)',s) 
     occ += len(a) + len(b) 
     s = s.encode('utf-8') 
     out = s.translate(string.maketrans("",""), string.punctuation) 


     # make a list of words that are in the text s 
     words = s.lower().split(None) 

     for word in word_list: 
      # try/except is quicker when we expect not to miss 
      # and it will be rare for us not to have 
      # a word in our list already. 
      try: 
       word_freq[word] += 1 
      except KeyError: 
       # we've never seen this word before so add it to our list 
       word_freq[word] = 1 
       word_map[word] = len(word_list) 
       word_list.append(word) 


     # little trick to get each word and the word that follows 
     for curword, nextword in zip(words, words[1:]): 
      lvl1.append(word_map[curword]) 
      lvl2.append(word_map[nextword])

次のようにします。 lvl1は、word_listの単語に対応する数字のリストを表示します。だからword_list[lvl1[0]]はあなたが処理した最初のつぶやきの最初の単語になります。 lvl2[0]はlvl1[0]に続く単語のインデックスになりますので、world_list[lvl2[0]]はfollows word_list[lvl1[0]]という単語です。このコードは基本的にword_map、word_listおよびword_freqを構築しています。

これまでに行っていたやり方、具体的には作成した方法はW2Nです。ではなく、が正しく機能します。辞書は順序を維持しません。順序付けされた辞書は3.1に入っていますが、今はそれを忘れてしまいます。基本的には、word_freq.keys()を実行していたときに、新しい単語を追加するたびに変更されていたので、一貫性はありませんでした。この例を参照してください、そう

>>> x = dict() 
>>> x[5] = 2 
>>> x 
{5: 2} 
>>> x[1] = 24 
>>> x 
{1: 24, 5: 2} 
>>> x[10] = 14 
>>> x 
{1: 24, 10: 14, 5: 2} 
>>>

5は、第二1でしたが、今では3位です。

また、1つのインデックスの代わりに0つのインデックスを使用するように更新しました。 range(len(...))ではなく、なぜrange(1, len(...)+1)を使用していたのかわかりません。

forのループは、数字の上でループするC/C++/Javaの伝統的な意味で考える必要はありません。インデックス番号が必要な場合を除いて、それを必要としないと考えるべきです。

Thumbのルール：インデックスが必要な場合は、おそらくそのインデックスに要素が必要なので、とにかくenumerateを使用する必要があります。私はあなたが何かしたい考える

出典

2011-01-10 05:26:56 milkypostman

辞書についての議論を読んだあと、私が作っていた最悪の間違いは、ループ内で 'W2N = dict（zip（WList、NList））'を複数回呼び出すことであることに気付きました。 – eWizardII

答えて

関連する問題