Pythonマルチプロセッシング - オンデマンドでワーカーを使用する

-1

複数ページのWebサイトを解析したい。Pythonマルチプロセッシング - オンデマンドでワーカーを使用する

私はページ数を知りません。 これは元のコードです：

 next_button=soup.find_all('a',{'class':"btn-page_nav right"}) 
     while next_button: 
      link=next_button[0]['href'] 
      resp=requests.get('webpage+link) 
      soup=BeautifulSoup(resp.content) 
      table=soup.find('table',{'class':'js-searchresults'}) 
      body=table.find('tbody') 
      rows=body.find_all('tr') 
      function(rows) 
      next_button=soup.find_all('a',{'class':"btn-page_nav right"})

それは正常に動作し、function(rows)は、各ページの一部を解析する機能です。

私がしたいのは、これらのページを解析するのにmultiprocessingを使用することです。一度に3ページを処理できるように、私はpool 3人の労働者を使用することを考えましたが、それを実装する方法を理解できません。その後、

rows_list=[] 
next_button=soup.find_all('a',{'class':"btn-page_nav right"}) 
while next_button: 
    link=next_button[0]['href'] 
    resp=requests.get('webpage+link) 
    soup=BeautifulSoup(resp.content) 
    table=soup.find('table',{'class':'js-searchresults'}) 
    body=table.find('tbody') 
    rows=body.find_all('tr') 
    rows_list.append(rows) 
    next_button=soup.find_all('a',{'class':"btn-page_nav right"})

すべてのページをループへのプログラムのための待ちと：：

一つの解決策はこれです

pool=multiprocessing.Pool(processes=4) 
pool.map(function,rows_list)

しかし、私は、これはあまりにも多くのパフォーマンスを向上させるとは思わない、私メインプロセスがページをループしてページを開くとすぐに、そのプロセスをワーカーに送ります。 どうすればいいですか？ダミー例：

pool=multiprocessing.Pool(processes=4) 

next_button=soup.find_all('a',{'class':"btn-page_nav right"}) 
while next_button: 
    link=next_button[0]['href'] 
    resp=requests.get('webpage+link) 
    soup=BeautifulSoup(resp.content) 
    table=soup.find('table',{'class':'js-searchresults'}) 
    body=table.find('tbody') 
    rows=body.find_all('tr') 

    **pool.send_to_idle_worker(rows)** 

    next_button=soup.find_all('a',{'class':"btn-page_nav right"})

出典

2017-10-19 Mike

あなたが代わりにmultiprocessingのconcurrentパッケージを使用することができます。例：

import concurrent.futures 

with concurrent.futures.ProcessPoolExecutor() as executor: 
    while next_button: 
     rows = ... 
     executor.submit(function, rows) 
     next_button = ...

あなたはexecutor = ProcessPoolExecutor(max_workers=10)と労働者の任意の量でexecutorをインスタンス化することができますが、与えられていない場合は、max_workersはあなたのマシン上のコアの量にデフォルト設定されます。 Further details in the python docs。

出典

2017-10-19 10:14:17 hoefling

Pool.map()の代わりにPool.apply_async()を使用できますか？ Apply_asyncはブロックされず、メインプログラムはさらに多くの行を処理し続けることができます。また、メインプログラムですべてのデータをマッピングできるようにする必要はありません。 1つのチャンクをパラメータとしてapply_async()に渡すだけです。

出典

2017-10-19 10:14:38 Hannu

Pythonマルチプロセッシング - オンデマンドでワーカーを使用する

答えて

関連する問題