Beautifulsoup parse thousand pages

私は数千のURLを持つリストを解析するスクリプトを持っています。しかし、私の問題は、そのリストでやるには年月がかかるということです。Beautifulsoup parse thousand pages

URLリクエストは、ページがロードされる前に約4秒かかり、解析できます。本当に大量のURLを高速で解析する方法はありますか？

from bs4 import BeautifulSoup 
import requests     

#read url-list 
with open('urls.txt') as f: 
    content = f.readlines() 
# remove whitespace characters 
content = [line.strip('\n') for line in content] 

#LOOP through urllist and get information 
for i in range(5): 
    try: 
     for url in content: 

      #get information 
      link = requests.get(url) 
      data = link.text 
      soup = BeautifulSoup(data, "html5lib") 

      #just example scraping 
      name = soup.find_all('h1', {'class': 'name'})

EDIT：この例ではフックとの非同期リクエストを処理する方法を

私のコードは次のようになりますか？私はこのサイトで述べたように以下を試しましたAsynchronous Requests with Python requests：

from bs4 import BeautifulSoup 
import grequests 

def parser(response): 
    for url in urls: 

     #get information 
     link = requests.get(response) 
     data = link.text 
     soup = BeautifulSoup(data, "html5lib") 

     #just example scraping 
     name = soup.find_all('h1', {'class': 'name'}) 

#read urls.txt and store in list variable 
with open('urls.txt') as f: 
    urls= f.readlines() 
# you may also want to remove whitespace characters 
urls = [line.strip('\n') for line in urls] 

# A list to hold our things to do via async 
async_list = [] 

for u in urls: 
    # The "hooks = {..." part is where you define what you want to do 
    # 
    # Note the lack of parentheses following do_something, this is 
    # because the response will be used as the first argument automatically 
    rs = grequests.get(u, hooks = {'response' : parser}) 

    # Add the task to our list of things to do via async 
    async_list.append(rs) 

# Do our list of things to do via async 
grequests.map(async_list, size=5)

これは私にとっては役に立ちません。私はコンソールに何のエラーもなく、停止するまで長い間走っています。

出典

2017-09-08 kratze

ドキュメントはあなたの友人です：http://docs.python-requests.org/ja/v0.10.6/user/advanced/#asynchronous-requests – Tomalak

私はあなたのURLリストを壊し、リクエスト間の時間差を設けることを提案します、正確に@Tomalakが提案するもの – chad

@Tomalakあなたは最初の問題でユーザーの問題を解決するために答えを出すべきです。 –

誰かがこの質問に興味があれば、私はゼロからプロジェクトをやり直し、beautifulsoupの代わりにscrapyを使用することにしました。

Scrapyはウェブスクレイピングのフルフレームワークであり、1000件のリクエストを一度に処理する機能を内蔵しています。デスティネーションサイトから「よりフレンドリーな」スクラップを受けることができます。

私はこれが誰かを助けてくれることを願っています。私にとっては、このプロジェクトのためのより良い選択でした。

出典

2017-09-22 10:39:42 kratze

Beautifulsoup parse thousand pages

答えて

関連する問題