シングルクローラを使用して複数のドメインをクロールする方法

-1

単一のクローラを使用して複数のドメインからデータをクロールするにはどうすればよいですか。美しいスープを使って単一のサイトをクロールしましたが、一般的なスープを作成する方法を理解できませんでした。シングルクローラを使用して複数のドメインをクロールする方法

出典

2017-03-04 Puja

よくこの質問には欠陥があります。あなたが掻き出したいサイトには、たとえば共通のものが必要です。

from bs4 import BeautifulSoup 
from urllib import request 
import urllib.request 

for counter in range(0,10):   
    # site = input("Type the name of your website") Python 3+ 
    site = raw_input("Type the name of your website") 
    # Takes the website you typed and stores it in > site < variable 
    make_request_to_site = request.urlopen(site).read() 
    # Makes a request to the site that we stored in a var 
    soup = BeautifulSoup(make_request_to_site, "html.parser") 
    # We pass it through BeautifulSoup parser in this case html.parser 
    # Next we make a loop to find all links in the site that we stored 
    for link in soup.findAll('a'): 
     print link['href']

出典

2017-03-05 12:19:05

前述のように、各サイトにはセレクタ（、など）の独自の設定があります。単一の一般的なクローラーはURLに入り、何を掻き取るのか直感的に理解することはできません。

BeautifulSoupはこのタイプのリクエストには最適ではないかもしれません。 Scrapyは、BS4よりも少し頑強なもう1つのWebクローラーライブラリです。

ここstackoverflowの上の同様の質問：Scrapy approach to scraping multiple URLs

Scrapyドキュメント： https://doc.scrapy.org/en/latest/intro/tutorial.html

出典

2017-03-12 17:52:17 pdel5

シングルクローラを使用して複数のドメインをクロールする方法

答えて

関連する問題