yield scrapy.Requestはタイトルを返さない

私はScrapyに新規で、ウェブサイトのクロールを練習するためにそれを使用しようとします。しかし、チュートリアルで提供されたコードに従っても、結果は返されません。 yield scrapy.Requestが動作しないようです。私のコードは以下の通りです：yield scrapy.Requestはタイトルを返さない

Import scrapy 
from bs4 import BeautifulSoup 
from apple.items import AppleItem 

class Apple1Spider(scrapy.Spider): 
    name = 'apple' 
    allowed_domains = ['appledaily.com'] 
    start_urls =['http://www.appledaily.com.tw/realtimenews/section/new/'] 

    def parse(self, response): 
     domain = "http://www.appledaily.com.tw" 
     res = BeautifulSoup(response.body) 
     for news in res.select('.rtddt'): 
      yield scrapy.Request(domain + news.select('a')[0]['href'], callback=self.parse_detail) 

    def parse_detail(self, response): 
     res = BeautifulSoup(response.body) 
     appleitem = AppleItem() 
     appleitem['title'] = res.select('h1')[0].text 
     appleitem['content'] = res.select('.trans')[0].text 
     appleitem['time'] = res.select('.gggs time')[0].text 
     return appleitem

スパイダーが開いて閉じているが、何も返されないことを示しています。 Pythonのバージョンは3.6です。誰でも助けてくれますか？ありがとう。クロールログにhereに到達することができるI

EDIT。

EDIT II

は、たぶん私は以下のようにコードを変更した場合、問題をより明確になります。

Import scrapy from bs4 import BeautifulSoup class Apple1Spider(scrapy.Spider): name = 'apple' allowed_domains = ['appledaily.com'] start_urls = ['http://www.appledaily.com.tw/realtimenews/section/new/'] def parse(self, response): domain = "http://www.appledaily.com.tw" res = BeautifulSoup(response.body) for news in res.select('.rtddt'): yield scrapy.Request(domain + news.select('a')[0]['href'], callback=self.parse_detail) def parse_detail(self, response): res = BeautifulSoup(response.body) print(res.select('#h1')[0].text)

コードは別途URLとタイトルをプリントアウトする必要がありますが、それは何も返しません。。

出典

2017-07-10 tzu

クロールログを投稿してもらえますか？あなたは 'scream crawl spider --logfile output.log'または' scrapy crawl spider 2> 1を使ってこれを行うことができます。 tee output.log'コマンド（後で出力を画面とファイルに出力します）。 – Granitosaurus

@Granitosaurus、私はリンクをログファイルに追加するだけです。ありがとう。 – tzu

あなたのログ状態：

2017年7月10日午前19時12分47秒[scrapy.spidermiddlewares.offsite] DEBUG：フィルタオフサイトの要求に 'www.appledaily.com.tw'：のhttp： //www.appledaily.com.tw/realtimenews/article/life/201 70710/1158177/oBike％E7％A6％81％E5％81％9C％E6％A9％9F％E8％BB％8A％E6％ A0％BC％E3％80％80％E6％96％B0％E5％8C％ 97％E7％81％AB％E9％80％9F％E5％86％8D％E5％85％AC％E5％91
：>

% 8A6％E5％以下の8Dの％80％E7％のA6の％81％E5％を81％の9Cは、あなたのクモがに設定されています

allowed_domains = ['appledaily.com']

だから、おそらく次のようになります。

allowed_domains = ['appledaily.com.tw']

出典

2017-07-10 11:43:48 Granitosaurus

ありがとうございます。私はそれが原因だとは思わなかった。 – tzu

あなたparseメソッド（クラスrtddtとのすなわちリスト項目）に興味のあるコンテンツが動的に生成されるように思え - それはChromeを使用して、たとえば検査することができますが、HTMLソースに存在しない（Scrapyとして取得するもの応答）。

まず、ページをScrapy用にレンダリングする必要があります。 scrapy-splashパッケージと一緒にSplashをお勧めします。

出典

2017-07-10 11:24:41

yield scrapy.Requestはタイトルを返さない

答えて

関連する問題