2017-06-21 21 views
0

JOBDIR(cf. https://doc.scrapy.org/en/latest/topics/jobs.html参照)のスクレーパーを実行しているため、クロールを一時停止して再開できます。スクレーパーは、しばらくの間、正常に実行されたが、私はクモをクロールするとき、今、私は次で終わるのログを取得:TypeError:Scrapyのdupefilterで+: 'NoneType'と 'str'のサポートされていないオペランドタイプ

scraper_1 | 2017-06-21 14:53:10 [scrapy.middleware] INFO: Enabled item pipelines: 
scraper_1 | ['scrapy.pipelines.images.ImagesPipeline', 
scraper_1 | 'scrapy.pipelines.files.FilesPipeline'] 
scraper_1 | 2017-06-21 14:53:10 [scrapy.core.engine] INFO: Spider opened 
scraper_1 | 2017-06-21 14:53:10 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min) 
scraper_1 | 2017-06-21 14:53:10 [scrapy.extensions.telnet] DEBUG: Telnet console listening on 127.0.0.1:6023 
scraper_1 | 2017-06-21 14:53:12 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://www.apkmirror.com/sitemap_index.xml> (referer: None) 
scraper_1 | 2017-06-21 14:53:13 [scrapy.core.scraper] ERROR: Spider error processing <GET http://www.apkmirror.com/sitemap_index.xml> (referer: None) 
scraper_1 | Traceback (most recent call last): 
scraper_1 | File "/usr/local/lib/python3.6/site-packages/scrapy/utils/defer.py", line 102, in iter_errback 
scraper_1 |  yield next(it) 
scraper_1 | GeneratorExit 
scraper_1 | 2017-06-21 14:53:13 [scrapy.core.engine] INFO: Closing spider (closespider_errorcount) 
scraper_1 | Exception ignored in: <generator object iter_errback at 0x7f4cc3a754c0> 
scraper_1 | RuntimeError: generator ignored GeneratorExit 
scraper_1 | 2017-06-21 14:53:13 [scrapy.statscollectors] INFO: Dumping Scrapy stats: 
scraper_1 | {'downloader/request_bytes': 306, 
scraper_1 | 'downloader/request_count': 1, 
scraper_1 | 'downloader/request_method_count/GET': 1, 
scraper_1 | 'downloader/response_bytes': 2498, 
scraper_1 | 'downloader/response_count': 1, 
scraper_1 | 'downloader/response_status_count/200': 1, 
scraper_1 | 'finish_reason': 'closespider_errorcount', 
scraper_1 | 'finish_time': datetime.datetime(2017, 6, 21, 14, 53, 13, 139012), 
scraper_1 | 'log_count/DEBUG': 26, 
scraper_1 | 'log_count/ERROR': 1, 
scraper_1 | 'log_count/INFO': 10, 
scraper_1 | 'memusage/max': 75530240, 
scraper_1 | 'memusage/startup': 75530240, 
scraper_1 | 'request_depth_max': 1, 
scraper_1 | 'response_received_count': 1, 
scraper_1 | 'scheduler/dequeued': 1, 
scraper_1 | 'scheduler/dequeued/disk': 1, 
scraper_1 | 'scheduler/enqueued': 1, 
scraper_1 | 'scheduler/enqueued/disk': 1, 
scraper_1 | 'spider_exceptions/GeneratorExit': 1, 
scraper_1 | 'start_time': datetime.datetime(2017, 6, 21, 14, 53, 10, 532154)} 
scraper_1 | 2017-06-21 14:53:13 [scrapy.core.engine] INFO: Spider closed (closespider_errorcount) 
scraper_1 | Unhandled error in Deferred: 
scraper_1 | 2017-06-21 14:53:13 [twisted] CRITICAL: Unhandled error in Deferred: 
scraper_1 | 
scraper_1 | 2017-06-21 14:53:13 [twisted] CRITICAL: 
scraper_1 | Traceback (most recent call last): 
scraper_1 | File "/usr/local/lib/python3.6/site-packages/twisted/internet/task.py", line 517, in _oneWorkUnit 
scraper_1 |  result = next(self._iterator) 
scraper_1 | File "/usr/local/lib/python3.6/site-packages/scrapy/utils/defer.py", line 63, in <genexpr> 
scraper_1 |  work = (callable(elem, *args, **named) for elem in iterable) 
scraper_1 | File "/usr/local/lib/python3.6/site-packages/scrapy/core/scraper.py", line 183, in _process_spidermw_output 
scraper_1 |  self.crawler.engine.crawl(request=output, spider=spider) 
scraper_1 | File "/usr/local/lib/python3.6/site-packages/scrapy/core/engine.py", line 210, in crawl 
scraper_1 |  self.schedule(request, spider) 
scraper_1 | File "/usr/local/lib/python3.6/site-packages/scrapy/core/engine.py", line 216, in schedule 
scraper_1 |  if not self.slot.scheduler.enqueue_request(request): 
scraper_1 | File "/usr/local/lib/python3.6/site-packages/scrapy/core/scheduler.py", line 54, in enqueue_request 
scraper_1 |  if not request.dont_filter and self.df.request_seen(request): 
scraper_1 | File "/usr/local/lib/python3.6/site-packages/scrapy/dupefilters.py", line 53, in request_seen 
scraper_1 |  self.file.write(fp + os.linesep) 
scraper_1 | TypeError: unsupported operand type(s) for +: 'NoneType' and 'str' 
apkmirrorscrapercompose_scraper_1 exited with code 0 

エラーがdupefilters.pyで発生しているように思えます。私はhttps://github.com/scrapy/scrapy/blob/master/scrapy/dupefilters.pyのソースコードを見てきましたが、このエラーの原因を明らかにすることはできませんでした。何か案は?

更新

はここでクモが実装されている方法についていくつかの詳細です。以下のようにSitemapSpiderある:parse方法はBaseSpiderクラスで定義され

import scrapy 
from scrapy.spiders import SitemapSpider 
from apkmirror_scraper.spiders.base_spider import BaseSpider 


class ApkmirrorSitemapSpider(SitemapSpider, BaseSpider): 
    name = 'apkmirror' 
    sitemap_urls = ['http://www.apkmirror.com/sitemap_index.xml'] 
    sitemap_rules = [(r'.*-android-apk-download/$', 'parse')] 

    custom_settings = { 
     'CLOSESPIDER_PAGECOUNT': 0, 
     'CLOSESPIDER_ERRORCOUNT': 1, 
     'CONCURRENT_REQUESTS': 32, 
     'CONCURRENT_REQUESTS_PER_DOMAIN': 16, 
     'TOR_RENEW_IDENTITY_ENABLED': True, 
     'TOR_ITEMS_TO_SCRAPE_PER_IDENTITY': 50, 
     'FEED_URI': '/scraper/apkmirror_scraper/data/apkmirror.json', 
     'FEED_FORMAT': 'json', 
     'DUPEFILTER_CLASS': 'apkmirror_scraper.dupefilters.URLDupefilter', 
    } 

    download_timeout = 60 * 15.0  # Allow 15 minutes for downloading APKs 

    def start_requests(self): 
     for url in self.sitemap_urls: 
      yield scrapy.Request(url, self._parse_sitemap, dont_filter=True) 

。次のように私は、カスタムURLDupefilterを定義した:

from scrapy.dupefilters import RFPDupeFilter 


class URLDupefilter(RFPDupeFilter): 
    def request_fingerprint(self, request): 
     '''Simply use the URL as fingerprint. (Scrapy's default is a hash containing the request's canonicalized URL, method, body, and (optionally) headers). Omit sitemap pages, which end with ".xml".''' 
     if not request.url.endswith('.xml'): 
      return request.url 

    def request_seen(self, request): 
     '''Same as the RFPDupeFilter's request_seen method, except that a fingerprint of "None" is viewed as 'not seen' (cf. https://stackoverflow.com/questions/44370949/is-it-ok-for-scrapys-request-fingerprint-method-to-return-none).''' 
     fp = self.request_fingerprint(request) 
     if fp is None: 
      return False       # These two lines are added to the original 
     if fp in self.fingerprints: 
      return True 
     self.fingerprints.add(fp) 
     if self.file: 
      self.file.write(fp + os.linesep) 

しかし、エラーがScrapyのビルトインRFPDupeFilterクラスから来ているようです。私はDUPEFILTER_CLASSをカスタム1に設定してもこれがまだ有効になっている理由を理解できませんか?

+1

:これは単に.xmlページの文字列を返すよう修正するには

。たとえば、ページxに移動してページyにリダイレクトすると、指紋がxのURLであるため、ページyへの次のリクエストはフィルタリングされません。そのような場合はもう少しあります。そのため、指紋を生成するためにスーパーメソッドを呼び出す必要があります。 – Granitosaurus

答えて

1
class URLDupefilter(RFPDupeFilter): 
    def request_fingerprint(self, request): 
     '''Simply use the URL as fingerprint. (Scrapy's default is a hash containing the request's canonicalized URL, method, body, and (optionally) headers). Omit sitemap pages, which end with ".xml".''' 
     if not request.url.endswith('.xml'): 
      return request.url 

リクエストURLが.xmlで終わる場合、これはNoneを返します。そして、dupefilterは、その文字列を改行文字と組み合わせて、文字列と考える文字列をファイルに書き込もうとします。 Noneと文字列を結合しようとするので、TypeErrorを取得します。それは正確ではありませんので、あなたは、指紋として `request.url`を使用すべきではありません

def request_fingerprint(self, request): 
     if not request.url.endswith('.xml'): 
      return request.url 
     return '' 
関連する問題