2017-02-18 1 views
0

pageの詳細ページからデータをクロールすると、エラーscrapy.exceptions.NotSupported:まだ少数のページでデータを取得できますが、ページの量を増やすと、それ以上の出力がなければ、実行され、停止することはできません。ありがとうございました!Scream error NotSupported

ページに画像がありますが、画像をクロールしたくない場合があります。応答内容はテキストではありません。

これはエラー

2017-02-18 15:35:35 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (302) to <GET https://www.google.com.my:443/maps/place/bs+bio+science+sdn+bhd/@4.109495,109.101269,2856256m/data=!3m1!4b1!4m2!3m1!1s0x0:0xb11eb29219c723f4?source=s_q&hl=en> from <GET http://maps.google.com.my/maps?f=q&source=s_q&hl=en&q=bs+bio+science+sdn+bhd&vps=1&jsv=171b&sll=4.109495,109.101269&sspn=25.686885,46.318359&ie=UTF8&ei=jPeISu6RGI7kugOboeXiDg&cd=1&usq=bs+bio+science+sdn+bhd&geocode=FQdNLwAdEm4QBg&cid=12762834734582014964&li=lmd> 
2017-02-18 15:35:37 [scrapy.downloadermiddlewares.retry] DEBUG: Gave up retrying <GET http://com> (failed 3 times): DNS lookup failed: address 'com' not found: [Errno 11001] getaddrinfo failed. 
2017-02-18 15:35:37 [scrapy.downloadermiddlewares.retry] DEBUG: Gave up retrying <GET http://www.byunature> (failed 3 times): DNS lookup failed: address 'www.byunature' not found: [Errno 11001] getaddrinfo failed. 
2017-02-18 15:35:37 [scrapy.downloadermiddlewares.retry] DEBUG: Gave up retrying <GET http://www.borneococonutoil.com> (failed 3 times): DNS lookup failed: address 'www.borneococonutoil.com' not found: [Errno 11001] getaddrinfo failed. 
2017-02-18 15:35:37 [scrapy.core.scraper] ERROR: Error downloading <GET http://com>: DNS lookup failed: address 'com' not found: [Errno 11001] getaddrinfo failed. 
2017-02-18 15:35:37 [scrapy.core.scraper] ERROR: Error downloading <GET http://www.byunature>: DNS lookup failed: address 'www.byunature' not found: [Errno 11001] getaddrinfo failed. 
2017-02-18 15:35:37 [scrapy.core.scraper] ERROR: Error downloading <GET http://www.borneococonutoil.com>: DNS lookup failed: address 'www.borneococonutoil.com' not found: [Errno 11001] getaddrinfo failed. 
2017-02-18 15:35:37 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (302) to <GET https://www.google.com.my/maps/place/bs+bio+science+sdn+bhd/@4.109495,109.101269,2856256m/data=!3m1!4b1!4m2!3m1!1s0x0:0xb11eb29219c723f4?source=s_q&hl=en&dg=dbrw&newdg=1> from <GET https://www.google.com.my:443/maps/place/bs+bio+science+sdn+bhd/@4.109495,109.101269,2856256m/data=!3m1!4b1!4m2!3m1!1s0x0:0xb11eb29219c723f4?source=s_q&hl=en> 
2017-02-18 15:35:38 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.google.com.my/maps/place/bs+bio+science+sdn+bhd/@4.109495,109.101269,2856256m/data=!3m1!4b1!4m2!3m1!1s0x0:0xb11eb29219c723f4?source=s_q&hl=en&dg=dbrw&newdg=1> (referer: http://www.bsbioscience.com/contactus.html) 
2017-02-18 15:35:41 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://www.canaanalpha.com/extras/Anistrike_Poster.pdf> (referer: http://www.canaanalpha.com/anistrike.html) 
2017-02-18 15:35:41 [scrapy.core.scraper] ERROR: Spider error processing <GET http://www.canaanalpha.com/extras/Anistrike_Poster.pdf> (referer: http://www.canaanalpha.com/anistrike.html) 
Traceback (most recent call last): 
    File "c:\python27\lib\site-packages\scrapy\utils\defer.py", line 102, in iter_errback 
    yield next(it) 
    File "c:\python27\lib\site-packages\scrapy\spidermiddlewares\offsite.py", line 29, in process_spider_output 
    for x in result: 
    File "c:\python27\lib\site-packages\scrapy\spidermiddlewares\referer.py", line 22, in <genexpr> 
    return (_set_referer(r) for r in result or()) 
    File "c:\python27\lib\site-packages\scrapy\spidermiddlewares\urllength.py", line 37, in <genexpr> 
    return (r for r in result or() if _filter(r)) 
    File "c:\python27\lib\site-packages\scrapy\spidermiddlewares\depth.py", line 58, in <genexpr> 
    return (r for r in result or() if _filter(r)) 
    File "D:\Scrapy\tutorial\tutorial\spiders\tu2.py", line 17, in parse 
    company = response.css('font:nth-child(3)::text').extract_first() 
    File "c:\python27\lib\site-packages\scrapy\http\response\__init__.py", line 97, in css 
    raise NotSupported("Response content isn't text") 
NotSupported: Response content isn't text 
2017-02-18 15:35:41 [scrapy.core.engine] INFO: Closing spider (finished) 
2017-02-18 15:35:41 [scrapy.extensions.feedexport] INFO: Stored json feed (30 items) in: tu2.json 
2017-02-18 15:35:41 [scrapy.statscollectors] INFO: Dumping Scrapy stats: 
{'downloader/exception_count': 55, 
'downloader/exception_type_count/scrapy.exceptions.NotSupported': 31, 
'downloader/exception_type_count/twisted.internet.error.DNSLookupError': 24, 

私のコードです:

import scrapy 
import json 
from scrapy.linkextractors import LinkExtractor 
# import LxmlLinkExtractor as LinkExtractor 

class QuotesSpider(scrapy.Spider): 
    name = "tu2" 

    def start_requests(self): 
     baseurl = 'http://edirectory.matrade.gov.my/application/edirectory.nsf/category?OpenForm&query=product&code=PT&sid=BED1E22D5BE3F9B5394D6AF0E742828F' 
     urls = [] 
     for i in range(1, 3): 
      urls.append(baseurl + "&page=" + str(i)); 

     for url in urls: 
      yield scrapy.Request(url=url, callback=self.parse) 


    def parse(self, response): 

     company = response.css('font:nth-child(3)::text').extract_first() 

     key3 = "Business Address"; 
     key4 = response.css('tr:nth-child(4) td:nth-child(1) b::text').extract_first(); 
     key5 = response.css('tr:nth-child(5) td:nth-child(1) b::text').extract_first(); 

     value3 = response.css('tr:nth-child(3) .table-middle:nth-child(3)::text').extract_first(); 
     value4 = response.css('tr:nth-child(4) td:nth-child(3)::text').extract_first(); 
     value5 = response.css('tr:nth-child(5) td:nth-child(3)::text').extract_first(); 


     # bla = {} 
     # if key3 is not None: 
     #  bla[key3] = value3; 

     if value3 is not None: 
      json_data = { 
       'company' : company, 
       key3: value3, 
       key4: value4, 
       key5: value5, 



      }; 
      yield json_data 
      # yield json.dumps(bla) 

     # detail page 
     count = 0; 
     for button in response.css('td td a'): 
      detail_page_url = button.css('::attr(href)').extract_first(); 
      if detail_page_url is not None: 
       page_urls = response.urljoin(detail_page_url); 
       yield scrapy.Request(page_urls, callback=self.parse) 

答えて

1
[scrapy.core.engine] DEBUG: Crawled (200) <GET http://www.canaanalpha.com/extras/Anistrike_Poster.pdf> (referer: http://www.canaanalpha.com/anistrike.html) 

クモはここにPDFファイルをクロールしています。これらを手動でフィルタリングするか、すでにこれを行うLinkExtractorを使用する必要があります。

# detail page 
count = 0; 
link_extractor = LinkExtractor(restrict_css='td td a::attr(href)') 
urls = link_extractor.extract_links(response) 
for detail_page_url in urls: 
    url = response.urljoin(detail_page_url); 
    yield scrapy.Request(url, callback=self.parse) 
+0

@Granitosarusのおかげで、私はどのように作成するには: - デフォルトのLinkExtractorによって

def parse(self, response): url = 'someurl' if '.pdf' not in url: yield Request(url, self.parse2) # or le = LinkExtractor() urls = le.extract_links(response) for url in urls: yield Request(url, self.parse2) 

は、PDFなどの非HTMLファイルの多くは、無視して自分のコード例についてはsource here for full list

を参照し、これを試してくださいそのためのフィルタ、あなたのリンクに続いて__init__.pyのフィルタを作成しますか?そして、あなたはそれを取り除くことができるという意味ですか、そのプロセスはpdfリンクですか? –

+0

@RoShanShanうん、ちょうどPDFリンクを処理しないでください。 '#or'の後の2番目の例は本当に必要なものです。 https://doc.scrapy.org/en/latest/topics/link-extractors.html#link-extractors – Granitosaurus

+0

「#or」の後ろにコードを置く場所は本当に分かりません。私はこのリンクの詳細からデータを抽出したいと思う:[リンク](http://edirectory.matrade.gov.my/application/edirectory.nsf/category?OpenForm&query=product&code=PT&sid=BED1E22D5BE3F9B5394D6AF0E742828F)。上記のコードを見ることができます。 –