私のクローラのための助けをscrapyに入れよう

-1

私はscrapyクローラを作成しましたが、コマンドラインからいくつかの引数を読み込む機能を追加し、スパイダークラスのいくつかの静的フィールドを読み込む必要があります。私はまた、スパイダーフィールドのいくつかを入力するようにイニシャライザーをオーバーライドする必要があります。私のクローラのための助けをscrapyに入れよう

import scrapy 
from scrapy.spiders import Spider 
from scrapy.http import Request 
import re 


class TutsplusItem(scrapy.Item): 
    title = scrapy.Field() 


class MySpider(Spider): 
    name = "tutsplus" 
    allowed_domains = ["bbc.com"] 
    start_urls = ["http://www.bbc.com/"] 

    def parse(self, response): 
     links = response.xpath('//a/@href').extract() 
     # We stored already crawled links in this list 
     crawledLinks = [] 

     for link in links: 
      # If it is a proper link and is not checked yet, yield it to the Spider 
      # if linkPattern.match(link) and not link in crawledLinks: 
      if not link in crawledLinks: 
       link = "http://www.bbc.com" + link 
       crawledLinks.append(link) 
       yield Request(link, self.parse) 

     titles = response.xpath('//a[contains(@class, "media__link")]/text()').extract() 
     for title in titles: 
      item = TutsplusItem() 
      item["title"] = title 
      print("Title is : %s" % title) 
      yield item

そして、それはとして実行する必要があります。

scrapy runspider crawler.py arg1 arg2

どのように私はこれを達成できますか？

出典

2017-02-13 Luckylukee

これはあなたのスパイダーのinitメソッドをこのようにオーバーライドすることで可能です。その後

class MySpider(Spider): 
    name = "tutsplus" 
    allowed_domains = ["bbc.com"] 
    start_urls = ["http://www.bbc.com/"] 
    arg1 = None 
    arg2 = None 

    def __init__(self, arg1, arg2, *args, **kwargs): 
     self.arg1 = arg1 
     self.arg2 = arg2 
     super(MySpider, self).__init__(*args, **kwargs) 

    def parse(self, response): 
     links = response.xpath('//a/@href').extract() 
     # We stored already crawled links in this list 
     crawledLinks = [] 

     for link in links: 
      # If it is a proper link and is not checked yet, yield it to the Spider 
      # if linkPattern.match(link) and not link in crawledLinks: 
      if not link in crawledLinks: 
       link = "http://www.bbc.com" + link 
       crawledLinks.append(link) 
       yield Request(link, self.parse) 

     titles = response.xpath('//a[contains(@class, "media__link")]/text()').extract() 
     for title in titles: 
      item = TutsplusItem() 
      item["title"] = title 
      print("Title is : %s" % title) 
      yield item

scrapy crawl tutsplus -a arg1=arg1 -a arg2=arg2

出典

2017-02-13 08:45:46

のようなあなたのクモを実行し、応答に感謝しますが、どのような場合、私は「はgetopt.getopt（引数を使用して、たとえば、 "scrapy runspider Crawler.py arg1にarg2に" のように実行したいです、 options、[long_options]） " – Luckylukee

runspiderコマンド実装のクラス実装を深く掘り下げる必要があります。 –

私のクローラのための助けをscrapyに入れよう

答えて

関連する問題