2017-03-09 14 views
2

私はScrapyの周りで私の心を得るために無駄な時間を費やしていましたが、ドキュメントやその他のScrapyブログやQ &を読んでいます。道順;-)問題は:私のスパイダーが開き、start_urlsを取り出しますが、明らかに何もしません。代わりにそれはすぐに終了し、それはそれでした。どうやら、私は最初のself.log()ステートメントに到達しません。Python/Scrapy:CrawlSpiderはstart_urlsを取得した後に終了します

# -*- coding: utf-8 -*- 
import scrapy 
# from scrapy.shell import inspect_response 
from scrapy.spiders import CrawlSpider, Rule 
from scrapy.linkextractors import LinkExtractor 
from scrapy.selector import Selector 
from scrapy.http import HtmlResponse, FormRequest, Request 
from KiPieSpider.items import * 
from KiPieSpider.settings import * 

class KiSpider(CrawlSpider): 
    name = "KiSpider" 
    allowed_domains = ['www.kiweb.de', 'kiweb.de'] 
    start_urls = (
     # ST Regra start page: 
     'https://www.kiweb.de/default.aspx?pageid=206', 
      # follow ST Regra links in the form of: 
      # https://www.kiweb.de/default.aspx?pageid=206&page=\d+ 
      # https://www.kiweb.de/default.aspx?pageid=299&docid=\d{6} 
     # ST Thermo start page: 
     'https://www.kiweb.de/default.aspx?pageid=202&page=1', 
      # follow ST Thermo links in the form of: 
      # https://www.kiweb.de/default.aspx?pageid=202&page=\d+ 
      # https://www.kiweb.de/default.aspx?pageid=299&docid=\d{6} 
    ) 
    rules = (
     # First rule that matches a given link is followed/parsed. 
     # Follow category pagination without further parsing: 
     Rule(
      LinkExtractor(
       # Extract links in the form: 
       allow=r'Default\.aspx?pageid=(202|206])&page=\d+', 
       # but only within the pagination table cell: 
       restrict_xpaths=('//td[@id="ctl04_teaser_next"]'), 
      ), 
      follow=True, 
     ), 
     # Follow links to category (202|206) articles and parse them: 
     Rule(
      LinkExtractor(
       # Extract links in the form: 
       allow=r'Default\.aspx?pageid=299&docid=\d+', 
       # but only within article preview cells: 
       restrict_xpaths=("//td[@class='TOC-zelle TOC-text']"), 
      ), 
      # and parse the resulting pages for article content: 
      callback='parse_init', 
      follow=False, 
     ), 
    ) 

    # Once an article page is reached, check whether a login is necessary: 
    def parse_init(self, response): 
     self.log('Parsing article: %s' % response.url) 
     if not response.xpath('input[@value="Logout"]'): 
      # Note: response.xpath() is a shortcut of response.selector.xpath() 
      self.log('Not logged in. Logging in...\n') 
      return self.login(response) 
     else: 
      self.log('Already logged in. Continue crawling...\n') 
      return self.parse_item(response) 


    def login(self, response): 
     self.log("Trying to log in...\n") 
     self.username = self.settings['KI_USERNAME'] 
     self.password = self.settings['KI_PASSWORD'] 
     return FormRequest.from_response(
      response, 
      formname='Form1', 
      formdata={ 
       # needs name, not id attributes! 
       'ctl04$Header$ctl01$textbox_username': self.username, 
       'ctl04$Header$ctl01$textbox_password': self.password, 
       'ctl04$Header$ctl01$textbox_logindaten_typ': 'Username_Passwort', 
       'ctl04$Header$ctl01$checkbox_permanent': 'True', 
      }, 
      callback = self.parse_item, 
     ) 

    def parse_item(self, response): 
     articles = response.xpath('//div[@id="artikel"]') 
     items = [] 
     for article in articles: 
      item = KiSpiderItem() 
      item['link'] = response.url 
      item['title'] = articles.xpath("div[@class='ct1']/text()").extract() 
      item['subtitle'] = articles.xpath("div[@class='ct2']/text()").extract() 
      item['article'] = articles.extract() 
      item['published'] = articles.xpath("div[@class='biblio']/text()").re(r"(\d{2}.\d{2}.\d{4}) PIE") 
      item['artid'] = articles.xpath("div[@class='biblio']/text()").re(r"PIE \[(d+)-\d+\]") 
      item['lang'] = 'de-DE' 
      items.append(item) 
#  return(items) 
     yield items 
#  what is the difference between return and yield?? found both on web. 

scrapy crawl KiSpiderをやって、これはその結果:私がこれまで持っているもの

はこれです、

2017-03-09 18:03:33 [scrapy.utils.log] INFO: Scrapy 1.3.2 started (bot: KiPieSpider) 
2017-03-09 18:03:33 [scrapy.utils.log] INFO: Overridden settings: {'NEWSPIDER_MODULE': 'KiPieSpider.spiders', 'DEPTH_LIMIT': 3, 'CONCURRENT_REQUESTS': 8, 'SPIDER_MODULES': ['KiPieSpider.spiders'], 'BOT_NAME': 'KiPieSpider', 'DOWNLOAD_TIMEOUT': 60, 'USER_AGENT': 'KiPieSpider (in[email protected])', 'DOWNLOAD_DELAY': 0.25} 
2017-03-09 18:03:33 [scrapy.middleware] INFO: Enabled extensions: 
['scrapy.extensions.logstats.LogStats', 
'scrapy.extensions.telnet.TelnetConsole', 
'scrapy.extensions.corestats.CoreStats'] 
2017-03-09 18:03:33 [scrapy.middleware] INFO: Enabled downloader middlewares: 
['scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware', 
'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware', 
'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware', 
'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware', 
'scrapy.downloadermiddlewares.retry.RetryMiddleware', 
'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware', 
'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware', 
'scrapy.downloadermiddlewares.redirect.RedirectMiddleware', 
'scrapy.downloadermiddlewares.cookies.CookiesMiddleware', 
'scrapy.downloadermiddlewares.stats.DownloaderStats'] 
2017-03-09 18:03:33 [scrapy.middleware] INFO: Enabled spider middlewares: 
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware', 
'scrapy.spidermiddlewares.offsite.OffsiteMiddleware', 
'scrapy.spidermiddlewares.referer.RefererMiddleware', 
'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware', 
'scrapy.spidermiddlewares.depth.DepthMiddleware'] 
2017-03-09 18:03:33 [scrapy.middleware] INFO: Enabled item pipelines: 
[] 
2017-03-09 18:03:33 [scrapy.core.engine] INFO: Spider opened 
2017-03-09 18:03:33 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min) 
2017-03-09 18:03:33 [scrapy.extensions.telnet] DEBUG: Telnet console listening on 127.0.0.1:6023 
2017-03-09 18:03:33 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.kiweb.de/default.aspx?pageid=206> (referer: None) 
2017-03-09 18:03:34 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.kiweb.de/default.aspx?pageid=202&page=1> (referer: None) 
2017-03-09 18:03:34 [scrapy.core.engine] INFO: Closing spider (finished) 
2017-03-09 18:03:34 [scrapy.statscollectors] INFO: Dumping Scrapy stats: 
{'downloader/request_bytes': 465, 
'downloader/request_count': 2, 
'downloader/request_method_count/GET': 2, 
'downloader/response_bytes': 48998, 
'downloader/response_count': 2, 
'downloader/response_status_count/200': 2, 
'finish_reason': 'finished', 
'finish_time': datetime.datetime(2017, 3, 9, 17, 3, 34, 235000), 
'log_count/DEBUG': 3, 
'log_count/INFO': 7, 
'response_received_count': 2, 
'scheduler/dequeued': 2, 
'scheduler/dequeued/memory': 2, 
'scheduler/enqueued': 2, 
'scheduler/enqueued/memory': 2, 
'start_time': datetime.datetime(2017, 3, 9, 17, 3, 33, 295000)} 
2017-03-09 18:03:34 [scrapy.core.engine] INFO: Spider closed (finished) 

は、ログインルーチンがコールバックで終わるべきではないということはありますが、何らかの返品/利回り明細書?または私は何を間違っているのですか?残念ながら、これまで私が見てきたドキュメントやチュートリアルでは、あらゆるビットが他のものとどのようにつながっているのか、あまりにも漠然としたアイデアしかありません。特にScrapyのドキュメントは、Scrapyについて多くのことを知っている人のリファレンスとして書かれています。

やや欲求不満挨拶 クリストファー

答えて

0
rules = (
     # First rule that matches a given link is followed/parsed. 
     # Follow category pagination without further parsing: 
     Rule(
      LinkExtractor(
       # Extract links in the form: 
       # allow=r'Default\.aspx?pageid=(202|206])&page=\d+', 

       # but only within the pagination table cell: 
       restrict_xpaths=('//td[@id="ctl04_teaser_next"]'), 
      ), 
      follow=True, 
     ), 
     # Follow links to category (202|206) articles and parse them: 
     Rule(
      LinkExtractor(
       # Extract links in the form: 
       # allow=r'Default\.aspx?pageid=299&docid=\d+', 
       # but only within article preview cells: 
       restrict_xpaths=("//td[@class='TOC-zelle TOC-text']"), 
      ), 
      # and parse the resulting pages for article content: 
      callback='parse_init', 
      follow=False, 
     ), 
    ) 

一つだけのリンクは、XPathで選択したタグでありますので、あなたは、allowは必要ありません。

allowパラメータの正規表現は理解できませんが、少なくとも?をエスケープする必要があります。 enter image description here

+1

本当にありがとうございます、それはエスケープされていませんでしたか? allowパラメータの中に! –

関連する問題