2016-07-05 18 views
-1

でデータを取得することができ、私はScrapyチュートリアルを勉強しようとしてきたし、プロジェクトの最上位レベルのコマンドを実行した後、私は次のような出力が得られます。scrapy1.1は0のページをクロールが、私はscrapyシェルコマンド

2016-07-05 21:06:01 [scrapy] INFO: Scrapy 1.1.0 started (bot: tutorial) 
 
2016-07-05 21:06:01 [scrapy] INFO: Overridden settings: {'NEWSPIDER_MODULE': 'tutorial.spiders', 'SPIDER_MODULES': ['tutorial.spiders'], 'ROBOTSTXT_OBEY': True, 'BOT_NAME': 'tutorial'} 
 
2016-07-05 21:06:01 [scrapy] INFO: Enabled extensions: 
 
['scrapy.extensions.logstats.LogStats', 
 
'scrapy.extensions.telnet.TelnetConsole', 
 
'scrapy.extensions.corestats.CoreStats'] 
 
2016-07-05 21:06:02 [scrapy] INFO: Enabled downloader middlewares: 
 
['scrapy.downloadermiddlewares.robotstxt.RobotsTxtMiddleware', 
 
'scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware', 
 
'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware', 
 
'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware', 
 
'scrapy.downloadermiddlewares.retry.RetryMiddleware', 
 
'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware', 
 
'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware', 
 
'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware', 
 
'scrapy.downloadermiddlewares.redirect.RedirectMiddleware', 
 
'scrapy.downloadermiddlewares.cookies.CookiesMiddleware', 
 
'scrapy.downloadermiddlewares.chunked.ChunkedTransferMiddleware', 
 
'scrapy.downloadermiddlewares.stats.DownloaderStats'] 
 
2016-07-05 21:06:02 [scrapy] INFO: Enabled spider middlewares: 
 
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware', 
 
'scrapy.spidermiddlewares.offsite.OffsiteMiddleware', 
 
'scrapy.spidermiddlewares.referer.RefererMiddleware', 
 
'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware', 
 
'scrapy.spidermiddlewares.depth.DepthMiddleware'] 
 
2016-07-05 21:06:02 [scrapy] INFO: Enabled item pipelines: 
 
[] 
 
2016-07-05 21:06:02 [scrapy] INFO: Spider opened 
 
2016-07-05 21:06:02 [scrapy] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min) 
 
2016-07-05 21:06:02 [scrapy] DEBUG: Telnet console listening on 127.0.0.1:6024 
 
2016-07-05 21:06:02 [scrapy] INFO: Closing spider (finished) 
 
2016-07-05 21:06:02 [scrapy] INFO: Dumping Scrapy stats: 
 
{'finish_reason': 'finished', 
 
'finish_time': datetime.datetime(2016, 7, 5, 13, 6, 2, 381000), 
 
'log_count/DEBUG': 1, 
 
'log_count/INFO': 7, 
 
'start_time': datetime.datetime(2016, 7, 5, 13, 6, 2, 381000)} 
 
2016-07-05 21:06:02 [scrapy] INFO: Spider closed (finished)

dmoz.pyがある...

# -*- coding: utf-8 -*- 
 
import scrapy 
 
from tutorial.items import TutorialItem 
 

 
class DmozSpider(scrapy.Spider): 
 
    name = 'dmoz' 
 
    allowed_domains = ['dmoz.org'] 
 
    strat_urls = ('http://www.dmoz.org/Computers/Programming/Languages/Python/Books/') 
 

 
    def parse(self,response): 
 
     lislink = response.xpath('/html/body/div[5]/div/section[3]/div/div/div[*]/div[3]/a') 
 

 
     for li in lislink: 
 
      item = TutorialItem() 
 
      item['link'] = li.xpath('@href').extract() 
 
      yield item

items.pyさ...

# -*- coding: utf-8 -*- 
 

 
# Define here the models for your scraped items 
 
# 
 
# See documentation in: 
 
# http://doc.scrapy.org/en/latest/topics/items.html 
 

 
import scrapy 
 

 

 
class TutorialItem(scrapy.Item): 
 
    # define the fields for your item here like: 
 
    link = scrapy.Field() 
 
    pass

シェルでプロジェクトをデバッグするときしかし、私は、URLを取得することができます。ここ

D:\pythonweb\scrapy\test2>scrapy shell http://www.dmoz.org/Computers/Programming/Languages/Python/Books/ 
 
2016-07-05 21:06:40 [scrapy] INFO: Scrapy 1.1.0 started (bot: scrapybot) 
 
2016-07-05 21:06:40 [scrapy] INFO: Overridden settings: {'LOGSTATS_INTERVAL': 0, 'DUPEFILTER_CLASS': 'scrapy.dupefilters.BaseDupeFilter'} 
 
2016-07-05 21:06:40 [scrapy] INFO: Enabled extensions: 
 
['scrapy.extensions.telnet.TelnetConsole', 
 
'scrapy.extensions.corestats.CoreStats'] 
 
2016-07-05 21:06:40 [scrapy] INFO: Enabled downloader middlewares: 
 
['scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware', 
 
'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware', 
 
'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware', 
 
'scrapy.downloadermiddlewares.retry.RetryMiddleware', 
 
'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware', 
 
'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware', 
 
'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware', 
 
'scrapy.downloadermiddlewares.redirect.RedirectMiddleware', 
 
'scrapy.downloadermiddlewares.cookies.CookiesMiddleware', 
 
'scrapy.downloadermiddlewares.chunked.ChunkedTransferMiddleware', 
 
'scrapy.downloadermiddlewares.stats.DownloaderStats'] 
 
2016-07-05 21:06:40 [scrapy] INFO: Enabled spider middlewares: 
 
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware', 
 
'scrapy.spidermiddlewares.offsite.OffsiteMiddleware', 
 
'scrapy.spidermiddlewares.referer.RefererMiddleware', 
 
'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware', 
 
'scrapy.spidermiddlewares.depth.DepthMiddleware'] 
 
2016-07-05 21:06:40 [scrapy] INFO: Enabled item pipelines: 
 
[] 
 
2016-07-05 21:06:40 [scrapy] DEBUG: Telnet console listening on 127.0.0.1:6024 
 
2016-07-05 21:06:40 [scrapy] INFO: Spider opened 
 
2016-07-05 21:06:42 [scrapy] DEBUG: Crawled (200) <GET http://www.dmoz.org/Computers/Programming/Languages/Python/Books/> (referer: None) 
 
[s] Available Scrapy objects: 
 
[s] crawler <scrapy.crawler.Crawler object at 0x03BF0E30> 
 
[s] item  {} 
 
[s] request <GET http://www.dmoz.org/Computers/Programming/Languages/Python/Books/> 
 
[s] response <200 http://www.dmoz.org/Computers/Programming/Languages/Python/Books/> 
 
[s] settings <scrapy.settings.Settings object at 0x03BF05F0> 
 
[s] spider  <DefaultSpider 'default' at 0x432b1d0> 
 
[s] Useful shortcuts: 
 
[s] shelp()   Shell help (print this help) 
 
[s] fetch(req_or_url) Fetch request (or URL) and update local objects 
 
[s] view(response) View response in a browser 
 
>>> lislink = response.xpath('/html/body/div[5]/div/section[3]/div/div/div[*]/div[3]/a') 
 
>>> lislink.xpath('@href').extract() 
 
[u'http://www.pearsonhighered.com/educator/academic/product/0,,0130260363,00%2Ben-USS_01DBC.html', u'http://www.brpreiss.com/books/opus7/html/book.html', u'http://www.diveintopython.net/', u'http://rhodesmill.org/brandon/2011/foundations-of-python-network-programming/', u'http://www.techbooksforfree.com/perlpython.shtml', u'http://www.freetechbooks.com/python-f6.html', u'http://greenteapress.com/thinkpython/', u'http://www.network-theory.co.uk/python/intro/', u'http://www.freenetpages.co.uk/hp/alan.gauld/', u'http://www.wiley.com/WileyCDA/WileyTitle/productCd-0471219754.html', u'http://hetland.org/writing/practical-python/', u'http://sysadminpy.com/', u'http://www.qtrac.eu/py3book.html', u'http://www.wiley.com/WileyCDA/WileyTitle/productCd-0764548077.html', u'https://www.packtpub.com/python-3-object-oriented-programming/book', u'http://www.network-theory.co.uk/python/language/', u'http://www.pearsonhighered.com/educator/academic/product/0,,0130409561,00%2Ben-USS_01DBC.html', u'http://www.informit.com/store/product.aspx?isbn=0201616165&redir=1', u'http://www.pearsonhighered.com/educator/academic/product/0,,0201748843,00%2Ben-USS_01DBC.html', u'http://www.informit.com/store/product.aspx?isbn=0672317354', u'http://gnosis.cx/TPiP/', u'http://www.informit.com/store/product.aspx?isbn=0130211192'] 
 
>>>

私のプラットフォームです。

Scrapy : 1.1.0 
 
lxml  : 3.6.0.0 
 
libxml2 : 2.9.0 
 
Twisted : 16.2.0 
 
Python : 2.7.12 (v2.7.12:d33e0cf91556, Jun 27 2016, 15:19:22) [MSC v.1500 32 bit (Intel)] 
 
pyOpenSSL : 16.0.0 (OpenSSL 1.0.2h 3 May 2016) 
 
Platform : Windows-10-10.0.10586

答えて

1

これは、項目としてURLを持つ(通常は、それがリストである)、それはstart_urlsであり、それは反復可能にする必要があり、strat_urlsではありません。

class DmozSpider(scrapy.Spider): 
    name = 'dmoz' 
    allowed_domains = ['dmoz.org'] 
    start_urls = ['http://www.dmoz.org/Computers/Programming/Languages/Python/Books/'] 
+0

この本当にクールなことをする。ありがとう! –

関連する問題