2016-10-05 18 views
-1

私は助けが必要です。特定のWebサイト(underminejournal)のクローラを作成したかったのです。私はコンソールの出力を作成するサイトからこのデータを取得したい、私は主にコンソールで動作し、頻繁にそれを切り替えるしたくないです。もう一つのポイントは、私はデータベース(SQLなどのデータをプッシュしても問題ありません)です。治療の初心者は例外を受け取ります

# -*- coding: utf-8 -*- 
import scrapy 


class JournalSpider(scrapy.Spider): 
    name = "journal" 
    allowed_domains = ["theunderminejournal.com"] 
    start_urls = (
     'theunderminejournal.com/#eu/eredar/item/124442', 
    ) 

    def parse(self, response): 
     page = respinse.url.split("/")[-2] 
     filename = 'journal-%s.html' % page 
     with open(filename, 'wb') as f: 
      f.write(response.body) 
      self.log('Saved file %s' % filename) 
     pass 

誰かがヒントを知っている:

2016-10-05 10:55:23 [scrapy] INFO: Scrapy 1.0.3 started (bot: undermine) 
2016-10-05 10:55:23 [scrapy] INFO: Optional features available: ssl, http11, boto 
2016-10-05 10:55:23 [scrapy] INFO: Overridden settings: {'NEWSPIDER_MODULE': 'undermine.spiders', 'SPIDER_MODULES': ['undermine.spiders'], 'BOT_NAME': 'undermine'} 
2016-10-05 10:55:23 [scrapy] INFO: Enabled extensions: CloseSpider, TelnetConsole, LogStats, CoreStats, SpiderState 
2016-10-05 10:55:23 [boto] DEBUG: Retrieving credentials from metadata server. 
2016-10-05 10:55:24 [boto] ERROR: Caught exception reading instance data 
Traceback (most recent call last): 
    File "/usr/lib/python2.7/dist-packages/boto/utils.py", line 210, in retry_url 
    r = opener.open(req, timeout=timeout) 
    File "/usr/lib/python2.7/urllib2.py", line 429, in open 
    response = self._open(req, data) 
    File "/usr/lib/python2.7/urllib2.py", line 447, in _open 
    '_open', req) 
    File "/usr/lib/python2.7/urllib2.py", line 407, in _call_chain 
    result = func(*args) 
    File "/usr/lib/python2.7/urllib2.py", line 1228, in http_open 
    return self.do_open(httplib.HTTPConnection, req) 
    File "/usr/lib/python2.7/urllib2.py", line 1198, in do_open 
    raise URLError(err) 
URLError: <urlopen error timed out> 
2016-10-05 10:55:24 [boto] ERROR: Unable to read instance data, giving up 
2016-10-05 10:55:24 [scrapy] INFO: Enabled downloader middlewares: HttpAuthMiddleware, DownloadTimeoutMiddleware, UserAgentMiddleware, RetryMiddleware, DefaultHeadersMiddleware, MetaRefreshMiddleware, HttpCompressionMiddleware, RedirectMiddleware, CookiesMiddleware, ChunkedTransferMiddleware, DownloaderStats 
2016-10-05 10:55:24 [scrapy] INFO: Enabled spider middlewares: HttpErrorMiddleware, OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMiddleware 
2016-10-05 10:55:24 [scrapy] INFO: Enabled item pipelines: 
2016-10-05 10:55:24 [scrapy] INFO: Spider opened 
2016-10-05 10:55:24 [scrapy] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min) 
2016-10-05 10:55:24 [scrapy] DEBUG: Telnet console listening on 127.0.0.1:6023 
2016-10-05 10:55:24 [scrapy] ERROR: Error while obtaining start requests 
Traceback (most recent call last): 
    File "/usr/lib/python2.7/dist-packages/scrapy/core/engine.py", line 110, in _next_request 
    request = next(slot.start_requests) 
    File "/usr/lib/python2.7/dist-packages/scrapy/spiders/__init__.py", line 70, in start_requests 
    yield self.make_requests_from_url(url) 
    File "/usr/lib/python2.7/dist-packages/scrapy/spiders/__init__.py", line 73, in make_requests_from_url 
    return Request(url, dont_filter=True) 
    File "/usr/lib/python2.7/dist-packages/scrapy/http/request/__init__.py", line 24, in __init__ 
    self._set_url(url) 
    File "/usr/lib/python2.7/dist-packages/scrapy/http/request/__init__.py", line 59, in _set_url 
    raise ValueError('Missing scheme in request url: %s' % self._url) 
ValueError: Missing scheme in request url: theunderminejournal.com/#eu/eredar/item/124442 
2016-10-05 10:55:24 [scrapy] INFO: Closing spider (finished) 
2016-10-05 10:55:24 [scrapy] INFO: Dumping Scrapy stats: 
{'finish_reason': 'finished', 
'finish_time': datetime.datetime(2016, 10, 5, 8, 55, 24, 710944), 
'log_count/DEBUG': 2, 
'log_count/ERROR': 3, 
'log_count/INFO': 7, 
'start_time': datetime.datetime(2016, 10, 5, 8, 55, 24, 704378)} 
2016-10-05 10:55:24 [scrapy] INFO: Spider closed (finished) 

私のクモがこれです:しかし、私は、クローラを実行しようとすると、どういうわけか私はちょうどこれが表示され得る、チュートリアルでは、私は考えて本当に便利ではないでしょうか?

EDITは

2016-10-05 11:21:35 [scrapy] INFO: Scrapy 1.0.3 started (bot: undermine) 
2016-10-05 11:21:35 [scrapy] INFO: Optional features available: ssl, http11, boto 
2016-10-05 11:21:35 [scrapy] INFO: Overridden settings: {'NEWSPIDER_MODULE': 'undermine.spiders', 'SPIDER_MODULES': ['undermine.spiders'], 'BOT_NAME': 'undermine'} 
2016-10-05 11:21:35 [scrapy] INFO: Enabled extensions: CloseSpider, TelnetConsole, LogStats, CoreStats, SpiderState 
2016-10-05 11:21:35 [boto] DEBUG: Retrieving credentials from metadata server. 
2016-10-05 11:21:36 [boto] ERROR: Caught exception reading instance data 
Traceback (most recent call last): 
    File "/usr/lib/python2.7/dist-packages/boto/utils.py", line 210, in retry_url 
    r = opener.open(req, timeout=timeout) 
    File "/usr/lib/python2.7/urllib2.py", line 429, in open 
    response = self._open(req, data) 
    File "/usr/lib/python2.7/urllib2.py", line 447, in _open 
    '_open', req) 
    File "/usr/lib/python2.7/urllib2.py", line 407, in _call_chain 
    result = func(*args) 
    File "/usr/lib/python2.7/urllib2.py", line 1228, in http_open 
    return self.do_open(httplib.HTTPConnection, req) 
    File "/usr/lib/python2.7/urllib2.py", line 1198, in do_open 
    raise URLError(err) 
URLError: <urlopen error timed out> 
2016-10-05 11:21:36 [boto] ERROR: Unable to read instance data, giving up 

答えて

0

ValueError: Missing scheme in request url: theunderminejournal.com/#eu/eredar/item/124442

URLは常にhttp://またはhttps://のいずれかで始まる必要がありRESULTS。

start_urls = (
    'theunderminejournal.com/#eu/eredar/item/124442', 
    #^should be: 
    'http://theunderminejournal.com/#eu/eredar/item/124442', 
) 
+0

編集時のエラーは完全に無関係で、どこかに接続できない 'boto'パッケージが原因です。おそらくそれを無視することができます。スパイダー自体は機能していますか? – Granitosaurus

関連する問題