掃除スパイダーのみタッチstart_urls

私は私のCrawlSpiderがクロールstart_urlsとそれ以上は行っていないことがわかりました。掃除スパイダーのみタッチstart_urls

以下は私のコードです。

import scrapy 
from scrapy.linkextractors import LinkExtractor 
from scrapy.spiders import CrawlSpider, Rule 


class ExampleSpider(CrawlSpider): 
    name = 'example' 
    allowed_domains = ['holy-bible-eng'] 
    start_urls = ['file:///G:/holy-bible-eng/OEBPS/bible-toc.xhtml'] 

    rules = (
     Rule(LinkExtractor(allow=r'OEBPS'), callback='parse_item', follow=True), 
    ) 

    def parse_item(self, response): 
     return response

以下は私のfile:///G:/holy-bible-eng/OEBPS/bible-toc.xhtml

<?xml version="1.0" encoding="UTF-8"?> 
 
<!DOCTYPE html 
 
    PUBLIC "-//W3C//DTD XHTML 1.1//EN" "http://www.w3.org/TR/xhtml11/DTD/xhtml11.dtd"> 
 
<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en"><head><meta http-equiv="Content-Type" content="text/html; charset=UTF-8" /><title>Holy Bible</title><link href="lds_ePub_scriptures.css" rel="stylesheet" type="text/css" /></head><body class="bible-toc"><div class="titleBlock"><h1 class="toc-title">The Names and Order of All the <br /><span class="dominant">Books of the Old and <br />New Testaments</span></h1></div><div class="bible-toc"><p><a href="bible_dedication.xhtml">Epistle Dedicatory</a> | <a href="quad_abbreviations.xhtml">Abbreviations</a></p><h2 class="toc-title"><a href="ot.xhtml">The Books of the Old Testament</a></h2><p><a href="gen.xhtml">Genesis</a> | <a href="ex.xhtml">Exodus</a> | <a href="lev.xhtml">Leviticus</a> | <a href="num.xhtml">Numbers</a> | <a href="deut.xhtml">Deuteronomy</a> | <a href="josh.xhtml">Joshua</a> | <a href="judg.xhtml">Judges</a> | <a href="ruth.xhtml">Ruth</a> | <a href="1-sam.xhtml">1 Samuel</a> | <a href="2-sam.xhtml">2 Samuel</a> | <a href="1-kgs.xhtml">1 Kings</a> | <a href="2-kgs.xhtml">2 Kings</a> | <a href="1-chr.xhtml">1 Chronicles</a> | <a href="2-chr.xhtml">2 Chronicles</a> | <a href="ezra.xhtml">Ezra</a> | <a href="neh.xhtml">Nehemiah</a> | <a href="esth.xhtml">Esther</a> | <a href="job.xhtml">Job</a> | <a href="ps.xhtml">Psalms</a> | <a href="prov.xhtml">Proverbs</a> | <a href="eccl.xhtml">Ecclesiastes</a> | <a href="song.xhtml">Song of Solomon</a> | <a href="isa.xhtml">Isaiah</a> | <a href="jer.xhtml">Jeremiah</a> | <a href="lam.xhtml">Lamentations</a> | <a href="ezek.xhtml">Ezekiel</a> | <a href="dan.xhtml">Daniel</a> | <a href="hosea.xhtml">Hosea</a> | <a href="joel.xhtml">Joel</a> | <a href="amos.xhtml">Amos</a> | <a href="obad.xhtml">Obadiah</a> | <a href="jonah.xhtml">Jonah</a> | <a href="micah.xhtml">Micah</a> | <a href="nahum.xhtml">Nahum</a> | <a href="hab.xhtml">Habakkuk</a> | <a href="zeph.xhtml">Zephaniah</a> | <a href="hag.xhtml">Haggai</a> | <a href="zech.xhtml">Zechariah</a> | <a href="mal.xhtml">Malachi</a></p><h2 class="toc-title"><a href="nt.xhtml">The Books of the New Testament</a></h2><p><a href="matt.xhtml">Matthew</a> | <a href="mark.xhtml">Mark</a> | <a href="luke.xhtml">Luke</a> | <a href="john.xhtml">John</a> | <a href="acts.xhtml">Acts</a> | <a href="rom.xhtml">Romans</a> | <a href="1-cor.xhtml">1 Corinthians</a> | <a href="2-cor.xhtml">2 Corinthians</a> | <a href="gal.xhtml">Galatians</a> | <a href="eph.xhtml">Ephesians</a> | <a href="philip.xhtml">Philippians</a> | <a href="col.xhtml">Colossians</a> | <a href="1-thes.xhtml">1 Thessalonians</a> | <a href="2-thes.xhtml">2 Thessalonians</a> | <a href="1-tim.xhtml">1 Timothy</a> | <a href="2-tim.xhtml">2 Timothy</a> | <a href="titus.xhtml">Titus</a> | <a href="philem.xhtml">Philemon</a> | <a href="heb.xhtml">Hebrews</a> | <a href="james.xhtml">James</a> | <a href="1-pet.xhtml">1 Peter</a> | <a href="2-pet.xhtml">2 Peter</a> | <a href="1-jn.xhtml">1 John</a> | <a href="2-jn.xhtml">2 John</a> | <a href="3-jn.xhtml">3 John</a> | <a href="jude.xhtml">Jude</a> | <a href="rev.xhtml">Revelation</a></p><h2 class="toc-title"><a href="bible-helps_title-page.xhtml">Appendix</a></h2><p><a href="tg.xhtml">Topical Guide</a> | <a href="bd.xhtml">Bible Dictionary</a> | <a href="bible-chron.xhtml">Bible Chronology</a> | <a href="harmony.xhtml">Harmony of the Gospels</a> | <a href="jst.xhtml">Joseph Smith Translation</a> | <a href="bible-maps.xhtml">Bible Maps</a> | <a href="bible-photos.xhtml">Bible Photographs</a></p></div></body></html>

start_urlsで

されており、以下の私のコンソール出力です。

(crawl) G:\kjvbible>scrapy crawl example 
...... 
...... 

2017-04-08 09:24:59 [scrapy.core.engine] INFO: Spider opened 
2017-04-08 09:24:59 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min) 
2017-04-08 09:24:59 [scrapy.extensions.telnet] DEBUG: Telnet console listening on 127.0.0.1:6026 
2017-04-08 09:24:59 [scrapy.core.engine] DEBUG: Crawled (200) <GET file:///G:/holy-bible-eng/OEBPS/bible-toc.xhtml> (referer: None) 
2017-04-08 09:24:59 [scrapy.core.engine] INFO: Closing spider (finished) 
2017-04-08 09:24:59 [scrapy.statscollectors] INFO: Dumping Scrapy stats: 
{'downloader/request_bytes': 237, 
'downloader/request_count': 1, 
'downloader/request_method_count/GET': 1, 
'downloader/response_bytes': 3693,

これ以上は進まない。

どのようなご提案も歓迎いたします。 CrawlSpider documentationから

出典

2017-04-08 Aaron

：

を従うリンクがからこのルールで抽出された各応答を追跡するかどうかを指定するブール値です。 コールバックがある場合はどれも、あなたが同時にcallbackとfollow=Trueとのルールを持つことができない偽

に、それ以外の場合、デフォルト、Trueにデフォルトは続きません。それはコールバックだけを聞くので、それ以上は進まないでしょう。

したがって、CrawlSpiderのルールの主なアイデアは、実際に抽出するリンクとそれに続くリンクを見つけることができるということです。

ここでscrapyはローカルファイルをチェックするのがベストではありません。単純なスクリプトを作成するだけです。

もう1つのエラーは、受け入れるドメインを指定するallowed_domainsクラス変数を設定していることです。他のすべては拒否され、これはインターネット上のリンクでのみ機能します。ドメインを拒否したくない場合、またはドメインをまったく使用していない場合（あなたの場合）、その変数を削除してください。

出典

2017-04-08 01:45:43 eLRuLL

返信いただきありがとうございます。私は 'allow_domains'をコメントアウトし、リンクをたどり始めました！ – Aaron

うれしかったです！ – eLRuLL

@Aaronそれがあなたを助けたら答えを受け入れることを忘れないでください。 – eLRuLL

掃除スパイダーのみタッチstart_urls

答えて

関連する問題