Nutch 2.3.1クロールシードURLのみ

URLの数が少ない（すべての）リンクをすべてクロールする必要があります。そのために、hadoopとhbaseでApache Nutch 2.3.1を使用しています。以下は、この目的で使用されるnutch-site.xmlファイルです。Nutch 2.3.1クロールシードURLのみ

<?xml version="1.0"?> 
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?> 

<!-- Put site-specific property overrides in this file. --> 

<configuration> 
<property> 
    <name>http.agent.name</name> 
    <value>crawler</value> 
</property> 
<property> 
    <name>storage.data.store.class</name> 
    <value>org.apache.gora.hbase.store.HBaseStore</value> 
</property> 
<property> 
    <name>plugin.includes</name> 
<value>protocol-httpclient|protocol-http|indexer-solr|urlfilter-regex|parse-(html|tika)|index-(basic|more|urdu)|urlnormalizer-(pass|regex|basic)|scoring-opic</value> 
</property> 
<property> 
<name>parser.character.encoding.default</name> 
<value>utf-8</value> 
</property> 
<property> 
    <name>http.robots.403.allow</name> 
    <value>true</value> 
<property> 
    <name>db.max.outlinks.per.page</name> 
    <value>-1</value> 
</property> 
<property> 
    <name>http.robots.agents</name> 
    <value>crawler,*</value> 
</property> 

<!-- language-identifier plugin properties --> 

<property> 
    <name>lang.ngram.min.length</name> 
    <value>1</value> 
</property> 

<property> 
    <name>lang.ngram.max.length</name> 
    <value>4</value> 
</property> 

<property> 
    <name>lang.analyze.max.length</name> 
    <value>2048</value> 
</property> 

<property> 
    <name>lang.extraction.policy</name> 
    <value>detect,identify</value> 
</property> 

<property> 
    <name>lang.identification.only.certain</name> 
    <value>true</value> 
</property> 

<!-- Language properties ends here --> 
<property> 
     <name>http.timeout</name> 
     <value>20000</value> 
</property> 
<!-- These tags are included as our crawled documents has started to decrease --> 
<property> 
<name>fetcher.max.crawl.delay</name> 
<value>10</value> 
</property> 
<property> 
    <name>generate.max.count</name> 
    <value>10000</value> 
</property> 

<property> 
<name>db.ignore.external.links</name> 
<value>true</value> 
</property> 
</configuration>

私はいくつかのURLをクロールする場合、唯一種のURLがこのメッセージで終わるをクロール次にフェッチされ

GeneratorJob: Selecting best-scoring urls due for fetch. 
GeneratorJob: starting 
GeneratorJob: filtering: false 
GeneratorJob: normalizing: false 
GeneratorJob: topN: 20 
GeneratorJob: finished at 2017-04-21 16:28:35, time elapsed: 00:00:02 
GeneratorJob: generated batch id: 1492774111-8887 containing 0 URLs 
Generate returned 1 (no new segments created) 
Escaping loop: no more URLs to fetch now

同様の問題がhere記載されているが、バージョン1.1のためであり、Iは、溶液ことを実装しています私の場合はうまくいかない。

出典

2017-04-21 Shafiq

この問題の解決方法をお探しですか？ –

シード注入後にCycleを実行する必要があります。Generate> Fetch> Parse> UpdateDb。シングルクロールではすべてのリンクを取得できないため、このサイクルを複数回実行する必要があります。 –

あなたのconf/regex-urlfilter.txtを確認して、正規表現フィルタリング正規表現が目的のアウトリンクをブロックしているかどうか確認できますか？ Nutchの異なるホストからのアウトリンクは生成されませんので、あなたは、trueにdb.ignore.external.linksを設定したよう

# accept anything else 
+.

。かどうかにかかわらず、conf/nutch-default.xmlのdb.ignore.internal.linksプロパティもチェックする必要があります。それ以外の場合、生成するアウトリンクはありません。

<property> 
    <name>db.ignore.internal.links</name> 
    <value>false</value> 
</property> 
<property> 
    <name>db.ignore.external.links</name> 
    <value>true</value> 
</property> 
<property>

HTH。

出典

2017-04-23 05:51:24

Nutch 2.3.1クロールシードURLのみ

答えて

関連する問題