2017-10-12 14 views
0

私は、Nuch 2.3 + ElasticSearch 1.4 + HBase 0.94をUbuntu 14.04にこのtutorialに続けて展開しようとしています。私が手Nutchのクロールを開始できません

$NUTCH_ROOT/runtime/local/bin/nutch inject urls 

:私はやってURLを注入するクロールを開始しようとすると

InjectorJob: starting at 2017-10-12 19:27:48 
InjectorJob: Injecting urlDir: urls 

を、プロセスは時間が残っています。

どういうことが起こっているのか分かりますか?

設定ファイル:

Nutchの-site.xmlの

<configuration> 
    <property> 
    <name>http.agent.name</name> 
    <value>mycrawlername</value> <!-- this can be changed to something more sane if you like --> 
    </property> 
    <property> 
    <name>http.robots.agents</name> 
    <value>mycrawlername</value> <!-- this is the robot name we're looking for in robots.txt files --> 
    </property> 
    <property> 
    <name>storage.data.store.class</name> 
    <value>org.apache.gora.hbase.store.HBaseStore</value> 
    </property> 
    <property> 
    <name>plugin.includes</name> 
    <!-- do **NOT** enable the parse-html plugin, if you want proper HTML parsing. Use something like parse-tika! --> 
    <value>protocol-httpclient|urlfilter-regex|parse-(text|tika|js)|index-(basic|anchor)|query-(basic|site|url)|response-(json|xml)|summary-basic|scoring-opic|urlnormalizer-(pass|regex|basic)|indexer-elastic</value> 
    </property> 
    <property> 
    <name>db.ignore.external.links</name> 
    <value>true</value> <!-- do not leave the seeded domains (optional) --> 
    </property> 
    <property> 
    <name>elastic.host</name> 
    <value>localhost</value> <!-- where is ElasticSearch listening --> 
    </property> 
</configuration> 

のHBase-site.xmlの

<configuration> 
    <property> 
     <name>hbase.rootdir</name> 
     <value>/home/kike/RIWS/hbase-0.94.14/</value> 
    </property> 
    <property> 
     <name>hbase.cluster.distributed</name> 
     <value>false</value> 
    </property> 
</configuration> 

ログファイル:

のHBaseマスターログ

2017-10-12 19:27:49,593 INFO org.apache.zookeeper.server.NIOServerCnxnFactory: Accepted socket connection from /127.0.0.1:47778 
2017-10-12 19:27:49,596 INFO org.apache.zookeeper.server.ZooKeeperServer: Client attempting to establish new session at /127.0.0.1:47778 
2017-10-12 19:27:49,609 INFO org.apache.zookeeper.server.ZooKeeperServer: Established session 0x15f11684f3f0017 with negotiated timeout 40000 for client /127.0.0.1:47778 
2017-10-12 19:31:11,092 DEBUG org.apache.hadoop.hbase.io.hfile.LruBlockCache: Stats: total=1.99 MB, free=239.7 MB, max=241.69 MB, blocks=2, accesses=18, hits=16, hitRatio=88,88%, , cachingAccesses=18, cachingHits=16, cachingHitsRatio=88,88%, , evictions=0, evicted=0, evictedPerRun=NaN 
2017-10-12 19:31:24,623 DEBUG org.apache.hadoop.hbase.client.MetaScanner: Scanning .META. starting at row= for max=2147483647 rows using org.apache.h[email protected]1646b7c 
2017-10-12 19:31:24,630 DEBUG org.apache.hadoop.hbase.master.CatalogJanitor: Scanned 0 catalog row(s) and gc'd 0 unreferenced parent region(s) 
2017-10-12 19:32:13,832 INFO org.apache.zookeeper.server.PrepRequestProcessor: Processed session termination for sessionid: 0x15f11684f3f0017 
2017-10-12 19:32:13,849 INFO org.apache.zookeeper.server.NIOServerCnxn: Closed socket connection for client /127.0.0.1:47778 which had sessionid 0x15f11684f3f0017 
2017-10-12 19:32:14,852 INFO org.apache.zookeeper.server.NIOServerCnxnFactory: Accepted socket connection from /127.0.0.1:47817 
2017-10-12 19:32:14,853 INFO org.apache.zookeeper.server.ZooKeeperServer: Client attempting to establish new session at /127.0.0.1:47817 
2017-10-12 19:32:14,880 INFO org.apache.zookeeper.server.ZooKeeperServer: Established session 0x15f11684f3f0018 with negotiated timeout 40000 for client /127.0.0.1:47817 

Hadoopのログ

2017-10-12 19:27:48,871 INFO crawl.InjectorJob - InjectorJob: starting at 2017-10-12 19:27:48 
2017-10-12 19:27:48,871 INFO crawl.InjectorJob - InjectorJob: Injecting urlDir: urls 

EDIT:

数時間後に、Hadoopのログを示しています

2017-10-12 20:34:59,333 ERROR crawl.InjectorJob - InjectorJob: org.apache.gora.util.GoraException: java.lang.RuntimeException: org.apache.hadoop.hbase.MasterNotRunningException: Retried 14 times 
    at org.apache.gora.store.DataStoreFactory.createDataStore(DataStoreFactory.java:167) 
    at org.apache.gora.store.DataStoreFactory.createDataStore(DataStoreFactory.java:135) 
    at org.apache.nutch.storage.StorageUtils.createWebStore(StorageUtils.java:78) 
    at org.apache.nutch.crawl.InjectorJob.run(InjectorJob.java:218) 
    at org.apache.nutch.crawl.InjectorJob.inject(InjectorJob.java:252) 
    at org.apache.nutch.crawl.InjectorJob.run(InjectorJob.java:275) 
    at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70) 
    at org.apache.nutch.crawl.InjectorJob.main(InjectorJob.java:284) 
Caused by: java.lang.RuntimeException: org.apache.hadoop.hbase.MasterNotRunningException: Retried 14 times 
    at org.apache.gora.hbase.store.HBaseStore.initialize(HBaseStore.java:133) 
    at org.apache.gora.store.DataStoreFactory.initializeDataStore(DataStoreFactory.java:102) 
    at org.apache.gora.store.DataStoreFactory.createDataStore(DataStoreFactory.java:161) 
    ... 7 more 
Caused by: org.apache.hadoop.hbase.MasterNotRunningException: Retried 14 times 
    at org.apache.hadoop.hbase.client.HBaseAdmin.<init>(HBaseAdmin.java:139) 
    at org.apache.gora.hbase.store.HBaseStore.initialize(HBaseStore.java:115) 
    ... 9 more 

しかし、私はJPSを入力する場合、私は見ることができますHMaster running:

31672 Jps 
20553 HMaster 
19739 Elasticsearch 
+0

あなたはシードファイルにどのように多くのURLがありますか? –

+0

ちょうど@JorgeLuis(https://www.fic.udc.es/) – Kroka

+0

例外がHBaseサーバーに正しく接続されていないように見える 'org.apache.hadoop.hbase.MasterNotRunningException:Retried 14 times' –

答えて

0

あなたのエラーログを示しています。(hbase.MasterNotRunningException)

org.apache.hadoop.hbase.MasterNotRunningException: Retried 14 times 
    at org.apache.gora.store.DataStoreFactory.createDataStore(DataStoreFactory.java:167) 
    at org.apache.gora.store.DataStoreFactory.createDataStore(DataStoreFactory.java:135) 
    at org.apache.nutch.storage.StorageUtils.createWebStore(StorageUtils.java:78) 
    at org.apache.nutch.crawl.InjectorJob.run(InjectorJob.java:218) 
    at org.apache.nutch.crawl.InjectorJob.inject(InjectorJob.java:252) 
    at org.apache.nutch.crawl.InjectorJob.run(InjectorJob.java:275) 
    at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70) 
    at org.apache.nutch.crawl.InjectorJob.main(InjectorJob.java:284) 

私たちは、HBaseのセットアップに必要

open ~/Desktop/Nutch/hbase/conf/hbase-site.xmlと、次の2つのノードを追加します。インストールのrootdirhbaseに伝え、zookeeperのデータディレクトリも指定する必要があります。

open ~/Desktop/Nutch/hbase/conf/hbase-site.xml 

<configuration> 
     <property> 
      <name>hbase.rootdir</name> 
      <value>file:///Users/sntiwari/Desktop/Nutch/hbase</value> 
     </property> 
     <property> 
      <name>hbase.zookeeper.property.dataDir</name> 
      <value>/Users/sntiwari/Desktop/Nutch/zookeeper</value> 
     </property> 
    </configuration> 

次に、我々はそれのデフォルトのデータ・ストアのHbaseを使用するgoraを指示する必要があります。

open ~/Desktop/Nutch/nutch/conf/gora.properties 
# open ~/Desktop/Nutch/nutch/runtime/local/conf/gora.properties 

# Add this line under `HBaseStore properties` (to keep things organised) 
gora.datastore.default=org.apache.gora.hbase.store.HBaseStore 

我々は(ライン118かもしれない)私たちのivy.xmlにコメント解除/ gora-hbase依存関係を追加する必要があります。

open ~/Desktop/Nutch/nutch/ivy/ivy.xml 

# Find and Uncomment this line (aprrox 118) 
<dependency org="org.apache.gora" name="gora-hbase" rev="0.5" conf="*->default" /> 

**あなたのHBaseをテスト**

# Start it up! 
~/Desktop/Nutch/hbase/bin/start-hbase.sh 

# Stop it (Can take a while, be patient) 
~/Desktop/Nutch/hbase/bin/stop-hbase.sh 

# Access the shell 
~/Desktop/Nutch/hbase/bin/hbase shell 

# list    = list all tables 
# disable 'webpage' = disable the table (before dropping) 
# drop 'webpage'  = drop the table (webpage is created & used by nutch) 
# exit    = exit from hbase 

# For the next part, we need to start hbase 
~/Desktop/Nutch/hbase/bin/start-hbase.sh 

もいくつかのテストの手順に従ってください:

  1. まずバージョンの互換性を確認してください。

  2. は必ずJAVA_HOMEとNUTCH_JAVA_HOME環境変数はNutchのコンパイル

  3. に設定されていることを確認し、[あなたは、Apache Nutchは、使用してコンパイルする必要がアリ(ant runtime)]
関連する問題