2016-12-15 15 views
0

Nutchを使用して複数のWebサイトを正常にクロールして2つのセグメントを作成しました。私はSolrサービスもインストールして起動しました。Apache Nutch 1.12とSolr 5.4.1の統合に失敗しました

しかし、これらのクロールされたデータをSolrにインデックス付けしようとすると、そのデータは機能しません。

私は、このコマンドを試してみました:

bin/nutch index http://localhost:8983/solr crawl/crawldb/ -linkdb crawl/linkdb/ crawl/segments/* 

出力:

The input path at crawldb is not a segment... skipping 
Segment dir is complete: crawl/segments/20161214143435. 
Segment dir is complete: crawl/segments/20161214144230. 
Indexer: starting at 2016-12-15 10:55:35 
Indexer: deleting gone documents: false 
Indexer: URL filtering: false 
Indexer: URL normalizing: false 
Active IndexWriters : 
SOLRIndexWriter 
    solr.server.url : URL of the SOLR instance 
    solr.zookeeper.hosts : URL of the Zookeeper quorum 
    solr.commit.size : buffer size when sending to SOLR (default 1000) 
    solr.mapping.file : name of the mapping file for fields (default solrindex-mapping.xml) 
    solr.auth : use authentication (default false) 
    solr.auth.username : username for authentication 
    solr.auth.password : password for authentication 


Indexer: java.io.IOException: No FileSystem for scheme: http 
    at org.apache.hadoop.fs.FileSystem.getFileSystemClass(FileSystem.java:2385) 
    at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2392) 
    at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:89) 
    at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2431) 
    at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2413) 
    at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:368) 
    at org.apache.hadoop.fs.Path.getFileSystem(Path.java:296) 
    at org.apache.hadoop.mapred.FileInputFormat.singleThreadedListStatus(FileInputFormat.java:256) 
    at org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:228) 
    at org.apache.hadoop.mapred.SequenceFileInputFormat.listStatus(SequenceFileInputFormat.java:45) 
    at org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:304) 
    at org.apache.hadoop.mapreduce.JobSubmitter.writeOldSplits(JobSubmitter.java:520) 
    at org.apache.hadoop.mapreduce.JobSubmitter.writeSplits(JobSubmitter.java:512) 
    at org.apache.hadoop.mapreduce.JobSubmitter.submitJobInternal(JobSubmitter.java:394) 
    at org.apache.hadoop.mapreduce.Job$10.run(Job.java:1285) 
    at org.apache.hadoop.mapreduce.Job$10.run(Job.java:1282) 
    at java.security.AccessController.doPrivileged(Native Method) 
    at javax.security.auth.Subject.doAs(Subject.java:422) 
    at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1548) 
    at org.apache.hadoop.mapreduce.Job.submit(Job.java:1282) 
    at org.apache.hadoop.mapred.JobClient$1.run(JobClient.java:562) 
    at org.apache.hadoop.mapred.JobClient$1.run(JobClient.java:557) 
    at java.security.AccessController.doPrivileged(Native Method) 
    at javax.security.auth.Subject.doAs(Subject.java:422) 
    at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1548) 
    at org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:557) 
    at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:548) 
    at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:833) 
    at org.apache.nutch.indexer.IndexingJob.index(IndexingJob.java:145) 
    at org.apache.nutch.indexer.IndexingJob.run(IndexingJob.java:228) 
    at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70) 
    at org.apache.nutch.indexer.IndexingJob.main(IndexingJob.java:237) 

そしてまた、このコマンド:

bin/nutch solrindex http://localhost:8983/solr crawl/crawldb/ -linkdb crawl/linkdb/ crawl/segments/* 

出力:

Segment dir is complete: crawl/segments/20161214143435. 
Segment dir is complete: crawl/segments/20161214144230. 
Indexer: starting at 2016-12-15 10:54:07 
Indexer: deleting gone documents: false 
Indexer: URL filtering: false 
Indexer: URL normalizing: false 
Active IndexWriters : 
SOLRIndexWriter 
    solr.server.url : URL of the SOLR instance 
    solr.zookeeper.hosts : URL of the Zookeeper quorum 
    solr.commit.size : buffer size when sending to SOLR (default 1000) 
    solr.mapping.file : name of the mapping file for fields (default solrindex-mapping.xml) 
    solr.auth : use authentication (default false) 
    solr.auth.username : username for authentication 
    solr.auth.password : password for authentication 


Indexing 250/250 documents 
Deleting 0 documents 
Indexing 250/250 documents 
Deleting 0 documents 
Indexer: java.io.IOException: Job failed! 
    at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:836) 
    at org.apache.nutch.indexer.IndexingJob.index(IndexingJob.java:145) 
    at org.apache.nutch.indexer.IndexingJob.run(IndexingJob.java:228) 
    at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70) 
    at org.apache.nutch.indexer.IndexingJob.main(IndexingJob.java:237) 

これらの前に、ファイルを/Nutch/solr-5.4.1/server/solr/configsets/data_driven_schema_configs/confにコピーし、提案通りにmanaged-schemaと名前を変更しました。

私の考えられる間違いはありますか?前もって感謝します!

編集

これは私のNutchのは

........................... 
........................... 
2016-12-15 10:15:48,355 INFO indexer.IndexerMapReduce - IndexerMapReduce: crawldb: crawl/crawldb 
2016-12-15 10:15:48,355 INFO indexer.IndexerMapReduce - IndexerMapReduce: linkdb: crawl/linkdb 
2016-12-15 10:15:48,355 INFO indexer.IndexerMapReduce - IndexerMapReduces: adding segment: crawl/segments/20161214143435 
2016-12-15 10:15:48,378 INFO indexer.IndexerMapReduce - IndexerMapReduces: adding segment: crawl/segments/20161214144230 
2016-12-15 10:15:49,120 WARN conf.Configuration - file:/tmp/hadoop-kaidul/mapred/staging/kaidul1333791357/.staging/job_local1333791357_0001/job.xml:an attempt to override final parameter: mapreduce.job.end-notification.max.retry.interval; Ignoring. 
2016-12-15 10:15:49,122 WARN conf.Configuration - file:/tmp/hadoop-kaidul/mapred/staging/kaidul1333791357/.staging/job_local1333791357_0001/job.xml:an attempt to override final parameter: mapreduce.job.end-notification.max.attempts; Ignoring. 
2016-12-15 10:15:49,180 WARN conf.Configuration - file:/tmp/hadoop-kaidul/mapred/local/localRunner/kaidul/job_local1333791357_0001/job_local1333791357_0001.xml:an attempt to override final parameter: mapreduce.job.end-notification.max.retry.interval; Ignoring. 
2016-12-15 10:15:49,181 WARN conf.Configuration - file:/tmp/hadoop-kaidul/mapred/local/localRunner/kaidul/job_local1333791357_0001/job_local1333791357_0001.xml:an attempt to override final parameter: mapreduce.job.end-notification.max.attempts; Ignoring. 
2016-12-15 10:15:49,406 INFO anchor.AnchorIndexingFilter - Anchor deduplication is: off 
2016-12-15 10:15:50,930 INFO indexer.IndexWriters - Adding org.apache.nutch.indexwriter.solr.SolrIndexWriter 
2016-12-15 10:15:51,137 INFO solr.SolrMappingReader - source: content dest: content 
2016-12-15 10:15:51,137 INFO solr.SolrMappingReader - source: title dest: title 
2016-12-15 10:15:51,137 INFO solr.SolrMappingReader - source: host dest: host 
2016-12-15 10:15:51,137 INFO solr.SolrMappingReader - source: segment dest: segment 
2016-12-15 10:15:51,137 INFO solr.SolrMappingReader - source: boost dest: boost 
2016-12-15 10:15:51,137 INFO solr.SolrMappingReader - source: digest dest: digest 
2016-12-15 10:15:51,137 INFO solr.SolrMappingReader - source: tstamp dest: tstamp 
2016-12-15 10:15:51,243 INFO solr.SolrIndexWriter - Indexing 250/250 documents 
2016-12-15 10:15:51,243 INFO solr.SolrIndexWriter - Deleting 0 documents 
2016-12-15 10:15:51,384 INFO solr.SolrIndexWriter - Indexing 250/250 documents 
2016-12-15 10:15:51,384 INFO solr.SolrIndexWriter - Deleting 0 documents 
2016-12-15 10:15:51,414 WARN mapred.LocalJobRunner - job_local1333791357_0001 
java.lang.Exception: org.apache.solr.client.solrj.impl.HttpSolrClient$RemoteSolrException: Error from server at http://localhost:8983/solr: Expected mime type application/octet-stream but got text/html. <html> 
<head> 
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8"/> 
<title>Error 404 Not Found</title> 
</head> 
<body><h2>HTTP ERROR 404</h2> 
<p>Problem accessing /solr/update. Reason: 
<pre> Not Found</pre></p><hr><i><small>Powered by Jetty://</small></i><hr/> 

</body> 
</html> 

    at org.apache.hadoop.mapred.LocalJobRunner$Job.runTasks(LocalJobRunner.java:462) 
    at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:529) 
............................ 
............................. 

答えて

0

問題はSolrの、NutchのとHBaseの間のバージョンの互換性がないたログです。 This articleは私にとって完璧に働いた。

関連する問題