はScrapy

と特定のウェブページをクロールこんにちは私はscrapy.Iでnoobのビットは以下のページから記事（コンテンツ、機関名、特派員など）をクロールしようとしていた午前： http://timesofindia.indiatimes.com/topic/Startup はScrapy

問題があります私のスパイダーは、ほとんどの記事で正しい結果を返しますが、代理店の名前が「reuters」（例：-http://timesofindia.indiatimes.com/business/international-business/novartis-roche-back-french-gene-therapy-start-up-vivet/articleshow/58511702.cms）の記事の場合は、コンテンツの代わりにエスケープ文字を返します（見出しと代理店の名前を返しますしかしここに私のxpath変数があります：

main_path=response.xpath('//div[@class="main-content"]') 

yield { 

'Headline':"".join(main_path.xpath('.//h1[@class="heading1"]/text()').extract(), 

'Correspondent':"".join(main_path.xpath('.//span[@class="auth_detail"]/text()').extract()), 

'Agency':"".join(main_path.xpath('.//span[@itemprop="name"]/text()').extract()), 

'ArticleContent':(main_path.xpath('.//div[@class="Normal"]/text()').extract()), 

}

なぜ私はこの問題に直面していますか？おかげ

出典

2017-05-11 D.Ace

ソリューション：あなたのxpathにtext()前に、第2 /を挿入

'ArticleContent':(main_path.xpath('.//div[@class="Normal"]//text()').extract()),

説明

ロイターは、その記事の内容に追加<p>のタグを持っています。 ../text()は同じノード/タグ..//text()内のテキストだけをキャプチャしますが、サブタグ/サブノードも同様です。

出典

2017-05-11 11:33:07 rrschmidt

ありがとうございました。 –

答えて

関連する問題