Scrapy：一緒に頭と身体のタグを選択する方法

私は頭のメタタグと身体のいくつかの要素タグからいくつかのデータを抽出する必要があるクローラを持っています。Scrapy：一緒に頭と身体のタグを選択する方法

私はresponse.xpath（ "// HTML"）のコースのため、この

を試してみてください。

とresponse.xpathのコースのため、この

（」 // head "）：

のメタタグからのデータのみを取得しますタグ。

私はresponse.xpath（ "//体"）のコースのため、この

を試してみてください。

それはHTMLだけ<body>... </body>タグ内のタグからデータをフェッチします。

は、私はこれらの2つのセレクタを組み合わせるにはどうすればよい、私はまたresponse.xpathのコース（ "//ヘッド| //体"）のために

を試してみました：

それだけ 'が返さmeta 'タグを<head>... </head>から削除すると、本文から何も抽出されませんでした。

は、私はまたresponse.xpathのコース（ "// *"）のために、この

を試してみました：

それは動作しますが、これは非常に非効率的であると抽出するために多くの時間を要し。私はこれを行うより効率的な方法があると確信しています。

そして、ここではScrapyコードである、それは場合に役立ちます、... yeild下

最初の2つの要素（ページタイプ、pagefeaturedは）<head> ... <head>タグです。最後の2つの要素（coursetloc、coursetfeesは）<body ... </body>タグ

にあり、はい、それは奇妙に見えるかもしれないが、私は掻き落としていたところから、ウェブサイトで<body>...</body>内部の「メタ」のタグがあります。

class MySpider(BaseSpider): 
name = "dkcourses" 
start_urls = ['http://www.example.com/scrapy/all-courses-listing'] 
allowed_domains = ["example.com"] 
def parse(self, response): 
hxs = Selector(response) 
for courses in response.xpath("//body"): 
yield { 
      'pagetype': ''.join(courses.xpath('.//meta[@name="dkpagetype"]/@content').extract()), 
      'pagefeatured': ''.join(courses.xpath('.//meta[@name="dkpagefeatured"]/@content').extract()), 
      'coursetloc': ''.join(courses.xpath('.//meta[@name="dkcoursetloc"]/@content').extract()), 
      'coursetfees': ''.join(courses.xpath('.//meta[@name="dkcoursetfees"]/@content').extract()), 
      } 
for url in hxs.xpath('//ul[@class="scrapy"]/li/a/@href').extract()): 
    yield Request(response.urljoin(url), callback=self.parse)

ご協力いただきありがとうございます。 extract()の最初の値を取得するためのおかげで

出典

2017-02-10 Slyper

投稿のURLやHTMLコード –

李は、私がウェブサイトのURLを意味 – Slyper

... –

使用extract_first()は、文書のすべてのコンテンツにmetaタグ、//meta手段を見つけるために使用[starts-with(@name, "dkn")]join()
を使用しないでください。

In [5]: for meta in response.xpath('//meta[starts-with(@name, "dkn")]'): 
    ...:  name = meta.xpath('@name').extract_first() 
    ...:  content = meta.xpath('@content').extract_first() 
    ...:  print({name:content})

アウト：

{'dknpagetype': 'Course'} 
{'dknpagefeatured': ''} 
{'dknpagedate': '2016-01-01'} 
{'dknpagebanner': 'http://www.deakin.edu.au/__data/assets/image/0006/757986/Banner_Cyber-Alt2.jpg'} 
{'dknpagethumbsquare': 'http://www.deakin.edu.au/__data/assets/image/0009/757989/SQ_Cyber1-2.jpg'} 
{'dknpagethumblandscape': 'http://www.deakin.edu.au/__data/assets/image/0007/757987/LS_Cyber1-1.jpg'} 
{'dknpagethumbportrait': 'http://www.deakin.edu.au/__data/assets/image/0008/757988/PT_Cyber1-3.jpg'} 
{'dknpagetitle': 'Graduate Diploma of Cyber Security'} 
{'dknpageurl': 'http://www.deakin.edu.au/course/graduate-diploma-cyber-security'} 
{'dknpagedescription': "Take your understanding of cyber security to the next level with Deakin's Graduate Diploma of Cyber Security and build your capacity to investigate and combat cyber-crime."} 
{'dknpageid': '723503'}

出典

2017-02-10 06:09:26

感謝をコードする投稿が、私は保存したい宏杰@変数の値はElasticsearchに値を送るだけでなく、上のサンプルコードのように画面に表示するだけです。 – Slyper

心配しないで、自分のコードで変更する必要があったのは、response.xpath（ "// meta"）のコースのresponse.xpath（ "// body"）： 'to 'ちょうどいま .... – Slyper

Scrapy：一緒に頭と身体のタグを選択する方法

答えて

関連する問題