Python Scrapy別の関数を使ってリンクを抽出しました

私はスクリーニングに新しいです学習目的のためにyellowpagesを削っていますが、すべてうまく動作しますが電子メールアドレスが必要ですが、別のparse_email関数を使用していますが、wokはありません。Python Scrapy別の関数を使ってリンクを抽出しました

私はそれが動作するparse_email関数をテストしたが、メインの解析関数の内部からは機能しないことを意味しています。私はparse_email関数がリンク元を取得したいので、コールバックを使用してparse_email関数を呼び出しますそれはparse_email機能が働いて、ちょうどここ

ページを開くことなく、リンクを返すことは、私は部品

をコメントしているコードされないが、何らかの理由でメールを返す必要がありますこれらの <GET https://www.yellowpages.com/los-angeles-ca/mip/palm-tree-la-7254813?lid=7254813> のようなリターンリンク
import scrapy import requests from urlparse import urljoin scrapy.optional_features.remove('boto') class YellowSpider(scrapy.Spider): name = 'yellow spider' start_urls = ['https://www.yellowpages.com/search?search_terms=restaurant&geo_location_terms=Los+Angeles%2C+CA'] def parse(self, response): SET_SELECTOR = '.info' for brickset in response.css(SET_SELECTOR): NAME_SELECTOR = 'h3 a ::text' ADDRESS_SELECTOR = '.adr ::text' PHONE = '.phone.primary ::text' WEBSITE = '.links a ::attr(href)' #Getiing the link of the page that has the email usiing this selector EMAIL_SELECTOR = 'h3 a ::attr(href)' #extracting the link email = brickset.css(EMAIL_SELECTOR).extract_first() #joining and making complete url url = urljoin(response.url, brickset.css('h3 a ::attr(href)').extract_first()) yield { 'name': brickset.css(NAME_SELECTOR).extract_first(), 'address': brickset.css(ADDRESS_SELECTOR).extract_first(), 'phone': brickset.css(PHONE).extract_first(), 'website': brickset.css(WEBSITE).extract_first(), #ONLY Returning Link of the page not calling the function 'email': scrapy.Request(url, callback=self.parse_email), } NEXT_PAGE_SELECTOR = '.pagination ul a ::attr(href)' next_page = response.css(NEXT_PAGE_SELECTOR).extract()[-1] if next_page: yield scrapy.Request( response.urljoin(next_page), callback=self.parse ) def parse_email(self, response): #xpath for the email address in the nested page EMAIL_SELECTOR = '//a[@class="email-business"]/@href' #returning the extracted email WORKS XPATH WORKS I CHECKED BUT FUNCTION NOT CALLING FOR SOME REASON yield { 'email': response.xpath(EMAIL_SELECTOR).extract_first().replace('mailto:', '') }

私は、彼らが得ることはありません（それがあります知っていないため、Scrapyがそれを派遣しません、私はあなたがそれの内部Requestとdictを得ている間違った

出典

2017-03-13 Shantanu Shady

だろう、この[解答]（http://stackoverflow.com/a/26196047/5699807）ヘルプ？ – Priyank

をやっているかわかりませんそれらの作成後に自動的にディスパッチされます）。あなたは実際にRequestを得る必要があります。

parse_emailファンクションでは、各メールがどのアイテムに属しているかを「記憶」するために、残りのアイテムデータをリクエストとともに渡す必要があります。これは引数metaで行うことができます。

例：

parse中：

yield scrapy.Request(url, callback=self.parse_email, meta={'item': { 
    'name': brickset.css(NAME_SELECTOR).extract_first(), 
    'address': brickset.css(ADDRESS_SELECTOR).extract_first(), 
    'phone': brickset.css(PHONE).extract_first(), 
    'website': brickset.css(WEBSITE).extract_first(), 
}})

parse_email中：

item = response.meta['item'] # The item this email belongs to 
item['email'] = response.xpath(EMAIL_SELECTOR).extract_first().replace('mailto:', '') 
return item

出典

2017-03-13 17:28:16 lufte

AttributeError： 'dict'オブジェクトに 'item'属性がありません –

これは現在私にこのエラーを与えています –

@ShantanuBedajna 'response.meta ['item']'？ – alecxe

Python Scrapy別の関数を使ってリンクを抽出しました

答えて

関連する問題