Scrapy/Python：yieldの値を処理する

Scrapy/Pythonを使用してページから値を読み取るクローラを作成しようとしています。Scrapy/Python：yieldの値を処理する

次に、このクローラに別々のフィールドに最高値と最低値を格納させます。

これまでは、ページから値を読み取ることができましたが（下のコードを参照してください）、最低値と最高値を計算し、別々のフィールドに格納する方法がわかりません。例えば

、クローラはページを読み出し、これらの値を返す言う

burvaleスコア= 75.25
リッチモンドスコア= 85.04
ソマーノスコア= ''（値の欠落）
ツーソンスコア= 90.67
クラウドスコア= 50.00

は、だから私は移入したい....

'highestscore'：90.67
'lowestscore'：50.00

は、私はそれをどのように行うのですか？アレイを使用する必要がありますか？すべての値を配列に入れてから、最高値/最低値を選択します。

また、私のコードで2 yield ....下yieldがクロールするURLを提供している、と最初yieldが実際にクロールがあることに注意してください/ボトムyieldによって提供されている各URLから値を収集

ご協力いただきありがとうございます。可能であれば、コード例を提供してください。

ここまでは私のコードです ....値がない場合は-1を保存しています。

class MySpider(BaseSpider): 
    name = "courses" 
    start_urls = ['http://www.example.com/all-courses-listing'] 
    allowed_domains = ["example.com"] 
    def parse(self, response): 
    hxs = Selector(response) 
    #for courses in response.xpath(response.body): 
    for courses in response.xpath("//meta"): 
    yield { 
       'pagetype': courses.xpath('//meta[@name="pagetype"]/@content').extract_first(), 
       'pagefeatured': courses.xpath('//meta[@name="pagefeatured"]/@content').extract_first(), 
       'pagedate': courses.xpath('//meta[@name="pagedate"]/@content').extract_first(), 
       'pagebanner': courses.xpath('//meta[@name="pagebanner"]/@content').extract_first(), 
       'pagetitle': courses.xpath('//meta[@name="pagetitle"]/@content').extract_first(), 
       'pageurl': courses.xpath('//meta[@name="pageurl"]/@content').extract_first(), 
       'pagedescription': courses.xpath('//meta[@name="pagedescription"]/@content').extract_first(), 
       'pageid': courses.xpath('//meta[@name="pageid"]/@content').extract_first(), 

       'courseatarburvale': float(courses.xpath('//meta[@name="courseatar-burvale"]/@content').extract_first('').strip() or -1), 
       'courseatarrichmond': float(courses.xpath('//meta[@name="courseatar-richmond"]/@content').extract_first('').strip() or -1), 
       'courseatarsomano': float(courses.xpath('//meta[@name="courseatar-somano"]/@content').extract_first('').strip() or -1), 
       'courseatartucson': float(courses.xpath('//meta[@name="courseatar-tucson"]/@content').extract_first('').strip() or -1), 
       'courseatarcloud': float(courses.xpath('//meta[@name="courseatar-cloud"]/@content').extract_first('').strip() or -1), 
       'highestscore'; ?????? 
       'lowestscore'; ?????? 
       } 
    for url in hxs.xpath('//ul[@class="scrapy"]/li/a/@href').extract(): 
     yield Request(response.urljoin(url), callback=self.parse)

出典

2017-07-28 Slyper

私はおそらく、このコードの一部を打破するでしょう：これに

yield { 
    'pagetype': courses.xpath('//meta[@name="pagetype"]/@content').extract_first(), 
    'pagefeatured': courses.xpath('//meta[@name="pagefeatured"]/@content').extract_first(), 
    'pagedate': courses.xpath('//meta[@name="pagedate"]/@content').extract_first(), 
    'pagebanner': courses.xpath('//meta[@name="pagebanner"]/@content').extract_first(), 
    'pagetitle': courses.xpath('//meta[@name="pagetitle"]/@content').extract_first(), 
    'pageurl': courses.xpath('//meta[@name="pageurl"]/@content').extract_first(), 
    'pagedescription': courses.xpath('//meta[@name="pagedescription"]/@content').extract_first(), 
    'pageid': courses.xpath('//meta[@name="pageid"]/@content').extract_first(), 

    'courseatarburvale': float(courses.xpath('//meta[@name="courseatar-burvale"]/@content').extract_first('').strip() or -1), 
    'courseatarrichmond': float(courses.xpath('//meta[@name="courseatar-richmond"]/@content').extract_first('').strip() or -1), 
    'courseatarsomano': float(courses.xpath('//meta[@name="courseatar-somano"]/@content').extract_first('').strip() or -1), 
    'courseatartucson': float(courses.xpath('//meta[@name="courseatar-tucson"]/@content').extract_first('').strip() or -1), 
    'courseatarcloud': float(courses.xpath('//meta[@name="courseatar-cloud"]/@content').extract_first('').strip() or -1), 
    'highestscore'; ?????? 
    'lowestscore'; ?????? 
}

：私は、同様のアプローチに取り組んでいますトーマス・リンハート@

item = { 
    'pagetype': courses.xpath('//meta[@name="pagetype"]/@content').extract_first(), 
    'pagefeatured': courses.xpath('//meta[@name="pagefeatured"]/@content').extract_first(), 
    'pagedate': courses.xpath('//meta[@name="pagedate"]/@content').extract_first(), 
    'pagebanner': courses.xpath('//meta[@name="pagebanner"]/@content').extract_first(), 
    'pagetitle': courses.xpath('//meta[@name="pagetitle"]/@content').extract_first(), 
    'pageurl': courses.xpath('//meta[@name="pageurl"]/@content').extract_first(), 
    'pagedescription': courses.xpath('//meta[@name="pagedescription"]/@content').extract_first(), 
    'pageid': courses.xpath('//meta[@name="pageid"]/@content').extract_first(), 
} 

scores = { 
    'courseatarburvale': float(courses.xpath('//meta[@name="courseatar-burvale"]/@content').extract_first('').strip() or -1), 
    'courseatarrichmond': float(courses.xpath('//meta[@name="courseatar-richmond"]/@content').extract_first('').strip() or -1), 
    'courseatarsomano': float(courses.xpath('//meta[@name="courseatar-somano"]/@content').extract_first('').strip() or -1), 
    'courseatartucson': float(courses.xpath('//meta[@name="courseatar-tucson"]/@content').extract_first('').strip() or -1), 
    'courseatarcloud': float(courses.xpath('//meta[@name="courseatar-cloud"]/@content').extract_first('').strip() or -1), 
} 

values = sorted(x for x in scores.values() if x > 0) 
scores.update({ 
    'highestscore': values[-1], 
    'lowestscore': values[0], 
}) 

item.update(scores) 
yield item

出典

2017-07-28 06:01:06

おかげで....私がしますまもなく返信します...お返事 – Slyper

こんにちは@TomášLinhart私はあなたに提案を試みたとき、私はこのエラーが発生しました.... _IndexError：範囲外のリストインデックス_任意のアイデア？この行の '' highestatar '：values [-1]、 ' – Slyper

@Slyper'おそらく' values'リストは空です。これは、ページからスコアが抽出されない場合に発生します。つまり、スコアは-1だけです。したがって、最高スコア割当コードを「最高スコア」に変更します。値が他の場合は値[-1]、最下位スコアも同様です。 –

Scrapy/Python：yieldの値を処理する

答えて

関連する問題