治療 - 最後の結果のみ - 優秀な図書館

私はこの最後の問題を除いて、この治療プログラムをほぼ終了しました。私はは治療 - 最後の結果のみ

ページ上の複数のエントリのそれぞれのリストの上に

のそれぞれを入力しています
そのHREFを経由してのエントリのURLは、この次のページで、リストの上に
を繰り返すことによって、いくつかのデータを抽出し、メインページからのデータを単一の項目を作成し、次のページ

問題は、私がcsvを開いたときに、2番目の反復リスト（最初のリストの各エントリ）の最後のエントリの重複のみが表示されることです。

アイテムを誤って追加していますか、またはresponse.metaを何らかの方法で誤って適用していますか？私はresponse.metaのドキュメントに従おうとしましたが、なぜこれが動作していないのか理解できません。

ご協力いただければ幸いです。

import scrapy 
from scrapy.selector import Selector 
from scrapy.http import HtmlResponse 
from fspeople.items import FspeopleItem 

class FSSpider(scrapy.Spider): 
name = "fspeople" 
allowed_domains = ["fs.fed.us"] 
start_urls = [ 
    "http://www.fs.fed.us/research/people/people_search_results.php?3employeename=&keywords=&station_id=SRS&state_id=ALL", 
    #"http://www.fs.fed.us/research/people/people_search_results.php?employeename=&keywords=&station_id=RMRS&state_id=ALL", 
    #"http://www.fs.fed.us/research/people/people_search_results.php?employeename=&keywords=&station_id=PSW&state_id=ALL", 
    #"http://www.fs.fed.us/research/people/people_search_results.php?employeename=&keywords=&station_id=PNW&state_id=ALL", 
    #"http://www.fs.fed.us/research/people/people_search_results.php?employeename=&keywords=&station_id=NRS&state_id=ALL", 
    #"http://www.fs.fed.us/research/people/people_search_results.php?employeename=&keywords=&station_id=IITF&state_id=ALL", 
    #"http://www.fs.fed.us/research/people/people_search_results.php?employeename=&keywords=&station_id=FPL&state_id=ALL", 
    #"http://www.fs.fed.us/research/people/people_search_results.php?employeename=&keywords=&station_id=WO&state_id=ALL" 
] 
def __init__(self): 
    self.i = 0 

def parse(self,response): 
    for sel in response.xpath("//a[@title='Click to view their profile ...']/@href"): 
     item = FspeopleItem() 
     url = response.urljoin(sel.extract()) 
     item['RStation'] = response.xpath("//table[@id='table_id']/tbody/tr/td[2]/i/b/text() | //table[@id='table_id']/tbody/td[2]/text()").extract_first().strip() 
     request = scrapy.Request(url, callback=self.parse_post) 
     request.meta['item'] = item 
     yield request 
    self.i += 1 

def parse_post(self, response): 
    theitems = [] 
    pubs = response.xpath("//div/h2[text()='Featured Publications & Products']/following-sibling::ul[1]/li | //div/h2[text()='Publications']/following-sibling::ul[1]/li") 
    for i in pubs: 
     item = response.meta['item'] 
     name = response.xpath("//div[@id='maincol']/h1/text() | //nobr/text()").extract_first().strip() 
     pubname = i.xpath("a/text()").extract_first().strip() 
     pubauth = i.xpath("text()").extract_first().strip() 
     pubURL = i.xpath("a/@href").extract_first().strip() 
     #RStation = response.xpath("//div[@id='right-float']/div/div/ul/li/a/text()").extract_first().strip() 

     item['link'] = response.url 
     item['name'] = name 
     item['pubname'] = pubname 
     item['pubauth'] = pubauth 
     item['pubURL'] = pubURL 
     #item['RStation'] = RStation 

     theitems.append(item) 
    return theitems

出典

2016-04-23 Chris

あなたは '__init__'をオーバーライドしているが、あなたはあなたのループ内で同じ項目を繰り返し処理しているscrapy.Spider –

のためのスーパーを呼び出していません。 'item = response.meta.get（ 'item'）を試してみてください。 –

各繰り返しごとにitemの新しいインスタンスを作成します。

def parse_post(self, response): 
    [...] 
    for i in pubs: 
     item = response.meta['item'] 
     item = item.copy() 
     [...]

出典

2016-04-25 22:09:27

治療 - 最後の結果のみ

答えて

関連する問題