Scrapy - 次のページのXPath

サイトの「次のページ」URLのXPathを取得するのに本当に問題があります。Scrapy - 次のページのXPath

次のようにHTMLは次のとおりです。

<div class="pagingcont"> 

     <div class="right margintop" id="save_search_header_popup" style="width:550px;"> 
      <div class="left marginleft" style="padding-top:1px;"> 
       <div class="left save_search_env"><img src="/themes/LW1/refresh/images/envelope_icon.gif" alt="Save" />&nbsp;</div> 
       <div class="left"> 
        Save this search and receive email alerts of new listings 
        &nbsp;<input type="text" maxlength="100" value="Name this search" onfocus="doSavedSearchFocus(this,'Name this search');" style="width:120px;height:14px;color:Gray;"/>&nbsp; 
       </div> 
      </div> 
      <div class="left save_search_btn" style="margin-right:10px;"><img class="pointer" src="/themes/LW1/refresh/images/btn_save.gif" alt="Save" onclick="showPopup(document.getElementById('save_search_header_popup'), null, 'In order to be notified of new or updated properties, you need to be registered and signed in.');return false;"/></div> 
     </div> 
     <div class="left margintop marginleft" style="cursor:pointer;height:27px;" onclick="javascript:docompare(true);"> 
      <div class="left"><img src="//www.landwatch.com/themes/LW1/images/comparebtn_btm.gif" style="margin-bottom:0px;">&nbsp;&nbsp;</div> 
      <div class="left active" style="margin-top:4px;">COMPARE</div> 
     </div> 
     <div class="clear topline"></div> 

    <div class="clear margin"> 
     <b>Page &nbsp;</b> 
     &nbsp;<span class="active" style="padding:3px 3px 3px 4px;border:solid 1px black;">1&nbsp;</span>&nbsp;<a href="https://www.landwatch.com/default.aspx?ct=r&type=5,37;268,6843&=&px=2000000&r.PSIZ=500%2c&pg=2">2</a>&nbsp;|&nbsp;<a href="https://www.landwatch.com/default.aspx?ct=r&type=5,37;268,6843&=&px=2000000&r.PSIZ=500%2c&pg=3">3</a>&nbsp;|&nbsp;<a href="https://www.landwatch.com/default.aspx?ct=r&type=5,37;268,6843&=&px=2000000&r.PSIZ=500%2c&pg=4">4</a>&nbsp;|&nbsp;<a href="https://www.landwatch.com/default.aspx?ct=r&type=5,37;268,6843&=&px=2000000&r.PSIZ=500%2c&pg=5">5</a>&nbsp;|&nbsp;<a href="https://www.landwatch.com/default.aspx?ct=r&type=5,37;268,6843&=&px=2000000&r.PSIZ=500%2c&pg=6">6</a>&nbsp;|&nbsp;<a href="https://www.landwatch.com/default.aspx?ct=r&type=5,37;268,6843&=&px=2000000&r.PSIZ=500%2c&pg=7">7</a>&nbsp;|&nbsp;<a href="https://www.landwatch.com/default.aspx?ct=r&type=5,37;268,6843&=&px=2000000&r.PSIZ=500%2c&pg=8">8</a>&nbsp;|&nbsp;<a href="https://www.landwatch.com/default.aspx?ct=r&type=5,37;268,6843&=&px=2000000&r.PSIZ=500%2c&pg=9">9</a>&nbsp;|&nbsp;<a href="https://www.landwatch.com/default.aspx?ct=r&type=5,37;268,6843&=&px=2000000&r.PSIZ=500%2c&pg=10">10</a>&nbsp;|&nbsp;<a href="https://www.landwatch.com/default.aspx?ct=r&type=5,37;268,6843&=&px=2000000&r.PSIZ=500%2c&pg=11">11</a>&nbsp;|&nbsp;<a href="https://www.landwatch.com/default.aspx?ct=r&type=5,37;268,6843&=&px=2000000&r.PSIZ=500%2c&pg=12">12</a>&nbsp;|&nbsp;<a href="https://www.landwatch.com/default.aspx?ct=r&type=5,37;268,6843&=&px=2000000&r.PSIZ=500%2c&pg=13">13</a>&nbsp;| <a href="https://www.landwatch.com/default.aspx?ct=r&type=5,37;268,6843&=&px=2000000&r.PSIZ=500%2c&pg=2">Next</a> 
    </div>

（私が探していますHREFがここに見ることが不便であり、非常に右下の、ある...）

私scrapyは、次のことをしようとします：

next_page_url = response.xpath("//div[@class='pagingcont']//span//a[text()='Next']/href") 
    next_page_url = response.urljoin(next_page_url) 

    for href in response.css('div.propName a::attr(href)'): 
     url = response.urljoin(href.extract()) 
     yield scrapy.Request(url, callback=self.parse_product_page) 
    yield scrapy.Request(next_page_url, callback=self.parse)

しかし、毎回、治療の結果が最初のページに表示され、その後は何も表示されません。だから私はそれが次のページを効果的に見つけるとは思わない。 next_page_urlについて何が間違っていますか？

出典

2017-12-29 JMP

あなたのXPathは二つの問題があります：に変更してそれはあなたのデータに

hrefない<span>を探している

を属性ではなく、ノードであるので、それはする必要があります@href。

下記の完全な実施例。

from scrapy.spiders import Spider 
from scrapy import Request 

class LandSpider(Spider): 
    name = 'myspider' 
    start_urls = [ 
     'https://www.landwatch.com/default.aspx?ct=r&type=5,37;268,6843&=&px=2000000&r.PSIZ=500%2C&pg=1'] 

    def parse(self, response): 
     next_page_url = response.xpath(
      "//div[@class='pagingcont']//a[text()='Next']/@href").extract_first() 

     for href in response.css('div.propName a::attr(href)'): 
      url = response.urljoin(href.extract()) 
      yield Request(url, callback=self.parse_product_page) 
     yield Request(next_page_url, callback=self.parse) 

    def parse_product_page(self, response): 
     return response.xpath("//div[@class='detTitle']/text()").extract_first()

結果：

[ 
{"title": "Lulaton, Brantley County, Coast, GA Land For Sale - 936 Acres"}, 
{"title": "Oglethorpe County, GA Land For Sale - 515 Acres"}, 
{"title": "Dawsonville, Lumpkin County, GA Land For Sale - 525 Acres"}, 
{"title": "Wheeler County, GA Land For Sale - 594 Acres"}, 
{"title": "Cedartown, Polk County, GA Land For Sale - 1185.65 Acres"}, 
... 
]

出典

2017-12-30 00:37:01 jschnurr

私たちはそれを持っています。ありがとう、非常に、jschnurr。 – JMP

最初に表示されているhtmlの例では、aタグの親としてspanが存在しないため、//span//aは何も取得できません。だから、おそらくあなたのxpathは：

"//div[@class='pagingcont']//a[text()='Next']/href"

もちろんいいかもしれません。

は今、あなたはまた、あなたの最初のnext_page_url変数（共有コードのあなたの最初の行）は、実際に文字列Selector、ではないので、.extract_firstで行われるべきである、あなたのPythonコードに値を取得されていません。

next_page_url = response.xpath("//div[@class='pagingcont']//a[text()='Next']/href").extract_first()

出典

2017-12-29 22:11:41 eLRuLL

は、応答のためにありがとうございました。残念ながら、それはまだ最初のページを通過します。次のページに移動しません。 – JMP

Scrapy - 次のページのXPath

答えて

関連する問題