はなぜセレクタループ内のXPathは、まだ私はチュートリアルをscrapyを勉強チュートリアル

でリストを返すん。私はそれが既にセレクタリストをループしていても、私がsel.xpath('a/text()').extract()から得たタイルはまだ1つの文字列を含んでいたリストであることがわかりました。 u'Python 3 Object Oriented Programming'ではなく、[u'Python 3 Object Oriented Programming']と同じです。後の例では、リストはitem['title'] = sel.xpath('a/text()').extract()という項目に割り当てられていますが、これは論理的に正しいとは思えません。はなぜセレクタループ内のXPathは、まだ私はチュートリアルをscrapyを勉強チュートリアル

import scrapy 

class DmozSpider(scrapy.Spider): 
    name = "dmoz" 
    allowed_domains = ["dmoz.org"] 
    start_urls = [ 
     "http://www.dmoz.org/Computers/Programming/Languages/Python/", 
    ] 

    def parse(self, response): 
     for href in response.css("ul.directory.dir-col > li > a::attr('href')"): 
      link = href.extract() 
      print(link)

linkはなく文字列のリストである：

import scrapy 

class DmozSpider(scrapy.Spider): 
    name = "dmoz" 
    allowed_domains = ["dmoz.org"] 
    start_urls = [ 
     "http://www.dmoz.org/Computers/Programming/Languages/Python/Books/", 
     "http://www.dmoz.org/Computers/Programming/Languages/Python/Resources/" 
    ] 

    def parse(self, response): 
     for sel in response.xpath('//ul/li'): 
      title = sel.xpath('a/text()').extract() 
      link = sel.xpath('a/@href').extract() 
      desc = sel.xpath('text()').extract() 
      print title, link, desc

は、しかし、私は次のコードを使用している場合。

これはバグですか？

出典

2016-02-26 entron

.xpath().extract()と.css().extract().xpath()と.css()リターンSelectorListオブジェクトので、リストを返します。

がhttps://parsel.readthedocs.org/en/v1.0.1/usage.html#parsel.selector.SelectorList.extract

（SelectorList）.extract（）を参照してください：

コール.extractを（）各要素のための方法は、このリストで、Unicode文字列のリストとして、その結果は平らに返します。

.extract_first()は（悪い文書化されている）あなたが探しているもの

http://doc.scrapy.org/en/latest/topics/selectors.htmlから撮影

です：

あなただけ最初に一致した要素を抽出したい場合は、セレクタ.extract_first()
を呼び出すことができます

>>> response.xpath('//div[@id="images"]/a/text()').extract_first() 
u'Name: My image 1 '

0あなたの他の例では

：

def parse(self, response): 
    for href in response.css("ul.directory.dir-col > li > a::attr('href')"): 
     link = href.extract() 
     print(link)

ループ内の各hrefはSelectorオブジェクトになります。それに.extract()を呼び出すと、あなたに戻って、単一のUnicode文字列を取得します：responseに

$ scrapy shell "http://www.dmoz.org/Computers/Programming/Languages/Python/" 
2016-02-26 12:11:36 [scrapy] INFO: Scrapy 1.0.5 started (bot: scrapybot) 
(...) 
In [1]: response.css("ul.directory.dir-col > li > a::attr('href')") 
Out[1]: 
[<Selector xpath=u"descendant-or-self::ul[@class and contains(concat(' ', normalize-space(@class), ' '), ' directory ') and (@class and contains(concat(' ', normalize-space(@class), ' '), ' dir-col '))]/li/a/@href" data=u'/Computers/Programming/Languages/Python/'>, 
<Selector xpath=u"descendant-or-self::ul[@class and contains(concat(' ', normalize-space(@class), ' '), ' directory ') and (@class and contains(concat(' ', normalize-space(@class), ' '), ' dir-col '))]/li/a/@href" data=u'/Computers/Programming/Languages/Python/'>, 
... 
<Selector xpath=u"descendant-or-self::ul[@class and contains(concat(' ', normalize-space(@class), ' '), ' directory ') and (@class and contains(concat(' ', normalize-space(@class), ' '), ' dir-col '))]/li/a/@href" data=u'/Computers/Programming/Languages/Python/'>]

ので.css()をSelectorListを返します。そのオブジェクトにループ

In [2]: type(response.css("ul.directory.dir-col > li > a::attr('href')")) 
Out[2]: scrapy.selector.unified.SelectorList

はあなたにSelectorインスタンスを与える：

In [5]: for href in response.css("ul.directory.dir-col > li > a::attr('href')"): 
    ...:  print href 
    ...:  
<Selector xpath=u"descendant-or-self::ul[@class and contains(concat(' ', normalize-space(@class), ' '), ' directory ') and (@class and contains(concat(' ', normalize-space(@class), ' '), ' dir-col '))]/li/a/@href" data=u'/Computers/Programming/Languages/Python/'> 
<Selector xpath=u"descendant-or-self::ul[@class and contains(concat(' ', normalize-space(@class), ' '), ' directory ') and (@class and contains(concat(' ', normalize-space(@class), ' '), ' dir-col '))]/li/a/@href" data=u'/Computers/Programming/Languages/Python/'> 
(...) 
<Selector xpath=u"descendant-or-self::ul[@class and contains(concat(' ', normalize-space(@class), ' '), ' directory ') and (@class and contains(concat(' ', normalize-space(@class), ' '), ' dir-col '))]/li/a/@href" data=u'/Computers/Programming/Languages/Python/'>

そして、.extract()を呼び出すと、1つのUnicode文字列INGの：

In [6]: for href in response.css("ul.directory.dir-col > li > a::attr('href')"): 
    print type(href.extract()) 
    ...:  
<type 'unicode'> 
<type 'unicode'> 
<type 'unicode'> 
<type 'unicode'> 
<type 'unicode'> 
<type 'unicode'> 
<type 'unicode'> 
<type 'unicode'> 
<type 'unicode'> 
<type 'unicode'> 
<type 'unicode'> 
<type 'unicode'> 
<type 'unicode'>

注：.extract()Selectorには、文字列のリストを返すようwrongly documentedです。 parsel（Scrapyセレクタと同じで、治療1.1+のボンネットの下で使用されています）の問題を開きます

出典

2016-02-26 10:44:05

ありがとうございました！私はその記事を編集し、 'extract（）'が文字列を与えるチュートリアルの例を追加しました。これは私がCSSを使用しているためですか？ – entron

私が書いたものは正しい（あまりにも速い答え）。実際に '.xpath（）。extract（）'と '.css（）。extract（）'は '.xpath（）'と '.css（）'が 'SelectorList'オブジェクトを返すので、リストを返します。しかし、 '.xpath（）'をループすると 'Selector'が得られ、そこから' .extract（） 'を呼び出して一つの要素を得ることができます。私は私の答えを修正する –

この部分は本当に混乱しているが、今私は理解する！どうもありがとうございました！ – entron

はなぜセレクタループ内のXPathは、まだ私はチュートリアルをscrapyを勉強チュートリアル

答えて

関連する問題