normalize-spaceが処理していません

URL内のWebページから章のタイトルとそのサブタイトルを抽出しようとしています。これは私がせいぜいでの結果を得るにはどうすればよいnormalize-spaceが処理していません

content_item,full_url,title 
" 

     ,Chapter 1, 



     , 


    , 

     ,Instructor Introduction, 

     ,00:01:00, 



    , 

    , 

     ,Course Overview,

私はスペースを削除するために私のXPathでノーマライズfucntionを使用していますが、私は私のcsvファイルに次のような結果を得る私のクモ

import scrapy 
from ..items import ContentsPageSFBItem 

class BasicSpider(scrapy.Spider): 
    name = "contentspage_sfb" 
    #allowed_domains = ["web"] 
    start_urls = [ 
     'https://www.safaribooksonline.com/library/view/shell-programming-in/9780134496696/', 
    ] 

    def parse(self, response): 
      item = ContentsPageSFBItem() 
      item['content_item'] = response.xpath('normalize-space(//ol[@class="detail-toc"]//*/text())').extract(); 
      length = len(response.xpath('//ol[@class="detail-toc"]//*/text()').extract()); #extract() 
      full_url_list = list(); 
      title_list = list(); 
      for i in range(1,length+1): 
       full_url_list.append(response.url) 
      item["full_url"] = full_url_list 
      title = response.xpath('//title[1]/text()').extract(); 
      for j in range(1,length+1): 
       title_list.append(title) 
      item["title"] = title_list 
      return item

です各エントリの後に新しい行が1つだけありますか？あなたがにitem['content_item']であなたのXPath式を変更する必要がTable of Contentsセクション内のすべてのテキストを取得したい場合は

出典

2017-05-17 Echchama Nayak

を出力する必要がありますか？あなたは '目次'の中のすべてのテキストをこすりましたか？あなたのコードのページには、あなたのcsvファイルの '教師紹介'テキストと他のテキストはありません。 – vold

はい目次 –

[scrapy shell]（https://doc.scrapy.org/en/latest/topics/shell.html）を使用することをお勧めします。これは、スパイダーコードをテストしてデバッグするのに非常に便利なツールです。あなたはxpathセレクタをテストし、それらが返すものを見ることができます。例えば、 'item [" title "]'は、同じ文字列を含むリストのリストを返します。 'item [" title "]'と 'item [" full_url "]'から期待される出力を指定することはできますか？ @voldに感謝します。 – vold

：

item['content_item'] = response.xpath('//ol[@class="detail-toc"]//a/text()').extract()

をあなたは、このようなコードをスパイダー書き換えることができます：あなたの予想は何

import scrapy 

class BasicSpider(scrapy.Spider): 

    name = "contentspage_sfb" 
    start_urls = [ 
     'https://www.safaribooksonline.com/library/view/shell-programming-in/9780134496696/', 
    ] 

    def parse(self, response): 
     item = dict()  # change dict to your scrapy item 
     for link in response.xpath('//ol[@class="detail-toc"]//a'): 
      item['link_text'] = link.xpath('text()').extract_first() 
      item['link_url'] = response.urljoin(link.xpath('@href').extract_first()) 
      yield item 

# Output: 
{'link_text': 'About This E-Book', 'link_url': 'https://www.safaribooksonline.com/library/view/shell-programming-in/9780134496696/pref00.html#pref00'} 
{'link_text': 'Title Page', 'link_url': 'https://www.safaribooksonline.com/library/view/shell-programming-in/9780134496696/title.html#title'}

出典

2017-05-17 18:22:57 vold

まだタイトル間にスペースがあります。 –

それは変です。あなたは私のxpath式をscrapy shellでテストしようとしましたか？ 'response.xpath（ '// ol [@ class =" detail-toc "] // a/text（）'）を実行したときの出力。 [出力]（http://icecream.me/24349dd3e159e2847f398c7a1ea0ea3a） – vold

が動作します。私のCSVの中に空白があります –

normalize-spaceが処理していません

答えて

関連する問題