治療では、私はXPATHを使ってHTMLを選び、多くの不要な ""と？

私はこれは私が欲しいものであるhttp://so.gushiwen.org/view_20788.aspx 治療では、私はXPATHを使ってHTMLを選び、多くの不要な ""と？

Inspector

を解析して問題に直面：

"detail_text": [" 
    寥落古行宫，宫花寂寞红。白头宫女在，闲坐说玄宗。 
"],

が、私はこれだ：

"detail_text": [" 
    ", " 
    ", " 
    ", " 
    ", " 
    寥落古行宫，宫花寂寞红。", "白头宫女在，闲坐说玄宗。 
"],

をして、これは私のコードです：

#spider 
class Tangshi3Spide(scrapy.Spider): 
    name = "tangshi3" 
    allowed_domains = ["gushiwen.org"] 
    start_urls = [ 
     "http://so.gushiwen.org/view_20788.aspx" 
    ] 
    def __init__(self): 
     self.items = [] 

    def parse(self, response): 
     sel = Selector(response) 
     sites = sel.xpath('//div[@class="main3"]/div[@class="shileft"]') 
     domain = 'http://so.gushiwen.org' 
     for site in sites: 
      item = Tangshi3Item() 
      item['detail_title'] = site.xpath('div[@class="son1"]/h1/text()').extract() 
      item['detail_dynasty'] = site.xpath(
       u'div[@class="son2"]/p/span[contains(text(),"朝代：")]/parent::p/text()').extract() 
      item['detail_translate_note_url'] = site.xpath('div[@id="fanyiShort676"]/p/a/u/parent::a/@href').extract() 
      item['detail_appreciation_url'] = site.xpath('div[@id="shangxiShort787"]/p/a/u/parent::a/@href').extract() 
      item['detail_background_url'] = site.xpath('div[@id="shangxiShort24492"]/p/a/u/parent::a/@href').extract() 
      #question line 
      item['detail_text'] = site.xpath('div[@class="son2"]/text()').extract() 
      self.items.append(item) 
     return self.items 



#pipeline 
class Tangshi3Pipeline(object): 
    def __init__(self): 
     self.file = codecs.open('tangshi_detail.json', 'w',  encoding='utf-8') 

    def process_item(self, item, spider): 
     line = json.dumps(dict(item)) 
     self.file.write(line.decode("unicode_escape")) 
     return item

正しいテキストを取得するにはどうすればよいですか？

出典

2016-09-07 ZahiZhou

あなたが唯一の空白を含むものすなわち空のテキストノードを拾う避けるために、述語[normalize-space()]を追加することができます。

item['detail_text'] = site.xpath('div[@class="son2"]/text()[normalize-space()]').extract()

出典

2016-09-07 02:41:13 har07

治療では、私はXPATHを使ってHTMLを選び、多くの不要な ""と？

答えて

関連する問題