Scrapy ItemLoaderアイテム私はこのような配列に3つの項目を組み合わせることItemLoaderを使用しようとしています

を組み合わせる：あなたは、以下のJSONで見ることができるように、それは一緒タイプのすべての項目を組み合わせていますScrapy ItemLoaderアイテム私はこのような配列に3つの項目を組み合わせることItemLoaderを使用しようとしています

[ 
    { 
     site_title: "Some Site Title", 
     anchor_text: "Click Here", 
     link: "http://example.com/page" 
    } 
]

。

私が探しているような配列でJSONを出力するにはどうすればよいですか？

スパイダーファイル：

import scrapy 
from linkfinder.items import LinkfinderItem 
from scrapy.loader import ItemLoader 

class LinksSpider(scrapy.Spider): 
    name = "links" 
    allowed_domains = ["wpseotest.com"] 
    start_urls = ["https://wpseotest.com"] 

    def parse(self, response): 

     l = ItemLoader(item=LinkfinderItem(), response=response) 
     l.add_xpath('site_title', '//title/text()') 
     l.add_xpath('anchor_text', '//a//text()') 
     l.add_xpath('link', '//a/@href') 
     return l.load_item() 

     pass

Items.py

import scrapy 
from scrapy import item, Field 

class LinkfinderItem(scrapy.Item): 
    # define the fields for your item here like: 
    # name = scrapy.Field() 
    site_title = Field() 
    anchor_text = Field() 
    link = Field() 
    pass

JSON出力

[ 
{"anchor_text": ["Globex Corporation", "Skip to content", "Home", "About", "Globex News", "Events", "Contact Us", "3999 Mission Boulevard,\r", "San Diego, CA 92109", "This is a test scheduled\u00a0post.", "Test Title", "Globex Subsidiary Ice Cream Inc. Creates Chicken Wing\u00a0Flavor", "Globex Inc.", "\r\n", "Blog at WordPress.com."], "link": ["https://wpseotest.com/", "#content", "https://wpseotest.com/", "https://wpseotest.com/about/", "https://wpseotest.com/globex-news/", "https://wpseotest.com/events/", "https://wpseotest.com/contact-us/", "http://maps.google.com/maps?z=16&q=3999+mission+boulevard,+san+diego,+ca+92109", "https://wpseotest.com/2016/08/19/this-is-a-test-scheduled-post/", "https://wpseotest.com/2016/06/28/test-title/", "https://wpseotest.com/2015/10/18/globex-subsidiary-ice-cream-inc-creates-chicken-wing-flavor/", "https://wpseotest.wordpress.com", "https://wordpress.com/?ref=footer_blog"], "site_title": ["Globex Corporation \u2013 We make things better, or, sometimes, worse."]} 
]

出典

2016-10-30 Christopher Smith

。私はあなたの事例を得ることができませんでした。私はそれが正しく動作しない原因になる何かがあるかどうかはわかりません。 –

あなたがここにすべてのリンクのためのアイテムを得たいですか？
あなたがしたいことを得るには、記事ノードを見つけて、それを繰り返して、後で辞書/ scrapy.Itemに結合するフィールドを見つけてください。

def parse(self, response): 
    site_title = response.xpath("//title/text()").extract_first() 
    links = response.xpath("//a") 
    for link in links: 
     l = ItemLoader(selector=link) 
     l.add_value('site_title', site_title) 
     l.add_xpath('anchor_text', 'text()') 
     l.add_xpath('link', '@href') 
     yield l.load_item()

今、あなたはscrapy crawl myspider -o output.jsonを実行することができますし、のようなものを取得する必要があります。

{[ 
    {"site_title": "title", 
    "anchor_text": "foo", 
    "link": "http://foo.com"}, 
    {"site_title": "title", 
    "anchor_text": "bar", 
    "link": "http://bar.com"} 
    ... 
    ] 
}

あなたは/作る私はパイプラインを使用して終了あなたの欲求出力

出典

2016-10-31 12:50:12 Granitosaurus

を作成するためにpiplinesを使用することができます –

私はそれがあなたのコード提案ではなかった問題を引き起こしていた何かを見るなら、ここでrepo（パイプラインを使ったバージョン）https://github.com/chrisfromthelc/scrapy-linkfinderにアップロードしました。 –

Scrapy ItemLoaderアイテム私はこのような配列に3つの項目を組み合わせることItemLoaderを使用しようとしています

答えて

関連する問題