2016-04-13 16 views
1

私は、そのURLの部分文字列が一致する場合に完全なURLを抽出する必要があるスクラップアプリケーションを構築しています。例えば部分文字列のコンテキストURLを抽出します

はのページが私の関心の次のURLを持っていると仮定しましょう:

  • /public/flag?cat=Computers/Programming/Languages/Python/Books&url=http://www.pearsonhighered.com/educator/academic/product/0,,0130260363,00%2Ben-USS_01DBC.html
  • /public/flag?cat=Computers/Programming/Languages/Python/Books&url=http://www.brpreiss.com/books/opus7/html/book.html
  • /public/flag?cat=Computers/Programming/Languages/Python/Books&url=http://www.diveintopython.net/
  • /public/flag?cat=Computers/Programming/Languages/Python/Books&url=http://rhodesmill.org/brandon/2011/foundations-of-python-network-programming/
  • [18以上]

しかし、私の検索文字列がURLとない完全なURLの唯一のマッチした部分を返しflag?cat=Computers/Programming/Languages/Python/Books

です。上記の完全なURLを取得するにはどうすればよいですか?ここで

を例に基づいて、単純なscrapyのテストケースです:

from scrapy.spiders import Spider 
from scrapy.selector import Selector 
import scrapy 

class DmozSpider(Spider): 
    name = "dmoz" 
    allowed_domains = ["dmoz.org"] 
    start_urls = [ 
     "http://www.dmoz.org/Computers/Programming/Languages/Python/Books/", 
    ] 

    def parse(self, response): 
     #scrapy.shell.inspect_response(response, self) 
     results = response.xpath('//body').re('(flag\?cat=Computers%2FProgramming%2FLanguages%2FPython%2FBooks)') 
     print results 

出力:

[ 
    u'flag?cat=Computers%2FProgramming%2FLanguages%2FPython%2FBooks', 
    u'flag?cat=Computers%2FProgramming%2FLanguages%2FPython%2FBooks', 
    u'flag?cat=Computers%2FProgramming%2FLanguages%2FPython%2FBooks', 
    u'flag?cat=Computers%2FProgramming%2FLanguages%2FPython%2FBooks', 
    u'flag?cat=Computers%2FProgramming%2FLanguages%2FPython%2FBooks', 
    u'flag?cat=Computers%2FProgramming%2FLanguages%2FPython%2FBooks', 
    u'flag?cat=Computers%2FProgramming%2FLanguages%2FPython%2FBooks', 
    u'flag?cat=Computers%2FProgramming%2FLanguages%2FPython%2FBooks', 
    u'flag?cat=Computers%2FProgramming%2FLanguages%2FPython%2FBooks', 
    u'flag?cat=Computers%2FProgramming%2FLanguages%2FPython%2FBooks', 
    u'flag?cat=Computers%2FProgramming%2FLanguages%2FPython%2FBooks', 
    u'flag?cat=Computers%2FProgramming%2FLanguages%2FPython%2FBooks', 
    u'flag?cat=Computers%2FProgramming%2FLanguages%2FPython%2FBooks', 
    u'flag?cat=Computers%2FProgramming%2FLanguages%2FPython%2FBooks', 
    u'flag?cat=Computers%2FProgramming%2FLanguages%2FPython%2FBooks', 
    u'flag?cat=Computers%2FProgramming%2FLanguages%2FPython%2FBooks', 
    u'flag?cat=Computers%2FProgramming%2FLanguages%2FPython%2FBooks', 
    u'flag?cat=Computers%2FProgramming%2FLanguages%2FPython%2FBooks', 
    u'flag?cat=Computers%2FProgramming%2FLanguages%2FPython%2FBooks', 
    u'flag?cat=Computers%2FProgramming%2FLanguages%2FPython%2FBooks', 
    u'flag?cat=Computers%2FProgramming%2FLanguages%2FPython%2FBooks', 
    u'flag?cat=Computers%2FProgramming%2FLanguages%2FPython%2FBooks' 
] 

予想される出力:

[ 
    u'/public/flag?cat=Computers%2FProgramming%2FLanguages%2FPython%2FBooks&amp;url=http%3A%2F%2Fwww.pearsonhighered.com%2Feducator%2Facademic%2Fproduct%2F0%2C%2C0130260363%2C00%252Ben-USS_01DBC.html"><img src="/img/flag.png" alt="[!]" title="report an issue with this listing' 
    u'/public/flag?cat=Computers%2FProgramming%2FLanguages%2FPython%2FBooks&amp;url=http%3A%2F%2Fwww.brpreiss.com%2Fbooks%2Fopus7%2Fhtml%2Fbook.html"><img src="/img/flag.png" alt="[!]" title="report an issue with this listing' 
    u'/public/flag?cat=Computers%2FProgramming%2FLanguages%2FPython%2FBooks&amp;url=http%3A%2F%2Fwww.diveintopython.net%2F"><img src="/img/flag.png" alt="[!]" title="report an issue with this listing' 
    u'/public/flag?cat=Computers%2FProgramming%2FLanguages%2FPython%2FBooks&amp;url=http%3A%2F%2Frhodesmill.org%2Fbrandon%2F2011%2Ffoundations-of-python-network-programming%2F"><img src="/img/flag.png" alt="[!]" title="report an issue with this listing' 
    u'/public/flag?cat=Computers%2FProgramming%2FLanguages%2FPython%2FBooks&amp;url=http%3A%2F%2Fwww.techbooksforfree.com%2Fperlpython.shtml"><img src="/img/flag.png" alt="[!]" title="report an issue with this listing' 
    u'/public/flag?cat=Computers%2FProgramming%2FLanguages%2FPython%2FBooks&amp;url=http%3A%2F%2Fwww.freetechbooks.com%2Fpython-f6.html"><img src="/img/flag.png" alt="[!]" title="report an issue with this listing' 
    u'/public/flag?cat=Computers%2FProgramming%2FLanguages%2FPython%2FBooks&amp;url=http%3A%2F%2Fgreenteapress.com%2Fthinkpython%2F"><img src="/img/flag.png" alt="[!]" title="report an issue with this listing' 
    u'/public/flag?cat=Computers%2FProgramming%2FLanguages%2FPython%2FBooks&amp;url=http%3A%2F%2Fwww.network-theory.co.uk%2Fpython%2Fintro%2F"><img src="/img/flag.png" alt="[!]" title="report an issue with this listing' 
    u'/public/flag?cat=Computers%2FProgramming%2FLanguages%2FPython%2FBooks&amp;url=http%3A%2F%2Fwww.freenetpages.co.uk%2Fhp%2Falan.gauld%2F"><img src="/img/flag.png" alt="[!]" title="report an issue with this listing' 
    u'/public/flag?cat=Computers%2FProgramming%2FLanguages%2FPython%2FBooks&amp;url=http%3A%2F%2Fwww.wiley.com%2FWileyCDA%2FWileyTitle%2FproductCd-0471219754.html"><img src="/img/flag.png" alt="[!]" title="report an issue with this listing' 
    u'/public/flag?cat=Computers%2FProgramming%2FLanguages%2FPython%2FBooks&amp;url=http%3A%2F%2Fhetland.org%2Fwriting%2Fpractical-python%2F"><img src="/img/flag.png" alt="[!]" title="report an issue with this listing' 
    u'/public/flag?cat=Computers%2FProgramming%2FLanguages%2FPython%2FBooks&amp;url=http%3A%2F%2Fsysadminpy.com%2F"><img src="/img/flag.png" alt="[!]" title="report an issue with this listing' 
    u'/public/flag?cat=Computers%2FProgramming%2FLanguages%2FPython%2FBooks&amp;url=http%3A%2F%2Fwww.qtrac.eu%2Fpy3book.html"><img src="/img/flag.png" alt="[!]" title="report an issue with this listing' 
    u'/public/flag?cat=Computers%2FProgramming%2FLanguages%2FPython%2FBooks&amp;url=http%3A%2F%2Fwww.wiley.com%2FWileyCDA%2FWileyTitle%2FproductCd-0764548077.html"><img src="/img/flag.png" alt="[!]" title="report an issue with this listing' 
    u'/public/flag?cat=Computers%2FProgramming%2FLanguages%2FPython%2FBooks&amp;url=https%3A%2F%2Fwww.packtpub.com%2Fpython-3-object-oriented-programming%2Fbook"><img src="/img/flag.png" alt="[!]" title="report an issue with this listing' 
    u'/public/flag?cat=Computers%2FProgramming%2FLanguages%2FPython%2FBooks&amp;url=http%3A%2F%2Fwww.network-theory.co.uk%2Fpython%2Flanguage%2F"><img src="/img/flag.png" alt="[!]" title="report an issue with this listing' 
    u'/public/flag?cat=Computers%2FProgramming%2FLanguages%2FPython%2FBooks&amp;url=http%3A%2F%2Fwww.pearsonhighered.com%2Feducator%2Facademic%2Fproduct%2F0%2C%2C0130409561%2C00%252Ben-USS_01DBC.html"><img src="/img/flag.png" alt="[!]" title="report an issue with this listing' 
    u'/public/flag?cat=Computers%2FProgramming%2FLanguages%2FPython%2FBooks&amp;url=http%3A%2F%2Fwww.informit.com%2Fstore%2Fproduct.aspx%3Fisbn%3D0201616165%26redir%3D1"><img src="/img/flag.png" alt="[!]" title="report an issue with this listing' 
    u'/public/flag?cat=Computers%2FProgramming%2FLanguages%2FPython%2FBooks&amp;url=http%3A%2F%2Fwww.pearsonhighered.com%2Feducator%2Facademic%2Fproduct%2F0%2C%2C0201748843%2C00%252Ben-USS_01DBC.html"><img src="/img/flag.png" alt="[!]" title="report an issue with this listing' 
    u'/public/flag?cat=Computers%2FProgramming%2FLanguages%2FPython%2FBooks&amp;url=http%3A%2F%2Fwww.informit.com%2Fstore%2Fproduct.aspx%3Fisbn%3D0672317354"><img src="/img/flag.png" alt="[!]" title="report an issue with this listing' 
    u'/public/flag?cat=Computers%2FProgramming%2FLanguages%2FPython%2FBooks&amp;url=http%3A%2F%2Fgnosis.cx%2FTPiP%2F"><img src="/img/flag.png" alt="[!]" title="report an issue with this listing' 
    u'/public/flag?cat=Computers%2FProgramming%2FLanguages%2FPython%2FBooks&amp;url=http%3A%2F%2Fwww.informit.com%2Fstore%2Fproduct.aspx%3Fisbn%3D0130211192"><img src="/img/flag.png" alt="[!]" title="report an issue with this listing' 
] 

答えて

1

問題があることです.re()は、式と一致する部分のみを返します。

response.xpath('//body//a/@href[re:test(., "flag\?cat=Computers%2FProgramming%2FLanguages%2FPython%2FBooks")]').extract() 

が私の最後に次のように生成されます:

[ 
    u'/public/flag?cat=Computers%2FProgramming%2FLanguages%2FPython%2FBooks&url=http%3A%2F%2Fwww.pearsonhighered.com%2Feducator%2Facademic%2Fproduct%2F0%2C%2C0130260363%2C00%252Ben-USS_01DBC.html', 
    u'/public/flag?cat=Computers%2FProgramming%2FLanguages%2FPython%2FBooks&url=http%3A%2F%2Fwww.brpreiss.com%2Fbooks%2Fopus7%2Fhtml%2Fbook.html', 
    ... 
] 
代わりに、あなたは正規表現のチェックを継続して使用する場合は、 re:test()フックを使用
関連する問題