302リダイレクト後に最初のリクエストURLを取得するにはどうすればいいですか？

インターネットでscrapy（ver：1.1.1）で治療をしています。上記のコードで302リダイレクト後に最初のリクエストURLを取得するにはどうすればいいですか？

class Link_Spider(scrapy.Spider): 
    name = 'GetLink' 
    allowed_domains = ['example_0.com'] 
    with codecs.open('link.txt', 'r', 'utf-8') as f: 
     start_urls = [url.strip() for url in f.readlines()] 

def parse(self, response): 
    print response.url

、 'start_urls' タイプがリストである：これは私が直面しているものです

start_urls = [ 
       example_0.com/?id=0, 
       example_0.com/?id=1, 
       example_0.com/?id=2, 
      ] # and so on

scrapyの実行は、デバッグ情報が私に言った：

[scrapy] DEBUG: Redirecting (302) to (GET https://example_1.com/?subid=poison_apple) from (GET http://example_0.com/?id=0) 
[scrapy] DEBUG: Redirecting (301) to (GET https://example_1/ture_a.html) from (GET https://example_1.com/?subid=poison_apple) 
[scrapy] DEBUG: Crawled (200) (GET https://example_1/ture_a.html) (referer: None)

「start_url」の「http://example_0.com/?id= ***」のURLが「https://example_1/ture_a.html」のURLとペアになっていることを確認するにはどうすればよいですか？誰でも私を助けることができますか？

出典

2016-12-03 xie

、あなたはリダイレクトがあるので、（自動的にリダイレクトされることなく、すべての要求を制御したい場合余分なリクエスト）、あなたはので、この場合には、RedirectMiddlewareまたは単に要求に対してメタパラメータdont_redirectを渡す無効にすることができます

class Link_Spider(scrapy.Spider): 
    name = 'GetLink' 
    allowed_domains = ['example_0.com'] 

    # you'll have to control the initial requests with `start_requests` 
    # instead of declaring start_urls 

    def start_requests(self): 
     with codecs.open('link.txt', 'r', 'utf-8') as f: 
      start_urls = [url.strip() for url in f.readlines()] 
     for start_url in start_urls: 
      yield Request(
       start_url, 
       callback=self.parse_handle1, 
       meta={'dont_redirect':True, 'handle_httpstatus_list': [301, 302]}, 
      ) 
    def parse_handle1(self, response): 
     # here you'll have to handle the redirect yourself 
     # remember that the redirected url is in in the header: `Location` 
     # do something with the response.body, response.headers. etc. 
     ... 
     yield Request(
      response.headers['Location'][0], 
      callback=self.parse_handle2, 
      meta={'dont_redirect':True, 'handle_httpstatus_list': [301, 302]}, 
     ) 

    def parse_handle2(self, response): 
     # here you'll have to handle the second redirect yourself 
     # do something with the response.body, response.headers. etc. 
     ... 
     yield Request(response.headers['Location'][0], callback=self.parse) 


    def parse(self, response): 
     # actual last response 
     print response.url

出典

2016-12-04 01:05:08 eLRuLL

あなたはそれから元のURLを取得できるように、すべての応答は、それに接続要求を持っています

答えを拡張

def parse(self, response): 
    print('original url:') 
    print(response.request.url)

出典

2016-12-03 22:33:22 Granitosaurus

を私がしようと試み、しかし、「印刷response.request.urlは」動作しませんでしたで、単に「https：//example_1/ture_a.html」と表示されます。応答は最後のデバッグ情報であるため、最初のデバッグ情報ではなく "crawled（200）"となります。 "redirecting（302）" – xie

302リダイレクト後に最初のリクエストURLを取得するにはどうすればいいですか？

答えて

関連する問題