Scrapy Tor Privoxy＆UserAgentを使用して匿名でスクラップする方法は？（Windows 10）

情報がばらばらになっていて、質問のタイトルが誤解を招くことがあるため、この質問の回答は非常に難しいものでした。以下の答えは、1つの場所で必要なすべての情報を再編成します。Scrapy Tor Privoxy＆UserAgentを使用して匿名でスクラップする方法は？（Windows 10）

出典

2017-12-21 J. Does

あなたのクモのように見えるはずです。

# based on https://doc.scrapy.org/en/latest/intro/tutorial.html 

import scrapy 
import requests 

class QuotesSpider(scrapy.Spider): 
    name = "quotes" 

    def start_requests(self): 
     urls = [ 
      'http://quotes.toscrape.com/page/1/', 
      'http://quotes.toscrape.com/page/2/', 
     ] 
     for url in urls: 
      print('\n\nurl:', url) 
     ## use one of the yield below 

      # middleware will process the request 
      yield scrapy.Request(url=url, callback=self.parse) 

      # check if Tor has changed IP 
      #yield scrapy.Request('http://icanhazip.com/', callback=self.is_tor_and_privoxy_used) 


    def parse(self, response): 
     page = response.url.split("/")[-2] 
     filename = 'quotes-%s.html' % page 
     with open(filename, 'wb') as f: 
      f.write(response.body) 
     print('\n\nSpider: Start') 
     print('Is proxy in response.meta?: ', response.meta) 
     print ("user_agent is: ",response.request.headers['User-Agent']) 
     print('\n\n Spider: End') 
     self.log('Saved file --- %s' % filename) 


    def is_tor_and_privoxy_used(self, response): 
     print('\n\nSpider: Start') 
     print("My IP is : " + str(response.body)) 
     print("Is proxy in response.meta?: ", response.meta) # not header dispo 
     print('\n\nSpider: End') 
     self.log('Saved file %s' % filename)

また、middleware.pyとsettings.pyにも追加する必要があります。あなたはそれを行う方法がわからない場合this will help you

出典

2017-12-21 16:15:19

Scrapy Tor Privoxy＆UserAgentを使用して匿名でスクラップする方法は？ （Windows 10）

答えて

関連する問題

Scrapy Tor Privoxy＆UserAgentを使用して匿名でスクラップする方法は？（Windows 10）