2016-03-29 17 views
5

私はScrapyを使ってGoogle Analyticsからデータを取得しようとしていますが、私が完全なPython初心者であるにもかかわらず進歩しました。 ScrapyでGoogleアナリティクスにログインできるようになりましたが、必要なデータを取得するためにAJAXリクエストを作成する必要があります。私は以下のコードで私のブラウザのHTTPリクエストヘッダを複製しようとしているが、動作するようには思えない、私のエラーログはGoogle AnalyticsをScrapで掻き集める

を解凍する

あまりにも多くの値は、誰かが助けてもらえ言いますか?私は2日間それに取り組んだ、私は非常に近いと感じているが、私はまた非常に混乱している。ここで

コードです:

from scrapy.spider import BaseSpider 
from scrapy.selector import HtmlXPathSelector 
from scrapy.http import FormRequest, Request 
from scrapy.selector import Selector 
import logging 
from super.items import SuperItem 
from scrapy.shell import inspect_response 
import json 

class LoginSpider(BaseSpider): 
    name = 'super' 
    start_urls = ['https://accounts.google.com/ServiceLogin?service=analytics&passive=true&nui=1&hl=fr&continue=https%3A%2F%2Fwww.google.com%2Fanalytics%2Fweb%2F%3Fhl%3Dfr&followup=https%3A%2F%2Fwww.google.com%2Fanalytics%2Fweb%2F%3Fhl%3Dfr#identifier'] 

    def parse(self, response): 
     return [FormRequest.from_response(response, 
        formdata={'Email': 'Email'}, 

        callback=self.log_password)] 


    def log_password(self, response): 
     return [FormRequest.from_response(response, 
        formdata={'Passwd': 'Password'}, 

        callback=self.after_login)] 

    def after_login(self, response): 
     if "authentication failed" in response.body: 
     self.log("Login failed", level=logging.ERROR) 
     return 
    # We've successfully authenticated, let's have some fun! 
     else: 
     print("Login Successful!!") 
     return Request(url="https://analytics.google.com/analytics/web/getPage?id=trafficsources-all-traffic&ds=a5425w87291514p94531107&hl=fr&authuser=0", 
       method='POST', 
       headers=[{'Content-Type': 'application/x-www-form-urlencoded;charset=UTF-8', 
         'Galaxy-Ajax': 'true', 
         'Origin': 'https://analytics.google.com', 
         'Referer': 'https://analytics.google.com/analytics/web/?hl=fr&pli=1', 
         'User-Agent': 'My-user-agent', 
         'X-GAFE4-XSRF-TOKEN': 'Mytoken'}], 
       callback=self.parse_tastypage, dont_filter=True) 


    def parse_tastypage(self, response): 
     response = json.loads(jsonResponse) 

     inspect_response(response, self) 
     yield item 

そして、ここでは、ログの一部です:ヘッダは辞書ではなく、辞書内のリストにする必要があるので、

2016-03-28 19:11:39 [scrapy] INFO: Enabled item pipelines: 
[] 
2016-03-28 19:11:39 [scrapy] INFO: Spider opened 
2016-03-28 19:11:39 [scrapy] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min) 
2016-03-28 19:11:39 [scrapy] DEBUG: Telnet console listening on 127.0.0.1:6023 
2016-03-28 19:11:40 [scrapy] DEBUG: Crawled (200) <GET https://accounts.google.com/ServiceLogin?service=analytics&passive=true&nui=1&hl=fr&continue=https%3A%2F%2Fwww.google.com%2Fanalytics%2Fweb%2F%3Fhl%3Dfr&followup=https%3A%2F%2Fwww.google.com%2Fanalytics%2Fweb%2F%3Fhl%3Dfr#identifier> (referer: None) 
2016-03-28 19:11:46 [scrapy] DEBUG: Crawled (200) <POST https://accounts.google.com/AccountLoginInfo> (referer: https://accounts.google.com/ServiceLogin?service=analytics&passive=true&nui=1&hl=fr&continue=https%3A%2F%2Fwww.google.com%2Fanalytics%2Fweb%2F%3Fhl%3Dfr&followup=https%3A%2F%2Fwww.google.com%2Fanalytics%2Fweb%2F%3Fhl%3Dfr) 
2016-03-28 19:11:50 [scrapy] DEBUG: Redirecting (302) to <GET https://accounts.google.com/CheckCookie?hl=fr&checkedDomains=youtube&pstMsg=0&chtml=LoginDoneHtml&service=analytics&continue=https%3A%2F%2Fwww.google.com%2Fanalytics%2Fweb%2F%3Fhl%3Dfr&gidl=CAA> from <POST https://accounts.google.com/ServiceLoginAuth> 
2016-03-28 19:11:57 [scrapy] DEBUG: Redirecting (302) to <GET https://www.google.com/analytics/web/?hl=fr> from <GET https://accounts.google.com/CheckCookie?hl=fr&checkedDomains=youtube&pstMsg=0&chtml=LoginDoneHtml&service=analytics&continue=https%3A%2F%2Fwww.google.com%2Fanalytics%2Fweb%2F%3Fhl%3Dfr&gidl=CAA> 
2016-03-28 19:12:01 [scrapy] DEBUG: Crawled (200) <GET https://www.google.com/analytics/web/?hl=fr> (referer: https://accounts.google.com/AccountLoginInfo) 
Login Successful!! 
2016-03-28 19:12:01 [scrapy] ERROR: Spider error processing <GET https://www.google.com/analytics/web/?hl=fr> (referer: https://accounts.google.com/AccountLoginInfo) 
Traceback (most recent call last): 
    File "/System/Library/Frameworks/Python.framework/Versions/2.7/Extras/lib/python/twisted/internet/defer.py", line 577, in _runCallbacks 
    current.result = callback(current.result, *args, **kw) 
    File "/Users/aminbouraiss/super/super/spiders/mySuper.py", line 42, in after_login 
    callback=self.parse_tastypage, dont_filter=True) 
    File "/Library/Python/2.7/site-packages/Scrapy-1.1.0rc3-py2.7.egg/scrapy/http/request/__init__.py", line 35, in __init__ 
    self.headers = Headers(headers or {}, encoding=encoding) 
    File "/Library/Python/2.7/site-packages/Scrapy-1.1.0rc3-py2.7.egg/scrapy/http/headers.py", line 12, in __init__ 
    super(Headers, self).__init__(seq) 
    File "/Library/Python/2.7/site-packages/Scrapy-1.1.0rc3-py2.7.egg/scrapy/utils/datatypes.py", line 193, in __init__ 
    self.update(seq) 
    File "/Library/Python/2.7/site-packages/Scrapy-1.1.0rc3-py2.7.egg/scrapy/utils/datatypes.py", line 229, in update 
    super(CaselessDict, self).update(iseq) 
    File "/Library/Python/2.7/site-packages/Scrapy-1.1.0rc3-py2.7.egg/scrapy/utils/datatypes.py", line 228, in <genexpr> 
    iseq = ((self.normkey(k), self.normvalue(v)) for k, v in seq) 
ValueError: too many values to unpack 
2016-03-28 19:12:01 [scrapy] INFO: Closing spider (finished) 
2016-03-28 19:12:01 [scrapy] INFO: Dumping Scrapy stats: 
{'downloader/request_bytes': 6419, 
'downloader/request_count': 5, 
'downloader/request_method_count/GET': 3, 
'downloader/request_method_count/POST': 2, 
'downloader/response_bytes': 75986, 
'downloader/response_count': 5, 
'downloader/response_status_count/200': 3, 
'downloader/response_status_count/302': 2, 
'finish_reason': 'finished', 
'finish_time': datetime.datetime(2016, 3, 28, 23, 12, 1, 824033), 
'log_count/DEBUG': 6, 
+2

いいえ、私はAPIを介して –

+0

を使用します!私はhttpリクエストのヘッダーを変更し、それはついに働いた。 –

答えて

2

あなたのエラーは、次のとおりです。

headers={'Content-Type': 'application/x-www-form-urlencoded;charset=UTF-8', 

          'Galaxy-Ajax': 'true', 
          'Origin': 'https://analytics.google.com', 
          'Referer': 'https://analytics.google.com/analytics/web/?hl=fr&pli=1', 
          'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/49.0.2623.87 Safari/537.36', 
          }, 

これで問題は解決しますが、コンテンツの長さを指定する必要がある場合は411が表示されます。 ddあなたが取りたいものを私はあなたにどのように表示することができるでしょう。あなたは以下の出力を見ることができます:

2016-03-29 14:02:11 [scrapy] DEBUG: Redirecting (302) to <GET https://www.google.com/analytics/web/?hl=fr> from <GET https://accounts.google.com/CheckCookie?hl=fr&checkedDomains=youtube&pstMsg=0&chtml=LoginDoneHtml&service=analytics&continue=https%3A%2F%2Fwww.google.com%2Fanalytics%2Fweb%2F%3Fhl%3Dfr&gidl=CAA> 
2016-03-29 14:02:13 [scrapy] DEBUG: Crawled (200) <GET https://www.google.com/analytics/web/?hl=fr> (referer: https://accounts.google.com/AccountLoginInfo) 
Login Successful!! 
2016-03-29 14:02:14 [scrapy] DEBUG: Crawled (411) <POST https://analytics.google.com/analytics/web/getPage?id=trafficsources-all-traffic&ds=a5425w87291514p94531107&hl=fr&authuser=0> (referer: https://analytics.google.com/analytics/web/?hl=fr&pli=1) 
2016-03-29 14:02:14 [scrapy] DEBUG: Ignoring response <411 https://analytics.google.com/analytics/web/getPage?id=trafficsources-all-traffic&ds=a5425w87291514p94531107&hl=fr&authuser=0>: HTTP status code is not handled or not allowed 
+0

おかげパドレイクを得ることができない、私はあなたにビールを借りていくつかのデータを取得しようとしているAPI –

+0

@gerardbaste、ないprob、あなたがそれを整理してうれしい、幸せな分析。 –