Scrap CSSセレクタは、壊れたjson文字列を返します。

-2

Hey私はPythonに初心者です。特にスクラピーです。Walmartをスクラップしようとしています。しかし、私は1つの問題で立ち往生しています。私は、応答Scrap CSSセレクタは、壊れたjson文字列を返します。

__WML_REDUX_INITIAL_STATE__ =*(.*\});\}; からJSON文字列を取得するには、この正規表現しています。しかし、それはjson.loadsが失敗したいくつかの回例えばFRこのwalmart productによりこれに壊れたJSON文字列を与えます。この問題はREGXまたはscrapyで.Iは、なぜこれが起こっている

出典

2017-07-13 Afraz Ahmad

サンプルJASONを出して、希望の出力と実際の出力を見せてください。 –

ここにsmaple json [jsonファイル]（https://gist.github.com/afrazahmad21/3f76f9010cc847319dcfd8ef4396151a） –

希望の出力が有効ですjson –

Scrapy/ParselのSelector.re()と.re_first()は、HTML文字エンティティ参照を置き換える（残念ながら）デフォルトの動作を持っています。取得しておりませんされてこれによりJSONのデコードが失敗することがあります。

サンプルURLをスクリーニングシェルで表示します。

$ scrapy shell https://www.walmart.com/ip/Riders-by-Lee-Women-s-On-the-Go-Performance-Capri/145227527 -s USER_AGENT='mozilla' 
2017-07-13 15:24:30 [scrapy.utils.log] INFO: Scrapy 1.4.0 started (bot: scrapybot) 
(..) 
2017-07-13 15:24:31 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.walmart.com/ip/Riders-by-Lee-Women-s-On-the-Go-Performance-Capri/145227527> (referer: None) 
>>> data = response.xpath('//script/text()').re_first('__WML_REDUX_INITIAL_STATE__ =*(.*\});\};') 
>>> data[:25], data[-25:] 
(' {"uuid":null,"isMobile":', 'nabled":true,"seller":{}}')

しかし、JSONが失敗したとして、この文字列をデコード：二重引用符は、トラブルの原因になっている

>>> import json 
>>> json.loads(data) 
Traceback (most recent call last): 
    File "<console>", line 1, in <module> 
    File "/usr/local/lib/python3.6/json/__init__.py", line 354, in loads 
    return _default_decoder.decode(s) 
    File "/usr/local/lib/python3.6/json/decoder.py", line 339, in decode 
    obj, end = self.raw_decode(s, idx=_w(s, 0).end()) 
    File "/usr/local/lib/python3.6/json/decoder.py", line 355, in raw_decode 
    obj, end = self.scan_once(s, idx) 
json.decoder.JSONDecodeError: Expecting ',' delimiter: line 1 column 40598 (char 40597) 
>>> data[40500:40650] 
'{"values":["<br /> <b>Riders by Lee Women\'s On the Go Performance Capri</b> <br /> <ul> <li>21" Inseam</li> <li>Rib knit waist with button and zippe'

あなたの正規表現は、それはあなたがしたいデータを選択し、作業を行います。

あなたはエンティティを置き換えないようにreplace_entities=False引数を使用することができます。

>>> dataraw = response.xpath('//script/text()').re_first('__WML_REDUX_INITIAL_STATE__ =*(.*\});\};', replace_entities=False) 
>>> dataraw[40500:40650] 
'{"values":["<br /> <b>Riders by Lee Women\'s On the Go Performance Capri</b> <br /> <ul> <li>21&quot; Inseam</li> <li>Rib knit waist with button and '

がそのまま残されているか"参照してください。

そして今、あなたは文字列をJSONデコードすることができます。

>>> d = json.loads(dataraw) 
>>> d.keys() 
dict_keys(['uuid', 'isMobile', 'isBot', 'isAdsEnabled', 'isEsiEnabled', 'isInitialStateDeferred', 'isServiceWorkerEnabled', 'isShellRequest', 'productId', 'product', 'showTrustModal', 'productBasicInfo', 'fulfillmentOptions', 'feedback', 'backLink', 'offersOrder', 'sellersHeading', 'fdaCompliance', 'recommendationMap', 'header', 'footer', 'addToRegistry', 'addToList', 'ads', 'btvMap', 'postQuestion', 'autoPartFinder', 'getPromoStatus', 'discoveryModule', 'lastAction', 'isAjaxCall', 'accessModeEnabled', 'seller']) 
>>>

replace_entitiesはparsel V1.2.0で導入されました。（https://github.com/scrapy/parsel/pull/88を参照してください）

出典

2017-07-13 13:36:52

あなたが使用しているscrapyのバージョンはTypeErrorです：re_first（）予期しないキーワード引数 'replace_entities' –

'TypeError：re_first（）は予期しないキーワード引数 'replace_entities''を持っています –

parselを1.2にアップグレードしてください（' pip install --upgrade parsel'）。 parselはScrapyの依存です –

Scrap CSSセレクタは、壊れたjson文字列を返します。

答えて

関連する問題