2016-09-05 10 views
0

私はElasticSearchにいくつかの公開Amazonデータセット(製品)のインデックスを作成しようとしています。ElasticSearch一括更新:Pythonスクリプトを使用したJSONの整理

私はデータ(9.9ギガバイト)用の非常に大きなJSONファイルを持っています。私は(メモリのために)さまざまな小さなファイルにファイルを分割されている、そして今、各ファイルには、以下の構造を有する:

{"asin": "0001048791", "salesRank": {"Books": 6334800}, "imUrl": "http://ecx.images-amazon.com/images/I/51MKP0T4DBL.jpg", "categories": [["Books"]], "title": "The Crucible: Performed by Stuart Pankin, Jerome Dempsey & Cast"} 
{"asin": "0000143561", "categories": [["Movies & TV", "Movies"]], "description": "3Pack DVD set - Italian Classics, Parties and Holidays.", "title": "Everyday Italian (with Giada de Laurentiis), Volume 1 (3 Pack): Italian Classics, Parties, Holidays", "price": 12.99, "salesRank": {"Movies & TV": 376041}, "imUrl": "http://g-ecx.images-amazon.com/images/G/01/x-site/icons/no-img-sm._CB192198896_.gif", "related": {"also_viewed": ["B0036FO6SI", "B000KL8ODE", "000014357X", "B0037718RC", "B002I5GNVU", "B000RBU4BM"], "buy_after_viewing": ["B0036FO6SI", "B000KL8ODE", "000014357X", "B0037718RC"]}} 
{"asin": "0000037214", "related": {"also_viewed": ["B00JO8II76", "B00DGN4R1Q", "B00E1YRI4C"]}, "title": "Purple Sequin Tiny Dancer Tutu Ballet Dance Fairy Princess Costume Accessory", "price": 6.99, "salesRank": {"Clothing": 1233557}, "imUrl": "http://ecx.images-amazon.com/images/I/31mCncNuAZL.jpg", "brand": "Big Dreams", "categories": [["Clothing, Shoes & Jewelry", "Girls"], ["Clothing, Shoes & Jewelry", "Novelty, Costumes & More", "Costumes & Accessories", "More Accessories", "Kids & Baby"]]} 

製品はJSONオブジェクトで、1・イン・ザ・ラインに配置しました。

ここで、ElasticSearch _bulk updateを使用して、このすべてのデータをインデックスに追加します。

ElasticSearchの各ドキュメントにはヘッダーが必要なので(正しい?)、適切なフォーマットの新しいファイルを作成するためのpythonスクリプトを作成しました。

シェルスクリプトは次のようになります。

#!/bin/sh 

# 0. Some constants to re-define to match your environment 
ES_HOST=localhost:9200 
JSON_FILE_IN=/home/aksarora/amazon-sample/parts/newaa 
JSON_FILE_OUT=/home/aksarora/amazon-sample/parts_parsed/newaa.json 

# 1. Python code to transform your JSON file 
PYTHON="import json,sys; 
out = open('$JSON_FILE_OUT', 'w'); 
with open('$JSON_FILE_IN', 'r') as json_in: 
    docs = [json.loads(line) for line in json_in] 
    for doc in docs: 
     out.write('%s\n' % json.dumps({\"index\": {}})); 
     out.write('%s\n' % json.dumps(doc, indent=0).replace('\n', '')); 
" 

# 2. run the Python script from step 1 
python3 -c "$PYTHON" 

# 3. use the output file from step 2 in the curl command 
curl -s -XPOST $ES_HOST/amazon/products/_bulk --data-binary @$JSON_FILE_OUT 

しかし、私はこれを実行すると、私は次のエラーを取得する:

Traceback (most recent call last): 
    File "<string>", line 4, in <module> 
    File "<string>", line 4, in <listcomp> 
    File "/usr/lib/python3.5/json/__init__.py", line 319, in loads 
    return _default_decoder.decode(s) 
    File "/usr/lib/python3.5/json/decoder.py", line 339, in decode 
    obj, end = self.raw_decode(s, idx=_w(s, 0).end()) 
    File "/usr/lib/python3.5/json/decoder.py", line 355, in raw_decode 
    obj, end = self.scan_once(s, idx) 
json.decoder.JSONDecodeError: Expecting ',' delimiter: line 1 column 42 (char 41) 
{"error":{"root_cause":[{"type":"parse_exception","reason":"Failed to derive xcontent"}],"type":"parse_exception","reason":"Failed to derive xcontent"},"status":400} 

私が間違っているの何任意のアイデアを?おかげさまで

+0

を私はPythonがあなた無効なJSON文句を言っていると思います'json.loads'に渡しました – Tempux

+0

最初のステップとして、シェルスクリプトを投げ捨てて、これをPythonで完全に行うことをお勧めします。 2つの異なるスクリプト通訳をここに含める理由はありません。 – Tomalak

答えて

1

私は問題を再現しようとすると、以下を実行したときに、私は同じ結果を得ることはありません:

import json 
import sys 

json_in =(
    """{"asin": "0001048791", "salesRank": {"Books": 6334800}, "imUrl": "http://ecx.images-amazon.com/images/I/51MKP0T4DBL.jpg", "categories": [["Books"]], "title": "The Crucible: Performed by Stuart Pankin, Jerome Dempsey &amp; Cast"}""", 
    """{"asin": "0000143561", "categories": [["Movies & TV", "Movies"]], "description": "3Pack DVD set - Italian Classics, Parties and Holidays.", "title": "Everyday Italian (with Giada de Laurentiis), Volume 1 (3 Pack): Italian Classics, Parties, Holidays", "price": 12.99, "salesRank": {"Movies & TV": 376041}, "imUrl": "http://g-ecx.images-amazon.com/images/G/01/x-site/icons/no-img-sm._CB192198896_.gif", "related": {"also_viewed": ["B0036FO6SI", "B000KL8ODE", "000014357X", "B0037718RC", "B002I5GNVU", "B000RBU4BM"], "buy_after_viewing": ["B0036FO6SI", "B000KL8ODE", "000014357X", "B0037718RC"]}}""", 
    """{"asin": "0000037214", "related": {"also_viewed": ["B00JO8II76", "B00DGN4R1Q", "B00E1YRI4C"]}, "title": "Purple Sequin Tiny Dancer Tutu Ballet Dance Fairy Princess Costume Accessory", "price": 6.99, "salesRank": {"Clothing": 1233557}, "imUrl": "http://ecx.images-amazon.com/images/I/31mCncNuAZL.jpg", "brand": "Big Dreams", "categories": [["Clothing, Shoes & Jewelry", "Girls"], ["Clothing, Shoes & Jewelry", "Novelty, Costumes & More", "Costumes & Accessories", "More Accessories", "Kids & Baby"]]}""", 
) 

out = sys.stdout # send output to screen 
docs = [json.loads(line) for line in json_in] # assume one object per line 
for doc in docs: 
    out.write('%s\n' % json.dumps({"index": {}})) 
    out.write('%s\n' % json.dumps(doc)) 

出力:

{"index": {}} 
{"asin": "0001048791", "salesRank": {"Books": 6334800}, "imUrl": "http://ecx.images-amazon.com/images/I/51MKP0T4DBL.jpg", "categories": [["Books"]], "title": "The Crucible: Performed by Stuart Pankin, Jerome Dempsey &amp; Cast"} 
{"index": {}} 
{"asin": "0000143561", "description": "3Pack DVD set - Italian Classics, Parties and Holidays.", "title": "Everyday Italian (with Giada de Laurentiis), Volume 1 (3 Pack): Italian Classics, Parties, Holidays", "price": 12.99, "imUrl": "http://g-ecx.images-amazon.com/images/G/01/x-site/icons/no-img-sm._CB192198896_.gif", "related": {"also_viewed": ["B0036FO6SI", "B000KL8ODE", "000014357X", "B0037718RC", "B002I5GNVU", "B000RBU4BM"], "buy_after_viewing": ["B0036FO6SI", "B000KL8ODE", "000014357X", "B0037718RC"]}, "salesRank": {"Movies & TV": 376041}, "categories": [["Movies & TV", "Movies"]]} 
{"index": {}} 
{"asin": "0000037214", "title": "Purple Sequin Tiny Dancer Tutu Ballet Dance Fairy Princess Costume Accessory", "price": 6.99, "imUrl": "http://ecx.images-amazon.com/images/I/31mCncNuAZL.jpg", "related": {"also_viewed": ["B00JO8II76", "B00DGN4R1Q", "B00E1YRI4C"]}, "salesRank": {"Clothing": 1233557}, "brand": "Big Dreams", "categories": [["Clothing, Shoes & Jewelry", "Girls"], ["Clothing, Shoes & Jewelry", "Novelty, Costumes & More", "Costumes & Accessories", "More Accessories", "Kids & Baby"]]} 
関連する問題