PySparkが、私はこのような何かを見て、複数のxmlファイルを持っている文字列

を含む行をカウント：PySparkが、私はこのような何かを見て、複数のxmlファイルを持っている文字列

<?xml version="1.0" encoding="UTF-8"?> 
<parent> 
    <row AcceptedAnswerId="15" AnswerCount="5" Body="&lt;p&gt;How should 
I elicit prior distributions from experts when fitting a Bayesian 
model?&lt;/p&gt;&#10;" CommentCount="1" CreationDate="2010-07- 
19T19:12:12.510" FavoriteCount="17" Id="1" LastActivityDate="2010-09- 
15T21:08:26.077" OwnerUserId="8" PostTypeId="1" Score="26" 
Tags="&lt;bayesian&gt;&lt;prior&gt;&lt;elicitation&gt;" 
Title="Eliciting priors from experts" ViewCount="1457" />

私は、文字列を含まない行をカウントするPySparkを使用できるようにしたいと思います：<row

私の現在の考え：

def startWithRow(line): 
    if line.strip().startswith("<row"): 
     return True 
    else: 
     return False 

sc.textFile(localpath("folder_containing_xmg.gz_files")) \ 
    .filter(lambda x: not startWithRow(x)) \ 
    .count()

私はこれを検証してみましたが、私は、XMLをダウンロード（意味を持たなくても、単純なカウントラインから結果を取得していますファイルにはwcがありますが、これはPySparkの単語数に一致しませんでした。）

私のアプローチに関する何かが間違った/奇妙なものとして目立つのですか？

出典

2017-05-26 Aus_10

の可能性のある重複した[ApacheのスパークでXMLファイルを解析する方法？]（https://stackoverflow.com/questions/33280821/how-to-parse-xml-files-in-apache-spark） –

私はちょうどrowでラインを数えるか、何かをフィルタリングするために、火花と組み合わせたlxmlライブラリを使用します。例えば

from lxml import etree 

def find_number_of_rows(path): 
    try: 
     tree = etree.fromstring(path) 
    except: 
     tree = etree.parse(path) 
    return len(tree.findall('row')) 

rdd = spark.sparkContext.parallelize(paths) # paths is a list to all your paths 
rdd.map(lambda x: find_number_of_rows(x)).collect()

、あなたがリストまたはXML文字列（単なるおもちゃの一例）を持っている場合、あなたは次の手順を実行することができますあなたの場合は

text = [ 
    """ 
    <parent> 
     <row ViewCount="1457" /> 
     <row ViewCount="1457" /> 
    </parent> 
    """, 
    """ 
    <parent> 
     <row ViewCount="1457" /> 
     <row ViewCount="1457" /> 
     <row ViewCount="1457" /> 
    </parent> 
    """ 
] 

rdd = spark.sparkContext.parallelize(text) 
rdd.map(lambda x: find_number_of_rows(x)).collect()

、あなたの関数ではなく、ファイルのパスに取らなければなりません。次に、これらの行を数えたりフィルタリングしたりすることができます。私はテストするための完全なファイルがありません。余分な助けが必要な場合はお知らせください！

出典

2017-05-26 20:18:15 titipata

早速のご返事ありがとうございます。しかし、私があなたの例を実行すると、それは[0、0]を返す –

Pythonで 'find_number_of_rows（text [0]）'を実行して、関数があなたのために機能するかどうかを調べることができますか？ – titipata

助けてくれてありがとう、私はそれを最終的に考え出した。私は正しく削減していませんでした –

def badRowParser(x):  
    try: 
     line = ET.fromstring(x.strip().encode('utf-8')) 
     return True 
    except: 
     return False 
posts = sc.textFile(localpath('folder_containing_xml.gz_files')) 
rejected = posts.filter(lambda l: "<row" in l.encode('utf- 
8')).map(lambda x: not badRowParser(x)) 
ans = rejected.collect() 

from collections import Counter 
Counter(ans)

出典

2017-05-29 15:57:32

PySparkが、私はこのような何かを見て、複数のxmlファイルを持っている文字列

答えて

関連する問題