lxmlを使用してxmlからフィールドタグとそのコンテンツを抽出する

xmlファイルがあります。このファイルの内容はこの記事の一番下にあります。私は 'フォーラムのタイトル'などのデータを使ってCSV出力を作成できるようにしたい。 'タイトル'; 'ユーザー'; '{すべては文中に}}。lxmlを使用してxmlからフィールドタグとそのコンテンツを抽出する

私はこのコードを持っている：

from lxml import etree 
xmL = 'huge-xml.xml' 

# Parse the XML file in chunks at a time and output info at every step of the way 

for event, elem in etree.iterparse(xmL, events=('start', 'end', 'start-ns', 'end-ns')): 
    text = elem.text 
    print event, elem, text

をしかし、これはワットの唯一何とか、すべてのタグ付けされたコンテンツを見つけることができません。

XML解析される：

<corpus id="politics"> 
<forum id="14" title="something & something" url="https://www.at.net/1"> 
<thread id="108" title="a title" postcount="87" lastpost="2005-03-31 06:35" url="https://www.at.net/111/222"> 
<text datefrom="20020526" dateto="20020526" timefrom="230000" timeto="230059" id="1185" username="user123" userid="46" date="2002-03-22 23:00" url="https://www.at.net/111/333"> 
<sentence id="776550f8f2-7765cba9fe"> 
<w pos="NN" msd="NN.UTR.SIN.DEF.NOM" lemma="|gräns|" lex="|gräns..nn.1|" saldo="|gräns..1|" prefix="|grän..nn.1|" suffix="|s..nn.1|" ref="1" dephead="6" deprel="AA">Gränsen</w> 
<w pos="PP" msd="PP" lemma="|mellan|" lex="|mellan..pp.1|" saldo="|mellan..1|" prefix="|" suffix="|" ref="2" dephead="4" deprel="DT">mellan</w> 
<w pos="NN" msd="NN.UTR.SIN.IND.NOM" lemma="|lycka|" lex="|lycka..nn.2|lycka..nn.1|" saldo="|lycka..2|lycka..1|lycka..3|" prefix="|" suffix="|" ref="3" dephead="4" deprel="CJ">lycka</w> 
<w pos="KN" msd="KN" lemma="|och|" lex="|och..kn.1|" saldo="|och..1|" prefix="|" suffix="|" ref="4" dephead="1" deprel="ET">och</w> 
<w pos="NN" msd="NN.UTR.SIN.IND.NOM" lemma="|död|" lex="|död..nn.1|" saldo="|död..2|" prefix="|" suffix="|" ref="5" dephead="4" deprel="CJ">död</w> 
<w pos="JJ" msd="JJ.POS.UTR.SIN.IND.NOM" lemma="|snäv|" lex="|snäv..av.1|" saldo="|snäv..1|" prefix="|" suffix="|" ref="6" dephead="" deprel="ROOT">snäv</w> 
<w pos="MAD" msd="MAD" lemma="|" lex="|" saldo="|" prefix="|" suffix="|" ref="7" dephead="6" deprel="I?">?</w> 
</sentence> 
</text> 
</thread> 

... and so on ...

出典

2017-03-14 textnet

次のコードでは、ワットののフォーラムのタイトル、スレッドのタイトル、ユーザー名、およびテキストを抽出し、各センテンスのために、これらのパラメータのリストを生成します行としてCSVファイルに書き込まれます。

<?xml version="1.0" encoding="UTF-8"?> 
<corpus id="politics"> 
    <forum id="14" title="something &amp; something" url="https://www.at.net/1"> 
     <thread id="108" title="a title" postcount="87" lastpost="2005-03-31 06:35" url="https://www.at.net/111/222"> 
      <text datefrom="20020526" dateto="20020526" timefrom="230000" timeto="230059" id="1185" username="user123" userid="46" date="2002-03-22 23:00" url="https://www.at.net/111/333"> 
       <sentence id="776550f8f2-7765cba9fe"> 
        <w pos="NN" msd="NN.UTR.SIN.DEF.NOM" lemma="|gräns|" lex="|gräns..nn.1|" saldo="|gräns..1|" prefix="|grän..nn.1|" suffix="|s..nn.1|" ref="1" dephead="6" deprel="AA">Gränsen</w> 
        <w pos="PP" msd="PP" lemma="|mellan|" lex="|mellan..pp.1|" saldo="|mellan..1|" prefix="|" suffix="|" ref="2" dephead="4" deprel="DT">mellan</w> 
        <w pos="NN" msd="NN.UTR.SIN.IND.NOM" lemma="|lycka|" lex="|lycka..nn.2|lycka..nn.1|" saldo="|lycka..2|lycka..1|lycka..3|" prefix="|" suffix="|" ref="3" dephead="4" deprel="CJ">lycka</w> 
        <w pos="KN" msd="KN" lemma="|och|" lex="|och..kn.1|" saldo="|och..1|" prefix="|" suffix="|" ref="4" dephead="1" deprel="ET">och</w> 
        <w pos="NN" msd="NN.UTR.SIN.IND.NOM" lemma="|död|" lex="|död..nn.1|" saldo="|död..2|" prefix="|" suffix="|" ref="5" dephead="4" deprel="CJ">död</w> 
        <w pos="JJ" msd="JJ.POS.UTR.SIN.IND.NOM" lemma="|snäv|" lex="|snäv..av.1|" saldo="|snäv..1|" prefix="|" suffix="|" ref="6" dephead="" deprel="ROOT">snäv</w> 
        <w pos="MAD" msd="MAD" lemma="|" lex="|" saldo="|" prefix="|" suffix="|" ref="7" dephead="6" deprel="I?">?</w> 
       </sentence> 
      </text> 
     </thread> 
    </forum> 
</corpus>

結果のCSVファイル：

something & something,a title,user123,Gränsen,mellan,lycka,och,död,snäv,?

私はあなたがワットのを連結したいかどうかわからないんだけど、私は次の入力ファイルを使用してコードをテストしてみた

import csv 
from lxml import etree 


def readXML(xml_file): 
    forum, thread, user = [''] * 3 
    ws = [] 

    for event, elem in etree.iterparse(xml_file, events=('start', 'end')): 
     if elem.tag == 'forum' and event == 'start': 
      forum = elem.attrib['title'] 
     if elem.tag == 'thread' and event == 'start': 
      thread = elem.attrib['title'] 
     if elem.tag == 'text' and event == 'start': 
      user = elem.attrib['username'] 
     if elem.tag == 'sentence': 
      if event == 'start': 
       ws.clear() 
      else: 
       yield [forum, thread, user] + ws 
     if elem.tag == 'w' and event == 'start': 
      ws.append(elem.text) 


with open('huge-csv.csv', 'w') as fd: 
    w = csv.writer(fd) 
    w.writerows(readXML('huge-xml.xml'))

？その場合は、yield [forum, thread, user] + wsをyield [forum, thread, user, ' '.join(w)]に置き換えてください。

出典

2017-03-14 17:11:31 jcbsv

lxmlを使用してxmlからフィールドタグとそのコンテンツを抽出する

答えて

関連する問題