PythonでXMLを解析するときに複数のノードを扱う

割り当てのために、200万行のXMLファイルを解析し、そのデータをMySQLデータベースに入力する必要があります。私たちはクラスのためのsqliteでpython環境を使用しているので、私はpythonを使用してファイルを解析しようとしています。私はちょうどすべてが新しいため、Pythonを学んでいることを覚えておいてください！PythonでXMLを解析するときに複数のノードを扱う

私はいくつかの試みがありましたが、失敗し続けていて、挫折しています。効率化のために、私はここでは、完全なXMLのほんの少量に出て自分のコードをテストしてい：

<pub> 
<ID>7</ID> 
<title>On the Correlation of Image Size to System Accuracy in Automatic Fingerprint Identification Systems</title> 
<year>2003</year> 
<booktitle>AVBPA</booktitle> 
<pages>895-902</pages> 
<authors> 
    <author>J. K. Schneider</author> 
    <author>C. E. Richardson</author> 
    <author>F. W. Kiefer</author> 
    <author>Venu Govindaraju</author> 
</authors> 
</pub>

最初の試み

ここ

を私が成功したときを除いて、各タグからのすべてのデータを取り出し<authors>タグの下に複数の著者がいます。私はauthorsタグの各ノードをループしてカウントしてから、それらの著者のために一時配列を作成し、SQLで次にそれらをデータベースにスローしようとしています。著者の数が「15」になっていますが、明確に4つしかありません！これをどうすれば解決できますか？あなたは、コードのために、ここで第一著者の文字数を取得した

from xml.dom import minidom 

xmldoc= minidom.parse("test.xml") 

pub = xmldoc.getElementsByTagName("pub")[0] 
ID = pub.getElementsByTagName("ID")[0].firstChild.data 
title = pub.getElementsByTagName("title")[0].firstChild.data 
year = pub.getElementsByTagName("year")[0].firstChild.data 
booktitle = pub.getElementsByTagName("booktitle")[0].firstChild.data 
pages = pub.getElementsByTagName("pages")[0].firstChild.data 
authors = pub.getElementsByTagName("authors")[0] 
author = authors.getElementsByTagName("author")[0].firstChild.data 
num_authors = len(author) 
print("Number of authors: ", num_authors) 

print(ID) 
print(title) 
print(year) 
print(booktitle) 
print(pages) 
print(author)

出典

2017-04-23 douglasrcjames

お知らせのみ第一著者（インデックス0）に結果を制限し、その長さを取得：

author = authors.getElementsByTagName("author")[0].firstChild.data 
num_authors = len(author) 
print("Number of authors: ", num_authors)

を

ちょうどすべての著者を取得するには、結果を限定するものではない：

author = authors.getElementsByTagName("author") 
num_authors = len(author) 
print("Number of authors: ", num_authors)

あなたはリストの内包表記を使用することができますリスト内の著者の代わりにすべての著者名を取得するには

author = [a.firstChild.data for a in authors.getElementsByTagName("author")] 
print(author) 
# [u'J. K. Schneider', u'C. E. Richardson', u'F. W. Kiefer', u'Venu Govindaraju']

出典

2017-04-23 06:27:02 har07

私は配列内の各変数にアクセスする必要があることは知っていましたが、構文上はわかりませんでした。どうもありがとうございます！ – douglasrcjames

Hey @ har07だから私は進歩しましたが、私のXMLデータの中にはある意味で「悪い」ものがあります...私は名前に "í"のような特殊文字を入れて "＆iacute;" XMLファイルに保存します。これらの特殊文字をどのように処理してPythonにするのですか？私が得ているエラーは、 "ExpatError：未定義エンティティ："です。 – douglasrcjames

PythonでXMLを解析するときに複数のノードを扱う

答えて

関連する問題