すべての記事を取得するXML Wikiのタイトルダンプ - Python

特定のカテゴリのすべてのページをエクスポートして作成したWikipedia XMLダンプがあります。 https://en.wikipedia.org/wiki/Special:Exportに自分自身を生成することによって、このXMLファイルの正確な構造を見ることができます。今度は、Pythonで各記事のタイトルのリストを作成したいと思います。私は使用しようとしました：すべての記事を取得するXML Wikiのタイトルダンプ - Python

import xml.etree.ElementTree as ET 

tree = ET.parse('./comp_sci_wiki.xml') 
root = tree.getroot() 

for element in root: 
    for sub in element: 
     print sub.find("title")

何も印刷されていません。これは比較的簡単な作業でなければならないようです。あなたが提供できるどんな助けも大歓迎です。ありがとう！

出典

2016-04-05 user2585945

エクスポートしたファイルの先頭を見れば、あなたは文書がデフォルトのXML名前空間を宣言していることがわかります：

<mediawiki xmlns="http://www.mediawiki.org/xml/export-0.10/" 
    xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLo

文書には非名前空間「タイトル」の要素が存在しないことを意味、あなたのsub.find("title")ステートメントが失敗している理由の1つです。それは<Element 'mediawiki'>を言っていないことを

>>> print root 
<Element '{http://www.mediawiki.org/xml/export-0.10/}mediawiki' at 0x7f2a45df6c10>

注：あなたがあなたのroot要素をプリントアウトした場合あなたはこれを見ることができます。識別子には完全な名前空間が含まれます。 This documentは、XML文書内の名前空間を操作する方法を詳細に説明するが、TL; dirのバージョンは、あなたが必要だということです。

>>> from xml.etree import ElementTree as ET 
>>> tree=ET.parse('/home/lars/Downloads/Wikipedia-20160405005142.xml') 
>>> root = tree.getroot() 
>>> ns = 'http://www.mediawiki.org/xml/export-0.10/ 
>>> for page in root.findall('{%s}page' % ns): 
... print (page.find('{%s}title' % ns).text) 
... 
Category:Wikipedia books on computer science 
Computer science in sport 
Outline of computer science 
Category:Unsolved problems in computer science 
Category:Philosophy of computer science 
[...etc...] 
>>>

あなたは lxmlモジュールをインストールした場合、あなたの人生は、おそらく容易になるだろうということ、これ名前空間のサポート上のドキュメントを読んで、できればそのプラスこれらの例は正しい方向にあなたを指します、とにかく

>>> nsmap={'x': 'http://www.mediawiki.org/xml/export-0.10/'} 
>>> for title in tree.xpath('//x:title', namespaces=nsmap): 
... print (title.text) 
... 
Category:Wikipedia books on computer science 
Computer science in sport 
Outline of computer science 
Category:Unsolved problems in computer science 
Category:Philosophy of computer science 
Category:Computer science organizations 
[...etc...]

：。にあなたはこのような何かを行うことができ、完全なXPathのサポートが含まれていますテイクアウェイでは、XML名前空間が重要であり、titleのの1つの名前空間は別の名前空間のtitleと同じでなければなりません。

出典

2016-04-05 01:10:27 larsks

すべての記事を取得するXML Wikiのタイトルダンプ - Python

答えて

関連する問題