Python - エスケープされた角かっこでxmlを解析する

いくつかのXMLを解析しようとしていますが、いくつかのエスケープ文字が含まれています。これを行う簡単な方法はありますか？Python - エスケープされた角かっこでxmlを解析する

のxml：

<?xml version="1.0" encoding="UTF-8"?> 
<Group id="RHEL-07-010010"> 
    <title>SRG-OS-000257-GPOS-00098</title> 
    <description>&lt;GroupDescription&gt;&lt;/GroupDescription&gt; </description> 
    <Rule id="RHEL-07-010010_rule" severity="high" weight="10.0"> 
     <version>RHEL-07-010010</version> 
     <title>The file permissions, ownership, and group membership of system files and commands must match the vendor values.</title> 
     <description>&lt;VulnDiscussion&gt;Discretionary access control is weakened if a user or group has access permissions to system files and directories greater than the default. 

Satisfies: SRG-OS-000257-GPOS-00098, SRG-OS-000278 GPOS-00108&lt;/VulnDiscussion&gt; 
    </Rule> 
</Group>

私は、descriptionタグに含まれるグループID、ルールの重要度、タイトルとVulnDiscussionを引き出すしようとしています。

import xml.etree.ElementTree as ET 
import HTMLParser 


tree = ET.parse("test.xml") 
root = tree.getroot() 


for findings in root.iter('Group'): 
    print findings.get('id') 
    rule = findings.find('Rule') 
    print rule.get('severity') 
    print rule.find('title').text 
    description = rule.find('description') 

    # my attempt at unescaping the description tag to parse the VulnDiscussion 
    embeddedHtml = HTMLParser.HTMLParser() 
    unescapedXML = embeddedHtml.unescape(description) 
    newtree = ET.fromstring(unescapedXML) 
    print newtree.get(VulnDiscussion).text

クラッシュして：私はそれはエスケープ文字が含まれているため、ここで>と<

は私のコードがあるVulnDiscussion以外のすべてを取得することができます

newtree = ET.fromstring(unescapedXML) 
    File "/usr/local/Cellar/python/2.7.11/Frameworks/Python.framework/Versions /2.7/lib/python2.7/xml/etree/ElementTree.py", line 1300, in XML 
    parser.feed(text) 
    File "/usr/local/Cellar/python/2.7.11/Frameworks/Python.framework/Versions/2.7/lib/python2.7/xml/etree/ElementTree.py", line 1640, in feed 
    self._parser.Parse(data, 0) 
TypeError: must be string or read-only buffer, not Element

出典

2017-01-10 Joel Parker

を生成し、私はあなたの質問を解決する投稿の答えをしたか、別の何かを探していましたか？ –

私はlxmlの代わりに、標準ライブラリのを使用することをお勧めxml、これはもう少し堅牢で機能的です。テキスト中のエスケープ記号を自動的にエスケープします。 XPathを使用すると、ここでもあなたの人生が楽になります。

from lxml import etree as ET 

xml = ET.XML(b"""<?xml version="1.0" encoding="UTF-8"?> 
<Group id="RHEL-07-010010"> 
    <title>SRG-OS-000257-GPOS-00098</title> 
    <description>&lt;GroupDescription&gt;&lt;/GroupDescription&gt; </description> 
    <Rule id="RHEL-07-010010_rule" severity="high" weight="10.0"> 
     <version>RHEL-07-010010</version> 
     <title>The file permissions, ownership, and group membership of system files and commands must match the vendor values.</title> 
     <description>&lt;VulnDiscussion&gt;Discretionary access control is weakened if a user or group has access permissions to system files and directories greater than the default. 

Satisfies: SRG-OS-000257-GPOS-00098, SRG-OS-000278 GPOS-00108&lt;/VulnDiscussion&gt; 
     </description> 
    </Rule> 
</Group>""") 

for description in xml.xpath('//description/text()'): 
    vulnDiscussion = next(iter(ET.XML(description).xpath('/VulnDiscussion/text()')), None) 
    print(vulnDiscussion)

上記のコードは

None 
Discretionary access control is weakened if a user or group has access permissions to system files and directories greater than the default. 

Satisfies: SRG-OS-000257-GPOS-00098, SRG-OS-000278 GPOS-00108

出典

2017-01-10 23:51:00

私はPythonの初心者ですが、あなたの答えをテストしようとしていますが、エラーが発生します。しかし、私はファイルからXMLをロードしていますが、ここにコードがあります： 'xml.pathには、 'description/text（）'）： AttributeError： 'lxml.etree._ElementTree（' // description/text（） '）： \t print（description） 'xml.path（' // description/text（） 'オブジェクトには属性'パス 'がありません –

これは私にとって誤植のようです。メソッド名は 'path'ではなく' xpath'でなければなりません。 –

xml.xpath（ 'VulnDiscussion/text（）'）の説明をしようとすると、目的はVulnDescriptionタグの内容を取得することでした。 –

Python - エスケープされた角かっこでxmlを解析する

答えて

関連する問題