.txtファイルを.xmlに解析する方法は？

これは私のtxtファイルです：私はこのようなXML形式にこれを解析しようとしています.txtファイルを.xmlに解析する方法は？

In File Name: C:\Users\naqushab\desktop\files\File 1.m1 
Out File Name: C:\Users\naqushab\desktop\files\Output\File 1.m2 
In File Size: Low: 22636 High: 0 
Total Process time: 1.859000 
Out File Size: Low: 77619 High: 0 

In File Name: C:\Users\naqushab\desktop\files\File 2.m1 
Out File Name: C:\Users\naqushab\desktop\files\Output\File 2.m2 
In File Size: Low: 20673 High: 0 
Total Process time: 3.094000 
Out File Size: Low: 94485 High: 0 

In File Name: C:\Users\naqushab\desktop\files\File 3.m1 
Out File Name: C:\Users\naqushab\desktop\files\Output\File 3.m2 
In File Size: Low: 66859 High: 0 
Total Process time: 3.516000 
Out File Size: Low: 217268 High: 0

：ここ

<?xml version='1.0' encoding='utf-8'?> 
<root> 
    <filedata> 
     <InFileName>File 1.m1</InFileName> 
     <OutFileName>File 1.m2</OutFileName> 
     <InFileSize>22636</InFileSize> 
     <OutFileSize>77619</OutFileSize> 
     <ProcessTime>1.859000</ProcessTime> 
    </filedata> 
    <filedata> 
     <InFileName>File 2.m1</InFileName> 
     <OutFileName>File 2.m2</OutFileName> 
     <InFileSize>20673</InFileSize> 
     <OutFileSize>94485</OutFileSize> 
     <ProcessTime>3.094000</ProcessTime> 
    </filedata> 
    <filedata> 
     <InFileName>File 3.m1</InFileName> 
     <OutFileName>File 3.m2</OutFileName> 
     <InFileSize>66859</InFileSize> 
     <OutFileSize>217268</OutFileSize> 
     <ProcessTime>3.516000</ProcessTime> 
    </filedata> 
</root>

は私がしたコード（私はPythonの2を使用しています）ですそれを達成しようとしている：

import re 
import xml.etree.ElementTree as ET 

rex = re.compile(r'''(?P<title>In File Name: 
         |Out File Name: 
         |In File Size: Low: 
         |Total Process time: 
         |Out File Size: Low: 
        ) 
        (?P<value>.*) 
        ''', re.VERBOSE) 

root = ET.Element('root') 
root.text = '\n' # newline before the celldata element 

with open('Performance.txt') as f: 
    celldata = ET.SubElement(root, 'filedata') 
    celldata.text = '\n' # newline before the collected element 
    celldata.tail = '\n\n' # empty line after the celldata element 
    for line in f: 
     # Empty line starts new celldata element (hack style, uggly) 
     if line.isspace(): 
      celldata = ET.SubElement(root, 'filedata') 
      celldata.text = '\n' 
      celldata.tail = '\n\n' 

     # If the line contains the wanted data, process it. 
     m = rex.search(line) 
     if m: 
      # Fix some problems with the title as it will be used 
      # as the tag name. 
      title = m.group('title') 
      title = title.replace('&', '') 
      title = title.replace(' ', '') 

      e = ET.SubElement(celldata, title.lower()) 
      e.text = m.group('value') 
      e.tail = '\n' 

# Display for debugging 
ET.dump(root) 

# Include the root element to the tree and write the tree 
# to the file. 
tree = ET.ElementTree(root) 
tree.write('Performance.xml', encoding='utf-8', xml_declaration=True)

しかし、私は空の値を取得しています、それはXMLにこのTXTを解析することは可能でしょうか？あなたの正規表現と

出典

2017-03-16 naqushab

あなたが取得している空の値を使用できますか？あなたはより明確になりますか？ –

*完全なプログラム*で期待される結果が得られない*場合は、小さな部分に分割して別々に試してみてください。ここでは、単に入力を解析し、見つけ出す部分を印刷することから始めるべきです。そして、彼らだけがXMLファイルを作成しようとします。 –

また、正規表現とサブ要素の名前が一致しません。彼らは意図的ですか？ –

補正：それはあなたが与えられたものとして

m = re.search('(?P<title>(In File Name)|(Out File Name)|(In File Size: *Low)|(Total Process time)|(Out File Size: *Low)):(?P<value>.*)',line)

はないはずです。あなたの正規表現ではIn File Name|Out File Nameという意味で、eまたはOに続いてIn File Namをチェックし、続いてut File Nameなどを調べます。あなたは正規表現を使用せずにそれを行うことができます

提案、

。 xml.dom.minidomは、xml文字列を見分けるために使用できます。

私は理解を深めるためにコメントをインラインで追加しました。

Node.toprettyxml（[インデント= "" [、NEWL = "" [、エンコーディング= ""]]]）

戻る文書のかなり印刷版。 indentはインデント文字列を指定し、デフォルトはtabulatorになります。 NEWLが
import itertools as it 
[line[0] for line in it.groupby(lines)] 
編集
に各行の終わりに放出された文字列とデフォルト値を指定しますが、リストの行にグループconsucutive DEDUPに

をitertoolsパッケージのGROUPBYを使用することができます
だから、

import xml.etree.ElementTree as ET root = ET.Element('root') with open('file1.txt') as f: lines = f.read().splitlines() #add first subelement celldata = ET.SubElement(root, 'filedata') import itertools as it #for every line in input file #group consecutive dedup to one for line in it.groupby(lines): line=line[0] #if its a break of subelements - that is an empty space if not line: #add the next subelement and get it as celldata celldata = ET.SubElement(root, 'filedata') else: #otherwise, split with : to get the tag name tag = line.split(":") #format tag name el=ET.SubElement(celldata,tag[0].replace(" ","")) tag=' '.join(tag[1:]).strip() #get file name from file path if 'File Name' in line: tag = line.split("\\")[-1].strip() elif 'File Size' in line: splist = filter(None,line.split(" ")) tag = splist[splist.index('Low:')+1] #splist[splist.index('High:')+1] el.text = tag #prettify xml import xml.dom.minidom as minidom formatedXML = minidom.parseString( ET.tostring( root)).toprettyxml(indent=" ",encoding='utf-8').strip() # Display for debugging print formatedXML #write the formatedXML to file. with open("Performance.xml","w+") as f: f.write(formatedXML)

出力： Performance.xml

<?xml version="1.0" encoding="utf-8"?> <root> <filedata> <InFileName>File 1.m1</InFileName> <OutFileName>File 1.m2</OutFileName> <InFileSize>22636</InFileSize> <TotalProcesstime>1.859000</TotalProcesstime> <OutFileSize>77619</OutFileSize> </filedata> <filedata> <InFileName>File 2.m1</InFileName> <OutFileName>File 2.m2</OutFileName> <InFileSize>20673</InFileSize> <TotalProcesstime>3.094000</TotalProcesstime> <OutFileSize>94485</OutFileSize> </filedata> <filedata> <InFileName>File 3.m1</InFileName> <OutFileName>File 3.m2</OutFileName> <InFileSize>66859</InFileSize> <TotalProcesstime>3.516000</TotalProcesstime> <OutFileSize>217268</OutFileSize> </filedata> </root>

はそれが役に立てば幸い！ドキュメントから

出典

2017-03-16 15:47:01

完璧！ちょうど1つのこと、生成されたtxtは開始時と終了時にいくつかの空行を持つことができるので、複数の新しい行をチェックするにはどうすればよいですか？ – naqushab

itertoolsのgroupbyがトリックを行う必要があります！私は同じ編集を追加しました。 –

（強調は私です）：

は、このフラグを使用すると、より見やすく正規表現を書くことができます
をre.VERBOSE。パターン内空白はすべての文字から、文字クラスや行は文字クラスに「＃」のいずれもが含まれてエスケープされていないバックスラッシュ、および、が先行またはエスケープされていないバックスラッシュが先行する場合を除き、無視されます左端の「＃」〜の行末は無視されます。

エスケープ正規表現でスペースや\sクラス

出典

2017-03-16 16:52:12

.txtファイルを.xmlに解析する方法は？

答えて

関連する問題