python IO：テキストファイルからPython配列に単語を分割して、エスケープ文字、改行、16進値を避ける

単純なtxtファイルから適切に単語をインポートし、分割するのは難しいです。python IO：テキストファイルからPython配列に単語を分割して、エスケープ文字、改行、16進値を避ける

Txtfile：

test1 file test1 test1  
test2 test2  
test 3 test3, test3. 

test 4, test 4. 


test 5 


^Ltest 6.

シンプルfor lines in file: array.append(lines) を行う場合、これは私が受け取る最後の配列です：

['test1 file test1 test1\xc2\xa0\n', 'test2 test2\xc2\xa0\n', 'test 3 test3, test3.\n', '\n', 'test 4, test 4.\n', '\n', '\n', 'test 5\n', '\n', '\n', '^Ltest 6.\xc2\xa0\n']

私はそれは私が一つのアイテムを持って、このようなもの、になりたいです実際の英語の単語またはエスケープ文字ごとに、また、\ x__ 16進数の部分文字列を含まない：

['test1', 'file', 'test1', 'test1', '\n', 'test2', 'test2', '\n', 'test', '3', 'test3', 'test3', '.', '\n', '\n', 'test', '4', 'test', '4', '.', '\n', '\n', '\n', 'test', '5', '\n', '\n', '\n', 'test 6', '.', \n']

ヘルプは本当にありがとう、感謝の前に。

出典

2016-06-16 user3107438

're.match（r '[\ w \ n \ 。] + $ '、w）] – gdlmx

こんにちは、お手伝いをありがとうございます。これをどのように解釈するか少し具体的にお願いしますか？ 're.match'は私にコンパイルエラーを与えます。 \ wは正規表現[a-zA-Z0-9_]を意味しますか？ – user3107438

はい。 'import re'を実行しましたか？ – gdlmx

シングルラインソリューション：

[w for w in file.read().split() if re.match(r'[\w\n\.]+$',w)]

大きなファイルを解析するためには、それは正規表現をプリコンパイルすると良いでしょう。

import re 
word_ptn = re.compile(r'[\w\n\.]+$') 
[w for w in file.read().split() if word_ptn.match(w)]

上記の正規表現は、'test1\xc2\xa0\n'のような文字列を除外します。それを保持したい場合は、一致した文字列を正規表現の結果から抽出します：

word_ptn = re.compile(r'[\w\n\.]+') 
Lp = (word_ptn.match(w) for w in file.read().split()) 
[ w.group(0) for w in Lp if w ]

出典

2016-06-16 19:46:42 gdlmx

python IO：テキストファイルからPython配列に単語を分割して、エスケープ文字、改行、16進値を避ける

答えて

関連する問題