Pythonでファイルからスペースを除く特殊文字を削除するには？

私は巨大なコーパスのテキストを（行ごとに）持っており、特殊文字を削除したいが、文字列のスペースと構造を維持したい。Pythonでファイルからスペースを除く特殊文字を削除するには？

hello? there A-Z-R_T(,**), world, welcome to python. 
this **should? the next line#followed- [email protected] an#other %million^ %%like $this.

hello there A Z R T world welcome to python 
this should be the next line followed by another million like this

出典

2017-04-12 pythonlearn

あなたが望む文字のリストを作成するだけで、AZ、az、0-9など。そして、リスト内にない文字をスペースで置き換えて文字列内の各文字を繰り返し処理するために 'for'ループを使います。 – Wright

は、百万行のテキストの膨大なコーパスに対して効率的ですか？ – pythonlearn

あなたがregexで、あまりにも、このパターンを使用することができますする必要があります：

from re 
a = '''hello? there A-Z-R_T(,**), world, welcome to python. 
this **should? the next line#followed- [email protected] an#other %million^ %%like $this.''' 

for k in a.split("\n"): 
    print(re.sub(r"[^a-zA-Z0-9]+", ' ', k)) 
    # Or: 
    # final = " ".join(re.findall(r"[a-zA-Z0-9]+", k)) 
    # print(final)

出力：

hello there A Z R T world welcome to python 
this should the next line followed by an other million like this

編集：

そうしないと、あなたはlistに最終ラインを保存することができます：私はNFNニールの答えは偉大だと思う...しかし、私は単純な正規表現を追加します

['hello there A Z R T world welcome to python ', 'this should the next line followed by an other million like this ']

出典

2017-04-12 01:57:25

これは仕事ですが、すべての行を1つの非常に長い行にまとめることを防ぐ方法はありますか？ – pythonlearn

答えを更新しました。今すぐチェックしてください。 –

：

final = [re.sub(r"[^a-zA-Z0-9]+", ' ', k) for k in a.split("\n")] 
print(final)

が出力すべての言葉の文字を削除していない、しかし、それは辞書のマッピング特別CHARACを作成単語

print re.sub(r'\W+', ' ', string) 
>>> hello there A Z R_T world welcome to python

出典

2017-04-12 02:02:31 Eliethesaiyan

の一部として強調して頂きますなし

d = {c:None for c in special_characters}

translation tableを辞書を使用して作成します。テキスト全体を変数に読み込み、テキスト全体にstr.translateを使用します。

出典

2017-04-12 03:08:40 wwii

Pythonでファイルからスペースを除く特殊文字を削除するには？

答えて

関連する問題