Pythonを使用してutf-8でエンコードされたテキストファイルを読むには

tamil（utf-8エンコード）のテキストファイルを解析する必要があります。私はインタフェースIDLEでPythonのnltkパッケージを使用しています。私は、インターフェイス上のテキストファイルを読み取ろうとすると、これは私が得るエラーです。どのように私はこれを避けるのですか？Pythonを使用してutf-8でエンコードされたテキストファイルを読むには

corpus = open('C:\\Users\\Customer\\Desktop\\DISSERTATION\\ettuthokai.txt').read() 

Traceback (most recent call last): 
    File "<pyshell#2>", line 1, in <module> 
    corpus = open('C:\\Users\\Customer\\Desktop\\DISSERTATION\\ettuthokai.txt').read() 
    File "C:\Users\Customer\AppData\Local\Programs\Python\Python35-32\lib\encodings\cp1252.py", line 23, in decode 
    return codecs.charmap_decode(input,self.errors,decoding_table)[0] 
UnicodeDecodeError: 'charmap' codec can't decode byte 0x8d in position 33: character maps to <undefined>

出典

2016-12-01 Ramprashanth

あなたはバイトの負荷を持っている場合は、私は完全にあなたの質問を読んで、しかし...していない、あなたは'your_bytes.decode（" UTF-8 "）'を使ってそれらをデコードして文字列にすることができます。 – byxor

どのPythonのバージョンですか？ –

@AntonisChristofides - トレースバックから、私はPython3を推論します。 –

あなたは、Python 3、ちょうどopen()にencodingパラメータを追加を使用しているので：

corpus = open('C:\\Users\\Customer\\Desktop\\DISSERTATION\\ettuthokai.txt', 
       encoding='utf-8').read()

出典

2016-12-01 19:14:36

Python 3以降でのみ動作します。 Python 2では 'codecs.open'を使います。 –

Pythonを使用してutf-8でエンコードされたテキストファイルを読むには

答えて

関連する問題