'UCS-2'はエンコードできません

テキストファイルを読み込もうとしていますが、1つのエラーが発生します。'UCS-2'はエンコードできません

UnicodeEncodeError: 'UCS-2' codec can't encode characters in position 12416-12416: Non-BMP character not supported in Tk

私もそれを無視しようとしましたが、うまくいきませんでした。ここコードである：

with io.open('reviews1.txt', mode='r',encoding='utf-8') as myfile: 
document1=myfile.read().replace('\n', '') 
print(document1)

出典

2017-07-07 Maitreya Patel

http://python-notes.curiousefficiency.org/en/latest/python3/text_file_processing.html#unicode-error-（[ 'surrogateescape'エラーハンドラ]お試しくださいハンドラー）？それにもかかわらず、あなたの質問を編集し、完全なトレースバックを表示してください。 – JosefZ

問題はファイルを読むことではない（それは** de **コーディングエラーである）。それは 'print'式です：あなたの環境は、顔文字のような[BMP]（https://en.wikipedia.org/wiki/Plane_（Unicode）#Basic_Multilingual_Plane）以外の文字を処理できないようです。代わりにオプションにファイルを書き込んでいますか？ – lenz

私はPython 3.5 IDLE環境でエラーを再現できます。しかし、スクリプトは円滑にコンソールから実行されます（私の場合はWindowsの 'cmd'）。エラーは 'print'に関連しています。 – JosefZ

-1

Python IDLE environment（Python version 3.5.1, Tk version 8.6.4, IDLE version 3.5.1）でエラーを再現できます。それはTkのバグのようです。しかし、元のスクリプトはコンソール（Windows cmd、私の場合はPython 3.5.1 (v3.5.1:37a07cee5969, Dec 6 2015, 01:54:25) [MSC v.1900 64 bit (AMD64)] on win32）から円滑に実行されます。私が見ることができる

唯一の方法は、非常に遅くなる可能性があり、次のはスクリプトをコピーしBasic Multilingual Planeのうち、すべてのものを排除することによって、文字、文書全体の文字をコメントしています。

：this (more Python-ish) solution (thanks to Mark Ransom)が見つかりました。残念ながら、これはPythonシェルで実行されますが、Pythonのコンソールが文句：

>>> print(''.join(c if c <= '\uffff' else ''.join(chr(x) for x in struct.unpack( ... '>2H', c.encode('utf-16be'))) for c in document1) ...) Traceback (most recent call last): File "<stdin>", line 2, in <module> File "C:\Python\Python35\lib\site-packages\win_unicode_console\streams.py", line 179, in write return self.base.write(s) UnicodeEncodeError: 'utf-16-le' codec can't encode character '\ud83d' in position 0: surrogates not allowed >>>

-

# -*- coding: utf-8 -*- import sys, io import os, codecs # for debugging print(os.path.basename(sys.executable), sys.argv[0], '\n') # for debugging ####################### ### original answer ### ####################### filepath = 'D:\\test\\reviews1.txt' with io.open(filepath, mode='r',encoding='utf-8') as myfile: document1=myfile.read() #.replace('\n', '') document2=u'' for character in document1: ordchar = ord(character) if ordchar <= 0xFFFF: # debugging # print('U+%.4X' % ordchar, character) document2+=character else: # debugging # print('U+%.6X' % ordchar, '�') ### �=Replacement Character; codepoint=U+FFFD; utf8=0xEFBFBD document2+='�' print(document2) # original answer, runs universally ###################### ### updated answer ### ###################### if os.path.basename(sys.executable) == 'pythonw.exe': import struct document3 = ''.join(c if c <= '\uffff' else ''.join(chr(x) for x in struct.unpack('>2H', c.encode('utf-16be'))) for c in document1) print(document3) # Pythonw shell else: print(document1) # Python console

出力、Pythonwシェル：

================== RESTART: D:/test/Python/Py/q44965129a.py ================== pythonw.exe D:/test/Python/Py/q44965129a.py � smiling face with smiling eyes � � smiling face with open mouth � � angry face � smiling face with smiling eyes smiling face with open mouth angry face >>>

出力、Pythonのコンソール：

==> D:\test\Python\Py\q44965129a.py python.exe D:\test\Python\Py\q44965129a.py � smiling face with smiling eyes � � smiling face with open mouth � � angry face � smiling face with smiling eyes smiling face with open mouth angry face ==>

出典

2017-07-07 20:55:37 JosefZ

私を助けていただきありがとうございます。 –

組み込みの 'open'が同じ場合に' io.open'を使う理由は何ですか？なぜ 'str.encode'が' errors = 'replace''スキーマを提供するのかを手動チェックで文字化してみましょう。車輪を改造しないでください... – lenz

@ lenz ** 1 **。私は、['io.open'は組み込みopen（）関数のエイリアスであることを知っています（https://docs.python.org/3.5/library/io.html）。それはOPのデザインです... ** 2 **。 ** **エラーハンドラを適用しようとしましたか？どうやら、あなたはそうではありませんでした。あなたはこれをテストすることができませんでしたが（@ sic（https://en.wikipedia.org/wiki/Sic）） – JosefZ

問題は、ファイルの読み取りとではない（すなわち、デコーディングエラーであろう）。これは印刷表現です：ご使用の環境では、エモーティコンなどBMP以外の文字は処理できません。

これらの文字をSTDOUTに出力する場合は、シェル/ IDEがすべてのUnicode（UTF-8、UTF-16 ...）をサポートするエンコードをサポートしているかどうかを確認できます。または、スクリプトを実行するための別の環境に切り替えます。

あなたが同じ設定でそれを実行したい場合は、あなたがあなたのカスタムエラー処理を指定するためのオプションを与える、データを自分でエンコードすることができます。

data = document1.encode('UCS-2', errors='replace') 
sys.stdout.buffer.write(data)

これは?または一部としてサポートされていない文字に置き換えられますその他の文字。 errors='ignore'を指定して、文字を非表示にすることもできます。

私のコーデックライブラリはUCS-2エンコーディングを知らないので、私はこれをテストできませんでした。これは、WindowsがNTまで使われていた古い標準です。

出典

2017-07-07 14:52:37 lenz

'UCS-2'はエンコードできません

答えて

関連する問題