Beautiful SoupがPython 3、IPython 6コンソールでUTF-8エンコーディングを認識できない

Beautiful SoupをPython 3.6.2、IPython 6.1.0、Windows 10で使用してXML文書を読み込もうとしていますが、エンコード権ここでBeautiful SoupがPython 3、IPython 6コンソールでUTF-8エンコーディングを認識できない

UTF8エンコードにファイルとして保存された私のテストXML、だ：

import xml.etree.ElementTree as ET 

def printXML(xml,indent=''): 
    print(indent+str(xml.tag)+': '+(xml.text if xml.text is not None else '').replace('\n','')) 
    if len(xml.attrib) > 0: 
     for k,v in xml.attrib.items(): 
      print(indent+'\t'+k+' - '+v) 
    if xml.getchildren(): 
     for child in xml.getchildren(): 
      printXML(child,indent+'\t') 

xml0 = ET.parse("test.xml").getroot() 
printXML(xml0)

出力が正しい：

root: 
     info: ÜÜÜÜÜÜÜ 
       name - 愛よ 
     items: 
       item: "23Äßßß" 
         thing - ÖöÖö

<?xml version="1.0" encoding="UTF-8"?> 
<root> 
<info name="愛よ">ÜÜÜÜÜÜÜ</info> 
<items> 
<item thing="ÖöÖö">"23Äßßß"</item> 
</items> 
</root>

まずElementTreeのを使用してXMLをチェックします

美しいスープで同じファイルを読んで、それを美しく印刷してください：

import bs4 

with open("test.xml") as ff: 
    xml = bs4.BeautifulSoup(ff,"html5lib") 
print(xml.prettify())

出力：

<!--?xml version="1.0" encoding="UTF-8"?--> 
<html> 
<head> 
</head> 
<body> 
    <root> 
    <info name="æ„›ã‚ˆ"> 
    ÃœÃœÃœÃœÃœÃœÃœ 
    </info> 
    <items> 
    <item thing="Ã–Ã¶Ã–Ã¶"> 
    "23Ã„ÃŸÃŸÃŸ" 
    </item> 
    </items> 
    </root> 
</body> 
</html>

これは単に間違っています。指定されたExpliciteエンコーディングで呼び出しを実行しても、結果は変わりません（bs4.BeautifulSoup(ff,"html5lib",from_encoding="UTF-8")）。

アウトドア

print(xml.original_encoding)

出力

None

とても美しいスープは、（++メモ帳に応じて）ファイルはUTF8でエンコードされていても、元の符号化を検出することが明らかにできず、ヘッダ情報がUTFを言います-8もあり、chardetがインストールされています。as the doc recommendsです。

ここをクリックしてください？これを引き起こす原因は何ですか？

EDIT：私はhtml5libせずにコードを呼び出すと、私はこの警告を得る：私はbs4.BeautifulSoup(ff,"html.parser")を試してみましたが、問題が残っているコメントで示唆したように

：

UserWarning: No parser was explicitly specified, so I'm using the best available HTML parser for this system ("html5lib"). 
This usually isn't a problem, but if you run this code on another system, or in a different virtual environment, 
it may use a different parser and behave differently. 

The code that caused this warning is on line 241 of the file C:\Users\My.Name\AppData\Local\Continuum\Anaconda2\envs\Python3\lib\site-packages\spyder\utils\ipython\start_kernel.py. 
To get rid of this warning, change code that looks like this: 

BeautifulSoup(YOUR_MARKUP}) 

to this: 

BeautifulSoup(YOUR_MARKUP, "html5lib") 

    markup_type=markup_type))

はEDIT 2。

次に、私はlxmlをインストールし、bs4.BeautifulSoup(ff,"lxml-xml")を試しましたが、それでも同じ出力です。また、奇数として私を打つ何

はbs4.BeautifulSoup(ff,"lxml-xml",from_encoding='UTF-8')のようにエンコーディングを指定する場合でも、xml.original_encodingの値がwritten in the doc何に反しNoneであるということです。

EDIT 3：

私は、文字列

xmlstring = "<?xml version=\"1.0\" encoding=\"UTF-8\"?><root><info name=\"愛よ\">ÜÜÜÜÜÜÜ</info><items><item thing=\"ÖöÖö\">\"23Äßßß\"</item></items></root>"

に私のXMLコンテンツを入れてbs4.BeautifulSoup(xmlstring,"lxml-xml")を使用し、今私は正しい出力を取得しています：

<?xml version="1.0" encoding="utf-8"?> 
<root> 
<info name="愛よ"> 
    ÜÜÜÜÜÜÜ 
</info> 
<items> 
    <item thing="ÖöÖö"> 
    "23Äßßß" 
    </item> 
</items> 
</root>

だから、何かを思わ結局のところファイルには間違っています。

出典

2017-08-22 Khris

なぜhtml5パーサーを使用してXMLファイルを処理しようとしていますか？ bs4出力の最初の行を見てください - これが原因で発生した問題を理解できますか？ – ekhumoro

これはHTML5ドキュメントではないため、おそらくデコードが期待どおりにできなかったのでしょう。 – glenfant

これを取得せずにコードを実行すると、 'UserWarning：パーサが明示的に指定されていないため、このシステムで最も利用可能なHTMLパーサ（" html5lib "）を使用しています。） – Khris

の間違いで、私は、ファイルを開くときにエンコードを指定する必要があります。

with open("test.xml",encoding='UTF-8') as ff: 
    xml = bs4.BeautifulSoup(ff,"html5lib")

私は、Python 3上だと私はencodingの値は、デフォルトではUTF-8だと思ったが、それはそれのシステムを判明しました依存し、私のシステムではそれはcp1252です。

出典

2017-08-23 06:20:59 Khris

これは私にとって参考になりました。 'responseObject = requests.get（someUrl）'と 'soup = BeautifulSoup（responseObject.text、" html.parser "）'を実行したいと思います。これらの2行の間に 'responseObject.encoding = responseObject.apparent_encoding'（または' = "utf-8" '）を挿入してエンコーディングを修正する必要があるようです。 'soup'を作成した後にエンコーディングを修正しようとしてもうまくいきません。 Khrisのように私はWindows 10とPython 3.6を使用しています。 – user1310503

Beautiful SoupがPython 3、IPython 6コンソールでUTF-8エンコーディングを認識できない

答えて

関連する問題