BeautifulSoupを使用してtext/htmlドキュメントからクリーンテキストを取得する

text/xmlとtext/htmlという2つのコンテンツタイプを持つドキュメントがあります。私はBeautifulSoupを使って文書を解析し、クリーンテキスト版にしたいと思います。ドキュメントはタプルとして始まるので、私はreprを使ってBeautifulSoupが認識できるようにしてから、find_allを使ってdivの部分を検索してドキュメントのtext/htmlビットを探します。BeautifulSoupを使用してtext/htmlドキュメントからクリーンテキストを取得する

soup = BeautifulSoup(repr(msg_data)) 
text = soup.html.find_all("div")

str_text = str(text) 
soup_text = BeautifulSoup(str_text) 
soup_text.get_text()

しかし、それはその後、変更されます。

はその後、私はそれを変数に保存してからスープオブジェクトに戻ってそれを回し、その上にGET_TEXT呼び出して、そのような文字列に戻ってテキストを回していますユニコードへのエンコード：

u'[9:16 PM\xa0Erica: with images, \xa0\xa0and that seemed long to me anyway, 9:17  
PM\xa0me: yeah, \xa0Erica: so feel free to make it shorter, \xa0\xa0or rather, please do, 
9:18 PM\xa0nobody wants to read about that shit for 2 pages, \xa0me: :), \xa0Erica: while 
browsing their site, \xa0me: srsly, \xa0Erica: unless of course your writing is magic, 
\xa0me: My writing saves drowning puppies, \xa0\xa0Just plucks him right out and gives 
them a scratch behind the ears and some kibble, \xa0Erica: Maine is weird, \xa0me: haha]'

私はそうのように、UTF-8として再エンコードすることにしてみてください：

soup.encode('utf-8')

私は戻って未解析のタイプにしています。

クリーンなテキストを文字列として保存してから、テキスト内の特定のものを見つけることができます（上記の例の "puppies"など）。

基本的に、私はここでサークルで走っています。誰も助けることができますか？いつものように、何か助けてくれてありがとう。

出典

2012-03-18 spikem

エンコードは損なわれません。それはまさにそれがすべきものです。 '\xa0'は、改行しないスペースのUnicodeです。

あなたはASCIIとして、この（ユニコード）文字列をエンコードしたい場合、あなたはそれを理解していない任意の文字を無視するコーデックを伝えることができます。

>>> x = u'[9:16 PM\xa0Erica: with images, \xa0\xa0and that seemed long to me anyway, 9:17 PM\xa0me: yeah, \xa0Erica: so feel free to make it shorter, \xa0\xa0or rather, please do, 9:18 PM\xa0nobody wants to read about that shit for 2 pages, \xa0me: :), \xa0Erica: while browsing their site, \xa0me: srsly, \xa0Erica: unless of course your writing is magic, \xa0me: My writing saves drowning puppies, \xa0\xa0Just plucks him right out and gives them a scratch behind the ears and some kibble, \xa0Erica: Maine is weird, \xa0me: haha]' 
>>> x.encode('ascii', 'ignore') 
'[9:16 PMErica: with images, and that seemed long to me anyway, 9:17 PMme: yeah, Erica: so feel free to make it shorter, or rather, please do, 9:18 PMnobody wants to read about that shit for 2 pages, me: :), Erica: while browsing their site, me: srsly, Erica: unless of course your writing is magic, me: My writing saves drowning puppies, Just plucks him right out and gives them a scratch behind the ears and some kibble, Erica: Maine is weird, me: haha]'

あなたは時間があれば、あなたはネッドBatchelderの最近のを見なければなりませんビデオPragmatic Unicode。それはすべてを明確かつ単純にします！

出典

2012-03-18 19:59:05 katrielalex

はい、それは私がそれを投稿したように私に起こりました。ビデオをありがとう、私は見てみましょう。あなたは私が読むことができるテキストリソースを持っていますか？（これは単なるGoogle検索ですが、特にあなたが好きですか？） – spikem

何を期待していますか？その中に非ASCII文字（非改行のスペース）を含む文字列があります。あなたはそれらを魔法にかけることはできません。 – katrielalex

私は頼まれていないと思っています。または、それらが魔法にかかってしまうことが予想されます。私はユニコードに完全に精通していません。 – spikem

BeautifulSoupを使用してtext/htmlドキュメントからクリーンテキストを取得する

答えて

関連する問題