は、私はPythonの2に働いている、と私は次のスクリプト持っ出力からのPython

をHTMLタグを削除します。は、私はPythonの2に働いている、と私は次のスクリプト持っ出力からのPython

from bs4 import BeautifulSoup 
import requests, re 

page = "http://hidden.com/example" 
headers = {'User-Agent': 'Craig'} 
html = requests.post(page, headers=headers) 

soup = BeautifulSoup(html.text, "html.parser") 

final = soup.find('p',{'class':'text'}) 

print final

これは私が公につもりポストはないよ、ウェブサイト上で動作し、これを返します。

<p>Example text <a href="example">Example</a> more example <a href="second example">Second example</a></p>

どのように私は<p>と<a href="">タグを削除しますか？そして、他のどんなタグが潜んでいますか？

出典

2017-01-15 Hugh Adam Chalmers

-1

正規表現を使ってhtmlタグをチェックし、空の文字列に置き換えることをお勧めします。

reg = r '\ < \ * [^>] +>'です。これは機能しているようです。

出典

2017-01-15 18:37:49 BloomBlack

"正規表現を使用してHTMLを解析すると落とし穴が発生します。" http://stackoverflow.com/questions/3790681/regular-expression-to-remove-html-tags – DyZ

ほとんどのbs4タグには、タグのすべての文字列のジェネレータである.strings属性があります。

print(''.join(final.strings)) 
# Example text Example more example Second example

出典

2017-01-15 18:44:23 DyZ

は、私はPythonの2に働いている、と私は次のスクリプト持っ出力からのPython

答えて

関連する問題