BeautifulSoupを使用してテキストを抽出する

古いWebページからテキストを抽出しようとしましたが、問題があります。 Webページ（http://www.presidency.ucsb.edu/ws/index.php?pid=119039）のソースを検査、テキストが始まる：BeautifulSoupを使用してテキストを抽出する

> </div></div><span class="displaytext"><b>PARTICIPANTS:</b><br>Former Secretary of State 
> Hillary Clinton (D) and<br>Businessman Donald Trump 
> (R)<p><b>MODERATOR:</b><br>Chris Wallace (Fox News)<p><b>WALLACE:</b> 
> Good evening from the Thomas and Mack Center at the University of 
> Nevada, Las Vegas. I'm Chris Wallace of Fox News, and I welcome you to 
> the third and final of the 2016 presidential debates between Secretary 
> of State Hillary Clinton and Donald J. Trump.<p>

私が使用してテキストを抽出しようとしている：

link = "http://www.presidency.ucsb.edu/ws/index.php?pid=119039" 
debate_response = requests.get(link) 
debate_soup = BeautifulSoup(debate_response.content, 'html.parser') 
debate_text = debate_soup.find_all('div',{'span class':"displaytext"}) 
print(debate_text)

をしかし、これは単に空リストを返します。どのように私はテキストを抽出することができます任意のアイデア？

出典

2017-11-25 Adam_G

html.parserを使用して最大再帰エラーが発生したため、lxmlをパーサとして使用する必要がありました。以下は、<span>タグの子供のすべてのテキストを1つの文字列に抽出します：

debate_soup = BeautifulSoup(debate_response.content, 'lxml') 
debate_text = debate_soup.find('span', {'class': 'displaytext'}).get_text()

出典

2017-11-25 03:59:54 Jay

まさに私が必要としたものです。ありがとうございました –

BeautifulSoupを使用してテキストを抽出する

答えて

関連する問題