2017-12-31 122 views
-1

websrapingのためにbeautifulsoupライブラリを使用するにはいくつかの助けが必要です。私はWebページhttp://thehill.com/からテキストを抽出する必要がbeautifulsoupを使用してwebscrapingを完了

は.../365407-ショーン・ディディ・櫛は-たい-に-購入-C ...

私の目標は、正確のためのWebページのようにテキストを抽出することです私はすべての "p"タグとそのテキストを抽出していますが、 "p"タグの中にはテキストもある "a"タグがあります。

私の質問 1.ユニコード( "")をウェブページのテキストとして通常の文字列に変換するにはどうすればよいですか? "p"タグだけを抽出すると、beautifulsoupライブラリがテキストをユニコードに変換し、特殊文字でもユニコードされるので、抽出したユニコードテキストを通常のテキストに変換したいのです。どうやってやるの?

  1. "a"タグを含む "p"タグ内のテキストを抽出する方法。私は、ネストされたタグの中のテキストを含む "p"タグの中で完全なテキストを抜き出したいと思っています。

私は次のコードで試してみました:あなたは段落タグを持つすべてのリンクを見つけて、ユニコードをデコードするencode("ascii", 'ignore')を使用して、ネストされたリストの内包表記を使用することができます

html = requests.get("http://thehill.com/…/365407-sean-diddy-combs-wants-to-buy-c…").content 
news_soup = BeautifulSoup(html, "html.parser") 
a_text = news_soup.find_all('p') 

y = a_text[1].find_all('a').string 

答えて

0

import urllib 
from bs4 import BeautifulSoup as soup 
s = soup(str(urllib.urlopen('http://thehill.com/blogs/blog-briefing-room/365407-sean-diddy-combs-wants-to-buy-carolina-panthers-and-sign-kaepernick').read()), 'lxml') 
all_text = [i.text.encode("ascii", 'ignore') for i in s.find_all('p')] 
all_paragraphs = filter(None, [[b.text.encode("ascii", 'ignore') for b in i.find_all('a')] for i in s.find_all('p')]) 
print(all_text) 
print(all_paragraphs) 

出力:

['Hip hop mogul Sean Diddy Combs said Sunday night hes interested in buying the Carolina Panthers and signing quarterback Colin Kaepernick, who has been unemployed this season after kneeling during the national anthem in 2016.', 'Panthers owner Jerry Richardson announced Sunday he would be selling the team after the 2017 season, just hours after Sports Illustrated published accusations of sexual misconduct from former employees. Richardson also allegedly used a racial slur about a team scout.', 'Diddy took to Twitter soon after the Panthers announced the upcoming sale, declaring his desire to own a team and increase diversity among NFL ownership.', 'I would like to buy the @Panthers. Spread the word. Retweet!', 'There are no majority African American NFL owners. Lets make history.', '', 'Kaepernick respondedSundaymorning, saying I want in on the ownership group!', 'I want in on the ownership group! Lets make it happen!, 'Other athletes, including NBA starStephen Curryandformer NFL playerGreg Jennings,responded to Combs saying they were interested in part-owning the team.', "Former league MVP Cam Newton is the team's current quarterback.", 'Kaepernick has been a free agent since the end of the 2016 season, when he made headlinesfor kneeling during the national anthem before games to protest issues of racial inequality.', 'President TrumpDonald John TrumpHouse Democrat slams Donald Trump Jr. for serious case of amnesia after testimony Skier Lindsey Vonn: I dont want to represent Trump at Olympics Poll: 4 in 10 Republicans think senior Trump advisers had improper dealings with Russia MORE hascriticized Kaepernick directly, saying the NFL should have suspended him for the demonstration. He has since taken aim at other players who have knelt or sat during the anthem during the 2017 season.', '- This story was updated at 11:03 A.M. EST.', 'View the discussion thread.', 'The Hill 1625 K Street, NW Suite 900 Washington DC 20006 | 202-628-8500 tel | 202-628-8503 fax', 'The contents of this site are 2017 Capitol Hill Publishing Corp., a subsidiary of News Communications, Inc.'] 
[['Sports Illustrated'], ['@Panthers'], ['Stephen Curry', 'former NFL player'], ['President Trump', 'Donald John Trump', 'House Democrat slams Donald Trump Jr. for serious case of amnesia after testimony', 'Skier Lindsey Vonn: I dont want to represent Trump at Olympics', 'Poll: 4 in 10 Republicans think senior Trump advisers had improper dealings with Russia', 'MORE', 'criticized Kaepernick directly', 'knelt or sat'], ['View the discussion thread.']] 
+0

ありがとうございますverymuch –

関連する問題