2013-01-23 12 views
5

私はPython:BeautifulSoupを使ってHTMLページからURLを抽出するには?

<div class="article-additional-info"> 
A peculiar situation arose in the Supreme Court on Tuesday when two lawyers claimed to be the representative of one of the six accused in the December 16 gangrape case who has sought shifting of t... 
<a class="more" href="http://www.thehindu.com/news/national/gangrape-case-two-lawyers-claim-to-be-engaged-by-accused/article4332680.ece"> 
<span class="arrows">»</span> 
</a> 
</div> 

<div class="article-additional-info"> 
Power consumers in the city will have to brace for spending more on their monthly bills as all three power distribution companies – the Anil Ambani-owned BRPL and BYPL and the Tatas-owned Tata Powe... 
<a class="more" href="http://www.thehindu.com/news/cities/Delhi/power-discoms-demand-yet-another-hike-in-charges/article4331482.ece"> 
<a class="commentsCount" href="http://www.thehindu.com/news/cities/Delhi/power-discoms-demand-yet-another-hike-in-charges/article4331482.ece#comments"> 
</div> 

のような複数のdivタグでHTMLページを持っていると私はので、私はURLを必要とする私はBeautifulSoup

に新しいですクラスarticle-additional-info ですべてのdiv要素のための<a href=>値を取得する必要があり

"http://www.thehindu.com/news/national/gangrape-case-two-lawyers-claim-to-be-engaged-by-accused/article4332680.ece" 
"http://www.thehindu.com/news/cities/Delhi/power-discoms-demand-yet-another-hike-in-charges/article4331482.ece" 

これを達成する最も良い方法は何ですか?

答えて

8

条件に応じて、3つのURLを返します(2つではありません)。第3のURLを除外しますか?

基本的な考え方は、あなたのクラスの要素のみを引き出し、HTMLを反復処理することであり、そのクラス内のすべてのリンクを反復、実際のリンクを引き出す:

In [1]: from bs4 import BeautifulSoup 

In [2]: html = # your HTML 

In [3]: soup = BeautifulSoup(html) 

In [4]: for item in soup.find_all(attrs={'class': 'article-additional-info'}): 
    ...:  for link in item.find_all('a'): 
    ...:   print link.get('href') 
    ...:   
http://www.thehindu.com/news/national/gangrape-case-two-lawyers-claim-to-be-engaged-by-accused/article4332680.ece 
http://www.thehindu.com/news/cities/Delhi/power-discoms-demand-yet-another-hike-in-charges/article4331482.ece 
http://www.thehindu.com/news/cities/Delhi/power-discoms-demand-yet-another-hike-in-charges/article4331482.ece#comments 

これは、あなたが制限しますarticle-additional-infoクラスタグでこれらの要素だけを検索し、そこにすべてのアンカー(a)タグを検索し、対応するhrefリンクを取得します。印刷し

2
from bs4 import BeautifulSoup as BS 
html = # Your HTML 
soup = BS(html) 
for text in soup.find_all('div', class_='article-additional-info'): 
    for links in text.find_all('a'): 
     print links.get('href') 

http://www.thehindu.com/news/national/gangrape-case-two-lawyers-claim-to-be-engaged-by-accused/article4332680.ece  
http://www.thehindu.com/news/cities/Delhi/power-discoms-demand-yet-another-hike-in-charges/article4331482.ece  
http://www.thehindu.com/news/cities/Delhi/power-discoms-demand-yet-another-hike-in-charges/article4331482.ece#comments 
2

を文書で働いた後、私はそれを次の方法をした、あなたの答えのためのあなたのすべてに感謝し、私は

>>> import urllib2 
>>> f = urllib2.urlopen('http://www.thehindu.com/news/cities/delhi/?union=citynews') 
>>> soup = BeautifulSoup(f.fp) 
>>> for link in soup.select('.article-additional-info'): 
... print link.find('a').attrs['href'] 
... 
http://www.thehindu.com/news/cities/Delhi/airport-metro-express-is-back/article4335059.ece 
http://www.thehindu.com/news/cities/Delhi/91-more-illegal-colonies-to-be-regularised/article4335069.ece 
http://www.thehindu.com/news/national/gangrape-case-two-lawyers-claim-to-be-engaged-by-accused/article4332680.ece 
http://www.thehindu.com/news/cities/Delhi/power-discoms-demand-yet-another-hike-in-charges/article4331482.ece 
http://www.thehindu.com/news/cities/Delhi/nurses-women-groups-demand-safety-audit-of-workplaces/article4331470.ece 
http://www.thehindu.com/news/cities/Delhi/delhi-bpl-families-to-get-12-subsidised-lpg-cylinders/article4328990.ece 
http://www.thehindu.com/news/cities/Delhi/shias-condemn-violence-against-religious-minorities/article4328276.ece 
http://www.thehindu.com/news/cities/Delhi/new-archbishop-of-delhi-takes-over/article4328284.ece 
http://www.thehindu.com/news/cities/Delhi/delhi-metro-to-construct-subway-without-disrupting-traffic/article4328290.ece 
http://www.thehindu.com/life-and-style/Food/going-for-the-kill-in-patparganj/article.ece 
http://www.thehindu.com/news/cities/Delhi/fire-at-janpath-bhavan/article4335068.ece 
http://www.thehindu.com/news/cities/Delhi/fiveyearold-girl-killed-as-school-van-overturns/article4335065.ece 
http://www.thehindu.com/news/cities/Delhi/real-life-stories-of-real-women/article4331483.ece 
http://www.thehindu.com/news/cities/Delhi/women-councillors-allege-harassment-by-male-councillors-of-rival-parties/article4331471.ece 
http://www.thehindu.com/news/cities/Delhi/airport-metro-resumes-today/article4331467.ece 
http://www.thehindu.com/news/national/hearing-today-on-plea-to-shift-trial/article4328415.ece 
http://www.thehindu.com/news/cities/Delhi/protestors-demand-change-in-attitude-of-men-towards-women/article4328277.ece 
http://www.thehindu.com/news/cities/Delhi/bjp-promises-5-lakh-houses-for-poor-on-interestfree-loans/article4328280.ece 
http://www.thehindu.com/life-and-style/metroplus/papad-bidi-and-a-dacoit/article4323219.ece 
http://www.thehindu.com/life-and-style/Food/gharana-of-food-not-just-music/article4323212.ece 
>>> 
0
In [4]: for item in soup.find_all(attrs={'class': 'article-additional-info'}): 
...:  for link in item.find_all('a'): 
...:   print link.get('href') 
...: 
http://www.thehindu.com/news/national/gangrape-case-two-lawyers-claim-to-be-engaged-by-accused/article4332680.ece  
http://www.thehindu.com/news/cities/Delhi/power-discoms-demand-yet-another-hike-in-charges/article4331482.ece  
http://www.thehindu.com/news/cities/Delhi/power-discoms-demand-yet-another-hike-in-charges/article4331482.ece#comments 
+1

それらを感謝してください。自分のサイトに再度リンクしないでください。[so]の[** spam **](http://stackoverflow.com/help/promotion)です。 –

関連する問題