Scraping SpeechesUSA.com

私はspeeches-usa.comのタイトルリンクを掻き取ろうとしています。され、次の私のPythonコード：Scraping SpeechesUSA.com

SPEECH_SOURCE = 'http://www.speeches-usa.com/' 
def get_speeches(): 
     cj = CookieJar() 
     opener = urllib2.build_opener(urllib2.HTTPCookieProcessor(cj)) 
     p = opener.open(SPEECH_SOURCE) 
     soup = BeautifulSoup(p.read(), PARSER_TYPE) 
     info = soup.find_all('a', class_='ListText') 
     elements = [] 
     for element in info: 
      elements.append(element) 
     for i in x range(0, min(len(elements), 5)): 
      print elements[i]

（1）私はリンクを取得するためにsoup.find_all（）の引数に入れてよく分からない - 私はelements.appendを入れてみましたが（element.get_text（））しかし、それは（2）の結果が不完全に見えるリンク

John Adams - Inaugural 
     Address 

Samuel Adams - American 
     Independence 

Spiro Agnew - Television 
     News Coverage 

Susan B. Anthony - Women's 
     Right to Vote

を奪うれ、代わりに次を与え、例えば、我々は以下のコードでジェーン・アダムスが欠落しています。

<a class="ListText" href="Transcripts/john_adams-inaugural.html">John Adams - Inaugural 
     Address<br/> 
</a> 
0 
<a class="ListText" href="Transcripts/samuel_adams-independence.html">Samuel Adams - American 
     Independence<br/> 
</a> 
1 
<a class="ListText" href="Transcripts/spiro_agnew-networknews.html">Spiro Agnew - Television 
     News Coverage<br/> 
</a> 
2 
<a class="ListText" href="Transcripts/susan_b_anthony-vote.html">Susan B. Anthony - Women's 
     Right to Vote</a> 
3 
<a class="ListText" href="Transcripts/spiro_agnew-networknews.html"></a> 
4

お手数をおかけしますようお願い申し上げます。

出典

2017-01-04 Erica Dohring

このコードは実行可能ではありません - 関連の輸入および変数（例えば 'SPEECH_SOURCE'） – asongtoruin

編集などの使用可能な例を、投稿してください！キャッチするためにありがとう。 –

をお試しください：

import urllib2 
from BeautifulSoup import BeautifulSoup 
import urlparse 


def get_speeches(input_url): 
    p = urllib2.urlopen(input_url) 
    soup = BeautifulSoup(p, 'html.parser') 
    info = soup.find_all('a', class_='ListText') 

    for element in info: 
     print urlparse.urljoin(input_url, element['href']) 

SOURCE_URL = 'http://speeches-usa.com' 
get_speeches(SOURCE_URL)

element.get_text()は、それが言うまさにません - それは、要素のテキストを取得します。プロパティが必要な場合は、element['href']

EDITのように角括弧を使用することができます。以下のコメントは、すべてのリンクがListTextクラスを持つわけではないため、いくつかの要素を逃しています。次のコードでは、すべてのリンクを検索し、'Transcripts'がリンクに含まれているかどうかを確認します（必要なトランスクリプトのリンクであると仮定しています）。これは重複を特徴付けることができるので、set()は一意のエントリのみを印刷するために使用される。

import urllib2 
from BeautifulSoup import BeautifulSoup 
import urlparse 


def get_speeches(input_url): 
    p = urllib2.urlopen(url=input_url) 
    soup = BeautifulSoup(p, 'html.parser') 
    info = soup.find_all('a', href=True) 

    all_transcripts = list() 

    for element in info: 
     if 'Transcripts' in element['href']: 
      all_transcripts.append(urlparse.urljoin(input_url, element['href'])) 

    for transcript_url in set(all_transcripts): 
     print transcript_url 

SOURCE_URL = 'http://speeches-usa.com' 
get_speeches(SOURCE_URL)

出典

2017-01-04 11:31:09 asongtoruin

コードをテストしますか？それぞれのタグにクラスがあるわけではありません。 –

@宏杰李それはいくつかのリンクを逃すことは間違いない - 私は試して編集して、これを改善する。 – asongtoruin

素晴らしい！それはすばらしいことです:) –

はあなたに完全なURLを与える必要があり、以下、この

for a in soup.find_all('a', href=True): 
    print "Found the URL:", a['href']

出典

2017-01-04 11:19:11

あなたはそれを投稿するよりもコードをテストします。 –

import bs4, requests 
r = requests.get('http://speeches-usa.com/') 
soup = bs4.BeautifulSoup(r.text, 'lxml') 

a_tags = soup.find('table', width="925").find_all('a', text=True, href=re.compile('\.html')) 
for a in a_tags: 
    link = a.get('href') 
    text = a.get_text(strip=True).replace('\n  ', '') 
    print(link, text, sep="\t\t")

アウト：

Transcripts/susan_b_anthony-vote.html  Susan B. Anthony - Women'sRight to Vote 
Transcripts/albert_beveridge-question.html  Albert J. Beveridge - ThePhillipine Question 
Transcripts/william_jennings_bryan-cross.html  William Jennings Bryan - Crossof Gold 
Transcripts/william_jennings_bryan-19002.html  William Jennings Bryan - 1900Democratic Presidential Acceptance 
Transcripts/tony_blair-irish.html  Tony Blair - Addressto Irish Parliament 
Transcripts/napolean_bonaparte-farewell.html  Napolean Bonaparte - Farewell to the Old Guard 
Transcripts/sarah_brady-1996dnc.html  Sarah Brady - 1996DNC Keynote address 
Transcripts/pat_buchanan-citadel.html  Pat Buchannan - Arepublic not an Empire 
Transcripts/edmund_burke.html  Edumund Burke - Thedeath of Marie Antoinette 
Transcripts/barbara_bush-1992rnc.html  Barbara Bush - 1992RNC Speech 
Transcripts/barbara_bush-wellesley.html  Barbara Bush - WelleslyCollege 
Transcripts/george_bush-somalia.html  George Bush - Conditionsin Somalia 
Transcripts/george_bush-1991sou.html  George Bush - 1991State of the Union 
Transcripts/george_bush-saudi.html  George Bush - Defenseof Saudi Arabia 
Transcripts/george_w_bush-knoxville.html  George W. Bush - Anew approach 
Transcripts/stokeley_carmichael-going.html  Stokley Carmichael - BlackPower 
Transcripts/stokeley_carmichael-weaint.html  Stokley Carmichael - "Weain't goin'" 
Transcripts/jimmy_carter-energy.html  Jimmy Carter - EnergyCrisis

出典

2017-01-04 11:42:43

答えて

関連する問題