2016-12-24 25 views
-1

ウェブサイトのページコンテンツのテキストのみを取得します。私はそれを行うためにBeautifulSoupを使用しています。BeautifulSoupを使用してウェブサイトからテキストを抽出する

私のような、以下の機能を書いた:

def textClean(text): 
    """ This function takes the input text and cleans the HTML tags from it 

    """ 

    from bs4 import BeautifulSoup 
    souptext=BeautifulSoup(text) 
    print text 
    print souptext.get_text() 

をこれも、元のHTMLソースコードと、そのテキストを印刷します。しかし、ここで

は私が得るサンプル出力です:

HTML出力:(最初のprint文)

<p><img style="float:right;" src="http://static4.businessinsider.com/image/56eb68e791058427008b72e5-907-680/5550538407_c22babffba_b.jpg" alt="radar" data-mce-source="US Navy" data-mce-caption="Mineman Seaman Charles Bryan watches for contacts on the SPA 256 radar while on watch in the Combat Directive Center aboard the mine countermeasures ship USS Ardent (MCM 12)." data-link="https://www.flickr.com/photos/usnavy/5550538407/in/photolist-9stXG4-e6i1uU-e6i1tE-dLSiBQ-c9jmg7-f5LbtS-r9jw69-efvjaN-duNiV6-efpeEP-eW8Dg9-q1nZiQ-en2osX-duNiTa-njkj3s-eep3Mb-kUdU5g-9d7u4E-eeoYiC-fr2CuX-axHdte-fsVD3D-drHPeJ-9rAVac-cnMSiW-9vVcbN-enB31b-f23pKF-aBjveY-9rEhwY-9u6GZy-9rDT9L-bojAAh-9uiNiU-9AJSrB-9rFxwQ-bjkanD-aefpN9-ea2WB2-ea2WyR-a1tUoa-9rAUXZ-ea8Bf9-9Wm3Z8-9rNE7o-enB1YY-9rAUHX-ea2WpF-aNR7eD-9NX2pq" /><span class="source">US Navy</span></p><p>The United States has seen Chinese activity around a reef that China seized from the Philippines nearly four years ago that could be a precursor to more land reclamation in the disputed South China Sea, the U.S. Navy chief said on Thursday.</p>

第二のtet出力:(第2 print文)

US NavyThe United States has seen Chinese activity around a reef that China seized from the Philippines nearly four years ago that could be a precursor to more land reclamation in the disputed South China Sea, the U.S. Navy chief said on Thursday. 

あなたはタグ

<span class="source">US Navy</span></p> 

間のテキストはまた、我々はテキストを元の記事の一部ではないことを元の記事(下記リンク)を参照してくださいかのように私はしたくないその抽出なっている表示された場合。

私はget_text()がすべてのテキストを取得することがわかっているので、段落タグの間のテキストを抽出するように指定できますが、spanタグ内のテキストは一部ではないと思うので、元のテキストの。

ここに私が使用した記事へのリンクがあります。

enter link description here

EDIT1:

このような出力を取得:各列はUnicodeに変換されます。

Spark DataFrameの各レコードをマップし、データフレームの「desc」列からHTMLタグを消去するために作成したマッピング関数コードを示します。

def htmlParsing(x): 
    """ This function takes the input text and cleans the HTML tags from it 

    """ 

    from bs4 import BeautifulSoup 
    #print text 
    row=x.asDict() 
    textcleaned='' 
    souptext=BeautifulSoup(row['desc']) 
    #souptext=BeautifulSoup(text) 
    p_tags=souptext.find_all('p') 
    for p in p_tags: 
     if p.string: 
      #textcleaned+=p.string 
      ret_list= (int(row['id']),(row['title']),(p.string)) 
      return ret_list 
      #print p.string 


sdf_cleaned=sdf_rss.map(htmlParsing)   

sdf_cleaned.take(4) 

あなたは銀行の窓口や顧客サービス担当者を扱う嫌いu'If [(-33753621、スコットランドの u'Royal銀行は、その後、 、「)RBS(あなたの銀行の問題を解決できるロボットをテストしていますスコットランドのロイヤルバンクにはあなたのための解決策があるかもしれません。)、 (-761323061、 u'Teen sextingは児童ポルノ法のオーバーホールを促しています '、 u'Plantティーンセックスは政治家や法執行当局を児童ポルノのために学生を起訴し、離婚させることの間にある種の合法的な中間地位を見つけようと努力している」)、 (1405376555、中国は南シナ海で新しいプロジェクトを建設し始めた。 米国では、中国が約4年前にフィリピンから押収したサンゴ礁周辺の中国の活動を見ている。紛争を起こした南シナ海での土地埋立地の増強を発表した。')、 (-1882022821、 )酔っ払いの法律は、酔っ払いの死亡率を減らしている。' u'Reuters Health - 酔っ払いドライバーに自動車のイグニッションインターロック装置を設置する必要がある国は、15%これらの要件のない状態、研究のショーに比べてアルコール関連のクラッシュ死亡インチ ')]

答えて

0
import requests, bs4 
r = requests.get('http://www.businessinsider.com/r-exclusive-us-sees-new-chinese-activity-around-south-china-sea-shoal-2016-3') 
soup = bs4.BeautifulSoup(r.text, 'lxml') 

p_tags = soup.find_all('p') 
for p in p_tags: 
    if p.string: 
     print(p.string) 

タグが一つだけの子供を持ち、その子が NavigableStringある場合

を.string、子供は.stringとして利用可能になります:

タグ が複数のものが含まれている場合、それは.stringは ので.stringはNoneにするために定義され、参照すべきかは明らかではありません。そう

、刺すだけでその唯一のpタグを返します。テキストを含むアウト

The United States has seen Chinese activity around a reef that 
    China seized from the Philippines nearly four years ago that 
    could be a precursor to more land reclamation in the disputed 
    South China Sea, the U.S. Navy chief said on Thursday. 


    The head of U.S. naval operations, Admiral John Richardson, 
    expressed concern that an international court ruling expected in 
    coming weeks on a case brought by the Philippines against China 
    over its South China Sea claims could be a trigger for Beijing to 
    declare an exclusion zone in the busy trade route. 


    Richardson told Reuters the United States was weighing responses 
    to such a move. 


    He said the U.S. military had seen Chinese activity around 
    Scarborough Shoal in the northern part of the Spratly 
    archipelago, about 125 miles (200 km) west of the Philippine base 
    of Subic Bay. 


    "I think we see some surface ship activity and those sorts of 
    things, survey type of activity, going on. Thatâs an area of 
    concern ... a next possible area of reclamation," he said. 


    Richardson said it was unclear if the activity near the reef, 
    which China seized in 2012, was related to the pending 
    arbitration decision. 


    He said China's pursuit of South China Sea territory, which has 
    included massive land reclamation to create artificial islands 
    elsewhere in the Spratlys, threatened to reverse decades of open 
    access and introduce new "rules" that required countries to 
    obtain permission before transiting those waters. 


    He said that was a worry given that 30 percent of the world's 
    trade passes through the region. 


    Asked whether China could respond to the ruling by the court of 
    arbitration in The Hague by declaring an air defense 
    identification zone, or ADIZ, as it did farther north in the East 
    China Sea in 2013, Richardson said: "Itâs definitely a concern." 


    "We will just have to see what happens," he said. "We think about 
    contingencies and ⦠responses." 


    Richardson said the United States planned to continue carrying 
    out freedom-of-navigation exercises within 12 nautical miles of 
    disputed South China Sea geographical features to underscore its 
    concerns about keeping sea lanes in the region open. 


    The United States responded to the East China Sea ADIZ by flying 
    B-52 bombers through the zone in a show of force in November 
    2013. 


    Richardson said he was struck by how China's increasing 
    militarization of the South China Sea had increased the 
    willingness of other countries in the region to work together, 
    not just bilaterally, but also multilaterally. 


    India and Japan joined the U.S. Navy in the Malabar naval 
    exercise since 2014, and were slated to take part again this year 
    in an even more complex exercise that will take place in an area 
    close to the East and South China Seas. 


    South Korea, Japan and the United States were also working 
    together more closely than ever before, he said. 


    Richardson said the United States would welcome the participation 
    of other countries in joint patrols with the United States in the 
    South China Sea, but those decisions needed to be made by the 
    countries in question. 


    He said the U.S. military saw good opportunities to build and 
    rebuild relationships with countries such as Vietnam, the 
    Philippines and India, which have all realized the importance of 
    safeguarding the freedom of the seas. 


    He cited India's recent hosting of an international fleet review 
    that included 75 ships from 50 navies, and said the United States 
    was exploring opportunities to increase its use of ports in the 
    Philippines and Vietnam, among others - including the former U.S. 
    naval base at Vietnam's Cam Ranh Bay. 


    But he said Washington needed to proceed judiciously rather than 
    charging in "very fast and very heavy," given the enormous 
    influence and importance of the Chinese economy in the region. 


    "We have to be sophisticated in how we approach this so that we 
    donât force any of our partners into an uncomfortable position 
    where they have to make tradeoffs that are not in their best 
    interest," he said. 


    "We would hope to have an approach that would ... include us a 
    primary partner but not necessarily to the exclusion of other 
    partners in the region," he said. 

The United States has seen Chinese activity... 
5 innovations in radiology that could impact everything from the Zika virus to dermatology 
Keep tabs on the latest from Business Insider in our new Chrome Extension 
Available on iOS or Android 
+0

これは良い答えです。しかし、私は文字列を印刷したくありません。私はそれをデータセットとして保存したい。しかし、私がそれを元に戻すと、私はunicode 'u'がそれに追加され、普通の文字列ではないということになります。どのようにそれらを取り除くのですか? – Baktaawar

+0

質問にデータコードを保存することはできますか? –

+0

plsを編集します。 – Baktaawar

0

あなたがget_text()に気づいたように、すべてのタグを消費し、その下のテキストを取得します。

このようなタグをターゲットにする必要があります。

from bs4 import BeautifulSoup 

html = ''' 
<p> 
    <img style="float:right;" src="http://static4.businessinsider.com/image/56eb68e791058427008b72e5-907-680/5550538407_c22babffba_b.jpg" alt="radar" data-mce-source="US Navy" data-mce-caption="Mineman Seaman Charles Bryan watches for contacts on the SPA 256 radar while on watch in the Combat Directive Center aboard the mine countermeasures ship USS Ardent (MCM 12)." data-link="https://www.flickr.com/photos/usnavy/5550538407/in/photolist-9stXG4-e6i1uU-e6i1tE-dLSiBQ-c9jmg7-f5LbtS-r9jw69-efvjaN-duNiV6-efpeEP-eW8Dg9-q1nZiQ-en2osX-duNiTa-njkj3s-eep3Mb-kUdU5g-9d7u4E-eeoYiC-fr2CuX-axHdte-fsVD3D-drHPeJ-9rAVac-cnMSiW-9vVcbN-enB31b-f23pKF-aBjveY-9rEhwY-9u6GZy-9rDT9L-bojAAh-9uiNiU-9AJSrB-9rFxwQ-bjkanD-aefpN9-ea2WB2-ea2WyR-a1tUoa-9rAUXZ-ea8Bf9-9Wm3Z8-9rNE7o-enB1YY-9rAUHX-ea2WpF-aNR7eD-9NX2pq" /> 
    <span class="source">US Navy</span> 
</p> 
<p> 
    The United States has seen Chinese activity around a reef that China seized from the Philippines nearly four years ago that could be a precursor to more land reclamation in the disputed South China Sea, the U.S. Navy chief said on Thursday. 
</p>''' 

soup = BeautifulSoup(html, "html.parser") 

print souptext.find_all('p')[1].get_text() 
+0

あなたのコードでは、p [1]しか与えません。 p [0]の場合、それは私が望むものではないUS Navyも印刷します。 – Baktaawar

関連する問題