2016-03-21 4 views
2

URLを指定してプレーンテキストを抽出しようとしています。 私の検索によると、最も関連性の高いツールはBeautifulSoupと思われるので、テストする簡単なプログラムを書きました。 しかし、私はまだ私の要件を満たすことができないことがわかった。結果には非常に多くの非平文テキストが含まれています。URLからBeautifulSoup、Pythonでプレーンテキストを抽出しますが、まだクリーンではありません。

結果を確認するには、次のPythonコードを実行してください。あなたがrawに見たとき

import urllib 
url = "http://www.amfastech.com/2015/07/lenovo-k3-note-brutally-honest-review-specifications-pros-cons.html" 
html = urllib.urlopen(url).read().decode('utf8') 

from bs4 import BeautifulSoup 
raw = BeautifulSoup(html).get_text() 

、結果は次のようなコードが含まれています

(function() { (function(){function 
c(a){this.t={};this.tick=function(a,c,b){var d=void 0!=b?b:(new 
Date).getTime();this.t[a]=[d,c];if(void 
0==b)try{window.console.timeStamp("CSI/"+a)}catch(e){}};this.tick("start",null,a)}var 
a;window.performance&&(a=window.performance.timing);var h=a?new 
c(a.responseStart):new c;window.jstiming={Timer:c,load:h};if(a){var 
b=a.navigationStart,e=a.responseStart;0<b&&e>=b&&(window.jstiming.srt=e-b)}if(a){var 
d=window.jstiming.load;0<b&&e>=b&&(d.tick("_wtsrt",void 
0,b),d.tick("wtsrt_", 
"_wtsrt",e),d.tick("tbsd_","wtsrt_"))}try{a=null,window.chrome&&window.chrome.csi&&(a=Math.floor(window.chrome.csi().pageT),d&&0<b&&(d.tick("_tbnd",void 
0,window.chrome.csi().startE),d.tick("tbnd_","_tbnd",b))),null==a&&window.gtbExternal&&(a=window.gtbExternal.pageT()),null==a&&window.external&&(a=window.external.pageT,d&&0<b&&(d.tick("_tbnd",void 
0,window.external.startE),d.tick("tbnd_","_tbnd",b))),a&&(window.jstiming.pt=a)}catch(k){}})();window.tickAboveFold=function(c){var 
a=0;if(c.offsetParent){do 
a+=c.offsetTop;while(c=c.offsetParent)}c=a;750>=c&&window.jstiming.load.tick("aft")};var 
f=!1;function 
g(){f||(f=!0,window.jstiming.load.tick("firstScrollTime"))}window.addEventListener?window.addEventListener("scroll",g,!1):window.attachEvent("onscroll",g); 
})(); 

だから私の質問は、私は本当にPythonのことで、HTMLからクリーンなプレーンテキストを取得する方法、です。私は多くのWebツールが、いわゆるブックビューモードをサポートしていることを知っています。ここでは、ほとんどの場合のみ主要記事を見ることができるので、きれいなプレーンテキストを抽出するのは問題ではないはずです。ありがとう!

答えて

1

あなたはBeautifulSoupを間違って使用しています。テキストを抽出するには、生のテキストを取得しないでください。BSは、ページから必要なものを推測する魔法の杖ではありません。行う。だからではなく、あなたが抽出したいオブジェクトのクラスとidを探す必要があります:

>>> bs.find_all('h1')[0].getText() 
u'\nLenovo K3 Note Brutally Honest Review: Specifications, Pros and Cons\n' 
>>> bs.find_all(attrs={'class': 'post-body', 'class': 'entry-content'})[0].getText() 
u'\n\n\n\n\n\n(adsbygoogle = window.adsbygoogle || []).push({});\n\n\nIt seems like Lenovo has finally caught the pulse of smartphone market in countries like India. After the successful launch of A6000, 6000+ and A7000, the company has come up with something big, both psychically and performance wise, with a name k3 note.The term \u2018Note\u2019 itself reminds us of the large phones which was actually been started mentioning by Samsung for its phablets. Like all other smartphone manufacturer companies, Lenovo also took up the term for its new boy.In this review, I\u2019ll be discussing the specifications of the K3 Note phablet in the price point of view and will be discussing the pros and cons of this device honestly brutally honestly.Let\u2019s begin! In the boxAlong with the handset, you will get a screen guard (non-tamper proof), 2-pin wall mounted charger, USB cable and removable battery in the box. K3 Note will not be accompanied by the headset in the box. That\u2019s somewhat upsetting to see A7000 coming with one and K3 Note with none. DesignNo actual changes were made to the physical design of Lenovo K3 Note compared to its predecessor, A7000. In fact, you will not see the difference between the two devices physically when kept side-by-side. \xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0 \xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0 \xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0 The screen size, body, camera, flash and speaker, buttons and slots are in the same position as A7000. K3 Note\u2019s physical design looks as good as A7000 but not build that tough. The body has low build quality and it can easily be broken under the appliance of little \u2018more\u2019 pressure. DisplayLenovo K3 Note comes with 5.5 inch Full HD IPS display that can render 401 pixels per inch (PPI) on 1080P resolution display.The screen contributes 72% to the body ratio thus making it a large screen-less body device. The best viewing angles of the screen has specified to be 178 degrees and it has 5-point touch sensor that can recognize 5-touch points simultaneously. Processor & RAMLenovo K3 Note comes with 1.7 GHz MediaTek Cortex A53 64-bit processor which is 0.2GHz faster than Lenovo A7000. The 2 GB RAM supports the processor at its best in multi-tasking.The combo is supported with ARM Mali-T760 MP2 GPU which is not so different to A7000\u2019s. You can experience good 3D gaming with this GPU configuration in parallel with the processor and RAM. MemoryK3 Note comes with 16 GB built-in ROM and allows users to expand the memory up to 32 GB through microSD card. This is an upgraded feature when compared to Lenovo A7000\u2019s 8 GB ROM. Operating SystemK3 Note runs on Android Lollipop v5.0 which is not even 5.0.2. It is sad to see Lenovo\u2019s next product, after A7000 coming with v5.0. It is expected to get Android Lollipop v5.1 in future. CameraLenovo has upgraded the rear camera for K3 Note from 8MP to 13MP. The dual tone LED flash helps to take best shots in both lighting conditions. The camera is added with some new shooting modes compared to A7000. It can record full HD\xa01080P resolution videos with 30 frames per second rate.The front camera can take 5MP sharp photos and it is good enough to take best selfies.K3 Note\u2019s camera specifications are satisfying for its price range. ConnectivityIt supports 4G LTE networks in both the slots and have the same Wi-Fi, Bluetooth and OTG support specifications that A7000 came up with. BatteryLenovo K3 Note has got 2900mAh powered battery which can hold the charging on moderate usage for 24 hours at most. The 1080P screen absorbs the juice quickly and so it cannot last as long as A7000. Pros A bit more fast processor Upgraded camera More internal memory Full HD screen Full HD recording Removable battery Cons Low built quality body Same design as A7000 No Lollipop v5.0.2 at least No Gorilla Glass 3 protection High SAR values 1.590W/KG for head and 0.688W/KG for body Update: Unboxing photos (shared by a fan exclusively for Amfas Tech) \xa0 For more photos: Check out Lenovo K3 Note album on our Facebook page. \xa0 Final VerdictLenovo K3 Note has got some improvements like 16 GB internal storage, 1080P screen and video recording, little faster processor. The rest of the phone is a quite replica of Lenovo A7000. It could have been named as \u2018Lenovo A7000 Plus\u2019 instead of \u2018K3 Note\u2019.After looking at the specifications and advancements, Lenovo K3 Note for such a low price of 9,999 INR is a great deal. If you are planning to buy A7000, dare 1,000 bucks more for K3 Note and you will get a damn good phone for that price (statement made keeping price in mind).Note: If you talk more on phone, think a while choosing this phone as its SAR values are very highly specified.\n\n\n\n\n\n(adsbygoogle = window.adsbygoogle || []).push({});\n\n\n\n\n\n(adsbygoogle = window.adsbygoogle || []).push({});\n\nPlease share this article if you like it! Bless me or curse me in comments! Thank you for reading anyway!\n\n\n\n\n' 

(主な原因テキスト内の広告のJSの)行うには、いくつかのクリーニングはまだありますが、それはほとんどがあります。あなたは身体の中に保持したいタグ/クラス/ idsを見る必要があります。

私の質問は、どうすれば本当にPythonによってHTMLからきれいなプレーンテキストを得ることができるかということです。それは、関連はありません、私は多くのウェブツールは、あなただけの、ほとんどの場合、メインの記事を見ることができる、いわゆるブックビューモードをサポート参照してくださいので、私はそれは問題がきれいにプレーンテキスト

を抽出するべきではありません数えます「生の」テキストは、テキストだけを表示する別のCSSスタイルです。しかし、それはページのソースを簡単にしません。

1

styleタグとscriptタグを抽出し、.decomposeメソッドを使用してコンテンツを破棄する必要があります。そこから単にget_textを使ってスープテキストを得る。

from urllib.request import urlopen # import urllib in Python 2.x 
from bs4 import BeautifulSoup 


url = "http://www.amfastech.com/2015/07/lenovo-k3-note-brutally-honest-review-specifications-pros-cons.html" 
html = urlopen(url).read() 
soup = BeautifulSoup(html, 'lxml') 
for tag in soup.find_all(['script', 'style']): 
    tag.decompose() 
soup.get_text(strip=True) 

を「レノボK3注残酷正直レビュー:仕様、長所とCons≡HomeAboutUsBlog IndexServicesNewsGuest後接触UsYouホーム»スマートフォンは、正直なレビュー残酷レノボK3注意»レビュー:レノボがインドなどの国でスマートフォン市場の勢いをつかんだようだ.A6000、6000+、A7000の発売成功後、会社は、名声k3のノートで、偉大な何か、精神的にもパフォーマンスにも賢明なものを思い付いた。テ自体は.........

関連する問題