BS4のトラブルスクレイピングサイト

通常、私はスクレイピングのために動作するスクリプトを書くことができますが、私が取り組んでいるこの研究プロジェクトに参加するテーブルのためにこのサイトを掻き集めるのにいくつかの困難を抱えています。ターゲット状態のURLを入力する前に、ある州でスクリプトが動作していることを確認する予定です。BS4のトラブルスクレイピングサイト

import requests 
import bs4 as bs 

url = ("http://programs.dsireusa.org/system/program/detail/284") 
dsire_get = requests.get(url) 
soup = bs.BeautifulSoup(dsire_get.text,'lxml') 
table = soup.findAll('div', {'data-ng-controller': 'DetailsPageCtrl'}) 
print(table) 
#I'm printing "Table" just to ensure that the table information I'm looking for is within this sections

私は、サイトがスクレーピングから人々を阻止しようとしているかどうかわからないんだけど、あなたはどのような表出力を見れば、私はつかむために探しているすべての情報は、「& QUOT」の範囲内です。

出典

2017-07-06 vlepore

「lxml」ではなく「html.parser」を試しましたか？ – martinB0103

あなたはページのどの部分をしたいですか？「プログラムの概要」の部分は？または、当局は「当局」に向かいましたか？または、他の何か？ –

@BillBell私は "Program Overview" – vlepore

だから私は最終的に私のために働いていたコードを、問題を解決し、successfuly Javascriptのページからデータをつかむために管理しましたPythonを使用して、Windowsを使用してjavascript Webページをスクレイプします（dryscrape互換性なし）。

import bs4 as bs 
from selenium import webdriver 
from selenium.common.exceptions import NoSuchElementException 
from selenium.webdriver.common.keys import Keys 
browser = webdriver.Chrome() 
url = ("http://programs.dsireusa.org/system/program/detail/284") 
browser.get(url) 
html_source = browser.page_source 
browser.quit() 
soup = bs.BeautifulSoup(html_source, "html.parser") 
table = soup.find('div', {'class': 'programOverview'}) 
data = [] 
for n in table.findAll("div", {"class": "ng-binding"}): 
    trip = str(n.text) 
    data.append(trip)

出典

2017-07-07 17:16:29 vlepore

テキストはJavaScriptでレンダリングされます。それはつまり、ページ上の異なる位置から、レンダリングされた後、次に

は、テキストを抽出することができます（あなたが Web-scraping JavaScript page with Pythonを見るdryscrape使用しない場合）まず dryscrape

でページをレンダリングそれがレンダリングされた場所。

例として、このコードは要約からHTMLを抽出します。

import bs4 as bs 
import dryscrape 

url = ("http://programs.dsireusa.org/system/program/detail/284") 
session = dryscrape.Session() 
session.visit(url) 
dsire_get = session.body() 
soup = bs.BeautifulSoup(dsire_get,'html.parser') 
table = soup.findAll('div', {'class': 'programSummary ng-binding'}) 
print(table[0])

出力：次のようにしようとしたとき、誰もが同じ問題に遭遇した場合

<div class="programSummary ng-binding" data-ng-bind-html="program.summary"><p> 
<strong>Eligibility and Availability</strong></p> 
<p> 
Net metering is available to all "qualifying facilities" (QFs), as defined by the federal <i>Public Utility Regulatory Policies Act of 1978</i> (PURPA), which pertains to renewable energy systems and combined heat and power systems up to 80 megawatts (MW) in capacity. There is no statewide cap on the aggregate capacity of net-metered systems.</p> 
<p> 
All utilities subject to Public ...

出典

2017-07-06 17:22:57

を探していますが、これはうまくいくようですが、dryscrapeは正式にウィンドウをサポートしていないので、私はそれを使用できません。私はあなたがdryscapeなしで参照したその投稿にレイアウトされた方法に従うつもりです。 – vlepore

だから私はリンクを含めた。 Dryscrape、Selenium、PyQtなどを使用していても、方法は同じです。 –

BS4のトラブルスクレイピングサイト

答えて

関連する問題