スクラップのPython 3.4とBeautifulSoupとの記事では、私がウェブサイトをスクラップしたい

を要求します：スクラップのPython 3.4とBeautifulSoupとの記事では、私がウェブサイトをスクラップしたい

https://xueqiu.com/yaodewang

そして私はBeautifulSoupと、そのような要求を使用.Iすべての彼の記事スクラップしたい：

import requests 
from bs4 import BeautifulSoup 
url = 'https://xueqiu.com/yaodewang' 
header = {'user-Agent':'Mozilla/5.0 (Linux; Android 6.0; Nexus 5 Build/MRA58N) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/46.0.2490.76 Mobile Safari/537.36'} 
r = requests.get(url,headers = header).content 
soup = BeautifulSoup(r,'lxml') 
artile = soup.find_all('ul',{'class':'status-list'}) 
print(artile)

を

結果は何もありませんそれはリターンです！私はこのような別のルールをTYR、SO

[]

：

# art = soup.find_all('div',{'class':'allStatuses no-head'}) 
# art = soup.find_all('div',{'class':'status_bd'}) 
# art = soup.find_all('div',{'class':'status_content container active tab-pane'})

しかし、それは正しくないいくつかの単語を返します。このようなコンテンツをお届けします

ありがとうございました！

出典

2016-05-01 champion Ch

実際には、希望のデータは、status-listクラスの要素内にはありません。あなたがソースを調べたい場合は、代わりに空の容器を見つけるだろう。その代わり、ステータスはあなたが検索する必要がありscript要素の内側に位置しています

<div class="status_bd"> 
    <div id="statusLists" class="allStatuses no-head"></div> 
</div>

、Pythonの辞書にJSONから所望の物体、負荷を抽出しますそして、必要な情報を抽出します。

import json 
import re 
import requests 
from bs4 import BeautifulSoup 

url = 'https://xueqiu.com/yaodewang' 
headers = { 
    'user-Agent': 'Mozilla/5.0 (Linux; Android 6.0; Nexus 5 Build/MRA58N) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/46.0.2490.76 Mobile Safari/537.36' 
} 
r = requests.get(url, headers=headers).content 
soup = BeautifulSoup(r, 'lxml') 

pattern = re.compile(r"SNB\.data\.statuses = ({.*?});", re.MULTILINE | re.DOTALL) 
script = soup.find("script", text=pattern) 

data = json.loads(pattern.search(script.text).group(1)) 
for item in data["statuses"]: 
    print(item["description"])

プリント：

The best advice: Remember common courtesy and act toward others as you want them to act toward you. 
Lighten up! It&#39;s the weekend. we&#39;re just having a little fun! Industrial Bank is expected to rise,next week... 
... 
点.点.点... 点到这个，学位、学历、成绩单翻译一下要50块、100块的...

出典

2016-05-01 02:24:49 alecxe

は私がcontenを知っていれば、私は、知りたい、非常にmuch.Itは右methlodだありがとうしかし！ tはスクリプトによって見つけ出されますが、どのように正規表現が次のように見つけられますか：pattern = re.compile（r "SNB \ .data \ .statuses =（{。*？}）;"、re.MULTILINE | –

別の質問：私は工芸品のリストを取得したい、しかし今、私は文字列を持っています。私はこの結果のようにしたいです[str01、str02 .....] –

@championCh確かに、ちょうどスクリプトテキストを抽出して、[regex101]（https://regex101.com/）などで作業してください。あなたの2番目の質問については、結果をリストに入れることを頼んでいると思います： 'items = [item [" description "] item [" statuses "]]'希望が役立ちます。 – alecxe

スクラップのPython 3.4とBeautifulSoupとの記事では、私がウェブサイトをスクラップしたい

答えて

関連する問題