リンクのリストから情報を取得してJSONオブジェクトにダンプする方法を教えてください。

PythonとBeautifulSoupの新機能です。どんな助けも高く評価されていますリンクのリストから情報を取得してJSONオブジェクトにダンプする方法を教えてください。

私はどのように構築するのか考えています企業情報のリストですが、それは1つのリンクをクリックした後です。

import requests 
from bs4 import BeautifulSoup 


url = "http://data-interview.enigmalabs.org/companies/" 
r = requests.get(url) 

soup = BeautifulSoup(r.content) 

links = soup.find_all("a") 

link_list = [] 

for link in links: 
    print link.get("href"), link.text 

g_data = soup.find_all("div",{"class": "table-responsive"}) 

for link in links: 
    print link_list.append(link)

誰もが最初のサイトのための企業のリストデータのすべてのJSONを構築し、その後、リンクをこするについて移動する方法のアイデアを与えることができますか？

より良い視覚化のためにサンプル画像を添付しました。

個々のリンクをクリックすることなく、以下の例のようにサイトをスクラップしてJSONを作成するにはどうすればよいですか？

例予想される出力：

all_listing = [ {"Dickens-Tillman":{'Company Detail': 
{'Company Name': 'Dickens-Tillman', 
    'Address Line 1 ': '7147 Guilford Turnpike Suit816', 
    'Address Line 2 ': 'Suite 708', 
    'City': 'Connfurt', 
    'State': 'Iowa', 
    'Zipcode ': '22598', 
    'Phone': '00866539483', 
    'Company Website ': 'lockman.com', 
    'Company Description': 'enable robust paradigms'}}}, 
`{'"Klein-Powlowski" ':{'Company Detail': 
{'Company Name': 'Klein-Powlowski', 
    'Address Line 1 ': '32746 Gaylord Harbors', 
    'Address Line 2 ': 'Suite 866', 
    'City': 'Lake Mario', 
    'State': 'Kentucky', 
    'Zipcode ': '45517', 
    'Phone': '1-299-479-5649', 
    'Company Website ': 'marquardt.biz', 
'Company Description': 'monetize scalable paradigms'}}}] 

print all_listing`

出典

2017-07-07 Vash

ええと...あなたは実際のURLを私たちに提供しますか？ –

@cᴏʟᴅsᴘᴇᴇᴅええ、実際のURLは問題ありません。[link]（http://data-interview.enigmalabs.org/companies/） – Vash

これはセレン+ bs4の仕事のようです。 –

ここでは、私は尋ねた質問への私の最終的な解決策です。

import bs4, urlparse, json, requests,csv 
from os.path import basename as bn 

links = [] 
data = {} 
base = 'http://data-interview.enigmalabs.org/' 

#Approach 
#1. Each individual pages, collect the links 
#2. Iterate over each link in a list 
#3. Before moving on each the list for links if correct move on, if not review step 2 then 1 
#4. Push correct data to a JSON file 



def bs(r): 
    return bs4.BeautifulSoup(requests.get(urlparse.urljoin(base, r).encode()).content, 'html.parser').find('table') 

for i in range(1,11): 
    print 'Collecting page %d' % i 
    links += [a['href'] for a in bs('companies?page=%d' % i).findAll('a')] 
# Search a the given range of "a" on each page 

# Now that I have collected all links into an list,iterate over each link 
# All the info is within a html table, so search and collect all company info in data 
for link in links: 
    print 'Processing %s' % link 
    name = bn(link) 
    data[name] = {} 
    for row in bs(link).findAll('tr'): 
     desc, cont = row.findAll('td') 
     data[name][desc.text.encode()] = cont.text.encode() 

print json.dumps(data) 

# Final step is to have all data formating 
json_data = json.dumps(data, indent=4) 
file = open("solution.json","w") 
file.write(json_data) 
file.close()

出典

2017-07-13 22:19:18 Vash

リンクのリストから情報を取得してJSONオブジェクトにダンプする方法を教えてください。

答えて

関連する問題