2016-12-05 3 views
0

私は現在、BeautifulSoupを使ってウェブサイトからhtml情報を引き出そうとしていますが、何らかの理由で出力が途切れた形式になってしまい、 。各行をパンダや他のモジュールで1つの値にマージする

私の現在のコードは次のとおりです。

from bs4 import BeautifulSoup 
import urllib 
import csv 
import pandas as pd 
url = 'http://www.hkexnews.hk/listedco/listconews/mainindex/SEHK_LISTEDCO_DATETIME_TODAY.HTM' 

html = urllib.urlopen(url) 
soup = BeautifulSoup(html,'html.parser') 

r0 = soup.find_all("tr", class_="row0") 
#removed r1 just to make sure everything works first 
#r1 = soup.find_all("tr", class_="row1") 


f = csv.writer(open('news.csv','w')) 


for a in r0: 
    f.writerow(a.encode('utf-8')) 

は、まず私は、単一のセルにそれぞれの行をマージする方法がわからないんだ、そして第二に、私はそれをマージすることなく、情報をプルするための別の方法があります。

答えて

1
import requests 
from bs4 import BeautifulSoup 
url = 'http://www.hkexnews.hk/listedco/listconews/mainindex/SEHK_LISTEDCO_DATETIME_TODAY.HTM' 
r = requests.get(url) 
soup = BeautifulSoup(r.text, 'lxml') 

rows = soup.find_all(class_=['row0','.row1']) 
for row in rows: 
    cell = [i.text for i in row.find_all('td')] 
    print(cell) 

アウト:

['06/12/201608:41', '01159', 'JIMEI INT ENT', 'Announcements and Notices - [Resumption]EXCHANGE NOTICE - RESUMPTION OF TRADING\xa0(1KB, HTM)'] 
['06/12/201608:15', '03933', 'UNITED LAB', 'Announcements and Notices - [Issue of Convertible Securities]COMPLETION OF THE ISSUE OF U.S.$130,000,000 CONVERTIBLE BONDS DUE 2021\xa0(80KB, PDF)'] 
['06/12/201608:10', '00005', 'HSBC HOLDINGS', 'Announcements and Notices - [Overseas Regulatory Announcement - Other]Transaction in own shares\xa0(860KB, PDF)'] 
['06/12/201607:59', '00763', 'ZTE', 'Announcements and Notices - [Overseas Regulatory Announcement - Board/Supervisory Board Resolutions]Announcement Resolutions of the Eleventh Meeting of the Seventh Session of the Board of Directors\xa0(186KB, PDF)'] 
['06/12/201607:08', '01378', 'CHINAHONGQIAO', 'Announcements and Notices - [Major Transaction]MAJOR TRANSACTION-(1) SUBSCRIPTION OF SHARES OF LOFTEN; AND (2) ACQUISITION OF THE ENTIRE EQUITY INTEREST IN INNOVATIVE METAL\xa0(75KB, PDF)'] 
['06/12/201607:04', '01345', 'PIONEER PHARM', 'Circulars - [Connected Transaction](1) DISCLOSEABLE AND CONNECTED TRANSACTION DISPOSAL OF 100% INTEREST IN A WHOLLY-OWNED SUBSIDIARY AND (2) NOTICE OF EGM\xa0(220KB, PDF)'] 
['06/12/201606:11', '00993', 'HUARONG INT FIN', 'Announcements and Notices - [Discloseable Transaction]DISCLOSEABLE TRANSACTION IN RELATION TO\r\nSUBSCRIPTION FOR NOTES\xa0(144KB, PDF)'] 
['06/12/201606:08', '00300', 'KUNMING MACHINE', 'Announcements and Notices - [Overseas Regulatory Announcement - Other]Announcement on Receiving An Enquiry Letter on \r\nRelated Supplemental Announcement from Shanghai Stock Exchange\xa0(394KB, PDF)'] 

更新:

import requests 
from bs4 import BeautifulSoup 
url = 'http://www.hkexnews.hk/listedco/listconews/mainindex/SEHK_LISTEDCO_DATETIME_TODAY.HTM' 
r = requests.get(url) 
soup = BeautifulSoup(r.text, 'lxml') 

rows = soup.find_all(class_=['row0','.row1']) 
for row in rows: 
    data = row.get_text(separator='\t').split('\t', 5) 
    print (data) 

アウト:区切るする方法がある場合は、返信用

['07/12/2016', '17:42', '00207', 'JOY CITY PPT', 'Announcements and Notices - [List of Directors and their Role and Function]', 'List of Directors and their Roles and Functions\t\xa0(62KB, PDF)'] 
['07/12/2016', '17:40', '02880', 'DALIAN PORT', 'Announcements and Notices - [Overseas Regulatory Announcement - Corporate Governance Related Matters]', 'An announcement has just been published by the issuer in the Chinese section of this website, a corresponding version of which may or may not be published in this section\t\xa0(1KB, HTM)'] 
['07/12/2016', '17:38', '00193', 'CAPITAL ESTATE', 'Announcements and Notices - [Results of AGM]', 'POLL RESULTS OF THE ANNUAL GENERAL\r\nMEETING HELD ON 7 DECEMBER, 2016\t\xa0(95KB, PDF)'] 
['07/12/2016', '17:35', '00207', 'JOY CITY PPT', 'Announcements and Notices - [Dividend or Distribution/Closure of Books or Change of Book Closure Period]', 'SPECIAL DIVIDEND AND CLOSURE OF REGISTER OF MEMBERS\t\xa0(133KB, PDF)'] 
['07/12/2016', '17:29', '00052', 'FAIRWOOD HOLD', 'Next Day Disclosure Returns - [Share Buyback]', 'Next Day Disclosure Return\t\xa0(125KB, PDF)'] 
['07/12/2016', '17:21', '00756', 'TIANYI SUMMI', 'Announcements and Notices - [Other - Miscellaneous]', 'VOLUNTARY ANNOUNCEMENT - INCREASE IN SHAREHOLDING OF A CONTROLLING SHAREHOLDER\t\xa0(120KB, PDF)'] 
['07/12/2016', '17:16', '00702', 'SINO OIL & GAS', 'Next Day Disclosure Returns - [Share Buyback]', 'NEXT DAY DISCLOSURE RETURN\t\xa0(294KB, PDF)'] 
+0

おかげで、私が知りたいのですが日付と時刻の値を入力し、各タイトルに関連付けられたリンクも追加しますか? – kimpster

+0

タイトルが何であるか分かりませんが、日付と時刻の値を分けるようにコードを更新しています。 –

関連する問題