Beautifulsoup HTMLテーブルの解析 - 最後の行だけを取得できますか？

私は構文解析するための単純なHTMLテーブルを持っていますが、何とかBeautifulsoupは私に最後の行からの結果しか得られません。誰かがそれを見て、何が間違っているのかを知りたいと思っています。Beautifulsoup HTMLテーブルの解析 - 最後の行だけを取得できますか？

<table class='participants-table'> 
    <thead> 
     <tr> 
      <th data-field="name" class="sort-direction-toggle name">Name</th> 
      <th data-field="type" class="sort-direction-toggle type active-sort asc">Type</th> 
      <th data-field="sector" class="sort-direction-toggle sector">Sector</th> 
      <th data-field="country" class="sort-direction-toggle country">Country</th> 
      <th data-field="joined_on" class="sort-direction-toggle joined-on">Joined On</th> 
     </tr> 
    </thead> 
    <tbody> 
     <tr> 
      <th class='name'><a href="/what-is-gc/participants/4479-Grontmij">Grontmij</a></th> 
      <td class='type'>Company</td> 
      <td class='sector'>General Industrials</td> 
      <td class='country'>Netherlands</td> 
      <td class='joined-on'>2000-09-20</td> 
     </tr> 
     <tr> 
      <th class='name'><a href="/what-is-gc/participants/4492-Groupe-Bial">Groupe Bial</a></th> 
      <td class='type'>Company</td> 
      <td class='sector'>Pharmaceuticals &amp; Biotechnology</td> 
      <td class='country'>Portugal</td> 
      <td class='joined-on'>2004-02-19</td> 
     </tr> 
    </tbody> 
    </table>

は、私は、行を取得するには、次のコードを使用します：

table=soup.find_all("table", class_="participants-table") 
table1=table[0] 
rows=table1.find_all('tr') 
rows=rows[1:]

これが取得する：

rows=[<tr> 
<th class="name"><a href="/what-is-gc/participants/4479-Grontmij">Grontmij</a></th> 
<td class="type">Company</td> 
<td class="sector">General Industrials</td> 
<td class="country">Netherlands</td> 
<td class="joined-on">2000-09-20</td> 
</tr>, <tr> 
<th class="name"><a href="/what-is-gc/participants/4492-Groupe-Bial">Groupe Bial</a></th> 
<td class="type">Company</td> 
<td class="sector">Pharmaceuticals &amp; Biotechnology</td> 
<td class="country">Portugal</td> 
<td class="joined-on">2004-02-19</td> 
</tr>]

を予想したように、それをだから私はすでに行がHTMLのテーブルからオブジェクトを作成します見える。しかし、私が続けると：

for row in rows: 
    cells = row.find_all('th')

私は最後のエントリしか取得できません！

cells=[<th class="name"><a href="/what-is-gc/participants/4492-Groupe-Bial">Groupe Bial</a></th>]

何が起こっているのですか？これは初めてbeautifulsoupを使用しているので、このテーブルをCSVにエクスポートすることです。どんな助けでも大歓迎です！ありがとう

出典

2016-08-03 AD233

を'rows'はどのように定義されていますか？ – alecxe

ありがとう！表とコードについて詳しく説明します。 – AD233

それはあなたの求めることをまさにやっています。 'td'をすべて手に入れようとしていますか？ –

すべてのthタグを1つのリストに入れたい場合は、cells = row.find_all('th')を再割り当てしておいてください。ループ外の印刷セルで最後に割り当てられたもの最後のTR：

cells = [] 
for row in rows: 
cells.extend(row.find_all('th'))

また、あなただけの見つける使用することができます唯一のテーブルがあるので：

soup = BeautifulSoup(html) table = soup.find("table", class_="participants-table")

あなたはTHEAD行をスキップしたい場合は、CSSセレクタを使用することができますが、：

from bs4 import BeautifulSoup soup = BeautifulSoup(html) rows = soup.select("table.participants-table thead ~ tr") cells = [tr.th for tr in rows] print(cells)

細胞はあなたを与える：

[<th class="name"><a href="/what-is-gc/participants/4479-Grontmij">Grontmij</a></th>, <th class="name"><a href="/what-is-gc/participants/4492-Groupe-Bial">Groupe Bial</a></th>]

をCSVファイルにテーブル全体を書き込むには：

import csv soup = BeautifulSoup(html, "html.parser") rows = soup.select("table.participants-table tr") with open("data.csv", "w") as out: wr = csv.writer(out) wr.writerow([th.text for th in rows[0].find_all("th")] + ["URL"]) for row in rows[1:]: wr.writerow([tag.text for tag in row.find_all()] + [row.th.a["href"]])

あなたのためのサンプルはあなたを与えるであろう：

Name,Type,Sector,Country,Joined On,URL Grontmij,Company,General Industrials,Netherlands,2000-09-20,/what-is-gc/participants/4479-Grontmij Groupe Bial,Company,Pharmaceuticals & Biotechnology,Portugal,2004-02-19,/what-is-gc/participants/4492-Groupe-Bial

出典

2016-08-03 21:13:32

ありがとう！これはそれを私の目標にもっと近づけます！実際に私がしたいのは、このテーブルを "Name"とHTMLリンクを別々の列として持つ典型的なCSV形式にエクスポートすることです。ちょうど提案した "拡張"メソッドでこれを行う方法はありますか？ありがとう！ – AD233

@ AD233だから、基本的にcsvでテーブルを作り直したいのですか？ –

これは正しいですが、別の列としてhrefリンクを抽出したい場合を除きます。ありがとう！ – AD233

Beautifulsoup HTMLテーブルの解析 - 最後の行だけを取得できますか？

答えて

関連する問題