問題のマージPandasとPythonでnumpyを使用してスクラップしたデータ

多くの異なるURLから情報を収集し、年とゴルファーの名前に基づいてデータを結合しようとしています。現在、私はcsvに情報を書き込んで、pd.merge（）を使用して一致させようとしていますが、各データフレームに一意の名前を使用してマージする必要があります。私は数の少ない配列を使用しようとしましたが、私はすべての別々のデータをマージする最終プロセスについています。問題のマージPandasとPythonでnumpyを使用してスクラップしたデータ

import csv 
from urllib.request import urlopen 
from bs4 import BeautifulSoup 
import datetime 
import socket 
import urllib.error 
import pandas as pd 
import urllib 
import sqlalchemy 
import numpy as np 

base = 'http://www.pgatour.com/' 
inn = 'stats/stat' 
end = '.html' 
years = ['2017','2016','2015','2014','2013'] 

alpha = [] 
#all pages with links to tables 
urls = ['http://www.pgatour.com/stats.html','http://www.pgatour.com/stats/categories.ROTT_INQ.html','http://www.pgatour.com/stats/categories.RAPP_INQ.html','http://www.pgatour.com/stats/categories.RARG_INQ.html','http://www.pgatour.com/stats/categories.RPUT_INQ.html','http://www.pgatour.com/stats/categories.RSCR_INQ.html','http://www.pgatour.com/stats/categories.RSTR_INQ.html','http://www.pgatour.com/stats/categories.RMNY_INQ.html','http://www.pgatour.com/stats/categories.RPTS_INQ.html'] 
for i in urls: 
    data = urlopen(i) 
    soup = BeautifulSoup(data, "html.parser") 
    for link in soup.find_all('a'): 
     if link.has_attr('href'): 
      alpha.append(base + link['href'][17:]) #may need adjusting 
#data links 
beta = [] 
for i in alpha: 
    if inn in i: 
     beta.append(i) 
#no repeats 
gamma= [] 
for i in beta: 
    if i not in gamma: 
     gamma.append(i) 

#making list of urls with Statistic labels 
jan = [] 
for i in gamma: 
    try: 
     data = urlopen(i) 
     soup = BeautifulSoup(data, "html.parser") 
     for table in soup.find_all('section',{'class':'module-statistics-off-the-tee-details'}): 
      for j in table.find_all('h3'): 
       y=j.get_text().replace(" ","").replace("-","").replace(":","").replace(">","").replace("<","").replace(">","").replace(")","").replace("(","").replace("=","").replace("+","") 
       jan.append([i,str(y+'.csv')]) 
       print([i,str(y+'.csv')]) 
    except Exception as e: 
      print(e) 
      pass 

# practice url 
#jan = [['http://www.pgatour.com/stats/stat.02356.html', 'Last15EventsScoring.csv']] 
#grabbing data 
#write to csv 
row_sp = [] 
rows_sp =[] 
title1 = [] 
title = [] 
for i in jan: 
    try: 
     with open(i[1], 'w+') as fp: 
      writer = csv.writer(fp) 
      for y in years: 
       data = urlopen(i[0][:-4] +y+ end) 
       soup = BeautifulSoup(data, "html.parser") 
       data1 = urlopen(i[0]) 
       soup1 = BeautifulSoup(data1, "html.parser") 
       for table in soup1.find_all('table',{'id':'statsTable'}): 
        title.append('year') 
        for k in table.find_all('tr'): 
         for n in k.find_all('th'): 
          title1.append(n.get_text()) 
          for l in title1: 
           if l not in title: 
            title.append(l) 
        rows_sp.append(title) 
       for table in soup.find_all('table',{'id':'statsTable'}): 
        for h in table.find_all('tr'): 
         row_sp = [y] 
         for j in h.find_all('td'): 
          row_sp.append(j.get_text().replace(" ","").replace("\n","").replace("\xa0"," ").replace("d","")) 
         rows_sp.append(row_sp) 
         print(row_sp) 
         writer.writerows([row_sp]) 
    except Exception as e: 
     print(e) 
     pass 

dfs = [df1,df2,df3] # store dataframes in one list 
df_merge = reduce(lambda left,right: pd.merge(left,right,on=['v1'], how='outer'), dfs)

のURL、STATの種類、希望の形式 ...ただ、すべて1行のデータを取得しようとして以下のデータのためののURL [「http://www.pgatour.com/stats/stat.02356.html」、「http://www.pgatour.com/stats/stat.02568.html」その間のものであり、 ...、 'http://www.pgatour.com/stats/stat.111.html']

統計タイトル

LAST 15 EVENTS - SCORING, SG: APPROACH-THE-GREEN, ..., SAND SAVE PERCENTAGE 
year rankthisweek ranklastweek name   events rating rounds avg 
2017 2    3    Rickie Fowler 10  8.8  62 .614  
TOTAL SG:APP MEASURED ROUNDS .... %  # SAVES # BUNKERS TOTAL O/U PAR 
26.386   43    ....70.37 76   108   +7.00

出典

2017-08-15 William Bernard

あなたのコードであなたはパンダを使用していますか？しようとしているのはどこですか？ – Parfait

試みはありませんが、dataframes = [df1、df2、df3]＃リストに1つのリストを格納します df_merge = reduce（lambda left、right：pd.merge（left、right、on = ['column']、how = 'outer'）、dataframes）、これは私が完了しようとしていたプロセスですが、それを利用するポイントにすることはできません –

なぜチェーンマージは機能しませんか？エラー？望ましくない結果？ csvsをデータフレームに読み込んでいませんか？ – Parfait

UPDATE（コメントあたり）
この質問は部分的には技術的な方法（Pandas merge()）ですが、データ収集とクリーニングに役立つワークフローについて議論する機会にも見えます。このように私は、コーディングソリューションに厳密に必要とされるものよりも少し詳細と説明を追加しています。

私の元の回答と同じアプローチを基本的に使用して、異なるURLカテゴリのデータを取得することができます。私はあなたのURLリストを繰り返し、そしてその辞書からきれいにされたデータフレームを構築するときに{url:data} dictsのリストを保つことを勧めます。

各URLカテゴリの異なる列を調整する必要があるため、クリーニング部分の設定には多少の手間がかかります。私はいくつかのテストURLだけを使用して、手作業でアプローチを実証しました。しかし、何千もの異なるURLカテゴリがある場合は、プログラムで列名を収集して整理する方法について考える必要があります。それはこのOPの範囲外であると感じます。

各URLにyearとPLAYER NAMEというフィールドがある限り、次のマージが有効です。前と同じように、CSVに書き込む必要がないと仮定して、スクラップコードを最適化したままにします。

まず、urlsにURLカテゴリを定義します。 URL：URLカテゴリhttp://www.pgatour.com/stats/stat.02356.htmlは、URL自体に一連の年を挿入することによって実際に複数回使用されるということです。例：http://www.pgatour.com/stats/stat.02356.2017.html、http://www.pgatour.com/stats/stat.02356.2016.htmlこの例では、stat.02356.htmlは、複数年のプレーヤーデータに関する情報を含むURLカテゴリです。

import pandas as pd 

# test urls given by OP 
# note: each url contains >= 1 data fields not shared by the others 
urls = ['http://www.pgatour.com/stats/stat.02356.html', 
     'http://www.pgatour.com/stats/stat.02568.html', 
     'http://www.pgatour.com/stats/stat.111.html'] 

# we'll store data from each url category in this dict. 
url_data = {}

今度はurlsを繰り返します。 urlsループの中では、このコードは私のオリジナルの答えと同じですが、これはOPから来ています - 新しいキャプチャ方法を反映するように調整されたいくつかの変数名のみです。 url_dataの各キーurlについて今

for url in urls: 
    print("url: ", url) 
    url_data[url] = {"row_sp": [], 
        "rows_sp": [], 
        "title1": [], 
        "title": []} 
    try: 
     #with open(i[1], 'w+') as fp: 
      #writer = csv.writer(fp) 
     for y in years: 
      current_url = url[:-4] +y+ end 
      print("current url is: ", current_url) 
      data = urlopen(current_url) 
      soup = BeautifulSoup(data, "html.parser") 
      data1 = urlopen(url) 
      soup1 = BeautifulSoup(data1, "html.parser") 
      for table in soup1.find_all('table',{'id':'statsTable'}): 
       url_data[url]["title"].append('year') 
       for k in table.find_all('tr'): 
        for n in k.find_all('th'): 
         url_data[url]["title1"].append(n.get_text()) 
         for l in url_data[url]["title1"]: 
          if l not in url_data[url]["title"]: 
           url_data[url]["title"].append(l) 
       url_data[url]["rows_sp"].append(url_data[url]["title"]) 
      for table in soup.find_all('table',{'id':'statsTable'}): 
       for h in table.find_all('tr'): 
        url_data[url]["row_sp"] = [y] 
        for j in h.find_all('td'): 
         url_data[url]["row_sp"].append(j.get_text().replace(" ","").replace("\n","").replace("\xa0"," ").replace("d","")) 
        url_data[url]["rows_sp"].append(url_data[url]["row_sp"]) 
        #print(row_sp) 
        #writer.writerows([row_sp]) 
    except Exception as e: 
     print(e) 
     pass

、rows_spはあなたがその特定のURLカテゴリのために興味を持っているデータが含まれています。
url_dataを反復処理すると、rows_spは実際にはurl_data[url]["rows_sp"]になりますが、次のいくつかのコードブロックは元の回答から得られたものなので、古いrows_sp変数名を使用してください。データフレームに直接rows_spを書く

# example rows_sp 
[['year', 
    'RANK THIS WEEK', 
    'RANK LAST WEEK', 
    'PLAYER NAME', 
    'EVENTS', 
    'RATING', 
    'year', 
    'year', 
    'year', 
    'year'], 
['2017'], 
['2017', '1', '1', 'Sam Burns', '1', '9.2'], 
['2017', '2', '3', 'Rickie Fowler', '10', '8.8'], 
['2017', '2', '2', 'Dustin Johnson', '10', '8.8'], 
['2017', '2', '3', 'Whee Kim', '2', '8.8'], 
['2017', '2', '3', 'Thomas Pieters', '3', '8.8'], 
... 
]

は、データが非常に適切なフォーマットになっていないことを示しています。少しクリーンアップで

pd.DataFrame(rows_sp).head() 
     0    1    2    3  4  5  6 \ 
0 year RANK THIS WEEK RANK LAST WEEK  PLAYER NAME EVENTS RATING year 
1 2017   None   None   None None None None 
2 2017    1    1  Sam Burns  1  9.2 None 
3 2017    2    3 Rickie Fowler  10  8.8 None 
4 2017    2    2 Dustin Johnson  10  8.8 None 

     7  8  9 
0 year year year 
1 None None None 
2 None None None 
3 None None None 
4 None None None 

pd.DataFrame(rows_sp).dtypes 
0 object 
1 object 
2 object 
3 object 
4 object 
5 object 
6 object 
7 object 
8 object 
9 object 
dtype: object

、我々は適切な数値をデータフレームにrows_spを得ることができますデータ型：

df = pd.DataFrame(rows_sp, columns=rows_sp[0]).drop(0) 
df.columns = ["year","RANK THIS WEEK","RANK LAST WEEK", 
       "PLAYER NAME","EVENTS","RATING", 
       "year1","year2","year3","year4"] 
df.drop(["year1","year2","year3","year4"], 1, inplace=True) 
df = df.loc[df["PLAYER NAME"].notnull()] 
df = df.loc[df.year != "year"] 
num_cols = ["RANK THIS WEEK","RANK LAST WEEK","EVENTS","RATING"] 
df[num_cols] = df[num_cols].apply(pd.to_numeric) 

df.head() 
    year RANK THIS WEEK RANK LAST WEEK  PLAYER NAME EVENTS RATING 
2 2017    1    1.0  Sam Burns  1  9.2 
3 2017    2    3.0 Rickie Fowler  10  8.8 
4 2017    2    2.0 Dustin Johnson  10  8.8 
5 2017    2    3.0  Whee Kim  2  8.8 
6 2017    2    3.0 Thomas Pieters  3  8.8

UPDATED CLEANING
ここでは、一連のURLカテゴリを扱うようになっています。それぞれのカテゴリには、クリーニングするフィールドのセットが異なります。上記のセクションはもう少し複雑になります。

cols = {'stat.02568.html':{'columns':['year', 'RANK THIS WEEK', 'RANK LAST WEEK', 
             'PLAYER NAME', 'ROUNDS', 'AVERAGE', 
             'TOTAL SG:APP', 'MEASURED ROUNDS', 
             'year1', 'year2', 'year3', 'year4'], 
          'numeric':['RANK THIS WEEK', 'RANK LAST WEEK', 'ROUNDS', 
             'AVERAGE', 'TOTAL SG:APP', 'MEASURED ROUNDS',] 
          }, 
     'stat.111.html':{'columns':['year', 'RANK THIS WEEK', 'RANK LAST WEEK', 
            'PLAYER NAME', 'ROUNDS', '%', '# SAVES', '# BUNKERS', 
            'TOTAL O/U PAR', 'year1', 'year2', 'year3', 'year4'], 
         'numeric':['RANK THIS WEEK', 'RANK LAST WEEK', 'ROUNDS', 
            '%', '# SAVES', '# BUNKERS', 'TOTAL O/U PAR'] 
         }, 
     'stat.02356.html':{'columns':['year', 'RANK THIS WEEK', 'RANK LAST WEEK', 
             'PLAYER NAME', 'EVENTS', 'RATING', 
             'year1', 'year2', 'year3', 'year4'], 
          'numeric':['RANK THIS WEEK', 'RANK LAST WEEK', 
             'EVENTS', 'RATING'] 
          } 
     }

そして、あなたは再びurl_dataをループとdfsコレクションで保存することができます：あなたが唯一のいくつかのページを持っている場合は、それだけで視覚的に各カテゴリのフィールドを確認し、それらを格納、このように実現可能です：この時点で

dfs = {} 

for url in url_data: 
    page = url.split("/")[-1] 
    colnames = cols[page]["columns"] 
    num_cols = cols[page]["numeric"] 
    rows_sp = url_data[url]["rows_sp"] 
    df = pd.DataFrame(rows_sp, columns=rows_sp[0]).drop(0) 
    df.columns = colnames 
    df.drop(["year1","year2","year3","year4"], 1, inplace=True) 
    df = df.loc[df["PLAYER NAME"].notnull()] 
    df = df.loc[df.year != "year"] 
    # tied ranks (e.g. "T9") mess up to_numeric; remove the tie indicators. 
    df["RANK THIS WEEK"] = df["RANK THIS WEEK"].str.replace("T","") 
    df["RANK LAST WEEK"] = df["RANK LAST WEEK"].str.replace("T","") 
    df[num_cols] = df[num_cols].apply(pd.to_numeric) 
    dfs[url] = df

、我々はyearとPLAYER NAMEでmergeすべての異なるデータカテゴリへの準備が整いました。（あなたが実際に清掃ループで繰り返し合併している可能性がありますが、私は実証の目的のためにここに分離しています。）

master = pd.DataFrame() 
for url in dfs: 
    if master.empty: 
     master = dfs[url] 
    else: 
     master = master.merge(dfs[url], on=['year','PLAYER NAME'])

は今masterは、各プレイヤー年のためのマージされたデータが含まれています。ここでgroupby()を使用して、データへのビューです：

master.groupby(["PLAYER NAME", "year"]).first().head(4) 
        RANK THIS WEEK_x RANK LAST WEEK_x EVENTS RATING \ 
PLAYER NAME year              
Aam Hawin 2015    66    66.0  7  8.2 
      2016    80    80.0  12  8.1 
      2017    72    45.0  8  8.2 
Aam Scott 2013    45    45.0  10  8.2 

        RANK THIS WEEK_y RANK LAST WEEK_y ROUNDS_x AVERAGE \ 
PLAYER NAME year               
Aam Hawin 2015    136    136  95 -0.183 
      2016    122    122  93 -0.061 
      2017    56    52  84 0.296 
Aam Scott 2013    16    16  61 0.548 

        TOTAL SG:APP MEASURED ROUNDS RANK THIS WEEK \ 
PLAYER NAME year             
Aam Hawin 2015  -14.805    81    86 
      2016  -5.285    87    39 
      2017  18.067    61    8 
Aam Scott 2013  24.125    44    57 

        RANK LAST WEEK ROUNDS_y  % # SAVES # BUNKERS \ 
PLAYER NAME year               
Aam Hawin 2015    86  95 50.96  80  157 
      2016    39  93 54.78  86  157 
      2017    6  84 61.90  91  147 
Aam Scott 2013    57  61 53.85  49   91 

        TOTAL O/U PAR 
PLAYER NAME year     
Aam Hawin 2015   47.0 
      2016   43.0 
      2017   27.0 
Aam Scott 2013   11.0

いくつかはデータカテゴリ（例えばROUNDS_xとROUNDS_y）間で重複しているとしてあなたは、マージされた列に、もう少し清掃を行うことをお勧めします。私が知る限り、重複するフィールド名には全く同じ情報が含まれているようですので、それぞれの_yバージョンを削除してください。

出典

2017-08-15 18:35:11

ありがとうございます、これはすばらしいことですが、私は何年もの間にデータを集計するつもりはない、私は他のすべてのURLからデータを取り出し、メインのデータフレームに追加したい。 –

あなたは大歓迎です！この回答はあなたの最初の質問に十分な解決策を提供しましたか？その場合は、答えの左側にあるチェックマークをクリックして、この回答のマークを付けてください。そうでない場合は、何をしゃべっていますか？ –

技術的にはありませんが、私が持っていた別の質問に答えました。私は、すべてのURLに含まれている情報から大規模なデータフレームを作成することに固執しています。 urlsのデータを独自のdfに変換し、name-yearに基づいてマージします。各プレイヤー行は、各URLから1つのデータフレーム内のすべての情報を取得します。 –

問題のマージPandasとPythonでnumpyを使用してスクラップしたデータ

答えて

関連する問題