私は2つのパンダデータフレームを持っています - df_current_data、df_new_data。pandasデータフレームのマージ（update insert）のためのより良い方法

私の目標は、マージを適用することです（パンダのマージ関数ではなく、 'update \ insert'のようにマージすることです）。一致するかどうかのチェックはキー列によるものです。

私の結果は、3つのオプションの行タイプで構築する必要があります。 df_current_dataに存在するが、df_new_dataに存在していない

行は - 「そのまま」の結果を挿入します。
df_new_dataには存在するが、df_current_dataには存在しない行は、結果として「そのまま」挿入されます。
df_new_dataに存在し、df_current_dataに存在する行 - 結果はdf_new_dataから行を取得する必要があります。

これは古典的なマージアップ動作です。

例：

# rows 0,1 are in current and not in new (check by index1 and index2) 
# row 2 is common 
In [41]: df_current_source 
Out[41]: A index1 index2 
     0 1  1  4 
     1 2  2  5 
     2 3  3  6 

# rows 0,2 are in current and not in new (check by index1 and index2) 
# row 1 is common 
In [42]: df_new_source 
Out[42]: A index1 index2 
     0 4  2  7 
     1 5  3  6 
     2 6  4  5 

# the result has 2 rows that only in current (rows 0,1) 
# the result has 2 rows that only in new (rows 3,4) 
# the result has one row that exists in both current and new (row 2 - index1 = 3, index2 = 6) - so the value of the column A is from the new and not from the current (5 instead of 2) 

In [43]: df_result 
Out[43]: A index1 index2 
     0 1  1  4 
     1 2  2  5 
     2 5  3  6 
     3 4  2  7 
     4 6  4  5

私がやったことだ：

# left join from source to new 
df = df_current_source.merge(df_new_source, how='left', left_on=p_new_keys, 
right_on=p_curr_keys, indicator=True) 

# take only the rows that exists in the current and not exists in the source 
df_only_current = df[df['_merge'] == 'left_only'] 

# merge new data into current data 
df_result = pd.concat([df_only_current, df_new_source])

別のオプションは、ISIN機能である：

df_result = pd.concat([df_current_source[~df_current_source[p_key_col_name]\ 

.isin(df_new_source[p_key_col_name])], df_new_source])

問題は、私が1つの以上の鍵を持っている場合私はisinを使用することはできません、私はマージが必要です。

新しいものからの電流がはるかに大きいと仮定すると、最新の行で現在の一致する行を直接更新し、 "新しい"データフレームの新しい行を現在の行に追加することです。

しかし、私はそれを行う方法がわかりません。

ありがとうたくさんありがとうございます。

出典

2017-08-21 user2671057

combined_dataframe = df_new_source.set_index('A').combine_first(df_current_source.set_index('A')) combined_dataframe.reset_index()

出力：あなたはcombine_firstを使用できますか？インデックス？ –

また、入力と期待される出力を提供してください。 –

更新の意味は？私は2つの列が一致する必要があります。入力と予想される出力を追加しました。 – user2671057

オプション1：

が

df_out = df_current_source.merge(df_new_source, 
           on=['index1', 'index2'], 
           how='outer', indicator=True) 

df_out['A'] = np.where(df_out['_merge'] == 'both', 
         df_out['A_y'], 
         df_out.A_x.add(df_out.A_y, fill_value=0)).astype(int) 

df_out[['A', 'index1', 'index2']]

出力：

A index1 index2 
0 1  1  4 
1 2  2  5 
2 5  3  6 
3 4  2  7 
4 6  4  5

オプション2：
：

df_new_source.set_index(['index1', 'index2'])\ 
      .combine_first(df_current_source.set_index(['index1', 'index2']))\ 
      .reset_index()\ 
      .astype(int)

出力とcombined_firstを使用mergeの一部としてindicator=Trueを使用0

index1 index2 A 
0  1  4 1 
1  2  5 2 
2  2  7 4 
3  3  6 5 
4  4  5 6

出典

2017-08-21 13:09:18

この部分について説明してください：df_out.A_x.add（df_out.A_y、fill_value = 0））。astype（int）？ありがとう – user2671057

ありがとうございます。インジケータ= Trueでマージを使用することができます。これは、データが左、右、またはその両方から出てくる可能性がある「外側」を選択したため、結合がデータフレームになったレコードにラベルを付けます。次に、if文のように動作するnp.whereを使用できます。 _mergeのラベル付け列が両方とも等しい場合は、新しいフレームの値を取得し、そうでない場合は、欠損値が0で置き換えられる左と右の加算値を取得します。 –

私は基本的に、A_xとA_y 1つは常にNaNなので、それらを一緒に追加することで、NaNを0に置き換えます。いずれかの列からnot-null値を取得するという簡単なトリックです。 –

このリンクをチェックするjoin or merge with overwrite in pandasあなたはどのようにすることによって更新されている

A index1 index2 
0 1 1.0 4.0 
1 2 2.0 5.0 
2 3 2.0 7.0 
3 5 3.0 6.0 
4 6 4.0 5.0

出典

2017-08-21 13:52:08 Vico

pandasデータフレームのマージ（update \ insert）のためのより良い方法

答えて

オプション1：

オプション2：
：

pandasデータフレームのマージ（update \ insert）のためのより良い方法

答えて

オプション1：

オプション2：：​​

関連する問題

オプション2：
：