値が2つのフィールドの間にある場合、Pandas Merge

私は、SQLで左結合に相当するいくつかのIP情報を含む2つのデータフレームを持っています。データフレームは、次のフィールドがあります値が2つのフィールドの間にある場合、Pandas Merge

df1: ["company","ip","actions"] 
df2: ["ip_range_start","ip_range_end","country","state","city"]

結果データフレームは、ヘッダを持っている必要があります：["company","ip","actions","country","state","city"]を。ここでの問題は、マージ基準です。 df1には、df2から国、州、都市の情報を取得するために使用したい単一のIPが含まれています。

この単一のIP は、は、df2の"ip_range_start"と"ip_range_end"フィールドで指定された範囲のいずれかになります。私はdf1とdf2の間に一致する値がないので、通常のマージ/ジョインとしてこれを達成する方法を明らかにしていません。

私の質問はこれと非常に似ているようだが、別の質問を保証するのに十分な異なる：Pandas: how to merge two dataframes on offset dates?

出典

2016-08-28 sastrup

いくつかのデータを共有するように気をつけますか？ – Abdou

IPの階層番号を別々のフィールドに分割できます。それは解決策を単純化するかもしれない。 –

では、次のデータフレームを持っていると仮定します。

In [5]: df1 
Out[5]: 
    company   ip actions 
0 comp1 10.10.1.2 act1 
1 comp2 10.10.2.20 act2 
2 comp3 10.10.3.50 act3 
3 comp4 10.10.4.100 act4 

In [6]: df2 
Out[6]: 
    ip_range_start ip_range_end country state city 
0  10.10.2.1 10.10.2.254 country2 state2 city2 
1  10.10.3.1 10.10.3.254 country3 state3 city3 
2  10.10.4.1 10.10.4.254 country4 state4 city4

我々はどの意志ベクトル化機能を作成することができますint(netaddr.IPAddress('192.0.2.1'))のような数値IP表現を計算する：

def ip_to_int(ip_ser): 
    ips = ip_ser.str.split('.', expand=True).astype(np.int16).values 
    mults = np.tile(np.array([24, 16, 8, 0]), len(ip_ser)).reshape(ips.shape) 
    return np.sum(np.left_shift(ips, mults), axis=1)

すべてのIPを数値に変換しましょうリットル表現：

df1['_ip'] = ip_to_int(df1.ip) 
df2[['_ip_range_start','_ip_range_end']] = df2.filter(like='ip_range').apply(lambda x: ip_to_int(x)) 

In [10]: df1 
Out[10]: 
    company   ip actions  _ip 
0 comp1 10.10.1.2 act1 168427778 
1 comp2 10.10.2.20 act2 168428052 
2 comp3 10.10.3.50 act3 168428338 
3 comp4 10.10.4.100 act4 168428644 

In [11]: df2 
Out[11]: 
    ip_range_start ip_range_end country state city _ip_range_start _ip_range_end 
0  10.10.2.1 10.10.2.254 country2 state2 city2  168428033  168428286 
1  10.10.3.1 10.10.3.254 country3 state3 city3  168428289  168428542 
2  10.10.4.1 10.10.4.254 country4 state4 city4  168428545  168428798

今

のはdf2 DFから IP間隔に一致する最初ののインデックスが含まれていますdf1 DFに新しい列を追加してみましょう：

In [12]: df1['x'] = (df1._ip.apply(lambda x: df2.query('_ip_range_start <= @x <= _ip_range_end') ....: .index ....: .values) ....: .apply(lambda x: x[0] if len(x) else -1)) In [14]: df1 Out[14]: company ip actions _ip x 0 comp1 10.10.1.2 act1 168427778 -1 1 comp2 10.10.2.20 act2 168428052 0 2 comp3 10.10.3.50 act3 168428338 1 3 comp4 10.10.4.100 act4 168428644 2

最後に、私たちはマージすることができます両方のDF：

In [15]: (pd.merge(df1.drop('_ip',1), ....: df2.filter(regex=r'^((?!.?ip_range_).*)$'), ....: left_on='x', ....: right_index=True, ....: how='left') ....: .drop('x',1) ....:) Out[15]: company ip actions country state city 0 comp1 10.10.1.2 act1 NaN NaN NaN 1 comp2 10.10.2.20 act2 country2 state2 city2 2 comp3 10.10.3.50 act3 country3 state3 city3 3 comp4 10.10.4.100 act4 country4 state4 city4
標準のint（IPAddress）と私たちの関数の速度を比較しましょう。比較のために4M行DFを使用してください）。

In [21]: big = pd.concat([df1.ip] * 10**6, ignore_index=True) In [22]: big.shape Out[22]: (4000000,) In [23]: big.head(10) Out[23]: 0 10.10.1.2 1 10.10.2.20 2 10.10.3.50 3 10.10.4.100 4 10.10.1.2 5 10.10.2.20 6 10.10.3.50 7 10.10.4.100 8 10.10.1.2 9 10.10.2.20 Name: ip, dtype: object In [24]: %timeit %timeit %%timeit In [24]: %timeit big.apply(lambda x: int(IPAddress(x))) 1 loop, best of 3: 1min 3s per loop In [25]: %timeit ip_to_int(big) 1 loop, best of 3: 25.4 s per loop

結論：私たちの機能は約です。 2.5倍速く

出典

2016-08-29 18:39:58 MaxU

値が2つのフィールドの間にある場合、Pandas Merge

答えて

関連する問題