2016-07-11 10 views
3

私は、私はすべてのID(唯一の年、月、日)に独自の日付を取得する必要がありパンダを使用してデータフレームからユニークになる方法は?

2016-06-21 06:25:09 [email protected] GET HTTP/1.1 Mozilla/5.0 (iPhone; CPU iPhone OS 7_1_2 like Mac OS X) AppleWebKit/537.51.2 (KHTML, like Gecko) Version/7.0 Mobile/11D257 Safari/9537.53 200 application/json 2130 https://edge-chat.facebook.com/pull?channel=p_100006170407238&seq=27&clientid=1d67ca6e&profile=mobile&partition=-2&sticky_token=185&msgs_recv=27&qp=y&cb=1830997782&state=active&sticky_pool=frc3c09_chat-proxy&uid=100006170407238&viewer_uid=100006170407238&m_sess=&__dyn=1Z3p5wnE-4UpwDF3GAgy78qzoC6Erz8B0GxG9xu3Z0QwFzohxO3O2G2a1mwYxm48sxadwpVEy1qK78gwUx6&__req=79&__ajax__=AYlbtcBwGC2suZLI-J88V0PWa58vtQeG3YlQLydFRsAl6UwLSjsSpD7peu8mGl6NsHvd2zxfDcB6A0-XunBugUsYZ1lMYmUu97R43iV7XSfpyg&__user=100006170407238 
2016-06-22 06:25:20 [email protected] POST HTTP/1.1 Mozilla/5.0 (iPhone; CPU iPhone OS 7_1_2 like Mac OS X) AppleWebKit/537.51.2 (KHTML, like Gecko) Version/7.0 Mobile/11D257 Safari/9537.53 200 application/x-javascript 20248 https://m.facebook.com/stories.php?aftercursor=MTQ2NjY2MzEwNToxNDY2NjYzMTA1Ojg6NzM0ODg0MDExMjAyNDY1MzA5NToxNDY2NjYyNzk1OjA%3D&tab=h_nor&__m_log_async__=1 
2016-06-23 06:25:25 [email protected] CONNECT HTTP/1.1 Mozilla/5.0 (iPhone; CPU iPhone OS 7_1_2 like Mac OS X) AppleWebKit/537.51.2 (KHTML, like Gecko) Version/7.0 Mobile/11D257 Safari/9537.53 200 - 0 scontent.xx.fbcdn.net:443 
2016-06-23 06:25:25 [email protected] GET HTTP/1.1 Mozilla/5.0 (iPhone; CPU iPhone OS 7_1_2 like Mac OS X) AppleWebKit/537.51.2 (KHTML, like Gecko) Version/7.0 Mobile/11D257 Safari/9537.53 200 text/html 1105 https://m.facebook.com/xti.php?xt=2.qid.6299270070554694533%3Amf_story_key.343726573953754118%3Aei.AI%40ecf11fb3faf9c0b1f73ce2a74bc9f228 
2016-06-24 06:25:25 [email protected] CONNECT HTTP/1.1 Mozilla/5.0 (iPhone; CPU iPhone OS 7_1_2 like Mac OS X) AppleWebKit/537.51.2 (KHTML, like Gecko) Version/7.0 Mobile/11D257 Safari/9537.53 200 - 0 scontent.xx.fbcdn.net:443 
2016-06-25 06:25:25 [email protected] CONNECT HTTP/1.1 Mozilla/5.0 (iPhone; CPU iPhone OS 7_1_2 like Mac OS X) AppleWebKit/537.51.2 (KHTML, like Gecko) Version/7.0 Mobile/11D257 Safari/9537.53 200 - 0 scontent.xx.fbcdn.net:443 
2016-06-25 06:25:25 [email protected] CONNECT HTTP/1.1 Mozilla/5.0 (iPhone; CPU iPhone OS 7_1_2 like Mac OS X) AppleWebKit/537.51.2 (KHTML, like Gecko) Version/7.0 Mobile/11D257 Safari/9537.53 200 - 0 scontent.xx.fbcdn.net:443 

をdfをしています。 希望出力:

[email protected] - 2016-06-21, 2016-06-22, 2016-06-23 
[email protected] - 2016-06-24, 2016-06-25 

この日付はどのように取得できますか?

+0

「df」とは何かが明確ではありません。 – peterh

+0

これはデータフレームではありませんが、何を試しましたか? – Merlin

答えて

2

あなたが最初にあなたの日付から必要な情報を抽出することができます。

df['filtered date'] = [w[:10] for w in df['date']] 

その後、あなたは `使用重複を削除 ':

output = df[['id','filtered date']].drop_duplicates() 

あなたはその後、明確にするためにデータフレームを並べ替えることができます。

output.sort_values(by['id','filtered date'],inplace = True) 

あなたが最終的に出力のこの種取得します:ここで

id    filtered date 
0 [email protected] 2016-06-24 
1 [email protected] 2016-06-25 
3 [email protected] 2016-06-21 
4 [email protected] 2016-06-22 
5 [email protected] 2016-06-23 
1

パンダは、必要なものに適したDataFrames用の機能groupbyを提供しています。

# Generate dataframe with random values 
mail = ['[email protected]', '[email protected]', '[email protected]'] 
stime = datetime.strptime('2016-07-01 00:00:00', '%Y-%m-%d %H:%M:%S') 
etime = datetime.strptime('2016-07-30 00:00:00', '%Y-%m-%d %H:%M:%S') 
tdelta = etime - stime 
tdiff = tdelta.days * 24 * 60 * 60 + tdelta.seconds 

df = pd.DataFrame({ 
    'mail': [choice(mail) for _ in range(10)], 
    'time':[stime + timedelta(seconds=randrange(tdiff)) for _ in range(10)] 
}) 

# Group dataframe by column 'mail' and apply the lambda expression to 
# transform the grouped set of values into unique time values. 
r = df.groupby(by='mail').apply(lambda x: set(x['time'].values)) 

その後、あなたは結果で作業することができるはずです。

print(r) 

mail 
[email protected] {2016-07-24T16:42:12.000000000, 2016-07-07T15:... 
[email protected]  {2016-07-13T18:53:07.000000000, 2016-07-04T06:... 
[email protected]  {2016-07-10T07:37:19.000000000, 2016-07-09T07:... 
dtype: object 
1

はの名前などdateIDを想定(ワンライナーだが関連する列)

df.groupby('ID').apply(lambda x: (x['date'].str[:10]).unique()) 

、その出力

ID 
[email protected]    [2016-06-24, 2016-06-25] 
[email protected] [2016-06-21, 2016-06-22, 2016-06-23] 
dtype: object 
1

のはであなたのサンプルデータを読んでみましょう:

import pandas as pd 
import StringIO 

df = pd.read_table(StringIO.StringIO("""2016-06-21 06:25:09 [email protected] GET HTTP/1.1 Mozilla/5.0 (iPhone; CPU iPhone OS 7_1_2 like Mac OS X) AppleWebKit/537.51.2 (KHTML, like Gecko) Version/7.0 Mobile/11D257 Safari/9537.53 200 application/json 2130 https://edge-chat.facebook.com/pull?channel=p_100006170407238&seq=27&clientid=1d67ca6e&profile=mobile&partition=-2&sticky_token=185&msgs_recv=27&qp=y&cb=1830997782&state=active&sticky_pool=frc3c09_chat-proxy&uid=100006170407238&viewer_uid=100006170407238&m_sess=&__dyn=1Z3p5wnE-4UpwDF3GAgy78qzoC6Erz8B0GxG9xu3Z0QwFzohxO3O2G2a1mwYxm48sxadwpVEy1qK78gwUx6&__req=79&__ajax__=AYlbtcBwGC2suZLI-J88V0PWa58vtQeG3YlQLydFRsAl6UwLSjsSpD7peu8mGl6NsHvd2zxfDcB6A0-XunBugUsYZ1lMYmUu97R43iV7XSfpyg&__user=100006170407238 
2016-06-22 06:25:20 [email protected] POST HTTP/1.1 Mozilla/5.0 (iPhone; CPU iPhone OS 7_1_2 like Mac OS X) AppleWebKit/537.51.2 (KHTML, like Gecko) Version/7.0 Mobile/11D257 Safari/9537.53 200 application/x-javascript 20248 https://m.facebook.com/stories.php?aftercursor=MTQ2NjY2MzEwNToxNDY2NjYzMTA1Ojg6NzM0ODg0MDExMjAyNDY1MzA5NToxNDY2NjYyNzk1OjA%3D&tab=h_nor&__m_log_async__=1 
2016-06-23 06:25:25 [email protected] CONNECT HTTP/1.1 Mozilla/5.0 (iPhone; CPU iPhone OS 7_1_2 like Mac OS X) AppleWebKit/537.51.2 (KHTML, like Gecko) Version/7.0 Mobile/11D257 Safari/9537.53 200 - 0 scontent.xx.fbcdn.net:443 
2016-06-23 06:25:25 [email protected] GET HTTP/1.1 Mozilla/5.0 (iPhone; CPU iPhone OS 7_1_2 like Mac OS X) AppleWebKit/537.51.2 (KHTML, like Gecko) Version/7.0 Mobile/11D257 Safari/9537.53 200 text/html 1105 https://m.facebook.com/xti.php?xt=2.qid.6299270070554694533%3Amf_story_key.343726573953754118%3Aei.AI%40ecf11fb3faf9c0b1f73ce2a74bc9f228 
2016-06-24 06:25:25 [email protected] CONNECT HTTP/1.1 Mozilla/5.0 (iPhone; CPU iPhone OS 7_1_2 like Mac OS X) AppleWebKit/537.51.2 (KHTML, like Gecko) Version/7.0 Mobile/11D257 Safari/9537.53 200 - 0 scontent.xx.fbcdn.net:443 
2016-06-25 06:25:25 [email protected] CONNECT HTTP/1.1 Mozilla/5.0 (iPhone; CPU iPhone OS 7_1_2 like Mac OS X) AppleWebKit/537.51.2 (KHTML, like Gecko) Version/7.0 Mobile/11D257 Safari/9537.53 200 - 0 scontent.xx.fbcdn.net:443 
2016-06-25 06:25:25 [email protected] CONNECT HTTP/1.1 Mozilla/5.0 (iPhone; CPU iPhone OS 7_1_2 like Mac OS X) AppleWebKit/537.51.2 (KHTML, like Gecko) Version/7.0 Mobile/11D257 Safari/9537.53 200 - 0 scontent.xx.fbcdn.net:443 
"""), delim_whitespace=True, header=None) 

あなたが最初の(インデックス:0)に興味がある日と三番目の列、(インデックス:2)電子メールのaddrです。

df2 = df[[0, 2]] 

今ある:純粋な可視性の理由から、のは、新しいデータフレームでそれらを分離させ

  0    2 
0 2016-06-21 [email protected] 
1 2016-06-22 [email protected] 
2 2016-06-23 [email protected] 
3 2016-06-23 [email protected] 
4 2016-06-24 [email protected] 
5 2016-06-25 [email protected] 
6 2016-06-25 [email protected] 

我々は今、(リストに集約された日付をオンにするカスタム機能を持つグループ彼らと集計に必要)ご希望の出力のように:

df2.groupby(2).agg(lambda x: x.unique().tolist()).reset_index() 

reset_index()修正インデックスデータフレーム以下の取得ワットので、

    2          0 
0 [email protected]    [2016-06-24, 2016-06-25] 
1 [email protected] [2016-06-21, 2016-06-22, 2016-06-23] 
関連する問題