2017-03-22 15 views
0

私はいくつかの申請者のトランザクションデータを変換しています。私は新しいフラグ列を作成する必要があります(私の例では "DESIRED FLAG"というラベルが付けられています)。しかし、私は正しいループ/適用法を理解することができません。なぜなら、下のロジックには非常に多くの異なるバリエーションが存在する可能性があるからです。この場合、Pandas apply/loopメソッドは何がベストですか?

完璧な世界では

、シーケンシャル申請プロセス履歴が「完了」に設定されたすべての「ステータス」で、次のようになります。

  • オンサイトのインタビューがキックオフ - >スケジュールインタビュー - >決定; OR
  • 電話インタビューキックオフ - >スケジュールインタビュー - >意思決定

そしてもちろん、出願人は、その申請の処理中に多くの電話インタビューを通じて、オンサイトに行くことができます。

以下の例に示すように、「スケジュールインタビュー」がキャンセルされることがあります。そのような場合、私はそのステップとそれに関連する後続のステップを削除する必要があります。 「スケジュールインタビュー」「決定」「オンサイトインタビューキックオフ」「電話インタビューキックオフ」などがあります。また、手作業でスキップしたような他の「イベント」もあります。

私はのためのフラグを作成する必要があるシナリオの他の種類を持っているので、私はちょうど新しいコラムで、元のデータフレームを維持する必要があります。

import pandas as pd 

data = {'Employee ID': ["100","100", "100", "100","100","100","100","100","100","100","200", "200", "200","200","200","200","200","300","300", "300", "300","300","300","300"], 
     'Completed On Date': ["2009-01-01","2010-01-01","2011-06-05","2012-07-01","2013-01-01","2014-01-01","2015-01-01","2016-01-01","2017-01-01","2018-01-01","2010-01-01","2011-06-05","2012-07-01","2012-08-15","2013-01-01","2014-01-01","2015-01-01","2009-01-01","2010-01-01","2011-06-05","2012-07-01","2013-01-01","2014-01-01","2015-01-01"], 
     'Event': ["Decision","On-Site Interview Kick Off","Schedule Interviews","Decision","On-Site Interview Kick Off","Schedule Interviews","Decision","Phone Interview Kick Off","Schedule Interviews","Decision","On-Site Interview Kick Off","Schedule Interviews","Decision","Decision","Phone Interview Kick Off","Schedule Interviews","Decision","Job Apply","Phone Interview Kick Off","Schedule Interviews","Decision","On-Site Interview Kick Off","Schedule Interviews","Decision"], 
     'Event Status': ["Completed","Completed","CANCELED","Completed","Completed","Completed","Completed","Completed","Completed","Completed","Completed","CANCELED","Manually Skipped","Completed","Completed","Completed","Completed","Completed","Completed","CANCELED","Completed","Completed","Completed","Completed"], 
     'DESIRED FLAG': ["Keep","Keep","Remove","Remove","Remove","Keep","Keep","Keep","Keep","Keep","Keep","Remove","Remove","Remove","Remove","Keep","Keep","Keep","Keep","Remove","Remove","Remove","Keep","Keep"]} 
df = pd.DataFrame(data, columns=['Employee ID','Completed On Date','Event','Event Status','DESIRED FLAG']) 
df = df.sort_values(by=(['Employee ID','Completed On Date'])) 

df 
+0

希望する出力がどのように表示されるかは、非常に役に立ちます。 – pshep123

+0

「DESIRED FLAG」列を参照してください。それが出力のようになります。ありがとう! – Christopher

+0

視覚化するデータフレームの形でそれを持っているのを助けるが、多分それは私だけである。 – pshep123

答えて

1

を私は次のコードは、あなたの問題解決と思い

import pandas as pd 

data = {'Employee ID': ["100","100", "100", "100","100","100","100","100","100","100","200", "200", "200","200","200","200","200","300","300", "300", "300","300","300","300"], 
     'Completed On Date': ["2009-01-01","2010-01-01","2011-06-05","2012-07-01","2013-01-01","2014-01-01","2015-01-01","2016-01-01","2017-01-01","2018-01-01","2010-01-01","2011-06-05","2012-07-01","2012-08-15","2013-01-01","2014-01-01","2015-01-01","2009-01-01","2010-01-01","2011-06-05","2012-07-01","2013-01-01","2014-01-01","2015-01-01"], 
     'Event': ["Decision","On-Site Interview Kick Off","Schedule Interviews","Decision","On-Site Interview Kick Off","Schedule Interviews","Decision","Phone Interview Kick Off","Schedule Interviews","Decision","On-Site Interview Kick Off","Schedule Interviews","Decision","Decision","Phone Interview Kick Off","Schedule Interviews","Decision","Job Apply","Phone Interview Kick Off","Schedule Interviews","Decision","On-Site Interview Kick Off","Schedule Interviews","Decision"], 
     'Event Status': ["Completed","Completed","CANCELED","Completed","Completed","Completed","Completed","Completed","Completed","Completed","Completed","CANCELED","Manually Skipped","Completed","Completed","Completed","Completed","Completed","Completed","CANCELED","Completed","Completed","Completed","Completed"], 
     'DESIRED FLAG': ["Keep","Keep","Remove","Remove","Remove","Keep","Keep","Keep","Keep","Keep","Keep","Remove","Remove","Remove","Remove","Keep","Keep","Keep","Keep","Remove","Remove","Remove","Keep","Keep"]} 
df = pd.DataFrame(data, columns=['Employee ID','Completed On Date','Event','Event Status','DESIRED FLAG']) 
df = df.sort_values(by=(['Employee ID','Completed On Date'])) 


index_list_delete = [] 
start_deleting = False 
for i in range(0, len(df)): 
    if start_deleting == False: 
     # whenever I see a "CANCELED", i know some following rows need to be deleted 
     if df.iloc[i]['Event Status'] == 'CANCELED': 
      index_list_delete += [i] 
      start_deleting = True 
    else: 
     # whenever i see a "Schedule Interviews", i need to stop deleting. 
     # otherwise keep track of the rows that need to be deleted 
     if df.iloc[i]['Event'] == 'Schedule Interviews': 
      start_deleting = False 
     else: 
      index_list_delete += [i] 

# deleting rows 
df = df.drop(df.index[index_list_delete]) 
# reseting index 
df = df.reset_index(drop = True) 

次の結果が得られます

Employee ID Completed On Date      Event Event Status DESIRED FLAG 
0   100  2009-01-01     Decision Completed   Keep 
1   100  2010-01-01 On-Site Interview Kick Off Completed   Keep 
2   100  2014-01-01   Schedule Interviews Completed   Keep 
3   100  2015-01-01     Decision Completed   Keep 
4   100  2016-01-01 Phone Interview Kick Off Completed   Keep 
5   100  2017-01-01   Schedule Interviews Completed   Keep 
6   100  2018-01-01     Decision Completed   Keep 
7   200  2010-01-01 On-Site Interview Kick Off Completed   Keep 
8   200  2014-01-01   Schedule Interviews Completed   Keep 
9   200  2015-01-01     Decision Completed   Keep 
10   300  2009-01-01     Job Apply Completed   Keep 
11   300  2010-01-01 Phone Interview Kick Off Completed   Keep 
12   300  2014-01-01   Schedule Interviews Completed   Keep 
13   300  2015-01-01     Decision Completed   Keep 
+0

私は実際のデータでいくつかの追加テストを行いましたが、このロジックは従業員IDに限定されません。ソリューションは各従業員IDセット内でのみ実行する必要があります。 – Christopher

+0

以下は、控えめで部分的な解決策です。次のステップでは、私はまだ彼らの最後のステップがスケジュールインタビューチームであるものを除外しなければなりません。 if(df.iloc [i] ['イベントステータス'] == 'CANCELED') (df.iloc [i] ['従業員ID'] == df.iloc [i + 1] ['従業員ID']): – Christopher

関連する問題