2017-11-20 5 views
2

私のデータフレームでは、文字列Positionがその最初の行の下の行に複数出現する場合は、最初のの行を保持したいと考えています。私の出力例を見てください。 私はduplicated機能を試していますが、最初の行を維持する方法がわかりません。アウト最初の重複した行だけを残す

Time Pos 
2006-01-12 Position 
2006-01-16 Position 
2006-01-17 Position 
2006-02-01 
2006-02-01 Position 
2006-02-02 
2006-02-02 Position 
2006-02-02 Position 
2006-02-02 Position 
2006-04-04 Position 
2006-04-06 Position 
2006-04-06 Position 
2006-10-11 
2006-10-17 Position 
2006-10-18 
2006-10-18 Position 
2006-10-18 
2006-10-18 Position 
2006-10-18 
2006-10-18 Position 
2006-10-18 Position 
2006-10-18 Position 
2006-10-18 Position 
2006-10-19 Position 

Time Pos 
2006-01-12 Position 
2006-02-01 
2006-02-01 Position 
2006-02-02 
2006-02-02 Position 
2006-10-11 
2006-10-17 Position 
2006-10-18 
2006-10-18 Position 
2006-10-18 
2006-10-18 Position 
2006-10-18 
2006-10-18 Position 

答えて

2
df[head(cumsum(c(1, (rle(df$Pos)$lengths))), -1),] 
#   Time  Pos 
#1 2006-01-12 Position 
#4 2006-02-01   
#5 2006-02-01 Position 
#6 2006-02-02   
#7 2006-02-02 Position 
#13 2006-10-11   
#14 2006-10-17 Position 
#15 2006-10-18   
#16 2006-10-18 Position 
#17 2006-10-18   
#18 2006-10-18 Position 
#19 2006-10-18   
#20 2006-10-18 Position 
3

ここdplyr + data.table::rleidとソリューションです:

library(dplyr) 

df %>% 
    mutate(ID = data.table::rleid(df$Pos)) %>% 
    group_by(ID) %>% 
    slice(1) %>% 
    ungroup() %>% 
    select(-ID) 

結果:

# A tibble: 13 x 2 
     Time  Pos 
     <chr> <chr> 
1 2006-01-12 Position 
2 2006-02-01   
3 2006-02-01 Position 
4 2006-02-02   
5 2006-02-02 Position 
6 2006-10-11   
7 2006-10-17 Position 
8 2006-10-18   
9 2006-10-18 Position 
10 2006-10-18   
11 2006-10-18 Position 
12 2006-10-18   
13 2006-10-18 Position 

またはdata.table同等:

setDT(df)[, .SD[1], by = rleid(Pos), .SDcol = c("Time", "Pos")] 

結果:

rleid  Time  Pos 
1:  1 2006-01-12 Position 
2:  2 2006-02-01   
3:  3 2006-02-01 Position 
4:  4 2006-02-02   
5:  5 2006-02-02 Position 
6:  6 2006-10-11   
7:  7 2006-10-17 Position 
8:  8 2006-10-18   
9:  9 2006-10-18 Position 
10: 10 2006-10-18   
11: 11 2006-10-18 Position 
12: 12 2006-10-18   
13: 13 2006-10-18 Position 

データ:

df = structure(list(Time = c("2006-01-12", "2006-01-16", "2006-01-17", 
"2006-02-01", "2006-02-01", "2006-02-02", "2006-02-02", "2006-02-02", 
"2006-02-02", "2006-04-04", "2006-04-06", "2006-04-06", "2006-10-11", 
"2006-10-17", "2006-10-18", "2006-10-18", "2006-10-18", "2006-10-18", 
"2006-10-18", "2006-10-18", "2006-10-18", "2006-10-18", "2006-10-18", 
"2006-10-19"), Pos = c("Position", "Position", "Position", "", 
"Position", "", "Position", "Position", "Position", "Position", 
"Position", "Position", "", "Position", "", "Position", "", "Position", 
"", "Position", "Position", "Position", "Position", "Position" 
)), .Names = c("Time", "Pos"), class = "data.frame", row.names = c(NA, 
-24L)) 
1

あなたは遅れを使用して試すことができます。

library(dplyr) 

df2 <- df %>% 
    mutate(pos = ifelse(Pos == "Position", 1, 0), 
      lag = lag(pos, k=1)) %>% 
    filter(is.na(lag) | lag == 0) 
+0

これはまた、期待される出力と一致しない空の 'Pos'行をすべて削除します – useR

関連する問題