私の元のデータフレーム:など ユニークID、イベント日時、イベント日付、週のイベント日、イベント時間、数値変数1、数値変数2、
df <- read.csv("mm.csv",header=TRUE,sep=",")
for (i in unique(df$customer_id)) {
#I initialize the output data frame so I can rbind as I loop though the grains. This data frame is always emptied out once we move onto our next customer_id
output.final.df <- data_frame(seller_name = factor(), is_anomaly_date = integer(), event_date_hr = double(), event_day_of_wk = integer(), event_day = double(), ...)
for (k in unique(df$event_day_of_wk)) {
for (z in unique(df$event_hr)) {
merchant.df = df[df$merchant_customer_id==i & df$event_day_of_wk==k & df$event_hr==z,10:19] #columns 10:19 are the 9 different numeric variables I am creating anomaly thresholds
#1st anomaly threshold - I have multiple different anomaly thresholds
# TRANSFORM VARIABLES - sometime within the for loop I run another loop that transforms the subset of data within it.
for(j in names(merchant.df)){
merchant.df[[paste(j,"_log")]] <- log(merchant.df[[j]]+1)
#merchant.df[[paste(j,"_scale")]] <- scale(merchant.df[[j]])
#merchant.df[[paste(j,"_cube")]] <- merchant.df[[j]]**3
#merchant.df[[paste(j,"_cos")]] <- cos(merchant.df[[j]])
mu_vector = apply(merchant.df, 2, mean)
sigma_matrix = cov(merchant.df, use="complete.obs", method='pearson')
inv_sigma_matrix = ginv(sigma_matrix)
det_sigma_matrix = det(sigma_matrix)
z_probas = apply(merchant.df, 1, mv_gaussian, mu_vector, det_sigma_matrix, inv_sigma_matrix)
eps = quantile(z_probas,0.01)
mv_outliers = ifelse(z_probas<eps, TRUE, FALSE)
#2nd anomaly threshold
nov = ncol(merchant.df)
pca_result <- PCA(merchant.df,graph = F, ncp = nov, scale.unit = T)
pca.var <- pca_result$eig[['cumulative percentage of variance']]/100
lambda <- pca_result$eig[, 'eigenvalue']
anomaly_score = (as.matrix(pca_result$ind$coord)^2) %*% (1/as.matrix(lambda, ncol = 1))
significance <- c (0.99)
thresh = qchisq(significance, nov)
pca_outliers = ifelse(anomaly_score > thresh , TRUE, FALSE)
#This is where I bind the anomaly points with the original data frame and then I row bind to the final output data frame then the code goes back to the top and loops through the next hour and then day of the week. Temp.output.df is constantly remade and output.df is slowly growing bigger.
temp.output.df <- cbind(merchant.df, mv_outliers, pca_outliers)
output.df <- rbind(output.df, temp.output.df)
#Again this is where I write the output for a particular unique_ID then output.df is recreated at the top for the next unique_ID
次のコードが表示さ私がやっていることのアイデア。あなたが見ることができるように、3つのforループを実行して、曜日ごとに時間レベルである最も低いグレインで複数の異常検出を計算した後、すべてのunique customer_idレベルをcsvに出力します。
