ファクタレベルは1つのレベルを削除しても同じままです

私はKaggleのTitanic Machine Learningデータセットの例を試していますが、次の問題に直面しています。エラーメッセージが読み：ファクタレベルは1つのレベルを削除しても同じままです

Error in predict.randomForest(modelFit, newtest) : 
Type of predictors in new data do not match that of the training data.

は、これは私の全体のコードです：

#Load the libraries: 
library(ggplot2) 
library(randomForest) 

#Load the data: 
set.seed(1) 
train <- read.csv("train.csv") 
test <- read.csv("test.csv") 
gendermodel <- read.csv("gendermodel.csv") 
genderclassmodel <- read.csv("genderclassmodel.csv") 

#Preprocess the data and feature extraction: 
features <- c("Pclass", "Age", "Sex", "Parch", "SibSp", "Fare", "Embarked")     

newtrain <- train[,features] 
newtest <- test[,features] 

newtrain$Embarked[newtrain$Embarked==""] <- "S" 
newtrain$Fare[newtrain$Fare == 0] <- median(newtrain$Fare, na.rm=TRUE) 
newtrain$Age[is.na(newtrain$Age)] <- -1 

newtest$Embarked[newtest$Embarked==""] <- "S" 
newtest$Fare[newtest$Fare == 0] <- median(newtest$Fare, na.rm=TRUE) 
newtest$Fare <- ifelse(is.na(newtest$Fare), mean(newtest$Fare, na.rm = TRUE), newtest$Fare) 
newtest$Age[is.na(newtest$Age)] <- -1 

#Model building 

modelFit <- randomForest(newtrain, as.factor(train$Survived), ntree = 100, importance = TRUE) 
predictedOutput <- data.frame(PassengerID = test$PassengerId) 
predictedOutput$Survived <- predict(modelFit, newtest) 
write.csv(predictedOutput, file = "TitanicPrediction.csv", row.names=FALSE) 

MDA <- importance(modelFit, type=1) 
featureImportance <- data.frame(Feature = row.names(MDA), Importance = MDA[,1]) 

#Plots 
g <- ggplot(featureImportance, aes(x=Feature, y=Importance)) + geom_bar(stat="identity") + xlab("Feature") + ylab("Importance") + ggtitle("Feature importance") 
ggsave("FeatureImportance.png", p)

私は、エラーメッセージは何を意味するのか理解し、私はstr(newtrain)とstr(newtest)を行うときに、私もnewtrain$Embarked[newtrain$Embarked==""] <- "S"を割り当てた後、次を得ます。

str(newtrain) 
'data.frame': 891 obs. of 7 variables: 
$ Pclass : int 3 1 3 1 3 3 1 3 3 2 ... 
$ Age  : num 22 38 26 35 35 -1 54 2 27 14 ... 
$ Sex  : Factor w/ 2 levels "female","male": 2 1 1 1 2 2 2 2 1 1 ... 
$ Parch : int 0 0 0 0 0 0 0 1 2 0 ... 
$ SibSp : int 1 1 0 1 0 0 0 3 0 1 ... 
$ Fare : num 7.25 71.28 7.92 53.1 8.05 ... 
$ Embarked: Factor w/ 4 levels "","C","Q","S": 4 2 4 4 4 3 4 4 4 2 ... 
> length(which(train$Embarked == "")) 
[1] 2 
> length(which(newtrain$Embarked == "")) 
[1] 0

欠損値を含むtrain列とnewtrainデータセットの長さを調べると、上記のように正しい出力が得られます。私はどこに間違っているのか分からない。どんな助けでも大歓迎です！ありがとう！あなたの行の後

出典

2016-08-05 Gingerbread

問題が要因レベルの場合は、「液滴」を試しましたか？ – aosmith

、

newtrain$Embarked[newtrain$Embarked==""] <- "S"

行います

newtrain$Embarked <- factor(newtrain$Embarked)

これは修正さnewtrain$Embarkedから因子のレベルをリセットします。

また、投稿されたコードの最後の行には、pがgである必要があります。

グッドラックとカグル！

出典

2016-08-05 21:23:00 aichao

それは働いた！どうもありがとうございます！：D – Gingerbread

ファクタレベルは1つのレベルを削除しても同じままです

答えて

関連する問題