0
2つのSparkR DataFrames、newHiresDF
およびsalesTeamDF
があります。私はnewHiresDF$name
の値に基づいてnewHiresDF
のサブセットを取得したいですが、それはsalesTeamDF$name
にありますが、これを行う方法がわかりません。以下は私の試みのコードです。別のDataFrameの列の値と一致する列の値に基づいてSparkR DataFrameをサブセット化する
#Create DataFrames
newHires <- data.frame(name = c("Thomas", "George", "Bill", "John"),
surname = c("Smith", "Williams", "Brown", "Taylor"))
salesTeam <- data.frame(name = c("Thomas", "Bill", "George"),
surname = c("Martin", "Clark", "Williams"))
newHiresDF <- createDataFrame(newHires)
salesTeamDF <- createDataFrame(salesTeam)
display(newHiresDF)
#Try to subset newHiresDF based on name values in salesTeamDF
#All of the below result in errors
NHsubset1 <- filter(newHiresDF, newHiresDF$name %in% salesTeamDF$name)
NHsubset2 <- filter(newHiresDF, intersect(select(newHiresDF, 'name'),
select(salesTeamDF, 'name')))
NHsubset3 <- newHiresDF[newHiresDF$name %in% salesTeamDF$name,] #This is how it would be done in R
#What I'd like NHsubset to look like:
name surname
1 Thomas Smith
2 George Williams
3 Bill Brown
PySparkコードもご使用いただけます。
ただ、これはhttps://stackoverflow.com/questions/43095208/subset-dataframe-based-on-matching-values-in-another-dataframe-pyspark-1-6-の近くに重複して実現します1でも答えられていません。 –