2017-09-13 11 views
1

私は顧客サービスから得た顧客の問い合わせと回答をcsvファイルに保管しています。私は各質問の主題を特定し、後にこれに関する分類モデルを作成する必要があります。私は2つの文書用語行列(文書を整理した後)を作成しました.1つは質問用、もう1つは回答用です。私は、文書全体で約400回以上出現する言葉(約40kの質問と回答)を取るだけでサイズを縮小しました。行ごとに2つの文書用語行列をマージする

これらの2つの行列を行でマージし、問題の共通の単語とdtmを保持するデータフレームを作成したいと思います(頻度を追加してください)。質問を標識する最高周波数ワード。

アプローチに関するすべてのヘルプ/提案が高く評価されています。

> str(inspect(dtmaf)) 
<<DocumentTermMatrix (documents: 38697, terms: 237)>> 
Non-/sparse entries: 326124/8845065 
Sparsity   : 96% 
Maximal term length: 13 
Weighting   : term frequency (tf) 
Sample    : 
    Terms 
Docs booking card change check confirm confirmation email make port wish 
12316  3 1  0  0  0   0  0 0 1 1 
137   4 1  2  0  1   0  0 0 0 0 
17618  4 1  0  0  0   0  0 2 0 2 
18082  2 1  3  1  1   0  0 0 1 0 
19141  3 0  2  0  1   0  0 0 1 0 
21862  2 0  0  0  0   0  0 1 0 0 
2756  1 0  2  0  0   0  0 1 0 1 
27578  2 1  5  0  0   0  0 0 0 1 
30312  4 1  2  0  0   0  0 2 0 2 
9019  1 1  1  0  0   0  0 0 0 0 
num [1:10, 1:10] 3 4 4 2 3 2 1 2 4 1 ... 
- attr(*, "dimnames")=List of 2 
..$ Docs : chr [1:10] "12316" "137" "17618" "18082" ... 
..$ Terms: chr [1:10] "booking" "card" "change" "check" ... 

> str(inspect(dtmc)) 
<<DocumentTermMatrix (documents: 38697, terms: 189)>> 
Non-/sparse entries: 204107/7109626 
Sparsity   : 97% 
Maximal term length: 13 
Weighting   : term frequency (tf) 
Sample    : 
     Terms 
Docs booking car change confirmation like number possible reservation return ticket 
    14091  0 0  0   0 2  0  0   2  0  0 
    18220  6 0  0   2 0  0  0   0  0  0 
    20103  1 0  1   0 0  1  0   0  0  0 
    20184  0 3  0   0 0  1  0   4  1  0 
    21005  3 5  0   1 2  0  1   0  0  0 
    24877  0 1  1   0 0  0  0   2  0  1 
    26135  0 0  0   0 0  0  0   1  0  0 
    28200  5 2  1   0 0  0  0   1  0  0 
    2979  12 7  2   0 1  0  0   0  0  0 
    680   0 0  1   2 0  1  0   0  0  0 
num [1:10, 1:10] 0 6 1 0 3 0 0 5 12 0 ... 
- attr(*, "dimnames")=List of 2 
    ..$ Docs : chr [1:10] "14091" "18220" "20103" "20184" ... 
    ..$ Terms: chr [1:10] "booking" "car" "change" "confirmation" ... 

の予想される出力は、(237 + 189)の用語と38697行の行列である。マッチングに関して、両方のdtmsは1つの用語につき1つの列を持ち、その周波数が合計され、一致しない用語がそのまま再現されます。ここで は10の文書と再現性の例である:

> dput(datamsg) 
structure(list(cmessage = c("No answer ?", "Hello the third number is . I bought this boarding card immediately after the operator has told me from the previous logbook the number can not be found in the system. Therefore I request to return money. It was not my fault !", 
"Hi I forget probably choose items on the How can I do this now. ", 
"Hi I forget probably choose items How can i do this now. ", 
"Hello I tell if I have booked . If not is it possible and what would it cost? ", 
"First I wanted to transfer fromThen I wanted to know if you can spontaneously postpone the return ", 
"Hello. Does the have an exact address? With this address I do not find it on the navigation. Have an exact address where I can get the ticets. Where I get the Tikets then. Is the automatic chekin. Or do I then mot the tickets to the Chekin. Thank you. But rather ask more questions. ", 
"Dear booked everything again. Also the journey through In my previous message I stated that it is a complete cancellation and I have booked the return trip. I do not intend to pay twice for travel. ", 
"Thank you. When will the new registration show ?...as it still shows the . Thanks", 
"So my phone number is .Please tell me how this works."), afreply = c("Hello afraid there is no space on the September. I have also checked but are all fully booked. Would you like us to check any other dates for you? ", 
"Hello As far as we can see the booking No was a valid reservation. We have however contacted and can confirm that administration fee was refunded back to your card. ", 
"Good afternoon You are currently booked as high plane. You have requested an amendment to change the height which will be more expensive. Could you please confirm the actual height of . We have cancelled you amendment request please submit a new one with an accurate height ofreply to this message. ", 
"Hello thanks for your message. I have checked and can see you have amended your height to on your booking. If you require any other assistance with your booking please contact us.", 
"Hello you booked any In order to make a change to your booking kindly send us a amendment request via", 
"Dear Mr. what dimensions you want to take with you? here is only the possibility to change your departure for a change of booking fee and a possible ticket price difference. The ticket price difference can be requested if you call us an alternative travel date.", 
"Dear Sir or Madam we will send you the address ", "Hello your crossing with was already refunded. As my colleague told you your with was still valid. In case you have booked a second ticket with please send us the new booking reference number but we cannot guarantee that you will be entitle to a refund. ", 
"if you can authorise us to take the payment from the card you used to make the we can then make the change.", 
"Good morning we could not reach you by telephone. If you do not have we can send you an invoice via PayPal. The change can not be made until paid. . Do you want to pay the change to 1. " 
)), .Names = c("cmessage", "afreply"), class = "data.frame", row.names = c(NA, 
-10L)) 

corpus1<-Corpus(VectorSource(datamsg$cmessage)) 
corpus2<-Corpus(VectorSource(datamsg$afreply)) 
dtmc<-DocumentTermMatrix(corpus1, control = list(weighting = weightTf)) 
dtmaf<-DocumentTermMatrix(corpus2, control = list(weighting = weightTf)) 
+0

希望出力の例 – PoGibas

+0

上記の質問に予想出力を追加しました。 – NKaz

+0

これは誰にも役立ちません。あなたの質問に、再生可能なサンプルと希望する出力の例を回答してほしい場合:https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example – PoGibas

答えて

1

あなたのコード:

#dput(datamsg) 
datamsg <- 
     structure(
       list(
         cmessage = c(
           "No answer ?", 
           "Hello the third number is . I bought this boarding card immediately after the operator has told me from the previous logbook the number can not be found in the system. Therefore I request to return money. It was not my fault !", 
           "Hi I forget probably choose items on the How can I do this now. ", 
           "Hi I forget probably choose items How can i do this now. ", 
           "Hello I tell if I have booked . If not is it possible and what would it cost? ", 
           "First I wanted to transfer fromThen I wanted to know if you can spontaneously postpone the return ", 
           "Hello. Does the have an exact address? With this address I do not find it on the navigation. Have an exact address where I can get the ticets. Where I get the Tikets then. Is the automatic chekin. Or do I then mot the tickets to the Chekin. Thank you. But rather ask more questions. ", 
           "Dear booked everything again. Also the journey through In my previous message I stated that it is a complete cancellation and I have booked the return trip. I do not intend to pay twice for travel. ", 
           "Thank you. When will the new registration show ?...as it still shows the . Thanks", 
           "So my phone number is .Please tell me how this works." 
         ), 
         afreply = c(
           "Hello afraid there is no space on the September. I have also checked but are all fully booked. Would you like us to check any other dates for you? ", 

           "Hello As far as we can see the booking No was a valid reservation. We have however contacted and can confirm that administration fee was refunded back to your card. ", 
           "Good afternoon You are currently booked as high plane. You have requested an amendment to change the height which will be more expensive. Could you please confirm the actual height of . We have cancelled you amendment request please submit a new one with an accurate height ofreply to this message. ", 
           "Hello thanks for your message. I have checked and can see you have amended your height to on your booking. If you require any other assistance with your booking please contact us.", 
           "Hello you booked any In order to make a change to your booking kindly send us a amendment request via", 
           "Dear Mr. what dimensions you want to take with you? here is only the possibility to change your departure for a change of booking fee and a possible ticket price difference. The ticket price difference can be requested if you call us an alternative travel date.", 
           "Dear Sir or Madam we will send you the address ", 
           "Hello your crossing with was already refunded. As my colleague told you your with was still valid. In case you have booked a second ticket with please send us the new booking reference number but we cannot guarantee that you will be entitle to a refund. ", 
           "if you can authorise us to take the payment from the card you used to make the we can then make the change.", 
           "Good morning we could not reach you by telephone. If you do not have we can send you an invoice via PayPal. The change can not be made until paid. . Do you want to pay the change to 1. " 
         ) 
       ), 
       .Names = c("cmessage", "afreply"), 
       class = "data.frame", 
       row.names = c(NA,-10L) 
     ) 

corpus1<-Corpus(VectorSource(datamsg$cmessage)) # 10 docs 
corpus2<-Corpus(VectorSource(datamsg$afreply)) # 10 docs 


dtmc<-DocumentTermMatrix(corpus1, control = list(weighting = weightTf)) 
dtmaf<-DocumentTermMatrix(corpus2, control = list(weighting = weightTf)) 

私のコードが続く:

library(tm) 
library(dplyr) 
library(stringr) 
# rename anonymous document ids: 
rownames(dtmc) <- dtmc %>% rownames() %>% as.numeric() %>% sprintf("doc%05d", .) 
rownames(dtmaf) <- dtmaf %>% rownames() %>% as.numeric() %>% sprintf("doc%05d", .) 

# transform to termDocumentmatrix 
tdmc <- dtmc %>% t() 
tdmaf<- dtmaf %>% t() 

# introduce new first column "word" 
tdmc_df <- tdmc %>% as.matrix() %>% as.data.frame() %>% rownames_to_column(var = "word") 
tdmaf_df <- tdmaf %>% as.matrix() %>% as.data.frame() %>% rownames_to_column(var = "word") 

# find common words 
tdm_df <- tdmc_df %>% inner_join(tdmaf_df, by=c("word")) 
tdm_df <- tdm_df %>% arrange(word) 
dtm_df <- tdm_df %>% column_to_rownames("word") %>% t() 


# count occurences of matching words 
colSums(dtm_df) 

# find nonmatching words 
dtm_df_nonmatching <- tdmc_df %>% anti_join(tdmaf_df, by=c("word")) %>% arrange(word) %>% column_to_rownames("word") 

# count occurences of nonmatching words 
rowSums(dtm_df_nonmatching) 

一般的な単語は、カウント:

colSums(dtm_df) 
address  also  and booked  but  can  card  dear  for  from  have hello message 
     4  2  5  7  3  13  3  3  4  2  12  8  3 
    more  new  not number  pay please possible request still thanks  that  the  then 
     2  3  8  4  2  5  2  3  2  2  3  32  3 
    this  told travel  was  what  will  with would  you 
     6  2  2  5  2  4  7  2  25 
1

quantedaパッケージを使用すると簡単な方法があります。

library("quanteda") 
packageVersion("quanteda") 
# [1] ‘0.99.9’ 

まず、我々は2ドキュメント機能のマトリックスを作成し、その共通項を把握:

dfm_c <- dfm(datamsg$cmessage, remove_punct = TRUE) 
dfm_af <- dfm(datamsg$afreply, remove_punct = TRUE) 
common_feature_names <- intersect(featnames(dfm_c), featnames(dfm_af)) 

その後、我々は(正確に)あなたが今持っている警告を発行cbind()を使用して、それらを組み合わせることができます重複機能。 2行目は共通のフィーチャだけを選択し、3行目はdfm内の同じ名前のフィーチャを合計して、これを必要なものと結合します。

combined_dfm <- cbind(dfm_c, dfm_af) %>% 
    dfm_select(pattern = common_feature_names) %>% 
    dfm_compress() 
head(combined_dfm) 
# Document-feature matrix of: 6 documents, 6 features (41.7% sparse). 
# 6 x 6 sparse Matrix of class "dfmSparse" 
#  features 
# docs no hello the number is i 
# text1 2  1 1  0 1 1 
# text2 1  2 6  2 1 2 
# text3 0  0 3  0 0 2 
# text4 0  1 0  0 0 3 
# text5 0  2 0  0 1 2 
# text6 0  0 3  0 1 2 

あなたは本当に、本当に戻ってTMでそれをしたい場合は、この使用して変換することができます:

convert(combined_dfm, to = "tm") 
# <<DocumentTermMatrix (documents: 10, terms: 49)>> 
# Non-/sparse entries: 189/301 
# Sparsity   : 61% 
# Maximal term length: 8 
# Weighting   : term frequency (tf) 

:あなたは明確に指定していないあなたはDFMをマージする必要がある場合があります別のドキュメントでは、ここでは、ドキュメントが同じであると仮定しています(この例から)。それらが異なっていれば、それも簡単に解決されますが、質問には指定されていません。

関連する問題