Countvectorizerのタイプエラーscikit-learn：予想される文字列またはバッファー

分類の問題を解決しようとしています。私はCountVectorizerにテキストを供給するとき、それはエラーを与える：Countvectorizerのタイプエラーscikit-learn：予想される文字列またはバッファー

expected string or buffer.

は、メッセージでもある数と単語さえ特殊文字のメッセージ混合物が含まれているように私のデータセットに何か問題があります。メッセージがどのように見えるんか

サンプルは以下の通りです：ここでは

0   I have not received my gifts which I ordered ok 
1     hth her wells idyll McGill kooky bbc.co 
2         test test test 1 test 
3             test 
4       hello where is my reward points 
5  hi, can you get koovs coupons or vouchers here...

は私が行うために使用されるコードは、分類：データであるため、 astypeにより stringに列 messageを変換する必要がある

import pandas as pd 
from sklearn.feature_extraction.text import CountVectorizer 
df = pd.read_excel('training_data.xlsx') 
X_train = df.message 
print X_train.shape 
map_class_label = {'checkin':0, 'greeting':1,'more reward options':2,'noclass':3, 'other':4,'points':5, 
          'referral points':6,'snapbill':7, 'thanks':8,'voucher not working':9,'voucher':10} 
df['label_num'] = df['Final Category'].map(map_class_label) 
y_train = df.label_num 
vectorizer = CountVectorizer(lowercase=False,decode_error='ignore') 
X_train_dtm = vectorizer.fit_transform(X_train)

出典

2016-08-24 Pooja Pandey

@jez rael Final Categoryはlabel_num列へのマッピングを介して数値に変更する各メッセージに対応するクラスラベル（テキストデータ）です。私はちょうど示していないデータセットに欠けていない。 countvectorizerを使用してメッセージに適合させて変換しようとしたときに問題が発生したためです。 –

私のソリューションは機能するのですか？ – jezrael

いくつかの数値です：

df = pd.read_excel('training_data.xlsx') 
df['message'] = df['message'].values.astype('unicode') 
... 
...

出典

2016-08-24 07:01:18 jezrael

は、UnicodeEncodeErrorエラーのために変換できません。私もdf.message.apply（str）を試しました。 –

ええと、1つのアイデアがあります - Excelの列 'message'を文字列に設定することは可能ですか？ – jezrael

あなたより。 – jezrael

Countvectorizerのタイプエラーscikit-learn：予想される文字列またはバッファー

答えて

関連する問題