python TfidfVectorizerはtypeErrorを返します：csvファイルの期待文字列またはバイト様オブジェクト

非常に大きなcsvファイルを解析していて、scikitを使用してtf-idf情報を抽出しようとしています。残念ながら、このtypeErrorをスローするので、データの処理は終了しません。このエラーを取り除くためにcsvファイルをプログラムで変更する方法はありますか？ここに私のコードです：python TfidfVectorizerはtypeErrorを返します：csvファイルの期待文字列またはバイト様オブジェクト

df = pd.read_csv("C:/Users/aidan/Downloads/papers/papers.csv", sep = None) 
df = df[pd.notnull(df)] 

    n_features = 1000 
    n_topics = 8 
    n_top_words = 10 
tfidf_vectorizer = TfidfVectorizer(max_df=0.95, min_df=2,max_features=n_features,stop_words='english', lowercase = False) 

tfidf = tfidf_vectorizer.fit_transform(df['paper_text'])

最後の行からエラーが発生します。ありがとうございます！

Traceback (most recent call last): 
    File "C:\Users\aidan\NIPS Analysis 2.0.py", line 35, in <module> 
    tfidf = tfidf_vectorizer.fit_transform(df['paper_text']) 
    File "c:\python\python36\lib\site-packages\sklearn\feature_extraction\text.py", line 1352, in fit_transform 
    X = super(TfidfVectorizer, self).fit_transform(raw_documents) 
    File "c:\python\python36\lib\site-packages\sklearn\feature_extraction\text.py", line 839, in fit_transform 
    self.fixed_vocabulary_) 
    File "c:\python\python36\lib\site-packages\sklearn\feature_extraction\text.py", line 762, in _count_vocab 
    for feature in analyze(doc): 
    File "c:\python\python36\lib\site-packages\sklearn\feature_extraction\text.py", line 241, in <lambda> 
    tokenize(preprocess(self.decode(doc))), stop_words) 
    File "c:\python\python36\lib\site-packages\sklearn\feature_extraction\text.py", line 216, in <lambda> 
    return lambda doc: token_pattern.findall(doc) 
TypeError: expected string or bytes-like object

出典

2017-05-12 Aidan Kehoe

df.dtypesをチェックしましたか？出力は何ですか？

.read_csv()コールには、dtype=strを引数として追加できます。

出典

2017-05-12 20:51:08 neox

出力にはdtype：bottomというオブジェクトがあります。その上には「オブジェクト」とも言える単語や文字の表があります。さて、私はそれを試みます。 –

はい！できます！ありがとうございます@neox！ –

うれしいことに、あなたは大歓迎です！ – neox

python TfidfVectorizerはtypeErrorを返します：csvファイルの期待文字列またはバイト様オブジェクト

答えて

関連する問題