2016-05-27 9 views
0

私のLDAモデルを計算してトピックを検索しましたが、コーパス上の各トピックの重み/割合を計算する方法を探しています。私が手しかしコーパスの各LDAトピックの重みを計算する

from itertools import chain 
print(type(doc_set)) 
print(len(doc_set)) 

for top in ldamodel.print_topics(): 
    print(top) 
print 

# Assinging the topics to the document in corpus 
lda_corpus = ldamodel[corpus] 
#print(lda_corpus) 

# Find the threshold, let's set the threshold to be 1/#clusters, 
# To prove that the threshold is sane, we average the sum of all probabilities: 
scores = list(chain(*[[score for topic_id,score in topic] \ 
        for topic in [doc for doc in lda_corpus]])) 
print(sum(scores)) 
print(len(scores)) 
threshold = sum(scores)/len(scores) 
print(threshold) 

cluster1 = [j for i,j in zip(lda_corpus,doc_set) if i[0][1] > threshold] 
cluster2 = [j for i,j in zip(lda_corpus,doc_set) if i[1][1] > threshold] 
cluster3 = [j for i,j in zip(lda_corpus,doc_set) if i[2][1] > threshold] 

:これまでのところ

## Libraries to download 
from nltk.tokenize import RegexpTokenizer 
from nltk.corpus import stopwords 
from nltk.stem.porter import PorterStemmer 
from gensim import corpora, models 
import gensim 

## Tokenizing 
tokenizer = RegexpTokenizer(r'\w+') 

# create English stop words list 
en_stop = stopwords.words('english') 

# Create p_stemmer of class PorterStemmer 
p_stemmer = PorterStemmer() 

import json 
import nltk 
import re 
import pandas 

appended_data = [] 

#for i in range(20014,2016): 
# df0 = pandas.DataFrame([json.loads(l) for l in open('SDM_%d.json' % i)]) 
# appended_data.append(df0) 

for i in range(2005,2016): 
    if i > 2013: 
     df0 = pandas.DataFrame([json.loads(l) for l in open('SDM_%d.json' % i)]) 
     appended_data.append(df0) 
    df1 = pandas.DataFrame([json.loads(l) for l in open('Scot_%d.json' % i)]) 
    df2 = pandas.DataFrame([json.loads(l) for l in open('APJ_%d.json' % i)]) 
    df3 = pandas.DataFrame([json.loads(l) for l in open('TH500_%d.json' % i)]) 
    df4 = pandas.DataFrame([json.loads(l) for l in open('DRSM_%d.json' % i)]) 
    appended_data.append(df1) 
    appended_data.append(df2) 
    appended_data.append(df3) 
    appended_data.append(df4) 


appended_data = pandas.concat(appended_data) 
# doc_set = df1.body 

doc_set = appended_data.body 

# list for tokenized documents in loop 
texts = [] 

# loop through document list 
for i in doc_set: 

    # clean and tokenize document string 
    raw = i.lower() 
    tokens = tokenizer.tokenize(raw) 

    # remove stop words from tokens 
    stopped_tokens = [i for i in tokens if not i in en_stop] 

    # add tokens to list 
    texts.append(stopped_tokens) 

# turn our tokenized documents into a id <-> term dictionary 
dictionary = corpora.Dictionary(texts) 

# convert tokenized documents into a document-term matrix 
corpus = [dictionary.doc2bow(text) for text in texts] 

# generate LDA model 
ldamodel = gensim.models.ldamodel.LdaModel(corpus, num_topics=15, id2word = dictionary, passes=50) 
ldamodel.save("model.lda0") 

、私は他のフォーラムで見たことは次の操作を行うことである。驚くべきことに、私は、これまでの私のコードは次のようになり、これを行うための方法を見つけることができませんクラスタ2のエラー:IndexError: list index out of range。どんな考え?

答えて

4

あなたはLDA機能でゼロに最小確率を述べる必要があります。

ldamodel = gensim.models.ldamodel.LdaModel(corpus, num_topics=15, id2word = dictionary, passes=50, minimum_probability=0) 

また、あなただけですべての記事のためのトピック分布を取得することができます:

for i in range(len(doc_set)): 
    print(ldamodel[corpus[i]]) 
関連する問題