私が正しく理解していれば、次のようなものを試すことができます:2つのカテゴリ、医師、定義があれば、related_termsはrelated_terms = [['doctor', 'some synonym for doctor', 'another synonym', ...], ['definition', 'some_synonym', 'maybe the term arthritis here or any other term that would help you navigate to the correct information?', ... ]]
です。
stopset = set(stopwords.words('english'))
sentence = input("Enter query:\n")
sent_words = word_tokenize(sentence)
tokens = [w for w in sent_words if not w in stopset]
tagged_sent = pos_tag(sentence.split())
nouns = [word for word, pos in tagged_sent if pos == 'NN' or pos == 'NNP' or pos == 'NNS' or pos == 'NNPS']
lower_nouns = [nn.lower() for nn in nouns]
synonyms = []
categories = ["doctors", "definitions"]
categories_folder_paths = ["path to doctors", "path to definitions"]
related_terms = [['synonyms for doctors'], ['synonyms for definitions']]
for wt in lower_nouns:
for syn in wn.synsets(wt):
for l in syn.lemmas():
synonyms.append(l.name())
# find out which category has the most common synonyms with the terms your user asked
doctors_intersect = list(set(synonyms).intersection(set(related_terms[categories.index("doctor")])))
definitions_intersect = list(set(synonyms).intersection(set(related_terms[categories.index("definitions")])))
# the length of the results should indicate with which category the query matches most
if len(doctors_intersect) > len(definitions_intersect):
# check the doctors folder with similar logic as above in order to use e.g. the cancer's file.
pass
elif len(doctors_intersect) < len(definitions_intersect):
# check the definitions folder with similar logic as above in order to use e.g. the arthritis' file.
pass
else:
print "I didn't get that, did you mean.... ?"
# and try again?
希望しますが、データとカテゴリをどのように構成したかによって異なります。
P .:このタスクでは、例えば、「医者」、「定義」、および列/テストセットとしてのそれぞれのデータの内容を単語モデルの単純な袋のように分類することを試みます。しかし、私はあなたのデータが何であるかを知り、一般的に細部を知ることなしには、それ以上のことは言えません。
幸運を祈る!
編集:
これは、あなたがこれを行うことができる方法の1つの非常に基本的な実施例です。私は短いテキストとlemmatizerでうまく動作するカスタムposタグモデルを使用しました。
def _prepare_gate_pos_tags_model(self):
# stored processed model as a dict in file so to minimize load time
with open('path to/model_dict.txt', 'r') as f:
train_model = ast.literal_eval(f.read())
出力例:私はこれは少し役に立てば幸い
C:\Python27_32b\python.exe C:/Users/m.karanasou/Documents/figurative-text-analysis/figurative-text-analysis/FigurativeTextAnalysis/helpers/test_snippets.py
Enter query:
Show me about fever
Details about fever....
C:\Python27_32b\python.exe C:/Users/m.karanasou/Documents/figurative-text-analysis/figurative-text-analysis/FigurativeTextAnalysis/helpers/test_snippets.py
Enter query:
Tell me about doctors for cancer.
Doctor A name, surname, contact details
Doctor B name, surname, contact details
Process finished with exit code 0
ロードモデルはカスタムモデルは次のようにロードされている。このhttps://bitbucket.org/mkaranasou/figurative-text-analysis/raw/c804fd3163d8682da2f1ab69095600de296eae56/figurative-text-analysis/TweetUtils/data/model_dict.txt
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.corpus import wordnet as wn
def q_and_a():
stopset = set(stopwords.words('english'))
sentence = raw_input("Enter query:\n")
sent_words = word_tokenize(sentence)
tokens = [w for w in sent_words if not w in stopset]
# tagged_sent = pos_tag(sentence.split())
tagged_sent = get_custom_pos_tags(sentence.split())
nouns = [word for word, pos in tagged_sent if pos == 'NN' or pos == 'NNP' or pos == 'NNS' or pos == 'NNPS']
lower_nouns = [nn.lower() for nn in nouns]
synonyms = []
lemmatizer = nltk.WordNetLemmatizer() # lemmatize/stem for better results
# terms should be the terms of interest along with their synonyms
terms = [
['cancer', 'carcenogenic'],
['fever', 'feverish', 'temperature'],
['cancer', 'carcenogenic', 'doctor', 'therapist', 'professional']
]
path_to_terms = ["../data/cancer.txt", "../data/fever.txt", "../data/cancer_doctors.txt"]
# a good idea would be to stem synonyms
for wt in lower_nouns:
for syn in wn.synsets(wt):
for l in syn.lemmas():
synonyms.append(lemmatizer.lemmatize(l.name()))
synonyms.append(lemmatizer.lemmatize(wt)) # include the original
results = []
for each in terms:
results.append(get_similarity(synonyms, each))
max_sim = max(results) # note: there could be two results with the same percentage
if max_sim >= 20: # some threshold
with open(path_to_terms[results.index(max_sim)]) as f:
print " ".join(f.readlines())
else:
print "I couldn't find something regarding: '%s'" % sentence
def get_similarity(input_terms, category_terms):
input_set = set(input_terms)
category_set = set(category_terms)
common = input_set.intersection(category_set)
# how many from the category did we get right
return (len(common)/float(len(category_set))) * 100
def get_custom_pos_tags(word_list):
"""
Get pos tagging results using custom tagger with the model provided by gate twitter tagger.
Reference: https://gate.ac.uk/wiki/twitter-postagger.html
L. Derczynski, A. Ritter, S. Clarke, and K. Bontcheva, 2013: "Twitter
Part-of-Speech Tagging for All: Overcoming Sparse and Noisy Data". In:
Proceedings of the International Conference on Recent Advances in Natural
Language Processing.
"""
# use a custom model that works well on short text
default_tagger = nltk.data.load(nltk.tag._POS_TAGGER)
train_model = g.train_model # custom gate model
tagger = nltk.tag.UnigramTagger(model=train_model, backoff=default_tagger)
return tagger.tag(word_list)
方法です。明らかにすべてのケースでうまくいくわけではありませんが、私の見解が分かります。あなたがこれを行うのに役立つ多くの方法と構成があります。
はい、あなたは正しいです、2つのカテゴリの定義と医師があります。しかし、これらのカテゴリとそれぞれのフォルダは、ユーザーの入力質問に基づいてアクセスされます。ユーザーが次のように入力したとします。癌とは何ですか?したがって、出力として定義が必要で、cancer.txt(definitions/cancer.txt)の下の定義フォルダを調べなければなりません。そうしないと、誰が熱に対して最も良い医者であるかを尋ねることができました。発熱の下で医者のフォルダを見てください。熱のための医者の名前のためのtxt。 今はこれが理にかなっていると思います。 –
これについてのさらなるヘルプは高く評価されます。 –