複数の辞書間の類似度 "score"を計算する

私は参照辞書 "dictA"を持っており、現場で生成されたn個の辞書と比較する必要があります（キーとvulesの類似性を計算する）。各辞書の長さは同じです。ディスカッションのために、それを比較する辞書の量はdictB、dictC、dictDの3つです。ここで複数の辞書間の類似度 "score"を計算する

は傍論がどのように見えるかです：私は解決策を持っていますが、ちょうど2つの辞書のオプションの

dictB={'1':"U", '2':"U", '3':"D", '4':"D", '5':"U",'6':"D"} 
dictC={'1':"U", '2':"U", '3':"U", '4':"D", '5':"U",'6':"D"} 
dictD={'1':"D", '2':"U", '3':"U", '4':"U", '5':"D",'6':"D"}

：

ここ

dictA={'1':"U", '2':"D", '3':"D", '4':"U", '5':"U",'6':"U"}

dictB、dictCとdictDがどのように見えるかです：

sharedValue = set(dictA.items()) & set(dictD.items()) 
dictLength = len(dictA) 
scoreOfSimilarity = len(sharedValue) 
similarity = scoreOfSimilarity/dictLength

私の質問は以下のとおりです。どのように私はdictiのn個の量を反復処理することができますonictariesとdictAは私が他の人と比較する主要な辞書です。目標は、各辞書の「類似性」値を取得して、主要辞書に対して反復することです。

ありがとうございました。

出典

2016-10-11 lechiffre

1）それらのn辞書はどこかのリストに存在しますか？ 2）複数の反復（例えば平均）の類似度スコアはどのようにして計算されますか？ – SuperSaiyan

BからDまでの辞書のリストをループさせてみませんか？この問題を解決する際に特定のパフォーマンスやデータ構造の制限を満たすことを検討していますか？ –

Python3の 'dict.items（）'はすでに '＆'と他の集合演算子で動作しています。これはリストではなく、辞書アイテムのビューである集合的なオブジェクトです。 –

ここでは、辞書を個別に生成し、それぞれを使用して次を生成できると仮定して、一般的な構造を示します。これはあなたが望むかもしれないように聞こえる。 calculate_similarityは上記の「私は解決策がある」コードを含む関数になります。

reference = {'1':"U", '2':"D", '3':"D", '4':"U", '5':"U",'6':"U"} 
while True: 
    on_the_spot = generate_dictionary() 
    if on_the_spot is None: 
     break 
    calculate_similarity(reference, on_the_spot)

すでに生成されている辞書を反復処理する必要がある場合は、それらを反復可能なPython構造にする必要があります。あなたはそれらを生成すると、辞書のリストを作成します。

victim_list = [ 
    {'1':"U", '2':"U", '3':"D", '4':"D", '5':"U",'6':"D"}, 
    {'1':"U", '2':"U", '3':"U", '4':"D", '5':"U",'6':"D"}, 
    {'1':"D", '2':"U", '3':"U", '4':"U", '5':"D",'6':"D"} 
] 
for on_the_spot in victim_list: 
    # Proceed as above

あなたは、Pythonを使い慣れている発電を構築しますか？これは、の値を返し、の結果であり、の戻り値ではない関数のようです。その場合は、上記のリストの代わりにそれを使用してください。

出典

2016-10-11 22:29:04 Prune

ソリューションにソリューションを貼り付ける場合は、任意の2つのディクテーションの名前で呼び出すことができます。また、ネストされた関数間で引数を分割して関数をカリー化すると、部分的に最初のdictを適用して2番目の関数を返すことができます（またはfunctools.partialを使用することもできます）。

別に

def similarity (a): 
    def _ (b): 
     sharedValue = set(a.items()) & set(b.items()) 
     dictLength = len(a) 
     scoreOfSimilarity = len(sharedValue) 
     return scoreOfSimilarity/dictLength 
    return _

：上記また、ネストされたラムダを経由して単一の式のように書くことができます。

similarity = lambda a: lambda b: len(set(a.items()) & set(b.items))/len(a)

今、あなたはマップと傍論と残りの部分との間の類似性を得ることができます。

otherDicts = [dictB, dictC, dictD] 
scores = map(similarity(dictA), otherdicts)

今、あなたはスコアのリストからベストを引き出すためにmin()（またはmax()、または何でも）を使用することができます。

winner = min(scores)

警告：私は、上記のいずれかをテストしていません。

出典

2016-10-11 22:36:26

内部関数であっても、関数の名前として "_"を使用しないでください。 http://stackoverflow.com/questions/5893163/what-is-the-purpose-of-the-single-underscore-variable-in-python – lejlot

回答に参加いただきありがとうございます。ここで私は必要なものを行い結果である：

def compareTwoDictionaries(self, absolute, reference, listOfDictionaries): 
    #look only for absolute fit, yes or no 
    if (absolute == True): 
     similarity = reference == listOfDictionaries 
    else: 
     #return items that are the same between two dictionaries 
     shared_items = set(reference.items()) & set(listOfDictionaries.items()) 
     #return the length of the dictionary for further calculation of % 
     dictLength = len(reference) 
     #return the length of shared_items for further calculation of % 
     scoreOfSimilarity = len(shared_items) 
     #return final score: similarity 
     similarity = scoreOfSimilarity/dictLength 
    return similarity

ここでは、上述したように、関数の呼び出し

for dict in victim_list: 
       output = oandaConnectorCalls.compareTwoDictionaries(False, reference, dict)

「リファレンス」辞書と「victim_list」辞書が使用されています。

出典

2016-10-12 13:55:55 lechiffre

問題の設定に基づいて、辞書の入力リストをループする方法はありません。ただし、ここで適用できるマルチプロセッシングのトリックがあります。ここで

は、あなたの入力です：

dict_a = {'1': "U", '2': "D", '3': "D", '4': "U", '5': "U", '6': "U"} 
dict_b = {'1': "U", '2': "U", '3': "D", '4': "D", '5': "U", '6': "D"} 
dict_c = {'1': "U", '2': "U", '3': "U", '4': "D", '5': "U", '6': "D"} 
dict_d = {'1': "D", '2': "U", '3': "U", '4': "U", '5': "D", '6': "D"} 
other_dicts = [dict_b, dict_c, dict_d]

私はループ技術に使用するsimilarity2機能に加えて、similarity1ようgary_fixlerのマップ技術@含まれています。

def similarity1(a): 
    def _(b): 
     shared_value = set(a.items()) & set(b.items()) 
     dict_length = len(a) 
     score_of_similarity = len(shared_value) 
     return score_of_similarity/dict_length 
    return _ 

def similarity2(c): 
    a, b = c 
    shared_value = set(a.items()) & set(b.items()) 
    dict_length = len(a) 
    score_of_similarity = len(shared_value) 
    return score_of_similarity/dict_length

私たちはここ3つの技術を評価されています（3）ここでdicts

のリストをマルチプロセッシング
（1）@ gary_fixlerのマップ
dicts
のリストを（2）単純なループをしています実行文：

print(list(map(similarity1(dict_a), other_dicts))) 
print([similarity2((dict_a, dict_v)) for dict_v in other_dicts]) 

max_processes = int(multiprocessing.cpu_count()/2-1) 
pool = multiprocessing.Pool(processes=max_processes) 
print([x for x in pool.map(similarity2, zip(itertools.repeat(dict_a), other_dicts))])

すべての3つのテクニックで同じ結果が得られます。

[0.5, 0.3333333333333333, 0.16666666666666666] 
[0.5, 0.3333333333333333, 0.16666666666666666] 
[0.5, 0.3333333333333333, 0.16666666666666666]

マルチプロセッシングでは、multiprocessing.cpu_count()/2コア（各コアにハイパースレッディングがあります）があります。あなたのシステムで何も実行しておらず、プログラムにI/Oや同期の必要がないと仮定すると、multiprocessing.cpu_count()/2-1プロセスでは最適なパフォーマンスが得られ、親プロセスでは-1となることがよくあります。今

、時間に3つの技術：

print(timeit.timeit("list(map(similarity1(dict_a), other_dicts))", 
        setup="from __main__ import similarity1, dict_a, other_dicts", 
        number=10000)) 

print(timeit.timeit("[similarity2((dict_a, dict_v)) for dict_v in other_dicts]", 
        setup="from __main__ import similarity2, dict_a, other_dicts", 
        number=10000)) 

print(timeit.timeit("[x for x in pool.map(similarity2, zip(itertools.repeat(dict_a), other_dicts))]", 
        setup="from __main__ import similarity2, dict_a, other_dicts, pool", 
        number=10000))

これは私のラップトップ上で、次の結果を生成します。

0.07092539698351175 
0.06757041101809591 
1.6528456939850003

あなたは、基本的なループ技術は最高の性能が得られていることがわかります。プロセスの作成とデータの受け渡しのオーバーヘッドが原因で、マルチプロセッシングは他の2つのテクニックよりも大幅に悪化しました。これはマルチプロセッシングがここでは役に立たないという意味ではありません。まったく反対。より多くの入力辞書の結果を見てください。

for _ in range(7): 
    other_dicts.extend(other_dicts)

これは、辞書リストを384項目に拡張します。この入力のタイミング結果は次のとおりです。

7.934810006991029 
8.184540337068029 
7.466550623998046

入力ディクショナリのいずれのセットでも、マルチプロセッシング手法が最適になります。

出典

2016-10-12 16:53:15

複数の辞書間の類似度 "score"を計算する

答えて

関連する問題