2つの文字列の重複を一意にカウントする

2つの列を持つデータセットがあります。最初の列には一意のユーザーIDが含まれ、2番目の列にはこれらのIDに関連付けられた属性が含まれます。例えば2つの文字列の重複を一意にカウントする

：私が知りたいのは何

------------------------ 
User ID  Attribute 
------------------------ 
1234  blond 
1235  brunette 
1236  blond 
1234  tall 
1235  tall 
1236  short 
------------------------

は、属性間の相関関係です。上記の例では、ブロンドの高さも何回知っていますか？私の所望の出力は次のようになります。

------------------------------ 
Attr 1  Attr 2  Overlap 
------------------------------ 
blond  tall   1 
blond  short  1 
brunette tall   1 
brunette short  0 
------------------------------

私はデータをピボットして出力を得るためにパンダを使用してみましたが、私のデータセットは、属性の数百を持っているとして、私の現在の試みは現実的ではありません。

df = pandas.read_csv('myfile.csv')  

df.pivot_table(index='User ID', columns'Attribute', aggfunc=len, fill_value=0)

私の現在の出力：

-------------------------------- 
Blond Brunette Short Tall 
-------------------------------- 
    0  1   0  1 
    1  0   0  1 
    1  0   1  0 
--------------------------------

は、私が欲しいの出力を取得する方法はありますか？前もって感謝します。

出典

2016-11-02 MARWEBIST

Iあなたの最初のステップは、これをより良い関係秩序に入れることだと考えてください。これらの属性をヘアカラー/高さ属性に論理的に分割することはありません – brianpck

確かに！私は答えを試みたが、これらの区別をすることができなかった –

あなたは、可能な各属性のカップルを見つけるためにitertools productを使用し、この上の行に一致するクーロン：

import pandas as pd 
from itertools import product 

# 1) creating pandas dataframe 
df = [ ["1234" , "blond"], 
     ["1235" , "brunette"], 
     ["1236" , "blond" ], 
     ["1234" , "tall"], 
     ["1235" , "tall"], 
     ["1236" , "short"]] 

df = pd.DataFrame(df) 
df.columns = ["id", "attribute"] 

#2) creating all the possible attributes binomes 
attributs = set(df.attribute) 
for attribut1, attribut2 in product(attributs, attributs): 
    if attribut1!=attribut2: 
     #3) selecting the rows for each attribut 
     df1 = df[df.attribute == attribut1]["id"] 
     df2 = df[df.attribute == attribut2]["id"] 
     #4) finding the ids that are matching both attributs 
     intersection= len(set(df1).intersection(set(df2))) 
     if intersection: 
      #5) displaying the number of matches 
      print attribut1, attribut2, intersection

与える：

tall brunette 1 
tall blond 1 
brunette tall 1 
blond tall 1 
blond short 1 
short blond 1

EDIT

を、その後に絞り込むのは簡単ですあなたの希望する出力を得る：

import pandas as pd 
from itertools import product 

# 1) creating pandas dataframe 
df = [ ["1234" , "blond"], 
     ["1235" , "brunette"], 
     ["1236" , "blond" ], 
     ["1234" , "tall"], 
     ["1235" , "tall"], 
     ["1236" , "short"]] 

df = pd.DataFrame(df) 
df.columns = ["id", "attribute"] 

wanted_attribute_1 = ["blond", "brunette"] 

#2) creating all the possible attributes binomes 
attributs = set(df.attribute) 
for attribut1, attribut2 in product(attributs, attributs): 
    if attribut1 in wanted_attribute_1 and attribut2 not in wanted_attribute_1: 
     if attribut1!=attribut2: 
      #3) selecting the rows for each attribut 
      df1 = df[df.attribute == attribut1]["id"] 
      df2 = df[df.attribute == attribut2]["id"] 
      #4) finding the ids that are matching both attributs 
      intersection= len(set(df1).intersection(set(df2))) 
      #5) displaying the number of matches 
      print attribut1, attribut2, intersection

与え

：

brunette tall 1 
brunette short 0 
blond tall 1 
blond short 1

出典

2016-11-02 14:41:23

ありがとう。これは、私が探している出力を私に与える。結果を.csvファイルにエクスポートするにはどうすればよいですか？ – MARWEBIST

最初に空の[結果]データフレームを作成し、[attribut1、attribut2、intersection]をループに追加する必要があります（appendについては、http://pandas.pydata.org/を参照してください）。 pandas-docs/stable/generated/pandas.DataFrame.append.html）。 Pandasのデータフレームには[to_csv]メソッドが用意されています。これをファイルに保存することができます。 –

をあなたに旋回テーブルから、あなたは自分自身の転置のCrossProductを計算し、その後、長い形式に上三角結果を変換することができます。

import pandas as pd 
import numpy as np 
mat = df.pivot_table(index='User ID', columns='Attribute', aggfunc=len, fill_value=0) 

tprod = mat.T.dot(mat)   # calculate the tcrossprod here 
result = tprod.where((np.triu(np.ones(tprod.shape, bool), 1)), np.nan).stack().rename('value') 
           # extract the upper triangular part 
result.index.names = ['Attr1', 'Attr2'] 
result.reset_index().sort_values('value', ascending = False)

出典

2016-11-02 14:42:44 Psidom

2つの文字列の重複を一意にカウントする

答えて

関連する問題