k-meansを使用してデータセットをクラスタ化しようとしています。アルゴリズムを1回だけ実行すると、ランダムクラスタが返されますが、複数の繰り返しを試すと0だけが返されます。私が使用している行列は、50k x 140の2進行列です。各行はユーザーを表し、各列は項目を表します。k-は、Pythonでクラスタリングが正しくないことを意味します。
def clusterizator(matriz, nDeClusters, it=10): # matrix, number of clusters, number of iterations
nOfLines = matriz.shape[0] # number of lines (users)
nOfColumns = matriz.shape[1] # number of columns (items)
clusterCurrently = np.zeros((nOfLines, 1)) # currently cluster assigned to each user
listOfCurrentlyAssigneds = [] # list with numberOfClusters size, each element is a list of currently elements assigned to this cluster
clusterCentroid = [] # centroid of each cluster
clusterCentroid = np.random.randint(2, size=(nDeClusters, nOfColumns)) # starts with randoms centroids
for repeat in xrange(it): # number of iterations
listOfCurrentlyAssigneds = [[] for i in xrange(nDeClusters)] # create empty lists for each cluster
for i in xrange(nOfLines): # for each user
closestCentroid = clusterMaisProximo(matriz[i], clusterCentroid) # calculates the closest centroid
clusterCurrently[i] = closestCentroid # assign the user to closest centroid
listOfCurrentlyAssigneds[closestCentroid].append(matriz[i]) # put user on that centroid list
for i in xrange(nDeClusters): # for each cluster
if listOfCurrentlyAssigneds[i] != []: # if the list is not empty
clusterCentroid[i] = centeroidnp(listOfCurrentlyAssigneds[i]) # calculates the new centroid
return clusterCurrently # return 1-column matrix with user x cluster
def distanciaEucl(elemento1, elemento2):
return np.linalg.norm(elemento2-elemento1) #calculates the distance between to items (or one user and one cluster)
def clusterMaisProximo(elemento, listaDeClusters): # receive one user and the cluster's centroids list, return the closest one
closest = 0
closestDist = distanciaEucl(elemento, listaDeClusters[0]) # starts with the cluster[0]
for i in xrange(len(listaDeClusters)-1): # for each cluster
dist = distanciaEucl(elemento, listaDeClusters[i+1]) # get the distance to currently cluster's centroid
if dist < closestDist: # if it is closer to the element
closest = i+1 # update new closest element
closestDist = dist # update new closest distance
return closest # return closest
# from https://stackoverflow.com/questions/23020659/fastest-way-to-calculate-the-centroid-of-a-set-of-coordinate-tuples-in-python-wi
# by Retozi (adapted)
def centeroidnp(lista): # get a list of elements (number of elements x items)
shape = list(lista[0].shape)
shape[:0] = [len(lista)]
arr = np.concatenate(lista).reshape(shape) # get an array from the list
length = arr.shape[0]
somas = np.zeros(arr.shape[1])
for i in xrange(arr.shape[1]): # for each item (dimension)
somas[i] = (np.sum(arr[:, i]))/length # sum all elements and divide by number of elements
return somas # return array that will be the new centroid position
私はその後、私はそれがより明確にするために翻訳され、最初に私の変数はポルトガル語で書かれていたので、いくつかのコメントはダムあり、各行が何をしているかを明らかにすることを試みるためにすべてをコメントしました。
私はこのようにそれを実行している:
clust = clusterizator(train, 10, 2)
例行列:
train = [[0, 1, 1, 0], [1, 0, 0, 0], [0, 1, 1, 1], [1, 0, 0, 1], [1, 0, 0, 0]]
「kNNクラスタリング」はありません。あなたはおそらくk-meansを実装しようとしています。 (kNN分類もあります) –