Kは条件付きであることを意味します

1

表記：
- Kがクラスタ
の数である -
のは、最初の2つの変数が点coordinnates（X、Y）であるものとする - CI - Vは、第三
可変意味：合計
0123： - S総和（サムCI）
- 閾値T

問題定義各クラスタI
オーバーVの私が理解したところでは、制約を尊重しながらkmeansの精神を保つアルゴリズムを実行することが目的です。重心への近接によってグループポイント[関数kmeans]
タスク2 - -

タスク1は、制約問題に対する各クラスタI、CI> T * [制約]

正規関数kmeans制限するため：
通常のkmeansは、ポイントを任意の順序で取り入れてセントロイドに割り当てます。私たちの場合、これはポイントを追加しながらCiの制御不能な成長につながります。例えば、K = 2、T = 40および4点で、第3の変数をV1 = 50、V2 = 1、V3 = 50、V4 = 50に等しい。
1--テイクポイント：もその点P1、P3、P4は、1ポイントP2を重心に近いと仮定すると、2

のは、通常の関数kmeansのassignementのステップを実行して、CIを追跡してみましょうを重心に近いですP1をクラスタ1に割り当てます。C1 = 50> T
2-取得ポイントP2をクラスタ2に割り当てます。C2 = 1
3-テイクポイントP3をクラスタ1に割り当てます。C1 = 100> T => C1が大きくなりすぎる！
4-ポイントP4を取得し、クラスタ1に割り当てます。C1 = 150> T => !!!関数kmeans修正

：以前に
を、我々はあまりにも成長しているからC1を防ぎ、C2の成長を助けるしたいです。

これは、シャンパンをいくつかの眼鏡に注ぐようなものです。シャンパンが少ないガラスが見える場合は、それを記入してください。それはあなたに制約があるためです：シャンパンの限界（Sが限定されている）と、すべてのガラスに十分なシャンペーン（Ci> T）を持たせたいからです。

もちろん、これは単にアナロジーです。変更されたkmeansは、制約が達成されるまで（タスク2）、最小限のCiでクラスタに新しいポイントを追加します。ここでポイントを追加する順序は？重心に近接することによって（タスク1）。すべてのクラスタiに対してすべての制約が達成されたら、未割り当ての残りのポイントに対して通常のkmeansを実行できます。

実装：
次に、修正されたアルゴリズムのPython実装を示します。図1は、大きなVSの低い値を仮想化するための透明度を使用した第3変数の表現を示しています。図2は、色を用いた進化クラスタを示す。

accept_threshパラメータを使用して再生することができます。
accept_thresh = 0 =>通常のkmeans（すぐに制約に達する）
accept_thresh = third_var.sum（）。sum（）/（2 * K）の場合、次のような点があります。制約理由のために、与えられた重心に近いものは別の重心に影響を受ける。

CODE：

import numpy as np import matplotlib.pyplot as plt from sklearn import datasets import time nb_samples = 1000 K = 3 # for demo purpose, used to generate cloud points c_std = 1.2 # Generate test samples : points, classes = datasets.make_blobs(n_features=2, n_samples=nb_samples, \ centers=K, cluster_std=c_std) third_var_distribution = 'cubic_bycluster' # 'uniform' if third_var_distribution == 'uniform': third_var = np.random.random((nb_samples)) elif third_var_distribution == 'linear_bycluster': third_var = np.random.random((nb_samples)) third_var = third_var * classes elif third_var_distribution == 'cubic_bycluster': third_var = np.random.random((nb_samples)) third_var = third_var * classes # Threshold parameters : # Try with K=3 and : # T = K => one cluster reach cosntraint, two clusters won't converge # T = 2K => accept_thresh = third_var.sum().sum()/(2*K) def dist2centroids(points, centroids): '''return arrays of ordered points to each centroids first array is index of points second array is distance to centroid dim 0 : centroid dim 1 : distance or point index ''' dist = np.sqrt(((points - centroids[:, np.newaxis]) ** 2).sum(axis=2)) ord_dist_indices = np.argsort(dist, axis=1) ord_dist_indices = ord_dist_indices.transpose() dist = dist.transpose() return ord_dist_indices, dist def assign_points_with_constraints(inds, dists, tv, accept_thresh): assigned = [False] * nb_samples assignements = np.ones(nb_samples, dtype=int) * (-1) cumul_third_var = np.zeros(K, dtype=float) current_inds = np.zeros(K, dtype=int) max_round = nb_samples * K for round in range(0, max_round): # we'll break anyway # worst advanced cluster in terms of cumulated third_var : cluster = np.argmin(cumul_third_var) if cumul_third_var[cluster] > accept_thresh: continue # cluster had enough samples while current_inds[cluster] < nb_samples: # add points to increase cumulated third_var on this cluster i_inds = current_inds[cluster] closest_pt_index = inds[i_inds][cluster] if assigned[closest_pt_index] == True: current_inds[cluster] += 1 continue # pt already assigned to a cluster assignements[closest_pt_index] = cluster cumul_third_var[cluster] += tv[closest_pt_index] assigned[closest_pt_index] = True current_inds[cluster] += 1 new_cluster = np.argmin(cumul_third_var) if new_cluster != cluster: break return assignements, cumul_third_var def assign_points_with_kmeans(points, centroids, assignements): new_assignements = np.array(assignements, copy=True) count = -1 for asg in assignements: count += 1 if asg > -1: continue pt = points[count, :] distances = np.sqrt(((pt - centroids) ** 2).sum(axis=1)) centroid = np.argmin(distances) new_assignements[count] = centroid return new_assignements def move_centroids(points, labels): centroids = np.zeros((K, 2), dtype=float) for k in range(0, K): centroids[k] = points[assignements == k].mean(axis=0) return centroids rgba_colors = np.zeros((third_var.size, 4)) rgba_colors[:, 0] = 1.0 rgba_colors[:, 3] = 0.1 + (third_var/max(third_var))/1.12 plt.figure(1, figsize=(14, 14)) plt.title("Three blobs", fontsize='small') plt.scatter(points[:, 0], points[:, 1], marker='o', c=rgba_colors) # Initialize centroids centroids = np.random.random((K, 2)) * 10 plt.scatter(centroids[:, 0], centroids[:, 1], marker='X', color='red') # Step 1 : order points by distance to centroid : inds, dists = dist2centroids(points, centroids) # Check if clustering is theoriticaly possible : tv_sum = third_var.sum() tv_max = third_var.max() if (tv_max > 1/3 * tv_sum): print("No solution to the clustering problem !\n") print("For one point : third variable is too high.") sys.exit(0) stop_criter_eps = 0.001 epsilon = 100000 prev_cumdist = 100000 plt.figure(2, figsize=(14, 14)) ln, = plt.plot([]) plt.ion() plt.show() while epsilon > stop_criter_eps: # Modified kmeans assignment : assignements, cumul_third_var = assign_points_with_constraints(inds, dists, third_var, accept_thresh) # Kmeans on remaining points : assignements = assign_points_with_kmeans(points, centroids, assignements) centroids = move_centroids(points, assignements) inds, dists = dist2centroids(points, centroids) epsilon = np.abs(prev_cumdist - dists.sum().sum()) print("Delta on error :", epsilon) prev_cumdist = dists.sum().sum() plt.clf() plt.title("Current Assignements", fontsize='small') plt.scatter(points[:, 0], points[:, 1], marker='o', c=assignements) plt.scatter(centroids[:, 0], centroids[:, 1], marker='o', color='red', linewidths=10) plt.text(0,0,"THRESHOLD T = "+str(accept_thresh), va='top', ha='left', color="red", fontsize='x-large') for k in range(0, K): plt.text(centroids[k, 0], centroids[k, 1] + 0.7, "Ci = "+str(cumul_third_var[k])) plt.show() plt.pause(1)
改善：
- 割り当てのための第三の変数の分布を使用します。
- 確かに、合計で私は、クラスタごとに3番目の変数の合計を意味し、あなたがこのケースでRandomSearchCVを使用する方法をもう少し説明してくださいすることができ、より良い初期化（k-means ++法）

出典

2017-05-12 15:47:59 MHICV

+0

ありがとうございます！それは私をたくさん助けました。私はまた、第二の溶液をしようとしている： - 無制約で関数kmeansを適用する - 制約を満たすクラスタを脇においてください - 制約を満たしていないグループのクラスタ - 実行関数kmeans - 反復：基本的に、それはこのように書きます –

+0

あなたは大歓迎です。あなたの解は、各反復で選択するkの値に応じて機能します。ただし、アルゴリズムの3番目のステップでは、制約を満たしていないN個のクラスタがある場合、制約を満たすN個のクラスタの残りのポイントの新しいパーティションを見つけることはできません。なぜなら、グローバル量SUM_over_points（third_variable）< N×閾値。したがって、ステップ4（ 'Kmeansを実行する'）では、厳密にNより下のkmeansに対してkを選択する必要があります。 – MHICV

+0

実際、各繰り返しで、k = int（sum_over_points（third_value）/ threshold）を選択します。結局、私は当然制約を尊重しない1つのクラスターを持っていますが、それは他のクラスターを重要なものにするために私ができることです –

0

これを処理する1つの方法は、をフィルタリングしてからクラスタリングすることです。和によってクラスタあたり3番目の変数の合計を意味する場合

>>> cluster_data = df.loc[df['third_variable'] > some_value] 

>>> from sklearn.cluster import KMeans 
>>> y_pred = KMeans(n_clusters=2).fit_predict(cluster_data)

あなたがやるかの条件を満たしていないKMeansのハイパーを見つけるためにRandomSearchCVを使用することができます。

出典

2017-05-10 14:34:08 spies006

+0

？ - アルゴリズム
の発散を管理しますか –

+0

@YahyaMortassim私がそれをする前に、私にあなたに質問をさせてください。あなたはKMeansアルゴリズムを変更しようとしていますか、またはハイパーパラメータの範囲を検索し、条件を満たすものを見つけることでOKですか？ – spies006

+0

私はKMeansのアルゴリズムを変更しようとしています –

0

K-手段自体が最適化の問題です。

追加の制約は、やや一般的な最適化制約です。

だから、私はむしろ最適化ソルバーでこれにアプローチしたいと思います。

出典

2017-05-10 19:41:41

Kは条件付きであることを意味します

答えて

関連する問題