-
Notifications
You must be signed in to change notification settings - Fork 35
/
Copy path_curriculum_clustering.py
335 lines (267 loc) · 16.3 KB
/
_curriculum_clustering.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
# -*- coding: utf-8 -*-
"""
Curriculum clustering algorithm
When teaching a particular concept, it can be useful to break down the learning into lessons of increasing complexity.
Consider a curriculum, of various subsets of difficulty, for machine learning. Instead of presenting samples
in random order, imagine starting with the easiest examples first, and the most complex ones later.
In this way, the machine can develop a solid foundation in a particular concept, before being exposed to more
sophistication. This can help enable a machine to learn from noisy (complex) samples, by learning easy (simple) ones
first. In terms of technique, this forms the basic principle of a weakly supervised learning approach called
Curriculum Learning (CurriculumNet).
The input is a set of feature vectors and corresponding concept (category) labels. Normally clustering algorithms
do not need labels, however this algorithm does require (possibly noisy) labels because they represent the concepts
that should be learned over a curriculum of increasing complexity. The number of subsets of the curriculum is set by
a subsets parameter. The algorithm will cluster *each* concept category into N subsets using distribution density
in an unsupervised manner.
Consider the application to computer vision, where each vector represents an image. A curriculum subset with a high
density value means all images are close to each other in feature space, suggesting that these images have a strong
similarity. We define this subset as a clean one, by assuming most of the labels are correct. The subset with a small
density value means the images have a large diversity in visual appearance, which may include more irrelevant images
with incorrect labels. This subset is considered as noisy data. Therefore, we generate a number of subsets in each
category, arranged from clean, noisy, to highly noisy ones, which are ordered with increasing complexity.
If you'd like to learn more please refer to, and / or if you find this work useful for your research, please cite:
CurriculumNet: Weakly Supervised Learning from Large-Scale Web Images
S. Guo, W. Huang, H. Zhang, C. Zhuang, D. Dong, M. R. Scott, D. Huang
European Conference on Computer Vision (ECCV), 2018 (arXiv:1808.01097)
"""
# Copyright (c) Malong LLC.
# All rights reserved.
#
# Contact: [email protected]
#
# This source code is licensed under the LICENSE file in the root directory of this source tree.
import time
import numpy as np
from sklearn.base import BaseEstimator, ClusterMixin
from sklearn.cluster import KMeans
from sklearn.decomposition import PCA
from sklearn.utils import check_array, check_consistent_length, gen_batches
def cluster_curriculum_subsets(X, y, n_subsets=3, method='default', density_t=0.6, verbose=False,
dim_reduce=256, batch_max=500000, random_state=None, calc_auxiliary=False):
"""Perform Curriculum Clustering from vector array or distance matrix.
The algorithm will cluster each category into n subsets by analyzing distribution density.
Parameters
----------
X : array or sparse (CSR) matrix of shape (n_samples, n_features), or \
array of shape (n_samples, n_samples)
A feature array (the embedding space of the samples.)
y : array-like, size=[n_samples]
category labels (the curriculum will be learned for each of these categories into N subsets)
verbose : bool, optional, default False
Whether to print progress messages to stdout.
density_t : float, optional
The density threshold for neighbors to be clustered together into a subset
n_subsets : int, optional (default = 3)
The number of subsets to cluster each category into. For example, if set to 3, then the categories
outputted will be assigned a label 0, 1, or 2. Where 0 contains the simplest (most similar) samples,
1 contains middle level (somewhat similar samples), and 2 contains most complex (most diverse) samples.
random_state : int, RandomState instance or None (default), optional
The generator used to make random selections within the algorithm. Use an int to make the
randomness deterministic.
method : {'default', 'gaussian'}, optional
The algorithm to be used to calculate local density values. The default algorithm
follows the approach outlined in the scientific paper referenced below.
dim_reduce : int, optional (default = 256)
The dimensionality to reduce the feature vector to, prior to calculating distance.
Lower dimension is more efficient, but degrades performance, and visa-versa.
batch_max : int, optional (default = 500000)
The maximum batch of feature vectors to process at one time (loaded into memory).
calc_auxiliary : bool, optional (default = False)
Provide auxiliary including delta centers and density centers.
This can be useful information for visualization, and debugging, amongst other use-cases.
The downside is the processing time significantly increases if turned on.
Returns
-------
all_clustered_labels : array [n_samples]
Clustered labels for each point. Labels are integers ordered from most simple to most complex.
E.g. if curriculum subsets=3, then label=0 is simplest, labels=1 is harder, and label=n is hardest.
auxiliary_info : list
If calc_auxiliary is set to True, this list contains collected auxiliary information
during the clustering process, including delta centers, which can be useful for visualization.
References
----------
S. Guo, W. Huang, H. Zhang, C. Zhuang, D. Dong, M. R. Scott, D. Huang,
"CurriculumNet: Weakly Supervised Learning from Large-Scale Web Images".
In: Proceedings of the European Conference on Computer Vision (ECCV),
Munich, Germany, 2018. (arXiv:1808.01097)
"""
if not density_t > 0.0:
raise ValueError("density_thresh must be positive.")
X = check_array(X, accept_sparse='csr')
check_consistent_length(X, y)
unique_categories = set(y)
t0 = None
pca = None
auxiliary_info = []
if X.shape[1] > dim_reduce:
pca = PCA(n_components=dim_reduce, copy=False, random_state=random_state)
# Initialize all labels as negative one which represents un-clustered 'noise'.
# Post-condition: after clustering, there should be no negatives in the label output.
all_clustered_labels = np.full(len(y), -1, dtype=np.intp)
for cluster_idx, current_category in enumerate(unique_categories):
if verbose:
t0 = time.time()
# Collect the "learning material" for this particular category
dist_list = [i for i, label in enumerate(y) if label == current_category]
for batch_range in gen_batches(len(dist_list), batch_size=batch_max):
batch_dist_list = dist_list[batch_range]
# Load data subset
subset_vectors = np.zeros((len(batch_dist_list), X.shape[1]), dtype=np.float32)
for subset_idx, global_idx in enumerate(batch_dist_list):
subset_vectors[subset_idx, :] = X[global_idx, :]
# Calc distances
if pca:
subset_vectors = pca.fit_transform(subset_vectors)
m = np.dot(subset_vectors, np.transpose(subset_vectors))
t = np.square(subset_vectors).sum(axis=1)
distance = np.sqrt(np.abs(-2 * m + t + np.transpose(np.array([t]))))
# Calc densities
if method == 'gaussian':
densities = np.zeros((len(subset_vectors)), dtype=np.float32)
distance = distance / np.max(distance)
for i in xrange(len(subset_vectors)):
densities[i] = np.sum(1 / np.sqrt(2 * np.pi) * np.exp((-1) * np.power(distance[i], 2) / 2.0))
else:
densities = np.zeros((len(subset_vectors)), dtype=np.float32)
flat_distance = distance.reshape(distance.shape[0] * distance.shape[1])
dist_cutoff = np.sort(flat_distance)[int(distance.shape[0] * distance.shape[1] * density_t)]
for i in xrange(len(batch_dist_list)):
densities[i] = len(np.where(distance[i] < dist_cutoff)[0]) - 1 # remove itself
if len(densities) < n_subsets:
raise ValueError("Cannot cluster into {} subsets due to lack of density diversification,"
" please try a smaller n_subset number.".format(n_subsets))
# Optionally, calc auxiliary info
if calc_auxiliary:
# Calculate deltas
deltas = np.zeros((len(subset_vectors)), dtype=np.float32)
densities_sort_idx = np.argsort(densities)
for i in xrange(len(densities_sort_idx) - 1):
larger = densities_sort_idx[i + 1:]
larger = larger[np.where(larger != densities_sort_idx[i])] # remove itself
deltas[i] = np.min(distance[densities_sort_idx[i], larger])
# Find the centers and package
center_id = np.argmax(densities)
center_delta = np.max(distance[np.argmax(densities)])
center_density = densities[center_id]
auxiliary_info.append((center_id, center_delta, center_density))
model = KMeans(n_clusters=n_subsets, random_state=random_state)
model.fit(densities.reshape(len(densities), 1))
clusters = [densities[np.where(model.labels_ == i)] for i in xrange(n_subsets)]
n_clusters_made = len(set([k for j in clusters for k in j]))
if n_clusters_made < n_subsets:
raise ValueError("Cannot cluster into {} subsets, please try a smaller n_subset number, such as {}.".
format(n_subsets, n_clusters_made))
cluster_mins = [np.min(c) for c in clusters]
bound = np.sort(np.array(cluster_mins))
# Distribute into curriculum subsets, and package into global adjusted returnable array, optionally aux too
other_bounds = xrange(n_subsets - 1)
for i in xrange(len(densities)):
# Check if the most 'clean'
if densities[i] >= bound[n_subsets - 1]:
all_clustered_labels[batch_dist_list[i]] = 0
# Else, check the others
else:
for j in other_bounds:
if bound[j] <= densities[i] < bound[j + 1]:
all_clustered_labels[batch_dist_list[i]] = len(bound) - j - 1
if verbose:
print "Clustering {} of {} categories into {} curriculum subsets ({:.2f} secs).".format(
cluster_idx + 1, len(unique_categories), n_subsets, time.time() - t0)
if (all_clustered_labels > 0).all():
raise ValueError("A clustering error occurred: incomplete labels detected.")
return all_clustered_labels, auxiliary_info
class CurriculumClustering(BaseEstimator, ClusterMixin):
"""Perform Curriculum Clustering from vector array or distance matrix.
The algorithm will cluster *each* category into N subsets using distribution density in an unsupervised manner.
The subsets can be thought of stages in an educational curriculum, going from easiest to hardest learning material.
For information, please see see CurriculumNet, a weakly supervised learning approach that leverages this technique.
Parameters
----------
X : array or sparse (CSR) matrix of shape (n_samples, n_features), or \
array of shape (n_samples, n_samples)
A feature array (the embedding space of the samples.)
y : array-like, size=[n_samples]
category labels (the curriculum will be learned for each of these categories into N subsets)
verbose : bool, optional, default False
Whether to print progress messages to stdout.
density_t : float, optional
The density threshold for neighbors to be clustered together
n_subsets : int, optional (default = 3)
The number of subsets to cluster each category into. For example, if set to 3, then the categories
outputted will be assigned a label 0, 1, or 2. Where 0 contains the simplest (most similar) samples,
1 contains middle level (somewhat similar samples), and 2 contains most complex (most diverse) samples.
random_state : int, RandomState instance or None (default), optional
The generator used to make random selections within the algorithm. Use an int to make the
randomness deterministic.
method : {'default', 'gaussian'}, optional
The algorithm to be used to calculate local density values. The default algorithm
follows the approach outlined in the scientific paper referenced below.
dim_reduce : int, optional (default = 256)
The dimensionality to reduce the feature vector to, prior to calculating distance.
Lower dimension is more efficient, but degrades performance, and visa-versa.
batch_max : int, optional (default = 500000)
The maximum batch of feature vectors to process at one time (loaded into memory).
calc_auxiliary : bool, optional (default = False)
Provide auxiliary including delta centers and density centers.
This can be useful information for visualization, and debugging, amongst other use-cases.
The downside is the processing time significantly increases if turned on.
Returns
-------
all_clustered_labels : array [n_samples]
Clustered labels for each point. Labels are integers ordered from most simple to most complex.
E.g. if curriculum subsets=3, then label=0 is simplest, labels=1 is harder, and label=n is hardest.
auxiliary_info : list
If calc_auxiliary is set to True, this list contains collected auxiliary information
during the clustering process, including delta centers, which can be useful for visualization.
References
----------
S. Guo, W. Huang, H. Zhang, C. Zhuang, D. Dong, M. R. Scott, D. Huang,
"CurriculumNet: Weakly Supervised Learning from Large-Scale Web Images".
In: Proceedings of the European Conference on Computer Vision (ECCV),
Munich, Germany, 2018. (arXiv:1808.01097)
"""
def __init__(self, n_subsets=3, method='default', density_t=0.6, verbose=False,
dim_reduce=256, batch_max=500000, random_state=None, calc_auxiliary=False):
self.n_subsets = n_subsets
self.method = method
self.density_t = density_t
self.verbose = verbose
self.output_labels = None
self.random_state = random_state
self.dim_reduce = dim_reduce
self.batch_max = batch_max
self.calc_auxiliary = calc_auxiliary
def fit(self, X, y):
"""Perform curriculum clustering.
Parameters
----------
X : array or sparse (CSR) matrix of shape (n_samples, n_features), or \
array of shape (n_samples, n_samples)
A feature array (the embedding space of the samples.)
y : array-like, size=[n_samples]
category labels (the curriculum will be learned for each of these categories into N subsets)
"""
X = check_array(X, accept_sparse='csr')
check_consistent_length(X, y)
self.output_labels, _ = cluster_curriculum_subsets(X, y, **self.get_params())
return self
def fit_predict(self, X, y=None):
"""Performs curriculum clustering on X and returns clustered labels (subsets).
Parameters
----------
X : array or sparse (CSR) matrix of shape (n_samples, n_features), or \
array of shape (n_samples, n_samples)
A feature array (the embedding space of the samples.)
y : array-like, size=[n_samples]
category labels (the curriculum will be learned for each of these categories into N subsets)
Returns
-------
all_clustered_labels : array [n_samples]
Clustered labels for each point. Labels are integers ordered from most simple to most complex.
E.g. if curriculum subsets=3, then label=0 is simplest, labels=1 is harder, and label=n is hardest.
auxiliary_info : list
If calc_auxiliary is set to True, this list contains collected auxiliary information
during the clustering process, including delta centers, which can be useful for visualization.
"""
self.fit(X, y)
return self.output_labels