DBSTREAM fix (related to issue #1324) (#1336)

* Adding negative signs before fading_factor for steps within Algoritjm 2 of the paper by Hashler and Bolanos to allow clusters with low weight removed. * Change clustering_is_up_to_date to True after every time the function recluster is called. * initiate new micro cluster based on the maximum key of the existing micro clusters, or indexed as 0 if the list of micro clusters is still empty. * Add description to the UNRELEASED.md file
online-ml · Oct 9, 2023 · f27e7cb · f27e7cb
1 parent 7468665
commit f27e7cb
Show file tree

Hide file tree

Showing 2 changed files with 22 additions and 5 deletions.
diff --git a/docs/releases/unreleased.md b/docs/releases/unreleased.md
@@ -6,6 +6,14 @@ River's mini-batch methods now support pandas v2. In particular, River conforms
 
 - Added `anomaly.LocalOutlierFactor`, which is an online version of the LOF algorithm for anomaly detection that matches the scikit-learn implementation.
 
+## clustering
+
+- Add fixes to `cluster.DBSTREAM` algorithm, including:
+  - Addition of the `-` sign before the `fading_factor` in accordance with the algorithm 2 proposed by Hashler and Bolanos (2016) to allow clusters with low weights to be removed. 
+  - The new `micro_cluster` is added with the key derived from the maximum key of the existing micro clusters. If the set of micro clusters is still empty (`len = 0`), a new micro cluster is added with key 0. 
+  - `cluster_is_up_to_date` is set to `True` at the end of the `self._recluster()` function.
+
+
 ## datasets
 
 - Added `datasets.WebTraffic`, which is a dataset that counts the occurrences of events on a website. It is a multi-output regression dataset with two outputs.

diff --git a/river/cluster/dbstream.py b/river/cluster/dbstream.py
@@ -182,9 +182,14 @@ def _update(self, x):
 
         if len(neighbor_clusters) < 1:
             # create new micro cluster
-            self._micro_clusters[len(self._micro_clusters)] = DBSTREAMMicroCluster(
-                x=x, last_update=self._time_stamp, weight=1
-            )
+            if len(self._micro_clusters) > 0:
+                self._micro_clusters[max(self._micro_clusters.keys()) + 1] = DBSTREAMMicroCluster(
+                    x=x, last_update=self._time_stamp, weight=1
+                )
+            else:
+                self._micro_clusters[0] = DBSTREAMMicroCluster(
+                    x=x, last_update=self._time_stamp, weight=1
+                )
         else:
             # update existing micro clusters
             current_centers = {}
@@ -253,7 +258,9 @@ def _cleanup(self):
         micro_clusters = copy.deepcopy(self._micro_clusters)
         for i, micro_cluster_i in self._micro_clusters.items():
             try:
-                value = 2 ** (self.fading_factor * (self._time_stamp - micro_cluster_i.last_update))
+                value = 2 ** (
+                    -self.fading_factor * (self._time_stamp - micro_cluster_i.last_update)
+                )
             except OverflowError:
                 continue
 
@@ -266,7 +273,7 @@ def _cleanup(self):
         for i in self.s.keys():
             for j in self.s[i].keys():
                 try:
-                    value = 2 ** (self.fading_factor * (self._time_stamp - self.s_t[i][j]))
+                    value = 2 ** (-self.fading_factor * (self._time_stamp - self.s_t[i][j]))
                 except OverflowError:
                     continue
 
@@ -375,6 +382,8 @@ def _recluster(self):
             self._n_clusters, self._clusters = self._generate_clusters_from_labels(labels)
             self._centers = {i: self._clusters[i].center for i in self._clusters.keys()}
 
+        self.clustering_is_up_to_date = True
+
     def learn_one(self, x, sample_weight=None):
         self._update(x)