Skip to content

Commit

Permalink
DBSTREAM fix (related to issue #1324) (#1336)
Browse files Browse the repository at this point in the history
* Adding negative signs before fading_factor for steps within Algoritjm 2 of the paper by Hashler and Bolanos to allow clusters with low weight removed.

* Change clustering_is_up_to_date to True after every time the function recluster is called.

* initiate new micro cluster based on the maximum key of the existing micro clusters, or indexed as 0 if the list of micro clusters is still empty.

* Add description to the UNRELEASED.md file
  • Loading branch information
hoanganhngo610 authored Oct 9, 2023
1 parent 7468665 commit f27e7cb
Show file tree
Hide file tree
Showing 2 changed files with 22 additions and 5 deletions.
8 changes: 8 additions & 0 deletions docs/releases/unreleased.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,6 +6,14 @@ River's mini-batch methods now support pandas v2. In particular, River conforms

- Added `anomaly.LocalOutlierFactor`, which is an online version of the LOF algorithm for anomaly detection that matches the scikit-learn implementation.

## clustering

- Add fixes to `cluster.DBSTREAM` algorithm, including:
- Addition of the `-` sign before the `fading_factor` in accordance with the algorithm 2 proposed by Hashler and Bolanos (2016) to allow clusters with low weights to be removed.
- The new `micro_cluster` is added with the key derived from the maximum key of the existing micro clusters. If the set of micro clusters is still empty (`len = 0`), a new micro cluster is added with key 0.
- `cluster_is_up_to_date` is set to `True` at the end of the `self._recluster()` function.


## datasets

- Added `datasets.WebTraffic`, which is a dataset that counts the occurrences of events on a website. It is a multi-output regression dataset with two outputs.
Expand Down
19 changes: 14 additions & 5 deletions river/cluster/dbstream.py
Original file line number Diff line number Diff line change
Expand Up @@ -182,9 +182,14 @@ def _update(self, x):

if len(neighbor_clusters) < 1:
# create new micro cluster
self._micro_clusters[len(self._micro_clusters)] = DBSTREAMMicroCluster(
x=x, last_update=self._time_stamp, weight=1
)
if len(self._micro_clusters) > 0:
self._micro_clusters[max(self._micro_clusters.keys()) + 1] = DBSTREAMMicroCluster(
x=x, last_update=self._time_stamp, weight=1
)
else:
self._micro_clusters[0] = DBSTREAMMicroCluster(
x=x, last_update=self._time_stamp, weight=1
)
else:
# update existing micro clusters
current_centers = {}
Expand Down Expand Up @@ -253,7 +258,9 @@ def _cleanup(self):
micro_clusters = copy.deepcopy(self._micro_clusters)
for i, micro_cluster_i in self._micro_clusters.items():
try:
value = 2 ** (self.fading_factor * (self._time_stamp - micro_cluster_i.last_update))
value = 2 ** (
-self.fading_factor * (self._time_stamp - micro_cluster_i.last_update)
)
except OverflowError:
continue

Expand All @@ -266,7 +273,7 @@ def _cleanup(self):
for i in self.s.keys():
for j in self.s[i].keys():
try:
value = 2 ** (self.fading_factor * (self._time_stamp - self.s_t[i][j]))
value = 2 ** (-self.fading_factor * (self._time_stamp - self.s_t[i][j]))
except OverflowError:
continue

Expand Down Expand Up @@ -375,6 +382,8 @@ def _recluster(self):
self._n_clusters, self._clusters = self._generate_clusters_from_labels(labels)
self._centers = {i: self._clusters[i].center for i in self._clusters.keys()}

self.clustering_is_up_to_date = True

def learn_one(self, x, sample_weight=None):
self._update(x)

Expand Down

0 comments on commit f27e7cb

Please sign in to comment.