DBSTREAM: update shared density #1468

ShkarupaDC · 2023-12-05T13:52:01Z

Description

Hi! I have a question about the DBSTREAM implementation. The DBSTREAM updates shared densities by multiplying the current weight by the fading factor (2^(-lambda * (t_current - t_last)) and adding 1. Thus, if 2 micro-clusters (MC) have not shared data samples before but receive a data sample that falls into assignment areas of both ones now, their shared density should be 1. However, in the code (snippet that supports my statement is presented below), we have 0. Am I right, and we should fix this bug, or did I miss something? @hoanganhngo610, could you take a look?

Code

# river/cluster/dbstream.py, 228 - 234
except KeyError:
    try:
        self.s[i][j] = 0
        self.s_t[i][j] = self._time_stamp
    except KeyError:
        self.s[i] = {j: 0}
        self.s_t[i] = {j: self._time_stamp}

Pseudo code

The update rule is described on line 11.

ShkarupaDC · 2023-12-05T14:22:07Z

The second question arose when I created this issue(

Could you explain why we take the square root of Minkowski distance as we already accounted for that in the metric implementation?

# river/cluster/dbstream.py, 162 - 164
@staticmethod
def _distance(point_a, point_b):
    return math.sqrt(utils.math.minkowski_distance(point_a, point_b, 2))

hoanganhngo610 · 2023-12-05T16:16:45Z

Thank you so much @ShkarupaDC for raising the issue. I will have a look at this issue shortly.

hoanganhngo610 · 2023-12-05T16:46:30Z

@ShkarupaDC For the second question, it's true that the math.sqrt is redundant. In fact, this is due to the fact that at the time this algorithm is implemented, the minkowski_distance function does not have **(1/p)**, hence the existence of the square root. I will modify this in the next commit.

…AM (resolve partially issue #1468).

ShkarupaDC · 2023-12-05T17:12:36Z

@hoanganhngo610, thank you for your answer! Could you also help me understand the reason for catching the OverflowError? The expression

# river/cluster/dbstream.py, 261 - 263
value = 2 ** (
    -self.fading_factor * (self._time_stamp - micro_cluster_i.last_update)
)

may result in a large number that causes an OverflowError only in the case we use a negative fading_factor. As negative value is not expected, it may be a good idea to add parameter validation as a protection mechanism. When using positive fading_factor the value can be only too small. However, Python just returns 0 in this case.

ShkarupaDC · 2023-12-05T20:23:57Z

From the official implementation (C++, R), I also found that:

Entries in shared density dict should be removed for weak clusters that are removed. That looks pretty obvious, so it will be great if we implement that (line).
Centers are reverted for all clusters in case there is at least one pair of clusters that is closer than the clustering_threshold to each other. I could not find a case that breaks the current implementation, so it is JFYI. Maybe, we should do the same to be compatible with the official implementation (line).

P.S. I could create a PR instead but maybe on Saturday or Sunday. What is better for you?

hoanganhngo610 · 2023-12-06T09:13:54Z

@ShkarupaDC Regarding your question on handling the OverflowError within DBSTREAM, originally, we did not implement anything to handle this error. However, when a large amount of data is taken into account (saying millions of data points), the difference between the actual time stamp and the last time stamp when the respective micro cluster is now large enough that this error occurs. That's the reason why this is implemented.

You can have a look at the discussion within the PR containing this change here (PR #872).

…elated to an issue raised in #1468).

… issued mentioned in #1468).

ShkarupaDC · 2023-12-06T12:07:54Z

@hoanganhngo610, I asked you because I did not get the reason from the mentioned discussion. If an overflow occurs here

self._time_stamp - micro_cluster_i.last_update
# or here
self._time_stamp - self.s_t[i][j]

then value should be close to 0 and less than weak density (or weak density multiplied by intersection_factor). Therefore, we should not continue, but remove the micro-cluster (or set the shared density to 0).

I am telling you about these 2 cases:

# river/cluster/dbstream.py, 260 - 265
try:
    value = 2 ** (
        -self.fading_factor * (self._time_stamp - micro_cluster_i.last_update)
    )
except OverflowError:
    continue

# river/cluster/dbstream.py, 275 - 278
try:
    value = 2 ** (-self.fading_factor * (self._time_stamp - self.s_t[i][j]))
except OverflowError:
    continue

hoanganhngo610 · 2023-12-06T16:25:49Z

@ShkarupaDC I'm really sorry but I don't really get your point here. If I understand it correctly, usually, the calculation

self._time_stamp - micro_cluster_i.last_update
or
self._time_stamp - self.s_t[i][j]

would never be large enough for an overflow. Instead, what we are trying to catch here is the calculation of the value variable itself, i.e when it is too large. As the author discussed in the discussion, in any case that the value is too large, even multiplying it with micro_cluster_i.weight would never make it less than the threshold, which is why we only catch the error and continue.

ShkarupaDC · 2023-12-06T16:31:23Z

This is how I understood why the error was captured for the first time. The value can not be a large number because the power of 2 is negative

hoanganhngo610 · 2023-12-06T16:42:17Z

@ShkarupaDC Sorry I can finally get your point now. Yes even if self._time_stamp - micro_cluster_i.last_update or self._time_stamp - self.s_t[i][j] gets large enough, this would only cause an overflow (logically if and only if) the fading factor is negative, which is not supposed to happen. However, there has been some problems when people are applying DBSTREAM in certain use cases (you can have a look at #1168), which I will follow up with it now.

At this state, I suggest that we keep it stable (since in my use cases they don't really have any major impact) until we totally understand which caused the overflow.

hoanganhngo610 · 2023-12-06T16:56:26Z

Regarding your original question @ShkarupaDC, I agree that the values s_ij should be set to 1 instead of 0, since if it's updated at that time, the previous value should be defaulted as 0, which, after the update, makes it equal to 1. I will incorporate the changes and modify the relating tests accordingly.

ShkarupaDC · 2023-12-06T22:49:48Z

Thank you! Could you also look at the first point here? The current implementation may lead to the case when cluster 1 is removed and then recreated with non-zero shared densities because we did not clear them when popping it here:

# river/cluster/dbstream.py, 259 - 268
for i, micro_cluster_i in self._micro_clusters.items():
    try:
        value = 2 ** (
            -self.fading_factor * (self._time_stamp - micro_cluster_i.last_update)
        )
    except OverflowError:
        continue

    if micro_cluster_i.weight * value < weight_weak:
        micro_clusters.pop(i)

MaxHalford · 2023-12-06T22:55:46Z

Guys, I'm not following deeply, but all I can recommend is that you write a couple of unit tests :)

ShkarupaDC · 2023-12-07T12:40:16Z

@MaxHalford, I agree with you. It is also a good idea to reproduce experiments done in the DBSTREAM paper and compare results.

hoanganhngo610 · 2023-12-08T10:39:52Z

@MaxHalford Totally agree! I would write up the unit tests once we resolved all the issues.

@ShkarupaDC If I understand correctly, after removing the micro cluster i (by micro_clusters.pop(i)), you want to erase all the entries within the shared density and the time-points (i.e removing all s[i][j] or s[j][i] and s_t[i][j] or s_t[j][i]). Is that correct?

ShkarupaDC · 2023-12-08T13:10:55Z

@hoanganhngo610, yes, you are right, it is my point!

hoanganhngo610 · 2023-12-08T17:28:35Z

@ShkarupaDC Thank you very much! If that's the case, I suggest modifying the code as follows:

# starting from line 267
            if micro_cluster_i.weight * value < weight_weak:
                micro_clusters.pop(i)
                try:
                    self.s.pop(i)
                    self.s_t.pop(i)
                except KeyError:
                    pass
                for j in range(i):
                    try:
                        self.s[j].pop(i)
                        self.s_t[j].pop(i)
                    except KeyError:
                        continue

Do you think this should be appropriate for the task?

ShkarupaDC · 2023-12-08T17:38:26Z

@hoanganhngo610, looks good to me! We can use pop(i, None) to avoid catching the KeyError

hoanganhngo610 · 2023-12-08T17:42:57Z

@ShkarupaDC Thank you so much, totally forgot about that. So this now comes

# starting from line 267
            if micro_cluster_i.weight * value < weight_weak:
                micro_clusters.pop(i)
                self.s.pop(i, None)
                self.s_t.pop(i, None)
                # Since self.s and self.s_t always have the same keys
                for j in self.s:
                    if j < i:
                        self.s[j].pop(i, None)
                        self.s_t[j].pop(i, None)
                    else:
                        break

would that be correct?

…removing a micro-cluster (related to an inquiry from #1468)

hoanganhngo610 · 2023-12-15T10:24:54Z

@ShkarupaDC All of the changese that we have discussed here have been incorporated into the main branch. If there is any other problem that still persists, please do not hesitate to let me know!
@MaxHalford I have also modified the unit tests and added one more on a synthetic dataset to investigate the behavior of DBSTREAM after changes. As @ShkarupaDC have also suggested previously, at some time, we can run some experiments to compare the performance of our version and the original implementation.
However, if there is nothing remains, I will proceed to close this PR.

MaxHalford · 2023-12-15T11:07:03Z

Good job @hoanganhngo610! In the future, can you please do pull requests instead of pushing to main? It's easier to see what changes you made that way.

However, if there is nothing remains, I will proceed to close this PR.

You mean this issue?

hoanganhngo610 · 2023-12-15T14:21:57Z

@MaxHalford I will definitely do that for future issues! And yes, I mean closing this issue :D

MaxHalford assigned hoanganhngo610 Dec 5, 2023

hoanganhngo610 added a commit that referenced this issue Dec 5, 2023

Remove math.sqrt from the distance calculating function within DBSTRE…

b93a169

…AM (resolve partially issue #1468).

hoanganhngo610 added a commit that referenced this issue Dec 6, 2023

Modify the distance calculating function in Silhouette coefficient (r…

3ea1918

…elated to an issue raised in #1468).

hoanganhngo610 added a commit that referenced this issue Dec 6, 2023

Modify the distance calculating function in DenStream (relating to an…

930fe02

… issued mentioned in #1468).

hoanganhngo610 added a commit that referenced this issue Dec 14, 2023

Removing entries within shared density and associated timestamp when …

c4a8676

…removing a micro-cluster (related to an inquiry from #1468)

hoanganhngo610 closed this as completed Dec 18, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DBSTREAM: update shared density #1468

DBSTREAM: update shared density #1468

ShkarupaDC commented Dec 5, 2023 •

edited

Loading

ShkarupaDC commented Dec 5, 2023 •

edited

Loading

hoanganhngo610 commented Dec 5, 2023

hoanganhngo610 commented Dec 5, 2023

ShkarupaDC commented Dec 5, 2023 •

edited

Loading

ShkarupaDC commented Dec 5, 2023 •

edited

Loading

hoanganhngo610 commented Dec 6, 2023

ShkarupaDC commented Dec 6, 2023

hoanganhngo610 commented Dec 6, 2023

ShkarupaDC commented Dec 6, 2023 •

edited

Loading

hoanganhngo610 commented Dec 6, 2023

hoanganhngo610 commented Dec 6, 2023

ShkarupaDC commented Dec 6, 2023

MaxHalford commented Dec 6, 2023

ShkarupaDC commented Dec 7, 2023

hoanganhngo610 commented Dec 8, 2023 •

edited

Loading

ShkarupaDC commented Dec 8, 2023

hoanganhngo610 commented Dec 8, 2023 •

edited

Loading

ShkarupaDC commented Dec 8, 2023

hoanganhngo610 commented Dec 8, 2023 •

edited

Loading

hoanganhngo610 commented Dec 15, 2023

MaxHalford commented Dec 15, 2023

hoanganhngo610 commented Dec 15, 2023

DBSTREAM: update shared density #1468

DBSTREAM: update shared density #1468

Comments

ShkarupaDC commented Dec 5, 2023 • edited Loading

Description

Code

Pseudo code

ShkarupaDC commented Dec 5, 2023 • edited Loading

hoanganhngo610 commented Dec 5, 2023

hoanganhngo610 commented Dec 5, 2023

ShkarupaDC commented Dec 5, 2023 • edited Loading

ShkarupaDC commented Dec 5, 2023 • edited Loading

hoanganhngo610 commented Dec 6, 2023

ShkarupaDC commented Dec 6, 2023

hoanganhngo610 commented Dec 6, 2023

ShkarupaDC commented Dec 6, 2023 • edited Loading

hoanganhngo610 commented Dec 6, 2023

hoanganhngo610 commented Dec 6, 2023

ShkarupaDC commented Dec 6, 2023

MaxHalford commented Dec 6, 2023

ShkarupaDC commented Dec 7, 2023

hoanganhngo610 commented Dec 8, 2023 • edited Loading

ShkarupaDC commented Dec 8, 2023

hoanganhngo610 commented Dec 8, 2023 • edited Loading

ShkarupaDC commented Dec 8, 2023

hoanganhngo610 commented Dec 8, 2023 • edited Loading

hoanganhngo610 commented Dec 15, 2023

MaxHalford commented Dec 15, 2023

hoanganhngo610 commented Dec 15, 2023

ShkarupaDC commented Dec 5, 2023 •

edited

Loading

ShkarupaDC commented Dec 5, 2023 •

edited

Loading

ShkarupaDC commented Dec 5, 2023 •

edited

Loading

ShkarupaDC commented Dec 5, 2023 •

edited

Loading

ShkarupaDC commented Dec 6, 2023 •

edited

Loading

hoanganhngo610 commented Dec 8, 2023 •

edited

Loading

hoanganhngo610 commented Dec 8, 2023 •

edited

Loading

hoanganhngo610 commented Dec 8, 2023 •

edited

Loading