Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DBSTREAM: update shared density #1468

Closed
ShkarupaDC opened this issue Dec 5, 2023 · 22 comments
Closed

DBSTREAM: update shared density #1468

ShkarupaDC opened this issue Dec 5, 2023 · 22 comments
Assignees

Comments

@ShkarupaDC
Copy link

ShkarupaDC commented Dec 5, 2023

Description

Hi! I have a question about the DBSTREAM implementation. The DBSTREAM updates shared densities by multiplying the current weight by the fading factor (2^(-lambda * (t_current - t_last)) and adding 1. Thus, if 2 micro-clusters (MC) have not shared data samples before but receive a data sample that falls into assignment areas of both ones now, their shared density should be 1. However, in the code (snippet that supports my statement is presented below), we have 0. Am I right, and we should fix this bug, or did I miss something? @hoanganhngo610, could you take a look?

Code

# river/cluster/dbstream.py, 228 - 234
except KeyError:
    try:
        self.s[i][j] = 0
        self.s_t[i][j] = self._time_stamp
    except KeyError:
        self.s[i] = {j: 0}
        self.s_t[i] = {j: self._time_stamp}

Pseudo code

The update rule is described on line 11.
Screenshot 2023-12-05 at 15 57 05

@ShkarupaDC
Copy link
Author

ShkarupaDC commented Dec 5, 2023

The second question arose when I created this issue(

Could you explain why we take the square root of Minkowski distance as we already accounted for that in the metric implementation?

# river/cluster/dbstream.py, 162 - 164
@staticmethod
def _distance(point_a, point_b):
    return math.sqrt(utils.math.minkowski_distance(point_a, point_b, 2))

@hoanganhngo610
Copy link
Contributor

Thank you so much @ShkarupaDC for raising the issue. I will have a look at this issue shortly.

@hoanganhngo610
Copy link
Contributor

@ShkarupaDC For the second question, it's true that the math.sqrt is redundant. In fact, this is due to the fact that at the time this algorithm is implemented, the minkowski_distance function does not have **(1/p)**, hence the existence of the square root. I will modify this in the next commit.

hoanganhngo610 added a commit that referenced this issue Dec 5, 2023
@ShkarupaDC
Copy link
Author

ShkarupaDC commented Dec 5, 2023

@hoanganhngo610, thank you for your answer! Could you also help me understand the reason for catching the OverflowError? The expression

# river/cluster/dbstream.py, 261 - 263
value = 2 ** (
    -self.fading_factor * (self._time_stamp - micro_cluster_i.last_update)
)

may result in a large number that causes an OverflowError only in the case we use a negative fading_factor. As negative value is not expected, it may be a good idea to add parameter validation as a protection mechanism. When using positive fading_factor the value can be only too small. However, Python just returns 0 in this case.

@ShkarupaDC
Copy link
Author

ShkarupaDC commented Dec 5, 2023

From the official implementation (C++, R), I also found that:

  1. Entries in shared density dict should be removed for weak clusters that are removed. That looks pretty obvious, so it will be great if we implement that (line).
  2. Centers are reverted for all clusters in case there is at least one pair of clusters that is closer than the clustering_threshold to each other. I could not find a case that breaks the current implementation, so it is JFYI. Maybe, we should do the same to be compatible with the official implementation (line).

P.S. I could create a PR instead but maybe on Saturday or Sunday. What is better for you?

@hoanganhngo610
Copy link
Contributor

@ShkarupaDC Regarding your question on handling the OverflowError within DBSTREAM, originally, we did not implement anything to handle this error. However, when a large amount of data is taken into account (saying millions of data points), the difference between the actual time stamp and the last time stamp when the respective micro cluster is now large enough that this error occurs. That's the reason why this is implemented.

You can have a look at the discussion within the PR containing this change here (PR #872).

hoanganhngo610 added a commit that referenced this issue Dec 6, 2023
hoanganhngo610 added a commit that referenced this issue Dec 6, 2023
@ShkarupaDC
Copy link
Author

@hoanganhngo610, I asked you because I did not get the reason from the mentioned discussion. If an overflow occurs here

self._time_stamp - micro_cluster_i.last_update
# or here
self._time_stamp - self.s_t[i][j] 

then value should be close to 0 and less than weak density (or weak density multiplied by intersection_factor). Therefore, we should not continue, but remove the micro-cluster (or set the shared density to 0).

I am telling you about these 2 cases:

# river/cluster/dbstream.py, 260 - 265
try:
    value = 2 ** (
        -self.fading_factor * (self._time_stamp - micro_cluster_i.last_update)
    )
except OverflowError:
    continue
# river/cluster/dbstream.py, 275 - 278
try:
    value = 2 ** (-self.fading_factor * (self._time_stamp - self.s_t[i][j]))
except OverflowError:
    continue

@hoanganhngo610
Copy link
Contributor

@ShkarupaDC I'm really sorry but I don't really get your point here. If I understand it correctly, usually, the calculation

self._time_stamp - micro_cluster_i.last_update
or
self._time_stamp - self.s_t[i][j]

would never be large enough for an overflow. Instead, what we are trying to catch here is the calculation of the value variable itself, i.e when it is too large. As the author discussed in the discussion, in any case that the value is too large, even multiplying it with micro_cluster_i.weight would never make it less than the threshold, which is why we only catch the error and continue.

@ShkarupaDC
Copy link
Author

ShkarupaDC commented Dec 6, 2023

This is how I understood why the error was captured for the first time. The value can not be a large number because the power of 2 is negative

@hoanganhngo610
Copy link
Contributor

@ShkarupaDC Sorry I can finally get your point now. Yes even if self._time_stamp - micro_cluster_i.last_update or self._time_stamp - self.s_t[i][j] gets large enough, this would only cause an overflow (logically if and only if) the fading factor is negative, which is not supposed to happen. However, there has been some problems when people are applying DBSTREAM in certain use cases (you can have a look at #1168), which I will follow up with it now.

At this state, I suggest that we keep it stable (since in my use cases they don't really have any major impact) until we totally understand which caused the overflow.

@hoanganhngo610
Copy link
Contributor

Regarding your original question @ShkarupaDC, I agree that the values s_ij should be set to 1 instead of 0, since if it's updated at that time, the previous value should be defaulted as 0, which, after the update, makes it equal to 1. I will incorporate the changes and modify the relating tests accordingly.

@ShkarupaDC
Copy link
Author

Thank you! Could you also look at the first point here? The current implementation may lead to the case when cluster 1 is removed and then recreated with non-zero shared densities because we did not clear them when popping it here:

# river/cluster/dbstream.py, 259 - 268
for i, micro_cluster_i in self._micro_clusters.items():
    try:
        value = 2 ** (
            -self.fading_factor * (self._time_stamp - micro_cluster_i.last_update)
        )
    except OverflowError:
        continue

    if micro_cluster_i.weight * value < weight_weak:
        micro_clusters.pop(i)

@MaxHalford
Copy link
Member

Guys, I'm not following deeply, but all I can recommend is that you write a couple of unit tests :)

@ShkarupaDC
Copy link
Author

@MaxHalford, I agree with you. It is also a good idea to reproduce experiments done in the DBSTREAM paper and compare results.

@hoanganhngo610
Copy link
Contributor

hoanganhngo610 commented Dec 8, 2023

@MaxHalford Totally agree! I would write up the unit tests once we resolved all the issues.

@ShkarupaDC If I understand correctly, after removing the micro cluster i (by micro_clusters.pop(i)), you want to erase all the entries within the shared density and the time-points (i.e removing all s[i][j] or s[j][i] and s_t[i][j] or s_t[j][i]). Is that correct?

@ShkarupaDC
Copy link
Author

@hoanganhngo610, yes, you are right, it is my point!

@hoanganhngo610
Copy link
Contributor

hoanganhngo610 commented Dec 8, 2023

@ShkarupaDC Thank you very much! If that's the case, I suggest modifying the code as follows:

# starting from line 267
            if micro_cluster_i.weight * value < weight_weak:
                micro_clusters.pop(i)
                try:
                    self.s.pop(i)
                    self.s_t.pop(i)
                except KeyError:
                    pass
                for j in range(i):
                    try:
                        self.s[j].pop(i)
                        self.s_t[j].pop(i)
                    except KeyError:
                        continue

Do you think this should be appropriate for the task?

@ShkarupaDC
Copy link
Author

@hoanganhngo610, looks good to me! We can use pop(i, None) to avoid catching the KeyError

@hoanganhngo610
Copy link
Contributor

hoanganhngo610 commented Dec 8, 2023

@ShkarupaDC Thank you so much, totally forgot about that. So this now comes

# starting from line 267
            if micro_cluster_i.weight * value < weight_weak:
                micro_clusters.pop(i)
                self.s.pop(i, None)
                self.s_t.pop(i, None)
                # Since self.s and self.s_t always have the same keys
                for j in self.s:
                    if j < i:
                        self.s[j].pop(i, None)
                        self.s_t[j].pop(i, None)
                    else:
                        break

would that be correct?

hoanganhngo610 added a commit that referenced this issue Dec 14, 2023
…removing a micro-cluster (related to an inquiry from #1468)
@hoanganhngo610
Copy link
Contributor

@ShkarupaDC All of the changese that we have discussed here have been incorporated into the main branch. If there is any other problem that still persists, please do not hesitate to let me know!
@MaxHalford I have also modified the unit tests and added one more on a synthetic dataset to investigate the behavior of DBSTREAM after changes. As @ShkarupaDC have also suggested previously, at some time, we can run some experiments to compare the performance of our version and the original implementation.
However, if there is nothing remains, I will proceed to close this PR.

@MaxHalford
Copy link
Member

Good job @hoanganhngo610! In the future, can you please do pull requests instead of pushing to main? It's easier to see what changes you made that way.

However, if there is nothing remains, I will proceed to close this PR.

You mean this issue?

@hoanganhngo610
Copy link
Contributor

@MaxHalford I will definitely do that for future issues! And yes, I mean closing this issue :D

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants