-
-
Notifications
You must be signed in to change notification settings - Fork 554
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
DBSTREAM: update shared density #1468
Comments
The second question arose when I created this issue( Could you explain why we take the square root of Minkowski distance as we already accounted for that in the metric implementation? # river/cluster/dbstream.py, 162 - 164
@staticmethod
def _distance(point_a, point_b):
return math.sqrt(utils.math.minkowski_distance(point_a, point_b, 2)) |
Thank you so much @ShkarupaDC for raising the issue. I will have a look at this issue shortly. |
@ShkarupaDC For the second question, it's true that the |
…AM (resolve partially issue #1468).
@hoanganhngo610, thank you for your answer! Could you also help me understand the reason for catching the OverflowError? The expression # river/cluster/dbstream.py, 261 - 263
value = 2 ** (
-self.fading_factor * (self._time_stamp - micro_cluster_i.last_update)
) may result in a large number that causes an OverflowError only in the case we use a negative fading_factor. As negative value is not expected, it may be a good idea to add parameter validation as a protection mechanism. When using positive fading_factor the value can be only too small. However, Python just returns 0 in this case. |
From the official implementation (C++, R), I also found that:
P.S. I could create a PR instead but maybe on Saturday or Sunday. What is better for you? |
@ShkarupaDC Regarding your question on handling the OverflowError within DBSTREAM, originally, we did not implement anything to handle this error. However, when a large amount of data is taken into account (saying millions of data points), the difference between the actual time stamp and the last time stamp when the respective micro cluster is now large enough that this error occurs. That's the reason why this is implemented. You can have a look at the discussion within the PR containing this change here (PR #872). |
…elated to an issue raised in #1468).
@hoanganhngo610, I asked you because I did not get the reason from the mentioned discussion. If an overflow occurs here self._time_stamp - micro_cluster_i.last_update
# or here
self._time_stamp - self.s_t[i][j] then value should be close to 0 and less than weak density (or weak density multiplied by intersection_factor). Therefore, we should not continue, but remove the micro-cluster (or set the shared density to 0). I am telling you about these 2 cases: # river/cluster/dbstream.py, 260 - 265
try:
value = 2 ** (
-self.fading_factor * (self._time_stamp - micro_cluster_i.last_update)
)
except OverflowError:
continue # river/cluster/dbstream.py, 275 - 278
try:
value = 2 ** (-self.fading_factor * (self._time_stamp - self.s_t[i][j]))
except OverflowError:
continue |
@ShkarupaDC I'm really sorry but I don't really get your point here. If I understand it correctly, usually, the calculation
would never be large enough for an overflow. Instead, what we are trying to catch here is the calculation of the |
This is how I understood why the error was captured for the first time. The value can not be a large number because the power of 2 is negative |
@ShkarupaDC Sorry I can finally get your point now. Yes even if At this state, I suggest that we keep it stable (since in my use cases they don't really have any major impact) until we totally understand which caused the overflow. |
Regarding your original question @ShkarupaDC, I agree that the values |
Thank you! Could you also look at the first point here? The current implementation may lead to the case when cluster 1 is removed and then recreated with non-zero shared densities because we did not clear them when popping it here: # river/cluster/dbstream.py, 259 - 268
for i, micro_cluster_i in self._micro_clusters.items():
try:
value = 2 ** (
-self.fading_factor * (self._time_stamp - micro_cluster_i.last_update)
)
except OverflowError:
continue
if micro_cluster_i.weight * value < weight_weak:
micro_clusters.pop(i) |
Guys, I'm not following deeply, but all I can recommend is that you write a couple of unit tests :) |
@MaxHalford, I agree with you. It is also a good idea to reproduce experiments done in the DBSTREAM paper and compare results. |
@MaxHalford Totally agree! I would write up the unit tests once we resolved all the issues. @ShkarupaDC If I understand correctly, after removing the micro cluster i (by |
@hoanganhngo610, yes, you are right, it is my point! |
@ShkarupaDC Thank you very much! If that's the case, I suggest modifying the code as follows: # starting from line 267
if micro_cluster_i.weight * value < weight_weak:
micro_clusters.pop(i)
try:
self.s.pop(i)
self.s_t.pop(i)
except KeyError:
pass
for j in range(i):
try:
self.s[j].pop(i)
self.s_t[j].pop(i)
except KeyError:
continue Do you think this should be appropriate for the task? |
@hoanganhngo610, looks good to me! We can use |
@ShkarupaDC Thank you so much, totally forgot about that. So this now comes # starting from line 267
if micro_cluster_i.weight * value < weight_weak:
micro_clusters.pop(i)
self.s.pop(i, None)
self.s_t.pop(i, None)
# Since self.s and self.s_t always have the same keys
for j in self.s:
if j < i:
self.s[j].pop(i, None)
self.s_t[j].pop(i, None)
else:
break would that be correct? |
…removing a micro-cluster (related to an inquiry from #1468)
@ShkarupaDC All of the changese that we have discussed here have been incorporated into the main branch. If there is any other problem that still persists, please do not hesitate to let me know! |
Good job @hoanganhngo610! In the future, can you please do pull requests instead of pushing to main? It's easier to see what changes you made that way.
You mean this issue? |
@MaxHalford I will definitely do that for future issues! And yes, I mean closing this issue :D |
Description
Hi! I have a question about the DBSTREAM implementation. The DBSTREAM updates shared densities by multiplying the current weight by the fading factor (2^(-lambda * (t_current - t_last)) and adding 1. Thus, if 2 micro-clusters (MC) have not shared data samples before but receive a data sample that falls into assignment areas of both ones now, their shared density should be 1. However, in the code (snippet that supports my statement is presented below), we have 0. Am I right, and we should fix this bug, or did I miss something? @hoanganhngo610, could you take a look?
Code
Pseudo code
The update rule is described on line 11.
The text was updated successfully, but these errors were encountered: