-
Notifications
You must be signed in to change notification settings - Fork 34
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
version is 3.6.0 #378
version is 3.6.0 #378
Conversation
Should we change the version here to 3.6.0-SNAPSHOT so its available on maven as a -SNAPSHOT before official 3.6.0 release. |
Good idea. Let me push a change. We would have to also update the forecast horizon changes. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
read a subset of files. Partial comments below
@@ -131,10 +130,6 @@ public RandomCutForest scoreAndUpdate(BenchmarkState state, Blackhole blackhole) | |||
} | |||
|
|||
blackhole.consume(score); | |||
if (!forest.parallelExecutionEnabled) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do you remove the code?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The memory meter was unstable for JDK 17 see here https://github.com/clojure-goes-fast/clj-memory-meter/blob/master/README.md
|
||
/** | ||
* This function is supposed to validate the integrity of the tree and rebuild | ||
* internal data structures. At this moment the only internal structure is the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
another structure is bounding box, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, but the bounding boxes can get initialized later as well. PointSum has no other path to be initialized.
@@ -151,18 +146,20 @@ public void deleteInternalNode(int index) { | |||
if (parentIndex != null) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For increaseMassOfInternalNode, why do we need an modulo here? When the mass goes too high, it will go back to 0?
@Override
protected void increaseMassOfInternalNode(int node) {
mass[node] = (mass[node] + 1) % (capacity + 1);
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The possible values of mass are 0..N. For N=256 (which has been a common use case) this means that we would have needed more than 1 byte. At the same time, no (reachable) internal node can have mass 0 -- the mass has to be at least 2. So the encoding here performed the modulo arithmetic and to represent 256 by 0. Unless there is a bug elsewhere (in the sampler code, or tree code) the mass should never exceed N.
@@ -151,18 +146,20 @@ public void deleteInternalNode(int index) { | |||
if (parentIndex != null) { | |||
parentIndex[index] = capacity; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For decreaseMassOfInternalNode, why do you + capacity instead of -1?
@Override
protected void decreaseMassOfInternalNode(int node) {
mass[node] = (mass[node] + capacity) % (capacity + 1);
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
They are the same in modulo arithmetic -- it made sense to not use the minus operation, and stay in non-negative (unsigned) range.
freeNodeManager.releaseIndex(index); | ||
} | ||
|
||
public int getMass(int index) { | ||
return (isLeaf(index)) ? getLeafMass(index) : mass[index] != 0 ? mass[index] : (capacity + 1); | ||
return mass[index] != 0 ? mass[index] : (capacity + 1); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is the default value of mass is capacity + 1 instead of 0? Also, what is mass in a node? Is the following understanding correct?
leaf node: number of times we have added this node
interior node: total mass of all descendent points
Also, in NodeStoreSmall, a mass is of byte. Does that mean its max is 255? When it goes beyond 255, does it go back to 0? Would we undercount mass if using 0?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Mass is the number of leaves which are descendants of a node. For any reachable (from root) node, mass cannot be 0; so the value 0 is used to represent (capacity+1).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Possibly in a future refactoring, I think it would be helpful to group together all methods that perform mass calculations, possibly into a utility class, and make the mass value only accessible through methods.
} | ||
} | ||
|
||
public void addPointToPartialTree(Integer pointIndex, long sequenceIndex) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Compared to addPoint, do we use the function to add points in a batch manner and recomputing point sum and bounding box in the final core? This would save redundant effort to react new point sum and bounding box construction for each new addition. If so, could you add that in the function comment?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes. Performing in a batch would save time. For a fully balanced binary tree, one'd need to perform lg n times more work. Would add to comments.
@@ -300,7 +312,7 @@ public Integer addPoint(Integer pointIndex, long sequenceIndex) { | |||
if (currentBox.contains(point) || parent == Null) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When will separation be false when there are only two nodes (oldPoint and point)? Is it when there is a floating point issue as in #376 ? Why do we insert the same leaf node again into parentPath "parentPath.push(new int[] { node, sibling });" ? That "savedDim == Integer.MAX_VALUE" is true means separation failed on the leaf node (first iteration in the loop), right? You then will retry a random cut for a bounding box of the leaf/internal node and its union with a point. After that, you will throw IllegalStateException no matter whether the cut succeeds or not. Is that intended?
boolean separation = ((point[dim] <= value && value < currentBox.getMinValue(dim)
|| point[dim] > value && value >= currentBox.getMaxValue(dim)));
if (separation) {
savedCutValue = value;
savedDim = dim;
savedParent = parent;
savedNode = node;
savedBox = currentBox.copy();
parentPath.clear();
} else {
parentPath.push(new int[] { node, sibling });
}
if (savedDim == Integer.MAX_VALUE) {
randomCut(factor, point, currentBox);
throw new IllegalStateException(" cut failed ");
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The process of adding a node is bottom up; with the caveat that any cut at the higher level (closer to root) invalidates previous cuts (e.g., look at parentPath.clear() inside the while(true) loop). Interestingly this makes the Update Tree operation the same in spirit as the traversal. It is possible to write a traversal that updates the tree in this manner -- but that may become more opaque.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For the other part - you're right, we can probably use the new checkArgument to raise the exception; and producing a new random cut is not required
@@ -318,8 +330,15 @@ public Integer addPoint(Integer pointIndex, long sequenceIndex) { | |||
assert (pathToRoot.lastElement()[0] == savedParent); | |||
} | |||
|
|||
int mergedNode = nodeStore.addNode(pathToRoot, point, sequenceIndex, pointIndex, savedNode, savedDim, | |||
savedCutValue, savedBox); | |||
int childMassIfLeaf = isLeaf(savedNode) ? getLeafMass(savedNode) : 0; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When is savedNode not a leaf node?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Almost always - the while(true) will check every level all the way to the root and the cuts closer to node will have a different savedNode
if (point.length != dimension) { | ||
throw new IllegalStateException(" incorrect projection"); | ||
} | ||
checkArgument(point.length != dimension, () -> " incorrect projection"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should it be point.length == dimension?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, changed, the new push should overwrite
@@ -38,11 +39,19 @@ private CommonUtils() { | |||
* @throws IllegalArgumentException if {@code condition} is false. | |||
*/ | |||
public static void checkArgument(boolean condition, String message) { | |||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
minor: with the new overloaded checkArgument
definition, we could change this implementation to:
public static void checkArgument(boolean condition, String message) {
checkArgument(condition, () => message);
}
freeNodeManager.releaseIndex(index); | ||
} | ||
|
||
public int getMass(int index) { | ||
return (isLeaf(index)) ? getLeafMass(index) : mass[index] != 0 ? mass[index] : (capacity + 1); | ||
return mass[index] != 0 ? mass[index] : (capacity + 1); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Possibly in a future refactoring, I think it would be helpful to group together all methods that perform mass calculations, possibly into a utility class, and make the mass value only accessible through methods.
Description of changes: Spurred by recent issue of floating point based exception, this version adds a number of checks and validations of state conversions. The code for a tree was spread across multiple classes since at some point the code supported multiple versions of trees (compact/non-compact/double precision) etc. The code is refactored. As a consequence it would be significantly easier to have an ensemble forest where each tree operates on a different transform of the input (possibly over different number of dimensions), or to have labels associated with points in a tree leading to a classification or semi-supervision capabilities.
The change to version 3.6 also resolves a regression in 3.5.1 as in issue #381
By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.