Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

version is 3.6.0 #378

Merged
merged 6 commits into from
Apr 4, 2023
Merged

version is 3.6.0 #378

merged 6 commits into from
Apr 4, 2023

Conversation

sudiptoguha
Copy link
Contributor

@sudiptoguha sudiptoguha commented Mar 22, 2023

Description of changes: Spurred by recent issue of floating point based exception, this version adds a number of checks and validations of state conversions. The code for a tree was spread across multiple classes since at some point the code supported multiple versions of trees (compact/non-compact/double precision) etc. The code is refactored. As a consequence it would be significantly easier to have an ensemble forest where each tree operates on a different transform of the input (possibly over different number of dimensions), or to have labels associated with points in a tree leading to a classification or semi-supervision capabilities.

The change to version 3.6 also resolves a regression in 3.5.1 as in issue #381

By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.

@sudiptoguha sudiptoguha requested a review from jotok March 22, 2023 22:41
@amitgalitz
Copy link
Contributor

Should we change the version here to 3.6.0-SNAPSHOT so its available on maven as a -SNAPSHOT before official 3.6.0 release.

@sudiptoguha
Copy link
Contributor Author

Good idea. Let me push a change. We would have to also update the forecast horizon changes.

@sudiptoguha sudiptoguha changed the title version is 3.6 version is 3.6.0-SNAPSHOT Mar 23, 2023
Copy link
Collaborator

@kaituo kaituo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

read a subset of files. Partial comments below

@@ -131,10 +130,6 @@ public RandomCutForest scoreAndUpdate(BenchmarkState state, Blackhole blackhole)
}

blackhole.consume(score);
if (!forest.parallelExecutionEnabled) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do you remove the code?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.


/**
* This function is supposed to validate the integrity of the tree and rebuild
* internal data structures. At this moment the only internal structure is the
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

another structure is bounding box, right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, but the bounding boxes can get initialized later as well. PointSum has no other path to be initialized.

@@ -151,18 +146,20 @@ public void deleteInternalNode(int index) {
if (parentIndex != null) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For increaseMassOfInternalNode, why do we need an modulo here? When the mass goes too high, it will go back to 0?

@Override
    protected void increaseMassOfInternalNode(int node) {
        mass[node] = (mass[node] + 1) % (capacity + 1);
    }

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The possible values of mass are 0..N. For N=256 (which has been a common use case) this means that we would have needed more than 1 byte. At the same time, no (reachable) internal node can have mass 0 -- the mass has to be at least 2. So the encoding here performed the modulo arithmetic and to represent 256 by 0. Unless there is a bug elsewhere (in the sampler code, or tree code) the mass should never exceed N.

@@ -151,18 +146,20 @@ public void deleteInternalNode(int index) {
if (parentIndex != null) {
parentIndex[index] = capacity;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For decreaseMassOfInternalNode, why do you + capacity instead of -1?

@Override
    protected void decreaseMassOfInternalNode(int node) {
        mass[node] = (mass[node] + capacity) % (capacity + 1);
    }

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

They are the same in modulo arithmetic -- it made sense to not use the minus operation, and stay in non-negative (unsigned) range.

freeNodeManager.releaseIndex(index);
}

public int getMass(int index) {
return (isLeaf(index)) ? getLeafMass(index) : mass[index] != 0 ? mass[index] : (capacity + 1);
return mass[index] != 0 ? mass[index] : (capacity + 1);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is the default value of mass is capacity + 1 instead of 0? Also, what is mass in a node? Is the following understanding correct?

leaf node: number of times we have added this node
interior node: total mass of all descendent points

Also, in NodeStoreSmall, a mass is of byte. Does that mean its max is 255? When it goes beyond 255, does it go back to 0? Would we undercount mass if using 0?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mass is the number of leaves which are descendants of a node. For any reachable (from root) node, mass cannot be 0; so the value 0 is used to represent (capacity+1).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Possibly in a future refactoring, I think it would be helpful to group together all methods that perform mass calculations, possibly into a utility class, and make the mass value only accessible through methods.

}
}

public void addPointToPartialTree(Integer pointIndex, long sequenceIndex) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Compared to addPoint, do we use the function to add points in a batch manner and recomputing point sum and bounding box in the final core? This would save redundant effort to react new point sum and bounding box construction for each new addition. If so, could you add that in the function comment?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. Performing in a batch would save time. For a fully balanced binary tree, one'd need to perform lg n times more work. Would add to comments.

@@ -300,7 +312,7 @@ public Integer addPoint(Integer pointIndex, long sequenceIndex) {
if (currentBox.contains(point) || parent == Null) {
Copy link
Collaborator

@kaituo kaituo Mar 31, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When will separation be false when there are only two nodes (oldPoint and point)? Is it when there is a floating point issue as in #376 ? Why do we insert the same leaf node again into parentPath "parentPath.push(new int[] { node, sibling });" ? That "savedDim == Integer.MAX_VALUE" is true means separation failed on the leaf node (first iteration in the loop), right? You then will retry a random cut for a bounding box of the leaf/internal node and its union with a point. After that, you will throw IllegalStateException no matter whether the cut succeeds or not. Is that intended?

                    boolean separation = ((point[dim] <= value && value < currentBox.getMinValue(dim)
                            || point[dim] > value && value >= currentBox.getMaxValue(dim)));
                    if (separation) {
                        savedCutValue = value;
                        savedDim = dim;
                        savedParent = parent;
                        savedNode = node;
                        savedBox = currentBox.copy();
                        parentPath.clear();
                    } else {
                        parentPath.push(new int[] { node, sibling });
                    }

                    if (savedDim == Integer.MAX_VALUE) {
                        randomCut(factor, point, currentBox);
                        throw new IllegalStateException(" cut failed ");
                    }

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The process of adding a node is bottom up; with the caveat that any cut at the higher level (closer to root) invalidates previous cuts (e.g., look at parentPath.clear() inside the while(true) loop). Interestingly this makes the Update Tree operation the same in spirit as the traversal. It is possible to write a traversal that updates the tree in this manner -- but that may become more opaque.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the other part - you're right, we can probably use the new checkArgument to raise the exception; and producing a new random cut is not required

@@ -318,8 +330,15 @@ public Integer addPoint(Integer pointIndex, long sequenceIndex) {
assert (pathToRoot.lastElement()[0] == savedParent);
}

int mergedNode = nodeStore.addNode(pathToRoot, point, sequenceIndex, pointIndex, savedNode, savedDim,
savedCutValue, savedBox);
int childMassIfLeaf = isLeaf(savedNode) ? getLeafMass(savedNode) : 0;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When is savedNode not a leaf node?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Almost always - the while(true) will check every level all the way to the root and the cuts closer to node will have a different savedNode

if (point.length != dimension) {
throw new IllegalStateException(" incorrect projection");
}
checkArgument(point.length != dimension, () -> " incorrect projection");
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should it be point.length == dimension?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, changed, the new push should overwrite

@sudiptoguha sudiptoguha changed the title version is 3.6.0-SNAPSHOT version is 3.6.0 Apr 3, 2023
@@ -38,11 +39,19 @@ private CommonUtils() {
* @throws IllegalArgumentException if {@code condition} is false.
*/
public static void checkArgument(boolean condition, String message) {

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

minor: with the new overloaded checkArgument definition, we could change this implementation to:

public static void checkArgument(boolean condition, String message) {
    checkArgument(condition, () => message);
}

freeNodeManager.releaseIndex(index);
}

public int getMass(int index) {
return (isLeaf(index)) ? getLeafMass(index) : mass[index] != 0 ? mass[index] : (capacity + 1);
return mass[index] != 0 ? mass[index] : (capacity + 1);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Possibly in a future refactoring, I think it would be helpful to group together all methods that perform mass calculations, possibly into a utility class, and make the mass value only accessible through methods.

@jotok jotok merged commit 3765c20 into aws:main Apr 4, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants