-
Notifications
You must be signed in to change notification settings - Fork 5.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Update spatial index memory while building R-Tree #13251
Conversation
b1c2f0e
to
718e21f
Compare
// update memory size when 1000 more leaf nodes are inserted | ||
if (position % 1000 == 999) { | ||
localUserMemoryContext.setBytes(estimatedLeafNodesSize); | ||
} | ||
} | ||
|
||
rtree.build(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What about the trailing geometries after the last localUserMemoryContext, and the cost of the built rtree? For the JTS rtree this is quite significant.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The overhead of non-leaf node is about 10% of leaf nodes. After SRTree is built, a final size check is already done on all nodes at all levels, which includes trailing geometries.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@viczhang861 Vic, thanks for making this change. Updating memory usage while building the R-Tree will help avoid JVM OOMs and allow for proper memory cap enforcement. I'm curious how this is done for regular hash table. CC: @nezihyigitbasi
The commit title Monitor spatial index memory usage
is misleading because this change doesn't do anything about the monitoring. I enabled the monitoring logic by starting to send timely updates to memory usage. I'd change the commit message to:
Update memory usage while building R-Tree
presto-main/src/main/java/com/facebook/presto/operator/PagesSpatialIndexSupplier.java
Outdated
Show resolved
Hide resolved
presto-main/src/main/java/com/facebook/presto/operator/PagesSpatialIndexSupplier.java
Outdated
Show resolved
Hide resolved
presto-main/src/main/java/com/facebook/presto/operator/PagesSpatialIndexSupplier.java
Outdated
Show resolved
Hide resolved
estimatedLeafNodesSize = estimatedLeafNodesSize + ENVELOPE_INSTANCE_SIZE + geometryWithPosition.getEstimatedMemorySizeInBytes(); | ||
|
||
// update memory size when 1000 more leaf nodes are inserted | ||
if (position % 1000 == 999) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it would be better to update the memory every so many bytes (or megabytes) rather than every so many objects.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about 100 MB? for a cluster with 600 nodes, this means > 60GB total user memory change per update.
@viczhang861 Vic, is there a way to test this change in a unit test? CC: @nezihyigitbasi |
718e21f
to
0ed7ee1
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think this will end up doing what we want in terms of aborting if there isn't enough memory due to how setBytes works.
This is how I had to fix it https://github.com/prestodb/presto/compare/master...aweisberg:fix_spatial_index_accounting?expand=1
@arhimondr Can probably confirm, but basically you have to yield to the driver loop with the memory accounted for as each page is processed in order to get the query to halt if it uses too much memory.
@aweisberg This diff was tested in a real cluster, I can share you query ids offline if you are interested |
I'm also working on the memory accounting for the hash table associated with PagesIndex. It's not as bad because it tends to be smaller, but it's also built in one giant go with no memory accounting. |
@nezihyigitbasi Nezih, could you clarify how memory accounting/limit enforcement works? |
@viczhang861 Can you try 20190806_050942_00000_yrda6 and see how it does? |
I don't understand what this means, but let me explain why this PR is working. SpatialIndexBuilderOperator updates memory when each page is added (query will be killed if memory budget exceeds at this step) or when R-Tree build is done (query could also be killed at this step). However, memory is not updated when R-Tree is built. Memory will increase significantly when data is deserialized and geometry objects are created by JTS library. We don't need to worry about setBytes, which is already doing the right thing, otherwise, query will not be killed in above mentioned steps. For this reason, unit test on behaviors of calling setBytes is not the scope of this PR. |
@aweisberg This is exactly the query I tested. Query id after fixing: 20190819_051830_00005_hpdfs |
I don't follow why this works. Doesn't mean that it doesn't though. That query builds a 180+ gigabyte tree on one node that causes an OOM. Calling setBytes won't fix that because it should still go ahead and build the entire tree and OOM the worker. Even if the coordinator detects the query exceeding the limits and returns an error to the client saying it killed the query (is killing asynchronous?) we still want to avoid OOMing the worker. The coordinator can't kill the query as it is building the tree. The operator is in a loop that won't be interrupted as far as I can tell. I checked the logs and couldn't really tell if the workers had OOMed. There are eventually a spate of worker restarts but exit code is 0 and there is no message. This is not typical of what I have seen for OOMs, but maybe this cluster was configured differently? If it does work, great! I just want to make sure we aren't missing something because it's an unexpected result to me. |
@aweisberg I monitored heap memory usage on worker nodes before and after fix, the worker nodes didn't crash after fix. I assume that is the result of killing the query due to "Exceeded global user memory limit of ***" |
@aweisberg Another good signal is that worker didn't restart after the fix. |
GeometryWithPosition geometryWithPosition = new GeometryWithPosition(ogcGeometry, partition, position); | ||
rtree.insert(getJtsEnvelope(ogcGeometry, radius), geometryWithPosition); | ||
|
||
estimatedMemorySizeInBytes += ENVELOPE_INSTANCE_SIZE + geometryWithPosition.getEstimatedMemorySizeInBytes(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is an array of references collecting the geometries inside the rtree. It's not 100% accurate, but can you add size of an object reference each time as well?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you be more specific about which array ? I only include leaf nodes into consideration, the total size will be off by 11%.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a reason not to be as precise as we can?
The rtree before you invoke build is wrapping what you provide in an ItemBoundable and stashing it in a big array which it then discards when the rtree is built.
This is a slightly more accurate accounting https://github.com/aweisberg/presto/blob/fix_spatial_index_accounting/presto-main/src/main/java/com/facebook/presto/operator/PagesSpatialIndexBuilder.java#L143
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
11% more nodes are generated when insertion is done. I will add ItemBoundable. What is the 8-bytes array? reference to new element in ItemBoundable array?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What file/line in the rtree library generates more nodes on insertion? When I looked at the rtree code it just wraps it and stores it in an array.
Do you mean it uses more after it builds?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you mean it uses more after it builds?
Yes, when insertion is done, parent nodes will be created after final build
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
|
||
// update memory size when estimated change is more than 100 MB | ||
if (estimatedMemorySizeInBytes - lastUpdatedMemorySizeInBytes >= 100_000_000) { | ||
localUserMemoryContext.setBytes(estimatedMemorySizeInBytes); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is dropping the memory used by the PagesIndex itself?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, I included PagesIndex in
0ed7ee1
to
687d496
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@viczhang861 Looks good overall % some comments.
presto-main/src/main/java/com/facebook/presto/operator/PagesSpatialIndexSupplier.java
Outdated
Show resolved
Hide resolved
presto-main/src/main/java/com/facebook/presto/operator/PagesSpatialIndexSupplier.java
Outdated
Show resolved
Hide resolved
GeometryWithPosition geometryWithPosition = new GeometryWithPosition(ogcGeometry, partition, position); | ||
rtree.insert(getJtsEnvelope(ogcGeometry, radius), geometryWithPosition); | ||
|
||
// 8-bytes for reference to the new element in itemBoundables array of STRtree |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is an implementation detail of the STRtree and therefore shouldn't be used here. How about just tracking size of the envelope and geometry?
private static final int MEMORY_USAGE_UPDATE_INCREMENT_BYTES = 100 * 1024 * 1024; // 100 MB
addedSizeInBytes += ENVELOPE_INSTANCE_SIZE + geometryWithPosition.getEstimatedMemorySizeInBytes();
if (addedSizeInBytes >= MEMORY_USAGE_UPDATE_INCREMENT_BYTES) {
localUserMemoryContext.setBytes(recordedSizeInBytes + addedSizeInBytes);
recordedSizeInBytes += addedSizeInBytes;
addedSizeInBytes = 0;
}
presto-main/src/main/java/com/facebook/presto/operator/PagesSpatialIndexSupplier.java
Outdated
Show resolved
Hide resolved
687d496
to
e107042
Compare
e107042
to
f4367da
Compare
@mbasmanova Thanks for review, all comments addressed and rebased to be compatible with FlatBush Rtree |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@viczhang861 Looks good to me.
Agree with @aweisberg , an Operator should return control to the engine between memory usage updates to ensure the memory manager works. No idea how it works. @nezihyigitbasi , you have the most expertise in the memory management. Your input here will be greatly appreciated. |
Update memory usage during construction of rtree
Before this change: large spatial query fail with internal communication error
After this change : same query failed with message "Query exceeded distributed user memory limit of 5TB"