Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ARROW-17338: [Java] The maximum request memory of BaseVariableWidthVector should limit to Integer.MAX_VALUE #13815

Merged
merged 4 commits into from
Aug 17, 2022
Merged
Changes from 2 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
Expand Up @@ -46,6 +46,7 @@ public abstract class BaseVariableWidthVector extends BaseValueVector
implements VariableWidthVector, FieldVector, VectorDefinitionSetter {
private static final int DEFAULT_RECORD_BYTE_COUNT = 8;
private static final int INITIAL_BYTE_COUNT = INITIAL_VALUE_ALLOCATION * DEFAULT_RECORD_BYTE_COUNT;
private static final int MAX_BUFFER_SIZE = (int) Math.min(MAX_ALLOCATION_SIZE, Integer.MAX_VALUE);
private int lastValueCapacity;
private long lastValueAllocationSizeInBytes;

Expand Down Expand Up @@ -430,7 +431,7 @@ public void allocateNew(int valueCount) {

/* Check if the data buffer size is within bounds. */
private void checkDataBufferSize(long size) {
if (size > MAX_ALLOCATION_SIZE || size < 0) {
if (size > MAX_BUFFER_SIZE || size < 0) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: I assume the check for negative size values is for overflows, but even so, the appearance of a negative in the error message text below could be misleading as a literal reading would say that a negative number is more than the max allowed.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Small update the error message for the overflow case.

throw new OversizedAllocationException("Memory required for vector " +
" is (" + size + "), which is more than max allowed (" + MAX_ALLOCATION_SIZE + ")");
}
Expand All @@ -445,7 +446,7 @@ private long computeAndCheckOffsetsBufferSize(int valueCount) {
* an additional slot in offset buffer.
*/
final long size = computeCombinedBufferSize(valueCount + 1, OFFSET_WIDTH);
if (size > MAX_ALLOCATION_SIZE) {
if (size > MAX_BUFFER_SIZE) {
throw new OversizedAllocationException("Memory required for vector capacity " +
valueCount +
" is (" + size + "), which is more than max allowed (" + MAX_ALLOCATION_SIZE + ")");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
" is (" + size + "), which is more than max allowed (" + MAX_ALLOCATION_SIZE + ")");
" is (" + size + "), which is more than max allowed (" + MAX_BUFFER_SIZE + ")");

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder whether the exception messages should point users to LargeVar*Vectors when exceeding buffer capacity. Based on my experiences with RDBMS, I expected LargeVarCharVector to be suitable for storing large values, and missed that it is needed for many small values as well. I'm not sure how well-understood this is, and perhaps users would benefit from being pointed in an appropriate direction.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree. I think the error message should give more information about how to solve the problem. Updated the message to point to the LargeVar*Vectors.

Expand Down Expand Up @@ -1240,7 +1241,7 @@ protected final void handleSafe(int index, int dataLength) {
* So even though we may have setup an initial capacity of 1024
* elements in the vector, it is quite possible
* that we need to reAlloc() the data buffer when we are setting
* the 5th element in the vector simply because previous
* the 1025th element in the vector simply because previous
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, I don't think this change is right. You should read this example as:

  • the binary/string vector is 1024 elements long
  • the 4 first binary/string elements already occupy 1024 bytes in the data buffer, so need to resize the data buffer as soon as the 5th binary/string element is appended

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the detailed explanation. Reverted the change.

* variable length elements have exhausted the buffer capacity.
* However, we really don't need to reAlloc() validity and
* offset buffers until we try to set the 1025th element
Expand All @@ -1250,8 +1251,10 @@ protected final void handleSafe(int index, int dataLength) {
while (index >= getValueCapacity()) {
reallocValidityAndOffsetBuffers();
}
final int startOffset = lastSet < 0 ? 0 : getStartOffset(lastSet + 1);
while (valueBuffer.capacity() < (startOffset + dataLength)) {
final long startOffset = lastSet < 0 ? 0 : getStartOffset(lastSet + 1);
final long targetCapacity = startOffset + dataLength;
checkDataBufferSize(targetCapacity);
while (valueBuffer.capacity() < targetCapacity) {
reallocDataBuffer();
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Slightly unrelated, but why is this using a while loop? Ideally it would be more efficient to write:

Suggested change
while (valueBuffer.capacity() < targetCapacity) {
reallocDataBuffer();
}
reallocDataBuffer(targetCapacity);

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the reallocation may not meet the memory request if the dataLength is larger than two times of valueBuffer.capacity(). That I think the while loop is needed here.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. My point is that it's wasteful to reallocate several times in a row, instead of reallocating directly to the desired target capacity. Anyway, this was already the case before.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got your point. And updated the code. It should be the right way.

}
Expand Down