ARROW-17338: [Java] The maximum request memory of BaseVariableWidthVector should limit to Integer.MAX_VALUE #13815

ConeyLiu · 2022-08-08T12:08:41Z

We got a IndexOutOfBoundsException:

2022-08-03 09:33:34,076 Error executing query, currentState RUNNING, java.lang.RuntimeException: org.apache.spark.SparkException: Job aborted due to stage failure: Task 3315 in stage 5.0 failed 4 times, most recent failure: Lost task 3315.3 in stage 5.0 (TID 3926) (30.97.116.209 executor 49): java.lang.IndexOutOfBoundsException: index: 2147312542, length: 777713 (expected: range(0, 2147483648))
	at org.apache.iceberg.shaded.org.apache.arrow.memory.ArrowBuf.checkIndex(ArrowBuf.java:699)
	at org.apache.iceberg.shaded.org.apache.arrow.memory.ArrowBuf.setBytes(ArrowBuf.java:826)
	at org.apache.iceberg.arrow.vectorized.parquet.VectorizedParquetDefinitionLevelReader$VarWidthReader.nextVal(VectorizedParquetDefinitionLevelReader.java:418)
	at org.apache.iceberg.arrow.vectorized.parquet.VectorizedParquetDefinitionLevelReader$BaseReader.nextBatch(VectorizedParquetDefinitionLevelReader.java:235)
	at org.apache.iceberg.arrow.vectorized.parquet.VectorizedPageIterator$VarWidthTypePageReader.nextVal(VectorizedPageIterator.java:353)
	at org.apache.iceberg.arrow.vectorized.parquet.VectorizedPageIterator$BagePageReader.nextBatch(VectorizedPageIterator.java:161)
	at org.apache.iceberg.arrow.vectorized.parquet.VectorizedColumnIterator$VarWidthTypeBatchReader.nextBatchOf(VectorizedColumnIterator.java:191)
	at org.apache.iceberg.arrow.vectorized.parquet.VectorizedColumnIterator$BatchReader.nextBatch(VectorizedColumnIterator.java:74)
	at org.apache.iceberg.arrow.vectorized.VectorizedArrowReader.read(VectorizedArrowReader.java:158)
	at org.apache.iceberg.spark.data.vectorized.ColumnarBatchReader.read(ColumnarBatchReader.java:51)
	at org.apache.iceberg.spark.data.vectorized.ColumnarBatchReader.read(ColumnarBatchReader.java:35)
	at org.apache.iceberg.parquet.VectorizedParquetReader$FileIterator.next(VectorizedParquetReader.java:134)
	at org.apache.iceberg.spark.source.BaseDataReader.next(BaseDataReader.java:98)
	at org.apache.spark.sql.execution.datasources.v2.PartitionIterator.hasNext(DataSourceRDD.scala:79)
	at org.apache.spark.sql.execution.datasources.v2.MetricsIterator.hasNext(DataSourceRDD.scala:112)
	at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37)
	at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:458)
	at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.columnartorow_nextBatch_0$(Unknown Source)
	at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown Source)
	at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
	at org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:755)
	at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:458)

The root cause is the following code of BaseVariableWidthVector.handleSafe could fail to reallocate because of int overflow and then led to IndexOutOfBoundsException when we put the data into the vector.

  protected final void handleSafe(int index, int dataLength) {
    while (index >= getValueCapacity()) {
      reallocValidityAndOffsetBuffers();
    }
    final int startOffset = lastSet < 0 ? 0 : getStartOffset(lastSet + 1);
    // startOffset + dataLength could overflow
    while (valueBuffer.capacity() < (startOffset + dataLength)) {
      reallocDataBuffer();
    }
  }

The offset width of BaseVariableWidthVector is 4, while the maximum memory allocation is Long.MAX_VALUE. This makes the memory allocation check invalid.

github-actions · 2022-08-08T12:19:38Z

https://issues.apache.org/jira/browse/ARROW-17338

github-actions · 2022-08-08T12:19:40Z

⚠️ Ticket has not been started in JIRA, please click 'Start Progress'.

ConeyLiu · 2022-08-08T13:22:44Z

Hi @pitrou, could you help to review this when you are free? Thanks a lot.

pitrou

Would it be easy to add a unit test for this (without consuming 2GB RAM)?

pitrou · 2022-08-08T13:38:08Z

java/vector/src/main/java/org/apache/arrow/vector/BaseVariableWidthVector.java

@@ -1240,7 +1241,7 @@ protected final void handleSafe(int index, int dataLength) {
     * So even though we may have setup an initial capacity of 1024
     * elements in the vector, it is quite possible
     * that we need to reAlloc() the data buffer when we are setting
-     * the 5th element in the vector simply because previous
+     * the 1025th element in the vector simply because previous


No, I don't think this change is right. You should read this example as:

the binary/string vector is 1024 elements long

the 4 first binary/string elements already occupy 1024 bytes in the data buffer, so need to resize the data buffer as soon as the 5th binary/string element is appended

Thanks for the detailed explanation. Reverted the change.

pitrou · 2022-08-08T13:42:43Z

java/vector/src/main/java/org/apache/arrow/vector/BaseVariableWidthVector.java

+    while (valueBuffer.capacity() < targetCapacity) {
      reallocDataBuffer();
    }


Slightly unrelated, but why is this using a while loop? Ideally it would be more efficient to write:

Suggested change

while (valueBuffer.capacity() < targetCapacity) {

reallocDataBuffer();

}

reallocDataBuffer(targetCapacity);

I think the reallocation may not meet the memory request if the dataLength is larger than two times of valueBuffer.capacity(). That I think the while loop is needed here.

Yes. My point is that it's wasteful to reallocate several times in a row, instead of reallocating directly to the desired target capacity. Anyway, this was already the case before.

Got your point. And updated the code. It should be the right way.

pitrou · 2022-08-08T13:43:48Z

Also cc @lwhite1 for review / opinions.

lwhite1 · 2022-08-08T17:55:32Z

Is this modification consistent with the goals of https://issues.apache.org/jira/browse/ARROW-6112?

pitrou · 2022-08-08T18:26:54Z

Is this modification consistent with the goals of https://issues.apache.org/jira/browse/ARROW-6112?

I think so. Regular binary/string types have 32-bit signed offsets so cannot handle more than a 2GB data buffer by construction.
It seems large string/binary types use a separate BaseLargeVariableWidthVector base class AFAICT (but feel free to check as I'm not a Java contributor).

ConeyLiu · 2022-08-09T03:45:31Z

Thanks, @pitrou @lwhite1 for your time to review this. The UT has been added. Please take another look when you are free. Thanks again.

pitrou

LGTM, but I'd like @lwhite1 's validation

java/vector/src/main/java/org/apache/arrow/vector/BaseVariableWidthVector.java

toddfarmer · 2022-08-09T16:14:36Z

java/vector/src/main/java/org/apache/arrow/vector/BaseVariableWidthVector.java

@@ -445,7 +446,7 @@ private long computeAndCheckOffsetsBufferSize(int valueCount) {
     * an additional slot in offset buffer.
     */
    final long size = computeCombinedBufferSize(valueCount + 1, OFFSET_WIDTH);
-    if (size > MAX_ALLOCATION_SIZE) {
+    if (size > MAX_BUFFER_SIZE) {
      throw new OversizedAllocationException("Memory required for vector capacity " +
          valueCount +
          " is (" + size + "), which is more than max allowed (" + MAX_ALLOCATION_SIZE + ")");


Suggested change

" is (" + size + "), which is more than max allowed (" + MAX_ALLOCATION_SIZE + ")");

" is (" + size + "), which is more than max allowed (" + MAX_BUFFER_SIZE + ")");

I wonder whether the exception messages should point users to LargeVar*Vectors when exceeding buffer capacity. Based on my experiences with RDBMS, I expected LargeVarCharVector to be suitable for storing large values, and missed that it is needed for many small values as well. I'm not sure how well-understood this is, and perhaps users would benefit from being pointed in an appropriate direction.

Agree. I think the error message should give more information about how to solve the problem. Updated the message to point to the LargeVar*Vectors.

lwhite1

I think suggestions by @toddfarmer and @pitrou may be worth implementing, but overall it looks good.

lwhite1 · 2022-08-09T16:59:54Z

java/vector/src/main/java/org/apache/arrow/vector/BaseVariableWidthVector.java

@@ -430,7 +431,7 @@ public void allocateNew(int valueCount) {

  /* Check if the data buffer size is within bounds. */
  private void checkDataBufferSize(long size) {
-    if (size > MAX_ALLOCATION_SIZE || size < 0) {
+    if (size > MAX_BUFFER_SIZE || size < 0) {


Nit: I assume the check for negative size values is for overflows, but even so, the appearance of a negative in the error message text below could be misleading as a literal reading would say that a negative number is more than the max allowed.

Small update the error message for the overflow case.

pitrou · 2022-08-17T17:27:29Z

Sorry for the delay @ConeyLiu . Given the several +1's I'm gonna merge if/when CI is green.

ursabot · 2022-08-18T01:33:00Z

Benchmark runs are scheduled for baseline = f0688d0 and contender = 4fa4007. 4fa4007 is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
Conbench compare runs links:
[Finished ⬇️0.0% ⬆️0.0%] ec2-t3-xlarge-us-east-2
[Failed ⬇️1.4% ⬆️0.24%] test-mac-arm
[Failed ⬇️1.37% ⬆️0.27%] ursa-i9-9960x
[Finished ⬇️1.32% ⬆️0.36%] ursa-thinkcentre-m75q
Buildkite builds:
[Finished] 4fa40072 ec2-t3-xlarge-us-east-2
[Failed] 4fa40072 test-mac-arm
[Failed] 4fa40072 ursa-i9-9960x
[Finished] 4fa40072 ursa-thinkcentre-m75q
[Finished] f0688d01 ec2-t3-xlarge-us-east-2
[Failed] f0688d01 test-mac-arm
[Failed] f0688d01 ursa-i9-9960x
[Finished] f0688d01 ursa-thinkcentre-m75q
Supported benchmarks:
ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python, R. Runs only benchmarks with cloud = True
test-mac-arm: Supported benchmark langs: C++, Python, R
ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java

ursabot · 2022-08-18T01:33:20Z

['Python', 'R'] benchmarks have high level of regressions.
ursa-i9-9960x

ConeyLiu · 2022-08-18T01:59:17Z

Thanks, @pitrou for merging this. And also thanks to everyone for the time to review this.

…ctor should limit to Integer.MAX_VALUE (apache#13815) We got a IndexOutOfBoundsException: ``` 2022-08-03 09:33:34,076 Error executing query, currentState RUNNING, java.lang.RuntimeException: org.apache.spark.SparkException: Job aborted due to stage failure: Task 3315 in stage 5.0 failed 4 times, most recent failure: Lost task 3315.3 in stage 5.0 (TID 3926) (30.97.116.209 executor 49): java.lang.IndexOutOfBoundsException: index: 2147312542, length: 777713 (expected: range(0, 2147483648)) at org.apache.iceberg.shaded.org.apache.arrow.memory.ArrowBuf.checkIndex(ArrowBuf.java:699) at org.apache.iceberg.shaded.org.apache.arrow.memory.ArrowBuf.setBytes(ArrowBuf.java:826) at org.apache.iceberg.arrow.vectorized.parquet.VectorizedParquetDefinitionLevelReader$VarWidthReader.nextVal(VectorizedParquetDefinitionLevelReader.java:418) at org.apache.iceberg.arrow.vectorized.parquet.VectorizedParquetDefinitionLevelReader$BaseReader.nextBatch(VectorizedParquetDefinitionLevelReader.java:235) at org.apache.iceberg.arrow.vectorized.parquet.VectorizedPageIterator$VarWidthTypePageReader.nextVal(VectorizedPageIterator.java:353) at org.apache.iceberg.arrow.vectorized.parquet.VectorizedPageIterator$BagePageReader.nextBatch(VectorizedPageIterator.java:161) at org.apache.iceberg.arrow.vectorized.parquet.VectorizedColumnIterator$VarWidthTypeBatchReader.nextBatchOf(VectorizedColumnIterator.java:191) at org.apache.iceberg.arrow.vectorized.parquet.VectorizedColumnIterator$BatchReader.nextBatch(VectorizedColumnIterator.java:74) at org.apache.iceberg.arrow.vectorized.VectorizedArrowReader.read(VectorizedArrowReader.java:158) at org.apache.iceberg.spark.data.vectorized.ColumnarBatchReader.read(ColumnarBatchReader.java:51) at org.apache.iceberg.spark.data.vectorized.ColumnarBatchReader.read(ColumnarBatchReader.java:35) at org.apache.iceberg.parquet.VectorizedParquetReader$FileIterator.next(VectorizedParquetReader.java:134) at org.apache.iceberg.spark.source.BaseDataReader.next(BaseDataReader.java:98) at org.apache.spark.sql.execution.datasources.v2.PartitionIterator.hasNext(DataSourceRDD.scala:79) at org.apache.spark.sql.execution.datasources.v2.MetricsIterator.hasNext(DataSourceRDD.scala:112) at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37) at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:458) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.columnartorow_nextBatch_0$(Unknown Source) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown Source) at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) at org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:755) at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:458) ``` The root cause is the following code of `BaseVariableWidthVector.handleSafe` could fail to reallocate because of int overflow and then led to `IndexOutOfBoundsException` when we put the data into the vector. ```java protected final void handleSafe(int index, int dataLength) { while (index >= getValueCapacity()) { reallocValidityAndOffsetBuffers(); } final int startOffset = lastSet < 0 ? 0 : getStartOffset(lastSet + 1); // startOffset + dataLength could overflow while (valueBuffer.capacity() < (startOffset + dataLength)) { reallocDataBuffer(); } } ``` The offset width of `BaseVariableWidthVector` is 4, while the maximum memory allocation is Long.MAX_VALUE. This makes the memory allocation check invalid. Authored-by: xianyangliu <[email protected]> Signed-off-by: Antoine Pitrou <[email protected]>

Kimahriman · 2023-01-15T16:56:25Z

While working on apache/spark#39572 to support the large variable width vectors in Spark, I think I found that this PR effectively limits these regular width variable vectors to 1 GiB total. While it was definitely a bug how things were handled before this PR, now whenever you try to add data beyond 1 GiB, the vector will try to double itself to the next power of two, which would be 2147483648, which is greater than Integer.MAX_VALUE which is 2147483647, thus throwing a OversizedAllocationException.

ConeyLiu added 2 commits August 8, 2022 19:44

fixes int overflow

89908f9

update

830319f

github-actions bot added the Component: Java label Aug 8, 2022

pitrou changed the title ~~ARROW-17338: [Java] The maximum request memory of BaseVariableWidthVector should limit to Interger.MAX_VALUE~~ ARROW-17338: [Java] The maximum request memory of BaseVariableWidthVector should limit to Integer.MAX_VALUE Aug 8, 2022

pitrou reviewed Aug 8, 2022

View reviewed changes

address comments and add ut

b2c9315

pitrou approved these changes Aug 9, 2022

View reviewed changes

toddfarmer suggested changes Aug 9, 2022

View reviewed changes

lwhite1 approved these changes Aug 9, 2022

View reviewed changes

address comments

e12a27d

toddfarmer approved these changes Aug 10, 2022

View reviewed changes

pitrou merged commit 4fa4007 into apache:master Aug 17, 2022

ConeyLiu deleted the int-overflow branch August 18, 2022 01:59

Kimahriman mentioned this pull request Jan 15, 2023

[SPARK-39979][SQL] Add option to use large variable width vectors for arrow UDF operations apache/spark#39572

Closed

Kimahriman mentioned this pull request Feb 4, 2023

[Java] BaseVariableWidthVector only supports 1 GiB through safe interfaces apache/arrow-java#220

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ARROW-17338: [Java] The maximum request memory of BaseVariableWidthVector should limit to Integer.MAX_VALUE #13815

ARROW-17338: [Java] The maximum request memory of BaseVariableWidthVector should limit to Integer.MAX_VALUE #13815

ConeyLiu commented Aug 8, 2022 •

edited

Loading

github-actions bot commented Aug 8, 2022

github-actions bot commented Aug 8, 2022

ConeyLiu commented Aug 8, 2022

pitrou left a comment

pitrou Aug 8, 2022

ConeyLiu Aug 9, 2022

pitrou Aug 8, 2022

ConeyLiu Aug 9, 2022

pitrou Aug 9, 2022

ConeyLiu Aug 10, 2022

pitrou commented Aug 8, 2022

lwhite1 commented Aug 8, 2022

pitrou commented Aug 8, 2022

ConeyLiu commented Aug 9, 2022

pitrou left a comment

toddfarmer Aug 9, 2022

toddfarmer Aug 9, 2022

ConeyLiu Aug 10, 2022

lwhite1 left a comment

lwhite1 Aug 9, 2022

ConeyLiu Aug 10, 2022

pitrou commented Aug 17, 2022

ursabot commented Aug 18, 2022

ursabot commented Aug 18, 2022

ConeyLiu commented Aug 18, 2022

Kimahriman commented Jan 15, 2023

	" is (" + size + "), which is more than max allowed (" + MAX_ALLOCATION_SIZE + ")");
	" is (" + size + "), which is more than max allowed (" + MAX_BUFFER_SIZE + ")");

ARROW-17338: [Java] The maximum request memory of BaseVariableWidthVector should limit to Integer.MAX_VALUE #13815

ARROW-17338: [Java] The maximum request memory of BaseVariableWidthVector should limit to Integer.MAX_VALUE #13815

Conversation

ConeyLiu commented Aug 8, 2022 • edited Loading

github-actions bot commented Aug 8, 2022

github-actions bot commented Aug 8, 2022

ConeyLiu commented Aug 8, 2022

pitrou left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

pitrou commented Aug 8, 2022

lwhite1 commented Aug 8, 2022

pitrou commented Aug 8, 2022

ConeyLiu commented Aug 9, 2022

pitrou left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

lwhite1 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

pitrou commented Aug 17, 2022

ursabot commented Aug 18, 2022

ursabot commented Aug 18, 2022

ConeyLiu commented Aug 18, 2022

Kimahriman commented Jan 15, 2023

ConeyLiu commented Aug 8, 2022 •

edited

Loading