ARROW-17468: [C++] Validation for RLE arrays #13916

zagto · 2022-08-18T23:25:07Z

No description provided.

github-actions · 2022-08-18T23:25:33Z

https://issues.apache.org/jira/browse/ARROW-17468

Signed-off-by: Felix Yan <[email protected]> Lead-authored-by: Yibo Cai <[email protected]> Co-authored-by: Felix Yan <[email protected]> Signed-off-by: Antoine Pitrou <[email protected]>

* Update the CUDA runtime version as CUDA 9.1 images are not available anymore * Fix passing child command arguments to "docker run" Checked locally under a Ubuntu 20.04 host with: ``` UBUNTU=18.04 archery --debug docker run ubuntu-cuda-cpp UBUNTU=20.04 archery --debug docker run ubuntu-cuda-cpp ``` Authored-by: Antoine Pitrou <[email protected]> Signed-off-by: Antoine Pitrou <[email protected]>

…ead_metadata (apache#13629) Add `filesystem` support to `pq.read_metadata` and `pq.read_schema`. Lead-authored-by: kshitij12345 <[email protected]> Co-authored-by: Kshiteej K <[email protected]> Co-authored-by: Joris Van den Bossche <[email protected]> Signed-off-by: Joris Van den Bossche <[email protected]>

…pache#13899) Checked locally on a Ubuntu 20.04 host with: ``` archery docker run ubuntu-cuda-python ``` Authored-by: Antoine Pitrou <[email protected]> Signed-off-by: Antoine Pitrou <[email protected]>

apache#13821) Will fix [ARROW-13763](https://issues.apache.org/jira/browse/ARROW-13763) A separate Jira issue will be made to address closing files in V2 ParquetDataset, which needs to be handled in the C++ layer. Adds context manager to `pq.ParquetFile` to close input file, and ensure reads within `pq.ParquetDataset` and `pq.read_table` are closed. ```python # user opened file-like object will not be closed with open('file.parquet', 'rb') as f: with pq.ParquetFile(f) as p: table = p.read() assert not f.closed # did not inadvertently close the open file assert not p.closed assert not f.closed # parquet context exit didn't close it assert not p.closed # references the input file status assert f.closed # normal context exit close assert p.closed # ... # path-like will be closed upon exit or `ParquetFile.close` with pq.ParquetFile('file.parquet') as p: table = p.read() assert not p.closed assert p.closed ``` Authored-by: Miles Granger <[email protected]> Signed-off-by: Antoine Pitrou <[email protected]>

See https://issues.apache.org/jira/browse/ARROW-17289 Lead-authored-by: Yaron Gvili <[email protected]> Co-authored-by: rtpsw <[email protected]> Co-authored-by: Antoine Pitrou <[email protected]> Signed-off-by: Antoine Pitrou <[email protected]>

…ctor should limit to Integer.MAX_VALUE (apache#13815) We got a IndexOutOfBoundsException: ``` 2022-08-03 09:33:34,076 Error executing query, currentState RUNNING, java.lang.RuntimeException: org.apache.spark.SparkException: Job aborted due to stage failure: Task 3315 in stage 5.0 failed 4 times, most recent failure: Lost task 3315.3 in stage 5.0 (TID 3926) (30.97.116.209 executor 49): java.lang.IndexOutOfBoundsException: index: 2147312542, length: 777713 (expected: range(0, 2147483648)) at org.apache.iceberg.shaded.org.apache.arrow.memory.ArrowBuf.checkIndex(ArrowBuf.java:699) at org.apache.iceberg.shaded.org.apache.arrow.memory.ArrowBuf.setBytes(ArrowBuf.java:826) at org.apache.iceberg.arrow.vectorized.parquet.VectorizedParquetDefinitionLevelReader$VarWidthReader.nextVal(VectorizedParquetDefinitionLevelReader.java:418) at org.apache.iceberg.arrow.vectorized.parquet.VectorizedParquetDefinitionLevelReader$BaseReader.nextBatch(VectorizedParquetDefinitionLevelReader.java:235) at org.apache.iceberg.arrow.vectorized.parquet.VectorizedPageIterator$VarWidthTypePageReader.nextVal(VectorizedPageIterator.java:353) at org.apache.iceberg.arrow.vectorized.parquet.VectorizedPageIterator$BagePageReader.nextBatch(VectorizedPageIterator.java:161) at org.apache.iceberg.arrow.vectorized.parquet.VectorizedColumnIterator$VarWidthTypeBatchReader.nextBatchOf(VectorizedColumnIterator.java:191) at org.apache.iceberg.arrow.vectorized.parquet.VectorizedColumnIterator$BatchReader.nextBatch(VectorizedColumnIterator.java:74) at org.apache.iceberg.arrow.vectorized.VectorizedArrowReader.read(VectorizedArrowReader.java:158) at org.apache.iceberg.spark.data.vectorized.ColumnarBatchReader.read(ColumnarBatchReader.java:51) at org.apache.iceberg.spark.data.vectorized.ColumnarBatchReader.read(ColumnarBatchReader.java:35) at org.apache.iceberg.parquet.VectorizedParquetReader$FileIterator.next(VectorizedParquetReader.java:134) at org.apache.iceberg.spark.source.BaseDataReader.next(BaseDataReader.java:98) at org.apache.spark.sql.execution.datasources.v2.PartitionIterator.hasNext(DataSourceRDD.scala:79) at org.apache.spark.sql.execution.datasources.v2.MetricsIterator.hasNext(DataSourceRDD.scala:112) at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37) at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:458) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.columnartorow_nextBatch_0$(Unknown Source) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown Source) at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) at org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:755) at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:458) ``` The root cause is the following code of `BaseVariableWidthVector.handleSafe` could fail to reallocate because of int overflow and then led to `IndexOutOfBoundsException` when we put the data into the vector. ```java protected final void handleSafe(int index, int dataLength) { while (index >= getValueCapacity()) { reallocValidityAndOffsetBuffers(); } final int startOffset = lastSet < 0 ? 0 : getStartOffset(lastSet + 1); // startOffset + dataLength could overflow while (valueBuffer.capacity() < (startOffset + dataLength)) { reallocDataBuffer(); } } ``` The offset width of `BaseVariableWidthVector` is 4, while the maximum memory allocation is Long.MAX_VALUE. This makes the memory allocation check invalid. Authored-by: xianyangliu <[email protected]> Signed-off-by: Antoine Pitrou <[email protected]>

Currently, Java JNI builds on Github Actions can take one hour due to a very long Arrow C++ build phase (example: https://github.com/apache/arrow/runs/7881918943?check_suite_focus=true#step:6:3512). Disable unused Arrow C++ components so as to make the C++ build faster. Authored-by: Antoine Pitrou <[email protected]> Signed-off-by: Antoine Pitrou <[email protected]>

…nt updates (apache#13769) Building on apache#12157 Lead-authored-by: Jacob Wujciak-Jens <[email protected]> Co-authored-by: Jonathan Keane <[email protected]> Signed-off-by: Sutou Kouhei <[email protected]>

Add `--validate` option to `archery crossbow status`. If `--validate` is specified and there are any missing artifacts, `archery crossbow status --validate` is existed with non-zero exit code. We can use it for CI to detect missing artifacts. We can't use `@github-actions crossbow submit` for this change because this isn't merged into the master branch yet. See https://github.com/ursacomputing/crossbow/branches/all?query=build-674 that is submitted `nightly-packages` manually. Authored-by: Sutou Kouhei <[email protected]> Signed-off-by: Sutou Kouhei <[email protected]>

…and *-glib-devel should have .gir (apache#13876) The current configuration is inverted. *-glib-libs have .gir and *-glib-devel have .typelib. Authored-by: Sutou Kouhei <[email protected]> Signed-off-by: Sutou Kouhei <[email protected]>

…pache#13910) I noticed looking at pkg.go.dev that there really isn't anyone who is using the existing `compute` module, which makes sense since it isn't really finished and only provides limited utility currently. This change will mark the `compute` module as a separate sub-module inside of the `arrow` module, allowing us to use `go1.18` in this new code without forcing anyone who *isn't* using the compute module to upgrade. That way I can leverage the generics when writing the new compute code where appropriate. Authored-by: Matt Topol <[email protected]> Signed-off-by: Matt Topol <[email protected]>

… GetSchema (apache#13898) Consistently implements and tests the GetSchema method in Flight SQL. Builds on apache#13897. Authored-by: David Li <[email protected]> Signed-off-by: David Li <[email protected]>

…lization for `LocalFileSystem` (apache#13796) Introduce a specialization of `GetFileInfoGenerator` in the `LocalFileSystem` class. This implementation tries to improves performance by hiding latencies at two levels: 1. Child directories can be readahead so that listing directories entries from disk can be achieved in parallel with other work; 2. Directory entries can be `stat`'ed and yielded in chunks so that the `FileInfoGenerator` consumer can start receiving entries before a large directory is fully processed. Both mechanisms can be tuned using dedicated parameters in `LocalFileSystemOptions`. Signed-off-by: Pavel Solodovnikov <[email protected]> Co-Authored-by: Igor Seliverstov <[email protected]> Lead-authored-by: Pavel Solodovnikov <[email protected]> Co-authored-by: Antoine Pitrou <[email protected]> Signed-off-by: Antoine Pitrou <[email protected]>

It looks like the entries in the truth tables were copy-pasted and the _results_ were updated to match the function, but not the operator. Authored-by: Gil Forsyth <[email protected]> Signed-off-by: Yibo Cai <[email protected]>

…pache#13906) Typical real life Arrow datasets contain List type vectors of primitive type. This PR introduce ListBinder mapping of primitive types lists to java.sql.Types.ARRAY Lead-authored-by: Igor Suhorukov <[email protected]> Co-authored-by: igor.suhorukov <[email protected]> Signed-off-by: David Li <[email protected]>

…rdered after adding duplicated fields (apache#13321) Authored-by: Hongze Zhang <[email protected]> Signed-off-by: David Li <[email protected]>

apache#13915) …railing bits Authored-by: Matt Topol <[email protected]> Signed-off-by: Matt Topol <[email protected]>

…pache#13913) Authored-by: Jacob Wujciak-Jens <[email protected]> Signed-off-by: Rok <[email protected]>

This PR aims to upgrade ORC to version 1.7.6. Apache ORC 1.7.6 is the most recent maintenance release with the following bug fixes. - https://github.com/apache/orc/releases/tag/v1.7.6 - https://orc.apache.org/news/2022/08/17/ORC-1.7.6/ Authored-by: William Hyun <[email protected]> Signed-off-by: Sutou Kouhei <[email protected]>

… PKGBUILD (apache#13917) Authored-by: Sutou Kouhei <[email protected]> Signed-off-by: Sutou Kouhei <[email protected]>

…datafusion-c (apache#13923) Binary uploader is dev/release/05-binary-upload.sh and dev/release/post-02-binary.sh. We need to customize .deb package name. This also adds missing environment variable entries. Authored-by: Sutou Kouhei <[email protected]> Signed-off-by: Sutou Kouhei <[email protected]>

…me or index (apache#13652) Authored-by: anjakefala <[email protected]> Signed-off-by: David Li <[email protected]>

Relating to the building of the functionality for Compute in Go with Arrow, this is the implementation of ArraySpan / ExecValue / ExecResult etc. It was able to be separated out from the function interface definitions, so I was able to make this PR while apache#13924 is still being reviewed Authored-by: Matt Topol <[email protected]> Signed-off-by: Matt Topol <[email protected]>

Authored-by: Matt Topol <[email protected]> Signed-off-by: Matt Topol <[email protected]>

…apache#14210) I couldn't reproduce it, so I added a suppression instead. In both cases, the error is that the server is uncontactable. That shouldn't happen, but I changed the tests to also bind to port 0 instead of using a potentially flaky free port finder. Authored-by: David Li <[email protected]> Signed-off-by: David Li <[email protected]>

This is a follow-up of apache#14204. Authored-by: Sutou Kouhei <[email protected]> Signed-off-by: David Li <[email protected]>

…tructor with new (apache#14216) Advantage: readabilty, exception safety and efficiency(only for shared_ptr). Cases that don't apply: When calling a private/protected constructor within class member function, make_shared/unique can't work. Authored-by: Jin Shang <[email protected]> Signed-off-by: Sutou Kouhei <[email protected]>

…pache#14228) Temporarily pin LLVM version on Appveyor due to a bug in Conda's packaging of LLVM. Authored-by: Jin Shang <[email protected]> Signed-off-by: Sutou Kouhei <[email protected]>

) This is a follow-up of apache#14216. We can't use std::make_shared for CUDA related classes because their constructors aren't public. Authored-by: Sutou Kouhei <[email protected]> Signed-off-by: Sutou Kouhei <[email protected]>

…th output buffer (apache#14230) When the output type of an expression is of variable length, e.g. string, Gandiva would realloc the output buffer to make space for new outputs for each row. When num of rows is high some memory allocators perform poorly. We can use the std::vector like approach to amortize the allcation cost. First allocate some initial space depending on the input size. Each time we run out of space, double the buffer size. In the end shrink it to fit the actual size. Arrow string builder also uses this approach. Authored-by: Jin Shang <[email protected]> Signed-off-by: Sutou Kouhei <[email protected]>

TweakValidityBit returns a new Array so the calling function should use it. https://github.com/apache/arrow/blob/6cc37cf2d1ba72c46b64fbc7ac499bd0d7296d20/cpp/src/arrow/testing/gtest_util.cc#L568-L579 Authored-by: kshitij12345 <[email protected]> Signed-off-by: David Li <[email protected]>

… OpenTelemetry propagation (apache#11920) Adds a client middleware that sends span/trace ID to the server, and a server middleware that gets the span/trace ID and starts a child span. The middleware are available in builds without OpenTelemetry, they simply do nothing. Authored-by: David Li <[email protected]> Signed-off-by: David Li <[email protected]>

Use `File.deleteOnExit` to delete jni lib file on JVM exit. `File.deleteOnExit` actually add a shut down hook to make sure file delte. Authored-by: jackylee-ch <[email protected]> Signed-off-by: David Li <[email protected]>

This PR gathers work from multiple PRs that can be closed after this one is merged: - Closes #13752 - Closes #13754 - Closes #13842 - Closes #13882 - Closes #13916 - Closes #14063 - Closes #13970 And the issues associated with those PRs can also be closed: - Fixes #20350 - Add RunEndEncodedScalarType - Fixes #32543 - Fixes #32544 - Fixes #32688 - Fixes #32731 - Fixes #32772 - Fixes #32774 * Closes: #32104 Lead-authored-by: Felipe Oliveira Carvalho <[email protected]> Co-authored-by: Tobias Zagorni <[email protected]> Signed-off-by: Matt Topol <[email protected]>

…pache#33641) This PR gathers work from multiple PRs that can be closed after this one is merged: - Closes apache#13752 - Closes apache#13754 - Closes apache#13842 - Closes apache#13882 - Closes apache#13916 - Closes apache#14063 - Closes apache#13970 And the issues associated with those PRs can also be closed: - Fixes apache#20350 - Add RunEndEncodedScalarType - Fixes apache#32543 - Fixes apache#32544 - Fixes apache#32688 - Fixes apache#32731 - Fixes apache#32772 - Fixes apache#32774 * Closes: apache#32104 Lead-authored-by: Felipe Oliveira Carvalho <[email protected]> Co-authored-by: Tobias Zagorni <[email protected]> Signed-off-by: Matt Topol <[email protected]>

github-actions bot added Component: C++ Component: Parquet labels Aug 18, 2022

felixonmars and others added 27 commits October 7, 2022 22:59

ARROW-17440: [C++] Support RISC-V architecture (apache#13902)

52eaedb

Signed-off-by: Felix Yan <[email protected]> Lead-authored-by: Yibo Cai <[email protected]> Co-authored-by: Felix Yan <[email protected]> Signed-off-by: Antoine Pitrou <[email protected]>

ARROW-17435: [CI][Python][CUDA] Install Numba for CUDA interop tests (a…

49007e3

…pache#13899) Checked locally on a Ubuntu 20.04 host with: ``` archery docker run ubuntu-cuda-python ``` Authored-by: Antoine Pitrou <[email protected]> Signed-off-by: Antoine Pitrou <[email protected]>

enable compression for rust (apache#13893)

9596839

ARROW-12590: [C++][R] Update copies of Homebrew files to reflect rece…

94c8cca

…nt updates (apache#13769) Building on apache#12157 Lead-authored-by: Jacob Wujciak-Jens <[email protected]> Co-authored-by: Jonathan Keane <[email protected]> Signed-off-by: Sutou Kouhei <[email protected]>

ARROW-17254: [C++][Go][Java][FlightRPC] Implement and test Flight SQL…

4937adc

… GetSchema (apache#13898) Consistently implements and tests the GetSchema method in Flight SQL. Builds on apache#13897. Authored-by: David Li <[email protected]> Signed-off-by: David Li <[email protected]>

ARROW-16754: [Java] StructVector's child vectors get unexpectedly reo…

b7da7b2

…rdered after adding duplicated fields (apache#13321) Authored-by: Hongze Zhang <[email protected]> Signed-off-by: David Li <[email protected]>

ARROW-17467: [Go] Aligned Bitmap Ops mess up the final byte when no t… (

84b3b84

apache#13915) …railing bits Authored-by: Matt Topol <[email protected]> Signed-off-by: Matt Topol <[email protected]>

ARROW-12958: [CI][Developer] Build + host the docs for PR branches (a…

1eb1a14

…pache#13913) Authored-by: Jacob Wujciak-Jens <[email protected]> Signed-off-by: Rok <[email protected]>

ARROW-17470: [CI][GLib] Add more system packages to sync the upstream…

58d24bf

… PKGBUILD (apache#13917) Authored-by: Sutou Kouhei <[email protected]> Signed-off-by: Sutou Kouhei <[email protected]>

ARROW-17131: [Python] add StructType().field(): returns a field by na…

c774d5d

…me or index (apache#13652) Authored-by: anjakefala <[email protected]> Signed-off-by: David Li <[email protected]>

ARROW-17482: [Go] Remove ValueDescr types (apache#13930)

ad9a63c

Authored-by: Matt Topol <[email protected]> Signed-off-by: Matt Topol <[email protected]>

ARROW-17475: [Go] Function interface and Registry impl (apache#13924)

c0ea0e1

Authored-by: Matt Topol <[email protected]> Signed-off-by: Matt Topol <[email protected]>

lidavidm and others added 14 commits October 7, 2022 23:00

ARROW-17814: [C++] Fix style (apache#14218)

69b73d8

This is a follow-up of apache#14204. Authored-by: Sutou Kouhei <[email protected]> Signed-off-by: David Li <[email protected]>

ARROW-17830: [C++][Gandiva] Temporarily pin LLVM version on AppVeyor (a…

2756b9b

…pache#14228) Temporarily pin LLVM version on Appveyor due to a bug in Conda's packaging of LLVM. Authored-by: Jin Shang <[email protected]> Signed-off-by: Sutou Kouhei <[email protected]>

type_fwd: C++17 compatibility

c7f2440

rle_util: support different types for run ends array

90b15aa

fix run-ends type detection in GetPhysicalOffset and GetPhysicalLength

a96883e

test mutltple run ends types in rle offset/length test

efe28e2

rle validate: support differrent run ends types

4ef444e

github-actions bot added Component: Documentation Component: FlightRPC Component: Gandiva Component: GLib Component: Go Component: Java Component: MATLAB Component: Python Component: R Component: Ruby labels Oct 7, 2022

felipecrv mentioned this pull request Jan 17, 2023

GH-32104: [C++] Add support for Run-End encoded data to Arrow #33641

Merged

zeroshade closed this in #33641 Feb 17, 2023

asfimport mentioned this pull request Nov 17, 2022

[C++] Validation for RLE arrays #32731

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ARROW-17468: [C++] Validation for RLE arrays #13916

ARROW-17468: [C++] Validation for RLE arrays #13916

zagto commented Aug 18, 2022

github-actions bot commented Aug 18, 2022

ARROW-17468: [C++] Validation for RLE arrays #13916

ARROW-17468: [C++] Validation for RLE arrays #13916

Conversation

zagto commented Aug 18, 2022

github-actions bot commented Aug 18, 2022