-
Notifications
You must be signed in to change notification settings - Fork 3.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ARROW-17468: [C++] Validation for RLE arrays #13916
Closed
Closed
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Signed-off-by: Felix Yan <[email protected]> Lead-authored-by: Yibo Cai <[email protected]> Co-authored-by: Felix Yan <[email protected]> Signed-off-by: Antoine Pitrou <[email protected]>
* Update the CUDA runtime version as CUDA 9.1 images are not available anymore * Fix passing child command arguments to "docker run" Checked locally under a Ubuntu 20.04 host with: ``` UBUNTU=18.04 archery --debug docker run ubuntu-cuda-cpp UBUNTU=20.04 archery --debug docker run ubuntu-cuda-cpp ``` Authored-by: Antoine Pitrou <[email protected]> Signed-off-by: Antoine Pitrou <[email protected]>
…ead_metadata (apache#13629) Add `filesystem` support to `pq.read_metadata` and `pq.read_schema`. Lead-authored-by: kshitij12345 <[email protected]> Co-authored-by: Kshiteej K <[email protected]> Co-authored-by: Joris Van den Bossche <[email protected]> Signed-off-by: Joris Van den Bossche <[email protected]>
…pache#13899) Checked locally on a Ubuntu 20.04 host with: ``` archery docker run ubuntu-cuda-python ``` Authored-by: Antoine Pitrou <[email protected]> Signed-off-by: Antoine Pitrou <[email protected]>
apache#13821) Will fix [ARROW-13763](https://issues.apache.org/jira/browse/ARROW-13763) A separate Jira issue will be made to address closing files in V2 ParquetDataset, which needs to be handled in the C++ layer. Adds context manager to `pq.ParquetFile` to close input file, and ensure reads within `pq.ParquetDataset` and `pq.read_table` are closed. ```python # user opened file-like object will not be closed with open('file.parquet', 'rb') as f: with pq.ParquetFile(f) as p: table = p.read() assert not f.closed # did not inadvertently close the open file assert not p.closed assert not f.closed # parquet context exit didn't close it assert not p.closed # references the input file status assert f.closed # normal context exit close assert p.closed # ... # path-like will be closed upon exit or `ParquetFile.close` with pq.ParquetFile('file.parquet') as p: table = p.read() assert not p.closed assert p.closed ``` Authored-by: Miles Granger <[email protected]> Signed-off-by: Antoine Pitrou <[email protected]>
See https://issues.apache.org/jira/browse/ARROW-17289 Lead-authored-by: Yaron Gvili <[email protected]> Co-authored-by: rtpsw <[email protected]> Co-authored-by: Antoine Pitrou <[email protected]> Signed-off-by: Antoine Pitrou <[email protected]>
…ctor should limit to Integer.MAX_VALUE (apache#13815) We got a IndexOutOfBoundsException: ``` 2022-08-03 09:33:34,076 Error executing query, currentState RUNNING, java.lang.RuntimeException: org.apache.spark.SparkException: Job aborted due to stage failure: Task 3315 in stage 5.0 failed 4 times, most recent failure: Lost task 3315.3 in stage 5.0 (TID 3926) (30.97.116.209 executor 49): java.lang.IndexOutOfBoundsException: index: 2147312542, length: 777713 (expected: range(0, 2147483648)) at org.apache.iceberg.shaded.org.apache.arrow.memory.ArrowBuf.checkIndex(ArrowBuf.java:699) at org.apache.iceberg.shaded.org.apache.arrow.memory.ArrowBuf.setBytes(ArrowBuf.java:826) at org.apache.iceberg.arrow.vectorized.parquet.VectorizedParquetDefinitionLevelReader$VarWidthReader.nextVal(VectorizedParquetDefinitionLevelReader.java:418) at org.apache.iceberg.arrow.vectorized.parquet.VectorizedParquetDefinitionLevelReader$BaseReader.nextBatch(VectorizedParquetDefinitionLevelReader.java:235) at org.apache.iceberg.arrow.vectorized.parquet.VectorizedPageIterator$VarWidthTypePageReader.nextVal(VectorizedPageIterator.java:353) at org.apache.iceberg.arrow.vectorized.parquet.VectorizedPageIterator$BagePageReader.nextBatch(VectorizedPageIterator.java:161) at org.apache.iceberg.arrow.vectorized.parquet.VectorizedColumnIterator$VarWidthTypeBatchReader.nextBatchOf(VectorizedColumnIterator.java:191) at org.apache.iceberg.arrow.vectorized.parquet.VectorizedColumnIterator$BatchReader.nextBatch(VectorizedColumnIterator.java:74) at org.apache.iceberg.arrow.vectorized.VectorizedArrowReader.read(VectorizedArrowReader.java:158) at org.apache.iceberg.spark.data.vectorized.ColumnarBatchReader.read(ColumnarBatchReader.java:51) at org.apache.iceberg.spark.data.vectorized.ColumnarBatchReader.read(ColumnarBatchReader.java:35) at org.apache.iceberg.parquet.VectorizedParquetReader$FileIterator.next(VectorizedParquetReader.java:134) at org.apache.iceberg.spark.source.BaseDataReader.next(BaseDataReader.java:98) at org.apache.spark.sql.execution.datasources.v2.PartitionIterator.hasNext(DataSourceRDD.scala:79) at org.apache.spark.sql.execution.datasources.v2.MetricsIterator.hasNext(DataSourceRDD.scala:112) at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37) at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:458) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.columnartorow_nextBatch_0$(Unknown Source) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown Source) at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) at org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:755) at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:458) ``` The root cause is the following code of `BaseVariableWidthVector.handleSafe` could fail to reallocate because of int overflow and then led to `IndexOutOfBoundsException` when we put the data into the vector. ```java protected final void handleSafe(int index, int dataLength) { while (index >= getValueCapacity()) { reallocValidityAndOffsetBuffers(); } final int startOffset = lastSet < 0 ? 0 : getStartOffset(lastSet + 1); // startOffset + dataLength could overflow while (valueBuffer.capacity() < (startOffset + dataLength)) { reallocDataBuffer(); } } ``` The offset width of `BaseVariableWidthVector` is 4, while the maximum memory allocation is Long.MAX_VALUE. This makes the memory allocation check invalid. Authored-by: xianyangliu <[email protected]> Signed-off-by: Antoine Pitrou <[email protected]>
Currently, Java JNI builds on Github Actions can take one hour due to a very long Arrow C++ build phase (example: https://github.com/apache/arrow/runs/7881918943?check_suite_focus=true#step:6:3512). Disable unused Arrow C++ components so as to make the C++ build faster. Authored-by: Antoine Pitrou <[email protected]> Signed-off-by: Antoine Pitrou <[email protected]>
…nt updates (apache#13769) Building on apache#12157 Lead-authored-by: Jacob Wujciak-Jens <[email protected]> Co-authored-by: Jonathan Keane <[email protected]> Signed-off-by: Sutou Kouhei <[email protected]>
Add `--validate` option to `archery crossbow status`. If `--validate` is specified and there are any missing artifacts, `archery crossbow status --validate` is existed with non-zero exit code. We can use it for CI to detect missing artifacts. We can't use `@github-actions crossbow submit` for this change because this isn't merged into the master branch yet. See https://github.com/ursacomputing/crossbow/branches/all?query=build-674 that is submitted `nightly-packages` manually. Authored-by: Sutou Kouhei <[email protected]> Signed-off-by: Sutou Kouhei <[email protected]>
…and *-glib-devel should have .gir (apache#13876) The current configuration is inverted. *-glib-libs have .gir and *-glib-devel have .typelib. Authored-by: Sutou Kouhei <[email protected]> Signed-off-by: Sutou Kouhei <[email protected]>
…pache#13910) I noticed looking at pkg.go.dev that there really isn't anyone who is using the existing `compute` module, which makes sense since it isn't really finished and only provides limited utility currently. This change will mark the `compute` module as a separate sub-module inside of the `arrow` module, allowing us to use `go1.18` in this new code without forcing anyone who *isn't* using the compute module to upgrade. That way I can leverage the generics when writing the new compute code where appropriate. Authored-by: Matt Topol <[email protected]> Signed-off-by: Matt Topol <[email protected]>
… GetSchema (apache#13898) Consistently implements and tests the GetSchema method in Flight SQL. Builds on apache#13897. Authored-by: David Li <[email protected]> Signed-off-by: David Li <[email protected]>
…lization for `LocalFileSystem` (apache#13796) Introduce a specialization of `GetFileInfoGenerator` in the `LocalFileSystem` class. This implementation tries to improves performance by hiding latencies at two levels: 1. Child directories can be readahead so that listing directories entries from disk can be achieved in parallel with other work; 2. Directory entries can be `stat`'ed and yielded in chunks so that the `FileInfoGenerator` consumer can start receiving entries before a large directory is fully processed. Both mechanisms can be tuned using dedicated parameters in `LocalFileSystemOptions`. Signed-off-by: Pavel Solodovnikov <[email protected]> Co-Authored-by: Igor Seliverstov <[email protected]> Lead-authored-by: Pavel Solodovnikov <[email protected]> Co-authored-by: Antoine Pitrou <[email protected]> Signed-off-by: Antoine Pitrou <[email protected]>
It looks like the entries in the truth tables were copy-pasted and the _results_ were updated to match the function, but not the operator. Authored-by: Gil Forsyth <[email protected]> Signed-off-by: Yibo Cai <[email protected]>
…pache#13906) Typical real life Arrow datasets contain List type vectors of primitive type. This PR introduce ListBinder mapping of primitive types lists to java.sql.Types.ARRAY Lead-authored-by: Igor Suhorukov <[email protected]> Co-authored-by: igor.suhorukov <[email protected]> Signed-off-by: David Li <[email protected]>
…rdered after adding duplicated fields (apache#13321) Authored-by: Hongze Zhang <[email protected]> Signed-off-by: David Li <[email protected]>
apache#13915) …railing bits Authored-by: Matt Topol <[email protected]> Signed-off-by: Matt Topol <[email protected]>
…pache#13913) Authored-by: Jacob Wujciak-Jens <[email protected]> Signed-off-by: Rok <[email protected]>
This PR aims to upgrade ORC to version 1.7.6. Apache ORC 1.7.6 is the most recent maintenance release with the following bug fixes. - https://github.com/apache/orc/releases/tag/v1.7.6 - https://orc.apache.org/news/2022/08/17/ORC-1.7.6/ Authored-by: William Hyun <[email protected]> Signed-off-by: Sutou Kouhei <[email protected]>
… PKGBUILD (apache#13917) Authored-by: Sutou Kouhei <[email protected]> Signed-off-by: Sutou Kouhei <[email protected]>
…datafusion-c (apache#13923) Binary uploader is dev/release/05-binary-upload.sh and dev/release/post-02-binary.sh. We need to customize .deb package name. This also adds missing environment variable entries. Authored-by: Sutou Kouhei <[email protected]> Signed-off-by: Sutou Kouhei <[email protected]>
…me or index (apache#13652) Authored-by: anjakefala <[email protected]> Signed-off-by: David Li <[email protected]>
Relating to the building of the functionality for Compute in Go with Arrow, this is the implementation of ArraySpan / ExecValue / ExecResult etc. It was able to be separated out from the function interface definitions, so I was able to make this PR while apache#13924 is still being reviewed Authored-by: Matt Topol <[email protected]> Signed-off-by: Matt Topol <[email protected]>
Authored-by: Matt Topol <[email protected]> Signed-off-by: Matt Topol <[email protected]>
Authored-by: Matt Topol <[email protected]> Signed-off-by: Matt Topol <[email protected]>
…apache#14210) I couldn't reproduce it, so I added a suppression instead. In both cases, the error is that the server is uncontactable. That shouldn't happen, but I changed the tests to also bind to port 0 instead of using a potentially flaky free port finder. Authored-by: David Li <[email protected]> Signed-off-by: David Li <[email protected]>
This is a follow-up of apache#14204. Authored-by: Sutou Kouhei <[email protected]> Signed-off-by: David Li <[email protected]>
…tructor with new (apache#14216) Advantage: readabilty, exception safety and efficiency(only for shared_ptr). Cases that don't apply: When calling a private/protected constructor within class member function, make_shared/unique can't work. Authored-by: Jin Shang <[email protected]> Signed-off-by: Sutou Kouhei <[email protected]>
…pache#14228) Temporarily pin LLVM version on Appveyor due to a bug in Conda's packaging of LLVM. Authored-by: Jin Shang <[email protected]> Signed-off-by: Sutou Kouhei <[email protected]>
) This is a follow-up of apache#14216. We can't use std::make_shared for CUDA related classes because their constructors aren't public. Authored-by: Sutou Kouhei <[email protected]> Signed-off-by: Sutou Kouhei <[email protected]>
…th output buffer (apache#14230) When the output type of an expression is of variable length, e.g. string, Gandiva would realloc the output buffer to make space for new outputs for each row. When num of rows is high some memory allocators perform poorly. We can use the std::vector like approach to amortize the allcation cost. First allocate some initial space depending on the input size. Each time we run out of space, double the buffer size. In the end shrink it to fit the actual size. Arrow string builder also uses this approach. Authored-by: Jin Shang <[email protected]> Signed-off-by: Sutou Kouhei <[email protected]>
TweakValidityBit returns a new Array so the calling function should use it. https://github.com/apache/arrow/blob/6cc37cf2d1ba72c46b64fbc7ac499bd0d7296d20/cpp/src/arrow/testing/gtest_util.cc#L568-L579 Authored-by: kshitij12345 <[email protected]> Signed-off-by: David Li <[email protected]>
… OpenTelemetry propagation (apache#11920) Adds a client middleware that sends span/trace ID to the server, and a server middleware that gets the span/trace ID and starts a child span. The middleware are available in builds without OpenTelemetry, they simply do nothing. Authored-by: David Li <[email protected]> Signed-off-by: David Li <[email protected]>
Use `File.deleteOnExit` to delete jni lib file on JVM exit. `File.deleteOnExit` actually add a shut down hook to make sure file delte. Authored-by: jackylee-ch <[email protected]> Signed-off-by: David Li <[email protected]>
zeroshade
pushed a commit
that referenced
this pull request
Feb 17, 2023
This PR gathers work from multiple PRs that can be closed after this one is merged: - Closes #13752 - Closes #13754 - Closes #13842 - Closes #13882 - Closes #13916 - Closes #14063 - Closes #13970 And the issues associated with those PRs can also be closed: - Fixes #20350 - Add RunEndEncodedScalarType - Fixes #32543 - Fixes #32544 - Fixes #32688 - Fixes #32731 - Fixes #32772 - Fixes #32774 * Closes: #32104 Lead-authored-by: Felipe Oliveira Carvalho <[email protected]> Co-authored-by: Tobias Zagorni <[email protected]> Signed-off-by: Matt Topol <[email protected]>
gringasalpastor
pushed a commit
to gringasalpastor/arrow
that referenced
this pull request
Feb 17, 2023
…pache#33641) This PR gathers work from multiple PRs that can be closed after this one is merged: - Closes apache#13752 - Closes apache#13754 - Closes apache#13842 - Closes apache#13882 - Closes apache#13916 - Closes apache#14063 - Closes apache#13970 And the issues associated with those PRs can also be closed: - Fixes apache#20350 - Add RunEndEncodedScalarType - Fixes apache#32543 - Fixes apache#32544 - Fixes apache#32688 - Fixes apache#32731 - Fixes apache#32772 - Fixes apache#32774 * Closes: apache#32104 Lead-authored-by: Felipe Oliveira Carvalho <[email protected]> Co-authored-by: Tobias Zagorni <[email protected]> Signed-off-by: Matt Topol <[email protected]>
fatemehp
pushed a commit
to fatemehp/arrow
that referenced
this pull request
Feb 24, 2023
…pache#33641) This PR gathers work from multiple PRs that can be closed after this one is merged: - Closes apache#13752 - Closes apache#13754 - Closes apache#13842 - Closes apache#13882 - Closes apache#13916 - Closes apache#14063 - Closes apache#13970 And the issues associated with those PRs can also be closed: - Fixes apache#20350 - Add RunEndEncodedScalarType - Fixes apache#32543 - Fixes apache#32544 - Fixes apache#32688 - Fixes apache#32731 - Fixes apache#32772 - Fixes apache#32774 * Closes: apache#32104 Lead-authored-by: Felipe Oliveira Carvalho <[email protected]> Co-authored-by: Tobias Zagorni <[email protected]> Signed-off-by: Matt Topol <[email protected]>
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
No description provided.