Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ARROW-17468: [C++] Validation for RLE arrays #13916

Closed
wants to merge 368 commits into from
Closed

Conversation

zagto
Copy link
Contributor

@zagto zagto commented Aug 18, 2022

No description provided.

@github-actions
Copy link

felixonmars and others added 27 commits October 7, 2022 22:59
Signed-off-by: Felix Yan <[email protected]>

Lead-authored-by: Yibo Cai <[email protected]>
Co-authored-by: Felix Yan <[email protected]>
Signed-off-by: Antoine Pitrou <[email protected]>
* Update the CUDA runtime version as CUDA 9.1 images are not available anymore
* Fix passing child command arguments to "docker run"

Checked locally under a Ubuntu 20.04 host with:
```
UBUNTU=18.04 archery --debug docker run ubuntu-cuda-cpp
UBUNTU=20.04 archery --debug docker run ubuntu-cuda-cpp
```

Authored-by: Antoine Pitrou <[email protected]>
Signed-off-by: Antoine Pitrou <[email protected]>
…ead_metadata (apache#13629)

Add `filesystem` support to `pq.read_metadata` and `pq.read_schema`.

Lead-authored-by: kshitij12345 <[email protected]>
Co-authored-by: Kshiteej K <[email protected]>
Co-authored-by: Joris Van den Bossche <[email protected]>
Signed-off-by: Joris Van den Bossche <[email protected]>
…pache#13899)

Checked locally on a Ubuntu 20.04 host with:
```
archery docker run ubuntu-cuda-python
```

Authored-by: Antoine Pitrou <[email protected]>
Signed-off-by: Antoine Pitrou <[email protected]>
apache#13821)

Will fix [ARROW-13763](https://issues.apache.org/jira/browse/ARROW-13763)

A separate Jira issue will be made to address closing files in V2 ParquetDataset, which needs to be handled in the C++ layer. 

Adds context manager to `pq.ParquetFile` to close input file, and ensure reads within `pq.ParquetDataset` and `pq.read_table` are closed.

```python

# user opened file-like object will not be closed
with open('file.parquet', 'rb') as f:
    with pq.ParquetFile(f) as p:
        table = p.read()
        assert not f.closed  # did not inadvertently close the open file
        assert not p.closed
    assert not f.closed      # parquet context exit didn't close it
    assert not p.closed      # references the input file status
assert f.closed              # normal context exit close
assert p.closed              # ...

# path-like will be closed upon exit or `ParquetFile.close`
with pq.ParquetFile('file.parquet') as p:
    table = p.read()
    assert not p.closed
assert p.closed
```

Authored-by: Miles Granger <[email protected]>
Signed-off-by: Antoine Pitrou <[email protected]>
See https://issues.apache.org/jira/browse/ARROW-17289

Lead-authored-by: Yaron Gvili <[email protected]>
Co-authored-by: rtpsw <[email protected]>
Co-authored-by: Antoine Pitrou <[email protected]>
Signed-off-by: Antoine Pitrou <[email protected]>
…ctor should limit to Integer.MAX_VALUE (apache#13815)

We got a IndexOutOfBoundsException:
```
2022-08-03 09:33:34,076 Error executing query, currentState RUNNING, java.lang.RuntimeException: org.apache.spark.SparkException: Job aborted due to stage failure: Task 3315 in stage 5.0 failed 4 times, most recent failure: Lost task 3315.3 in stage 5.0 (TID 3926) (30.97.116.209 executor 49): java.lang.IndexOutOfBoundsException: index: 2147312542, length: 777713 (expected: range(0, 2147483648))
	at org.apache.iceberg.shaded.org.apache.arrow.memory.ArrowBuf.checkIndex(ArrowBuf.java:699)
	at org.apache.iceberg.shaded.org.apache.arrow.memory.ArrowBuf.setBytes(ArrowBuf.java:826)
	at org.apache.iceberg.arrow.vectorized.parquet.VectorizedParquetDefinitionLevelReader$VarWidthReader.nextVal(VectorizedParquetDefinitionLevelReader.java:418)
	at org.apache.iceberg.arrow.vectorized.parquet.VectorizedParquetDefinitionLevelReader$BaseReader.nextBatch(VectorizedParquetDefinitionLevelReader.java:235)
	at org.apache.iceberg.arrow.vectorized.parquet.VectorizedPageIterator$VarWidthTypePageReader.nextVal(VectorizedPageIterator.java:353)
	at org.apache.iceberg.arrow.vectorized.parquet.VectorizedPageIterator$BagePageReader.nextBatch(VectorizedPageIterator.java:161)
	at org.apache.iceberg.arrow.vectorized.parquet.VectorizedColumnIterator$VarWidthTypeBatchReader.nextBatchOf(VectorizedColumnIterator.java:191)
	at org.apache.iceberg.arrow.vectorized.parquet.VectorizedColumnIterator$BatchReader.nextBatch(VectorizedColumnIterator.java:74)
	at org.apache.iceberg.arrow.vectorized.VectorizedArrowReader.read(VectorizedArrowReader.java:158)
	at org.apache.iceberg.spark.data.vectorized.ColumnarBatchReader.read(ColumnarBatchReader.java:51)
	at org.apache.iceberg.spark.data.vectorized.ColumnarBatchReader.read(ColumnarBatchReader.java:35)
	at org.apache.iceberg.parquet.VectorizedParquetReader$FileIterator.next(VectorizedParquetReader.java:134)
	at org.apache.iceberg.spark.source.BaseDataReader.next(BaseDataReader.java:98)
	at org.apache.spark.sql.execution.datasources.v2.PartitionIterator.hasNext(DataSourceRDD.scala:79)
	at org.apache.spark.sql.execution.datasources.v2.MetricsIterator.hasNext(DataSourceRDD.scala:112)
	at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37)
	at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:458)
	at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.columnartorow_nextBatch_0$(Unknown Source)
	at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown Source)
	at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
	at org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:755)
	at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:458)
```

The root cause is the following code of `BaseVariableWidthVector.handleSafe` could fail to reallocate because of int overflow and then led to `IndexOutOfBoundsException` when we put the data into the vector.

```java
  protected final void handleSafe(int index, int dataLength) {
    while (index >= getValueCapacity()) {
      reallocValidityAndOffsetBuffers();
    }
    final int startOffset = lastSet < 0 ? 0 : getStartOffset(lastSet + 1);
    // startOffset + dataLength could overflow
    while (valueBuffer.capacity() < (startOffset + dataLength)) {
      reallocDataBuffer();
    }
  }
```

The offset width of `BaseVariableWidthVector` is 4, while the maximum memory allocation is Long.MAX_VALUE. This makes the memory allocation check invalid.


Authored-by: xianyangliu <[email protected]>
Signed-off-by: Antoine Pitrou <[email protected]>
Currently, Java JNI builds on Github Actions can take one hour due to a very long Arrow C++ build phase
(example: https://github.com/apache/arrow/runs/7881918943?check_suite_focus=true#step:6:3512).

Disable unused Arrow C++ components so as to make the C++ build faster.

Authored-by: Antoine Pitrou <[email protected]>
Signed-off-by: Antoine Pitrou <[email protected]>
…nt updates (apache#13769)

Building on apache#12157

Lead-authored-by: Jacob Wujciak-Jens <[email protected]>
Co-authored-by: Jonathan Keane <[email protected]>
Signed-off-by: Sutou Kouhei <[email protected]>
Add `--validate` option to `archery crossbow status`.
If `--validate` is specified and there are any missing artifacts, `archery crossbow status --validate` is existed
with non-zero exit code. We can use it for CI to detect missing artifacts.

We can't use `@github-actions crossbow submit` for this change because this isn't merged into the master branch
yet. See https://github.com/ursacomputing/crossbow/branches/all?query=build-674 that is submitted `nightly-packages`
manually.

Authored-by: Sutou Kouhei <[email protected]>
Signed-off-by: Sutou Kouhei <[email protected]>
…and *-glib-devel should have .gir (apache#13876)

The current configuration is inverted.
*-glib-libs have .gir and *-glib-devel have .typelib.

Authored-by: Sutou Kouhei <[email protected]>
Signed-off-by: Sutou Kouhei <[email protected]>
…pache#13910)

I noticed looking at pkg.go.dev that there really isn't anyone who is using the existing `compute` module, which makes sense since it isn't really finished and only provides limited utility currently.

This change will mark the `compute` module as a separate sub-module inside of the `arrow` module, allowing us to use `go1.18` in this new code without forcing anyone who *isn't* using the compute module to upgrade. That way I can leverage the generics when writing the new compute code where appropriate.

Authored-by: Matt Topol <[email protected]>
Signed-off-by: Matt Topol <[email protected]>
… GetSchema (apache#13898)

Consistently implements and tests the GetSchema method in Flight SQL.

Builds on apache#13897.

Authored-by: David Li <[email protected]>
Signed-off-by: David Li <[email protected]>
…lization for `LocalFileSystem` (apache#13796)

Introduce a specialization of `GetFileInfoGenerator` in the `LocalFileSystem` class.

This implementation tries to improves performance by hiding latencies at two levels:
1. Child directories can be readahead so that listing directories entries from disk can be achieved in parallel with other work;
2. Directory entries can be `stat`'ed and yielded in chunks so that the `FileInfoGenerator` consumer can start receiving entries before a large directory is fully processed.

Both mechanisms can be tuned using dedicated parameters in `LocalFileSystemOptions`.

Signed-off-by: Pavel Solodovnikov <[email protected]>
Co-Authored-by: Igor Seliverstov <[email protected]>

Lead-authored-by: Pavel Solodovnikov <[email protected]>
Co-authored-by: Antoine Pitrou <[email protected]>
Signed-off-by: Antoine Pitrou <[email protected]>
It looks like the entries in the truth tables were copy-pasted and the
_results_ were updated to match the function, but not the operator.

Authored-by: Gil Forsyth <[email protected]>
Signed-off-by: Yibo Cai <[email protected]>
…pache#13906)

Typical real life Arrow datasets contain List type vectors of primitive type. This PR introduce ListBinder mapping of primitive types lists to java.sql.Types.ARRAY

Lead-authored-by: Igor Suhorukov <[email protected]>
Co-authored-by: igor.suhorukov <[email protected]>
Signed-off-by: David Li <[email protected]>
…rdered after adding duplicated fields (apache#13321)

Authored-by: Hongze Zhang <[email protected]>
Signed-off-by: David Li <[email protected]>
This PR aims to upgrade ORC to version 1.7.6.

Apache ORC 1.7.6 is the most recent maintenance release with the following bug fixes.

- https://github.com/apache/orc/releases/tag/v1.7.6
- https://orc.apache.org/news/2022/08/17/ORC-1.7.6/ 

Authored-by: William Hyun <[email protected]>
Signed-off-by: Sutou Kouhei <[email protected]>
…datafusion-c (apache#13923)

Binary uploader is dev/release/05-binary-upload.sh and
dev/release/post-02-binary.sh. We need to customize .deb package
name.

This also adds missing environment variable entries.

Authored-by: Sutou Kouhei <[email protected]>
Signed-off-by: Sutou Kouhei <[email protected]>
Relating to the building of the functionality for Compute in Go with Arrow, this is the implementation of ArraySpan / ExecValue / ExecResult etc.

It was able to be separated out from the function interface definitions, so I was able to make this PR while apache#13924 is still being reviewed

Authored-by: Matt Topol <[email protected]>
Signed-off-by: Matt Topol <[email protected]>
lidavidm and others added 14 commits October 7, 2022 23:00
…apache#14210)

I couldn't reproduce it, so I added a suppression instead. 

In both cases, the error is that the server is uncontactable. That shouldn't happen, but I changed the tests to also bind to port 0 instead of using a potentially flaky free port finder.

Authored-by: David Li <[email protected]>
Signed-off-by: David Li <[email protected]>
This is a follow-up of apache#14204.

Authored-by: Sutou Kouhei <[email protected]>
Signed-off-by: David Li <[email protected]>
…tructor with new (apache#14216)

Advantage: readabilty, exception safety and efficiency(only for shared_ptr).

Cases that don't apply: When calling a private/protected constructor within class member function, make_shared/unique can't work.

Authored-by: Jin Shang <[email protected]>
Signed-off-by: Sutou Kouhei <[email protected]>
…pache#14228)

Temporarily pin LLVM version on Appveyor due to a bug in Conda's packaging of LLVM.

Authored-by: Jin Shang <[email protected]>
Signed-off-by: Sutou Kouhei <[email protected]>
)

This is a follow-up of apache#14216. We can't use std::make_shared for CUDA related classes because their constructors aren't public.

Authored-by: Sutou Kouhei <[email protected]>
Signed-off-by: Sutou Kouhei <[email protected]>
…th output buffer (apache#14230)

When the output type of an expression is of variable length, e.g. string, Gandiva would realloc the output buffer to make space for new outputs for each row. When num of rows is high some memory allocators perform poorly.

We can use the std::vector like approach to amortize the allcation cost. First allocate some initial space depending on the input size. Each time we run out of space, double the buffer size. In the end shrink it to fit the actual size. Arrow string builder also uses this approach.

Authored-by: Jin Shang <[email protected]>
Signed-off-by: Sutou Kouhei <[email protected]>
… OpenTelemetry propagation (apache#11920)

Adds a client middleware that sends span/trace ID to the server, and a server middleware that gets the span/trace ID and starts a child span.

The middleware are available in builds without OpenTelemetry, they simply do nothing.

Authored-by: David Li <[email protected]>
Signed-off-by: David Li <[email protected]>
Use `File.deleteOnExit` to delete jni lib file on JVM exit. `File.deleteOnExit` actually add a shut down hook to make sure file delte.

Authored-by: jackylee-ch <[email protected]>
Signed-off-by: David Li <[email protected]>
zeroshade pushed a commit that referenced this pull request Feb 17, 2023
This PR gathers work from multiple PRs that can be closed after this one is merged:

 - Closes #13752
 - Closes #13754
 - Closes #13842
 - Closes #13882
 - Closes #13916
 - Closes #14063
 - Closes #13970

And the issues associated with those PRs can also be closed:

 - Fixes #20350
 - Add RunEndEncodedScalarType
 - Fixes #32543
 - Fixes #32544
 - Fixes #32688
 - Fixes #32731
 - Fixes #32772
 - Fixes #32774

* Closes: #32104

Lead-authored-by: Felipe Oliveira Carvalho <[email protected]>
Co-authored-by: Tobias Zagorni <[email protected]>
Signed-off-by: Matt Topol <[email protected]>
gringasalpastor pushed a commit to gringasalpastor/arrow that referenced this pull request Feb 17, 2023
…pache#33641)

This PR gathers work from multiple PRs that can be closed after this one is merged:

 - Closes apache#13752
 - Closes apache#13754
 - Closes apache#13842
 - Closes apache#13882
 - Closes apache#13916
 - Closes apache#14063
 - Closes apache#13970

And the issues associated with those PRs can also be closed:

 - Fixes apache#20350
 - Add RunEndEncodedScalarType
 - Fixes apache#32543
 - Fixes apache#32544
 - Fixes apache#32688
 - Fixes apache#32731
 - Fixes apache#32772
 - Fixes apache#32774

* Closes: apache#32104

Lead-authored-by: Felipe Oliveira Carvalho <[email protected]>
Co-authored-by: Tobias Zagorni <[email protected]>
Signed-off-by: Matt Topol <[email protected]>
fatemehp pushed a commit to fatemehp/arrow that referenced this pull request Feb 24, 2023
…pache#33641)

This PR gathers work from multiple PRs that can be closed after this one is merged:

 - Closes apache#13752
 - Closes apache#13754
 - Closes apache#13842
 - Closes apache#13882
 - Closes apache#13916
 - Closes apache#14063
 - Closes apache#13970

And the issues associated with those PRs can also be closed:

 - Fixes apache#20350
 - Add RunEndEncodedScalarType
 - Fixes apache#32543
 - Fixes apache#32544
 - Fixes apache#32688
 - Fixes apache#32731
 - Fixes apache#32772
 - Fixes apache#32774

* Closes: apache#32104

Lead-authored-by: Felipe Oliveira Carvalho <[email protected]>
Co-authored-by: Tobias Zagorni <[email protected]>
Signed-off-by: Matt Topol <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.