Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ARROW-352: [Format] Add time resolution to Interval, schema documentation #920

Closed
wants to merge 880 commits into from

Conversation

wesm
Copy link
Member

@wesm wesm commented Jul 31, 2017

I also added brief explanatory notes to some other types.

Internal storage for timedelta / interval types seem to be somewhat inconsistent across SQL systems. ANSI SQL provides the YEAR_MONTH and DAY_TIME semantic interval types. I propose that we settle on 64-bit integer representation and add a time resolution (consistent with Time, Timestamp) for the DAY_TIME variety.

If we run into problems down the road and need something other than 64-bits, we may need to add a bit width member to Interval, but this seems like unnecessary complexity at this stage.

Is this type used in the Java implementation or any downstream users of Arrow?

@siddharthteotia @jacques-n @julianhyde

bgosztonyi and others added 30 commits May 11, 2017 15:58
__WIN32 is not standard for all windows compilers.

Author: bgosztonyi <[email protected]>

Closes apache#669 from bgosztonyi/patch-1 and squashes the following commits:

618ab3e [bgosztonyi] _WIN32 instead of __WIN32
This verifies that all is working properly after PARQUET-965.

Author: Wes McKinney <[email protected]>

Closes apache#665 from wesm/ARROW-901 and squashes the following commits:

6433c14 [Wes McKinney] Add Parquet unit test for fixed size binary
…odules

Author: Uwe L. Korn <[email protected]>

Closes apache#660 from xhochy/ARROW-813 and squashes the following commits:

47459ba [Uwe L. Korn] Call git later
5b3b4a4 [Uwe L. Korn] Symlink fixes
70ca0d8 [Uwe L. Korn] ARROW-813: [Python] setup.py sdist must also bundle dependent cmake modules
Author: Kouhei Sutou <[email protected]>

Closes apache#666 from kou/glib-example-go-add-missing-error-handling and squashes the following commits:

7baa8ba [Kouhei Sutou] [GLib] Add missing error checks in Go examples
Author: Kouhei Sutou <[email protected]>

Closes apache#667 from kou/site-fix-a-typo and squashes the following commits:

dc66a18 [Kouhei Sutou] [Website] Fix a typo
I added the metadata to the `FieldType` object. It doesn't necessarily seem the right place for it, but it minimizes having to change method signatures to pass around the metadata. I also tried to standardize vector constructors on `FieldType` to ensure it's passed correctly.

Author: Emilio Lahr-Vivaz <[email protected]>

Closes apache#611 from elahrvivaz/ARROW-482 and squashes the following commits:

63a76e0 [Emilio Lahr-Vivaz] ARROW-482 [Java] Exposing custom field metadata
Author: Kouhei Sutou <[email protected]>

Closes apache#668 from kou/website-0.3-release-announce-japanese and squashes the following commits:

cc7871f [Kouhei Sutou] [Website] Add 0.3.0 release announce in Japanese
Author: Max Risuhin <[email protected]>

Closes apache#674 from MaxRis/ARROW-29 and squashes the following commits:

e2d5d72 [Max Risuhin] ARROW-29: [C++] FindRe2 cmake module
… post in blogroll

Author: Wes McKinney <[email protected]>

Closes apache#675 from wesm/ARROW-1010 and squashes the following commits:

35893e8 [Wes McKinney] Mark JA version of 0.3.0 release post
Author: Uwe L. Korn <[email protected]>

Closes apache#678 from xhochy/ARROW-1016 and squashes the following commits:

bc9b5ee [Uwe L. Korn] Build pyarrow using multiple cores
c982c97 [Uwe L. Korn] Remove existing include folder
0b3209a [Uwe L. Korn] More cmake debug output
590f629 [Uwe L. Korn] Add output to wrap cmake command
3de6a8a [Uwe L. Korn] ARROW-1016: Python: Include C++ headers (optionally) in wheels
…:import_pyarrow method

I have been looking at LXML's approach to creating both a public Cython API and C++ API

https://github.com/lxml/lxml

While this may seem like a somewhat radical reorganization of the code, putting all of the main symbols in a single Cython extension makes generating a C++ API for them significantly simpler. By using `.pxi` files we can break the codebase into as small pieces as we like (as long as there are no circular dependencies). As a convenient side effect, the build times are shorter.

Author: Wes McKinney <[email protected]>

Closes apache#680 from wesm/ARROW-819 and squashes the following commits:

9e6ee24 [Wes McKinney] Fix up optional extensions
cff757d [Wes McKinney] Expose pyarrow C API in arrow/python/pyarrow.h
b39d19c [Wes McKinney] Fix test suite. Move _config into lib
ff1b5e5 [Wes McKinney] Rename things a bit
d4a8391 [Wes McKinney] Reorganize Cython code in the style of lxml so make declaring a public C API easier
Author: Brian Hulette <[email protected]>

Closes apache#677 from TheNeuralBit/js_travis and squashes the following commits:

3d5b54b [Brian Hulette] Revert "Testing a failure"
a84b364 [Brian Hulette] Testing a failure
70310b6 [Brian Hulette] add flatbuffers download
e0c1100 [Brian Hulette] Add travis config and scripts for JS
according to the specification in the `Null bitmaps` section:
> Bitmaps are to be initialized to be all unset at allocation time (this includes padding).

null bitmaps should always be padded with 0

Author: Wenchen Fan <[email protected]>

Closes apache#673 from cloud-fan/first and squashes the following commits:

8566f7c [Wenchen Fan] fix typo and mistakes in Layout.md
…e clearer names to IPC reader/writer classes

The main motivation for this patch was to make `StreamReader` and `StreamWriter` abstract, so that other implementations can be created. I would also like to add the option for asynchronous reading and writing.

I also added a CMake option `ARROW_NO_DEPRECATED_API` for more graceful name deprecations.

@kou do you think these names for the IPC classes are more clear?

Author: Wes McKinney <[email protected]>

Closes apache#679 from wesm/ARROW-1008 and squashes the following commits:

d7b7c9c [Wes McKinney] Add missing dtors for pimpl pattern
a797ee3 [Wes McKinney] Fix glib
04fa285 [Wes McKinney] Feedback on ipc reader/writer names. Add open_stream/open_file Python APIs
22346d4 [Wes McKinney] Fix unit tests
10837a6 [Wes McKinney] Add abstract stream writer and reader C++ APIs. Rename record batch stream reader and writer classes for better clarity
Author: Wes McKinney <[email protected]>

Closes apache#682 from wesm/ARROW-1022 and squashes the following commits:

8fd241e [Wes McKinney] Add multithreaded read option to read_feather
Change-Id: I5cfe95272d43fd0cb08cac6646ffde30adc94bdf

Author: Uwe L. Korn <[email protected]>

Closes apache#684 from xhochy/ARROW-1024 and squashes the following commits:

0d1f2f3 [Uwe L. Korn] ARROW-1024: Python: Update build time numpy version to 1.10.1
Notes:

* `PyList_Append` increments ref count, so new objects must be DECREF'd after being inserted
* `PyArray_SimpleNewFromDescr` does not set `NPY_ARRAY_OWNDATA`, neither does `NewFromDescr`

Author: Wes McKinney <[email protected]>

Closes apache#685 from wesm/ARROW-1017 and squashes the following commits:

8459123 [Wes McKinney] Fix memory leak caused by list append ref count, lack of setting NPY_ARRAY_OWNDATA
Author: Uwe L. Korn <[email protected]>

Closes apache#683 from xhochy/macos-wheels and squashes the following commits:

bdc5651 [Uwe L. Korn] Correct the amount of dots
e8d643f [Uwe L. Korn] Correct the amount of dots
4fa6d33 [Uwe L. Korn] Correct the amount of dots
5853b7c [Uwe L. Korn] Prepend LIBARRY_DIR
51a2b13 [Uwe L. Korn] Fix typo
7074eda [Uwe L. Korn] Version numbers are at a different place on macOS
…egers and floats

Author: Wes McKinney <[email protected]>

Closes apache#681 from wesm/ARROW-1004 and squashes the following commits:

9e0b2ea [Wes McKinney] Code review comments
45f1ecb [Wes McKinney] Fixes for manylinux1
4e4c752 [Wes McKinney] Add conversions for numpy object arrays with integers and floats
Author: Wes McKinney <[email protected]>

Closes apache#687 from wesm/python-update-ipc-docs and squashes the following commits:

a502abe [Wes McKinney] Fix IPC docs per API changes
…rrow Table and Schema objects

Negative indexing causes an interpreter segfault right now. After this PR, no segfaults, and reasonable error messages for out of bounds indexes.

Author: Phillip Cloud <[email protected]>

Closes apache#686 from cpcloud/ARROW-1027 and squashes the following commits:

1bfcef4 [Phillip Cloud] ARROW-1027: [Python] Allow negative indexing in fields/columns on pyarrow Table and Schema objects
Author: Kouhei Sutou <[email protected]>

Closes apache#688 from kou/glib-support-pretty-print and squashes the following commits:

815f87f [Kouhei Sutou] [GLib] Support pretty print
Author: Phillip Cloud <[email protected]>

Closes apache#689 from cpcloud/ARROW-1033 and squashes the following commits:

60a3bc3 [Phillip Cloud] ARROW-1033: [Python] pytest discovers scripts/test_leak.py
Author: Julien Le Dem <[email protected]>

Closes apache#644 from julienledem/TZ and squashes the following commits:

37987b9 [Julien Le Dem] add integration tests
a58fcae [Julien Le Dem] add integration test
39966aa [Julien Le Dem] add other vectors and tests
bf245ce [Julien Le Dem] add TZ vectors
Author: Emilio Lahr-Vivaz <[email protected]>

Closes apache#676 from elahrvivaz/ARROW-1015 and squashes the following commits:

9e94e1a [Emilio Lahr-Vivaz] Changing accessor to getCustomMetadata
9ffdb5a [Emilio Lahr-Vivaz] ARROW-1015 [Java] Schema-level metadata
…rtlog

This patch combines new features and improvements in the changelog into a single category and lists these first on the website (since these may be more interesting for readers). I also added the `git shortlog` to show who contributed to each release

Author: Wes McKinney <[email protected]>

Closes apache#691 from wesm/ARROW-1025 and squashes the following commits:

416c876 [Wes McKinney] Consolidated changelog for the website with new features/improvements shown first. Include git shortlog
…tional copy of the schema

cc @TheNeuralBit

Author: Wes McKinney <[email protected]>

Closes apache#694 from wesm/ARROW-998 and squashes the following commits:

eec4fad [Wes McKinney] Clarify that the IPC file footer contains an additional copy of the schema
This makes it easier to use templates for writing more validation logic, and also we don't need the additional array methods

Author: Wes McKinney <[email protected]>

Closes apache#692 from wesm/ARROW-182 and squashes the following commits:

9753a59 [Wes McKinney] Factor out Array::Validate into a separate function
Better API naming consistency with C++

Author: Wes McKinney <[email protected]>

Closes apache#690 from wesm/ARROW-961 and squashes the following commits:

85b352c [Wes McKinney] Add more graceful deprecation warnings for renamed classes, test suite
69a99cd [Wes McKinney] Fix Cython compilation
a15910a [Wes McKinney] Rename InMemoryOutputStream to BufferOutputStream
… format

cc @TheNeuralBit -- the 64-byte padding in the C++ file writer was incorrect (http://arrow.apache.org/docs/ipc.html indicate padding to an 8-byte boundary), so this fixes that.

Author: Wes McKinney <[email protected]>

Closes apache#693 from wesm/ARROW-1002 and squashes the following commits:

35c023f [Wes McKinney] Fix C++ inconsistency with padding at start of IPC file format
wesm and others added 18 commits July 27, 2017 14:18
Author: Wes McKinney <[email protected]>

Closes apache#900 from wesm/fix-license-headers and squashes the following commits:

73941bf [Wes McKinney] Fix many license headers to use proper ASF one
… single Parquet file fails

cc @jreback

Author: Wes McKinney <[email protected]>

Closes apache#902 from wesm/ARROW-1285 and squashes the following commits:

b8f9ef4 [Wes McKinney] Delete any incomplete file when attempt to write single Parquet file fails
Fixes ARROW-1276 and fixes Python dev. documentation (encountered during the preparation of this PR).

Author: Marco Neumann <[email protected]>

Closes apache#906 from crepererum/ARROW-1276 and squashes the following commits:

1c1c92c [Marco Neumann] ARROW-1276: enable parquet serialization of empty DataFrames
1d9cc41 [Marco Neumann] add missing conda packages to python dev. doc
…and Python

We aren't testing this in Travis CI because spinning up an HDFS cluster is a bit heavy weight, but this will at least enable us to do easier ongoing validation that this functionality is working properly.

Author: Wes McKinney <[email protected]>

Closes apache#895 from wesm/ARROW-1281 and squashes the following commits:

a96e166 [Wes McKinney] Fix header
4effee7 [Wes McKinney] Fix license header
d12eea4 [Wes McKinney] Fix license headers
591e7c6 [Wes McKinney] Add Python tests
bbbd8c1 [Wes McKinney] Docker HDFS testing scripts, use hdfs-client.xml from Apache HAWQ (incubating)
Contributed by Stepan Kadlec <[email protected]>

Closes apache#899

Change-Id: I3ad6cb9cc30946fda76e2dd454e0e19966276abc
…::BufferBuilder as in array builders

Kind of an embarrassing oversight, but it's good that we caught it.

In a test for incrementally building a BinaryArray, this yields about 4x speedup

```
Benchmark                                     Time           CPU Iterations
---------------------------------------------------------------------------
BM_BuildBinaryArray/repeats:3             11892 us      11892 us         59   840.886MB/s
BM_BuildBinaryArray/repeats:3             11903 us      11904 us         59   840.082MB/s
BM_BuildBinaryArray/repeats:3             11909 us      11910 us         59   839.662MB/s
BM_BuildBinaryArray/repeats:3_mean        11902 us      11902 us         59    840.21MB/s
BM_BuildBinaryArray/repeats:3_stddev          7 us          7 us          0   520.137kB/s
```

before:

```
Benchmark                                     Time           CPU Iterations
---------------------------------------------------------------------------
BM_BuildBinaryArray/repeats:3             45678 us      45571 us         15   219.439MB/s
BM_BuildBinaryArray/repeats:3             45416 us      45209 us         15   221.197MB/s
BM_BuildBinaryArray/repeats:3             45227 us      45122 us         15   221.619MB/s
BM_BuildBinaryArray/repeats:3_mean        45440 us      45301 us         15   220.752MB/s
BM_BuildBinaryArray/repeats:3_stddev        185 us        194 us          0   966.716kB/s
```

Author: Wes McKinney <[email protected]>

Closes apache#905 from wesm/ARROW-1290 and squashes the following commits:

59d4d9c [Wes McKinney] Double buffer size when exceeding capacity in arrow::BufferBuilder, like in other array builder classes
…ce functions

cc @xhochy @cpcloud for feedback on API

Author: Wes McKinney <[email protected]>

Closes apache#904 from wesm/ARROW-1273 and squashes the following commits:

1372565 [Wes McKinney] Add Parquet read_metadata, read_schema convenience functions
…mantics of --with-parquet

Now the test suite does not fail if you build the Plasma libraries but forget to pass `--with-plasma` to the Python build

Author: Wes McKinney <[email protected]>

Closes apache#903 from wesm/ARROW-1289 and squashes the following commits:

0e9ce78 [Wes McKinney] Add PYARROW_BUILD_PLASMA CMake option, make plasma build work like parquet build
@jacques-n
@StevenMPhillips

PR for the change: dremio@b794dfa

A new unit test has been added on top of original change.

Author: Steven Phillips <[email protected]>
Author: siddharth <[email protected]>

Closes apache#890 from siddharthteotia/ARROW-1267-PR and squashes the following commits:

89a08d9 [siddharth] Handle zero length in BitVector.splitAndTransfer
84946b7 [Steven Phillips] Handle zero length case in BitVector.splitAndTransfer()
b55c146 [Steven Phillips] Handle zero length case in BitVector.splitAndTransfer()
…not Bas…

Fixing the inheritance tree:

Nullable <Fixed-length | Var-Length>Vectors do not use "ArrowBuf data" field in BaseDataValueVector. Therefore, they should extend BaseValueVector class..

Author: siddharth <[email protected]>

Closes apache#892 from siddharthteotia/ARROW-276 and squashes the following commits:

d919538 [siddharth] ARROW-276: Nullable Vectors should extend BaseValueVector and not BaseDataValueVector
…Union Vectors.

@jacques-n
@StevenMPhillips

PR for the original change-set dremio@75396ba
with corresponding unit tests.

Author: siddharth <[email protected]>
Author: Steven Phillips <[email protected]>

Closes apache#901 from siddharthteotia/ARROW-1192-PR and squashes the following commits:

3d89c99 [siddharth] ARROW-1192: splitAndTransfer changes for ListVector,UnionVector and corresponding unit tests
035e886 [Steven Phillips] Use buffer slice for splitAndTransfer in List and Union vectors
….seek

I still need to validate this against the use case in dask/fastparquet#188

Author: Wes McKinney <[email protected]>

Closes apache#907 from wesm/ARROW-1287 and squashes the following commits:

933f3f6 [Wes McKinney] Add testing script for checking thirdparty library against pyarrow.HdfsClient
423ca87 [Wes McKinney] Implement whence argument for pyarrow.NativeFile.seek
Author: Wes McKinney <[email protected]>

Closes apache#908 from wesm/ARROW-968 and squashes the following commits:

47b71a5 [Wes McKinney] Support slices in RecordBatch.__getitem__
See conda-forge/cmake-feedstock#38. Not sure the origin of the build failure but we will pin at 3.8.0 for now

Author: Wes McKinney <[email protected]>

Closes apache#910 from wesm/ARROW-1294 and squashes the following commits:

69dfecc [Wes McKinney] Pin cmake=3.8.0 in MSVC toolchainbuild
…RecordBatch/Table.from_pandas

Author: Wes McKinney <[email protected]>

Closes apache#911 from wesm/ARROW-1291 and squashes the following commits:

d442f3b [Wes McKinney] Cast non-string DataFrame columns to strings in RecordBatch/Table.from_pandas
… cannot connect to Plasma store

cc @pcmoritz @robertnishihara

Author: Wes McKinney <[email protected]>

Closes apache#912 from wesm/ARROW-1264 and squashes the following commits:

bd134d7 [Wes McKinney] Add flags to disable certain classes of unit tests
1d9de77 [Wes McKinney] Raise exception in Python instead of aborting if cannot connect to Plasma store
… and -Werror in CI

Author: Wes McKinney <[email protected]>

Closes apache#913 from wesm/ARROW-932 and squashes the following commits:

9534ae9 [Wes McKinney] Only pass PYARROW_CXXFLAGS when set
dedcbb9 [Wes McKinney] Fix typo
b5a6d9a [Wes McKinney] Supress another clang warning
2e8f105 [Wes McKinney] typo
5740f00 [Wes McKinney] Add PYARROW_CXXFLAGS option, fix MSVC compiler warnings
c32ee09 [Wes McKinney] Remove print statement. Disable MSVC 4190 warning
…rval

Change-Id: Ibfacba119e49e11fa7a5c34cf472db4ee2f81313
/// The kind of interval, YEAR_MONTH or DAY_TIME
///
/// TODO(wesm): Should this be renamed to kind and change resolution to be
/// "unit" for consistency with the other temporal types?
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would be in favor of this.

/// libraries, like NumPy, have a 64-bit integer-based timedelta type with time
/// resolution metadata
///
/// We support two styles of intervals encoded in a 64-bit integer
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

based on prior threads I'm not clear if both types are 64-bit or only DAY_TIME with YEAR-MONTh being 32-bit.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Arrow Java uses 4 byte vectors for IntervalYearVector and 8 byte vectors for IntervalDayVector.

We would have to use 64-bit for DAY_TIME and 32-bit for YEAR_MONTH

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

32-bits for YEAR_MONTH intervals is fine with me. The main purpose of this patch is to support other interval resolutions than milliseconds in DAY_TIME

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we have two different widths would it make sense to make to separate interval tables?
It might make the contract a little bit less confusing (YEAR_MONTH has a fixed resolution), DAY_TIME has a configurable one.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

agree with @emkornfield. The two interval types differ on both the width and the resolution. It's probably cleaner to create two new types, and deprecate this one (but not delete it to handle backward compatibility).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for deleting last comment, I missed the one white line among all the green :(

@wesm do you agree with the approach? Do you have bandwidth to update the PR?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can update this PR. I'd like to get @jacques-n or others to sign off on the format changes before it gets merged.

I don't think the current iteration of the metadata is implemented for IPC in Java, so this is not that disruptive.

Copy link
Contributor

@emkornfield emkornfield Feb 13, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually looking at this file it seems like there are other types that have different widths based on metadata (i.e.date in seconds since epoch are 32 bit and in milliseconds 64 bits, so we can probably keep as is).

Still ramping up on the java side of things, but it looks like the easiest path forward to not break existing class/structure hierarchy or cause undue confusion is to add a new IntervalUnit (maybe called EPOCH?) above and denote DAY_TIME as "optional" in the standard, but still have it reflect the current java definition (days and milliseconds stored contiguosly as 2 int32s, https://github.com/apache/arrow/blob/master/java/vector/src/main/java/org/apache/arrow/vector/IntervalDayVector.java)

It appears YEAR_MONTH already reflects the definition here (32 bit integer representing months https://github.com/apache/arrow/blob/master/java/vector/src/main/java/org/apache/arrow/vector/IntervalYearVector.java)

@xhochy
Copy link
Member

xhochy commented Jan 26, 2019

@wesm can you rebase? This sounds also fine to me.

@wesm
Copy link
Member Author

wesm commented Jan 26, 2019

I will rebase. We need to figure out what is Dremio's migration plan and timeline for this. My understanding is that we've been blocked on someone committing to doing the Java changes. We have yet to implement anything in C++ pending this

Since eventually Gandiva will want to compile expressions involving intervals I don't think we want the memory representation to be a moving target @pravindra @praveenbingo @jacques-n

@emkornfield
Copy link
Contributor

I made #3644 and #3645 to show possible implications for changes to the java side with two different approaches to adding a Interval type that supports seconds, milliseconds, etc to replace the DAY_TIME resolution type. I think the existing YEAR_MONTH type matches what has already been discussed on the mailing list. If either of these approaches look promising can we try to port the FBS file changes to this PR and then continue with implementation?

@emkornfield
Copy link
Contributor

I think the way to go is a new interval type instead of trying to retrofit the new format on top of day-time. This will retain backward compatibility with the existing consumers of the java code base. I updated #3644 which contains the new type and a java implementation with light unit tests for it. @wesm @pravindra @praveenbingo @jacques-n could you take a look at this to see if it is a reasonable way forward?

@wesm
Copy link
Member Author

wesm commented Mar 29, 2019

Closing in favor of new @emkornfield pull request

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.