-
Notifications
You must be signed in to change notification settings - Fork 3.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ARROW-352: [Format] Add time resolution to Interval, schema documentation #920
Conversation
__WIN32 is not standard for all windows compilers. Author: bgosztonyi <[email protected]> Closes apache#669 from bgosztonyi/patch-1 and squashes the following commits: 618ab3e [bgosztonyi] _WIN32 instead of __WIN32
This verifies that all is working properly after PARQUET-965. Author: Wes McKinney <[email protected]> Closes apache#665 from wesm/ARROW-901 and squashes the following commits: 6433c14 [Wes McKinney] Add Parquet unit test for fixed size binary
…odules Author: Uwe L. Korn <[email protected]> Closes apache#660 from xhochy/ARROW-813 and squashes the following commits: 47459ba [Uwe L. Korn] Call git later 5b3b4a4 [Uwe L. Korn] Symlink fixes 70ca0d8 [Uwe L. Korn] ARROW-813: [Python] setup.py sdist must also bundle dependent cmake modules
Author: Kouhei Sutou <[email protected]> Closes apache#666 from kou/glib-example-go-add-missing-error-handling and squashes the following commits: 7baa8ba [Kouhei Sutou] [GLib] Add missing error checks in Go examples
Author: Kouhei Sutou <[email protected]> Closes apache#667 from kou/site-fix-a-typo and squashes the following commits: dc66a18 [Kouhei Sutou] [Website] Fix a typo
I added the metadata to the `FieldType` object. It doesn't necessarily seem the right place for it, but it minimizes having to change method signatures to pass around the metadata. I also tried to standardize vector constructors on `FieldType` to ensure it's passed correctly. Author: Emilio Lahr-Vivaz <[email protected]> Closes apache#611 from elahrvivaz/ARROW-482 and squashes the following commits: 63a76e0 [Emilio Lahr-Vivaz] ARROW-482 [Java] Exposing custom field metadata
Author: Kouhei Sutou <[email protected]> Closes apache#668 from kou/website-0.3-release-announce-japanese and squashes the following commits: cc7871f [Kouhei Sutou] [Website] Add 0.3.0 release announce in Japanese
Author: Max Risuhin <[email protected]> Closes apache#674 from MaxRis/ARROW-29 and squashes the following commits: e2d5d72 [Max Risuhin] ARROW-29: [C++] FindRe2 cmake module
… post in blogroll Author: Wes McKinney <[email protected]> Closes apache#675 from wesm/ARROW-1010 and squashes the following commits: 35893e8 [Wes McKinney] Mark JA version of 0.3.0 release post
Author: Uwe L. Korn <[email protected]> Closes apache#678 from xhochy/ARROW-1016 and squashes the following commits: bc9b5ee [Uwe L. Korn] Build pyarrow using multiple cores c982c97 [Uwe L. Korn] Remove existing include folder 0b3209a [Uwe L. Korn] More cmake debug output 590f629 [Uwe L. Korn] Add output to wrap cmake command 3de6a8a [Uwe L. Korn] ARROW-1016: Python: Include C++ headers (optionally) in wheels
…:import_pyarrow method I have been looking at LXML's approach to creating both a public Cython API and C++ API https://github.com/lxml/lxml While this may seem like a somewhat radical reorganization of the code, putting all of the main symbols in a single Cython extension makes generating a C++ API for them significantly simpler. By using `.pxi` files we can break the codebase into as small pieces as we like (as long as there are no circular dependencies). As a convenient side effect, the build times are shorter. Author: Wes McKinney <[email protected]> Closes apache#680 from wesm/ARROW-819 and squashes the following commits: 9e6ee24 [Wes McKinney] Fix up optional extensions cff757d [Wes McKinney] Expose pyarrow C API in arrow/python/pyarrow.h b39d19c [Wes McKinney] Fix test suite. Move _config into lib ff1b5e5 [Wes McKinney] Rename things a bit d4a8391 [Wes McKinney] Reorganize Cython code in the style of lxml so make declaring a public C API easier
Author: Brian Hulette <[email protected]> Closes apache#677 from TheNeuralBit/js_travis and squashes the following commits: 3d5b54b [Brian Hulette] Revert "Testing a failure" a84b364 [Brian Hulette] Testing a failure 70310b6 [Brian Hulette] add flatbuffers download e0c1100 [Brian Hulette] Add travis config and scripts for JS
according to the specification in the `Null bitmaps` section: > Bitmaps are to be initialized to be all unset at allocation time (this includes padding). null bitmaps should always be padded with 0 Author: Wenchen Fan <[email protected]> Closes apache#673 from cloud-fan/first and squashes the following commits: 8566f7c [Wenchen Fan] fix typo and mistakes in Layout.md
…e clearer names to IPC reader/writer classes The main motivation for this patch was to make `StreamReader` and `StreamWriter` abstract, so that other implementations can be created. I would also like to add the option for asynchronous reading and writing. I also added a CMake option `ARROW_NO_DEPRECATED_API` for more graceful name deprecations. @kou do you think these names for the IPC classes are more clear? Author: Wes McKinney <[email protected]> Closes apache#679 from wesm/ARROW-1008 and squashes the following commits: d7b7c9c [Wes McKinney] Add missing dtors for pimpl pattern a797ee3 [Wes McKinney] Fix glib 04fa285 [Wes McKinney] Feedback on ipc reader/writer names. Add open_stream/open_file Python APIs 22346d4 [Wes McKinney] Fix unit tests 10837a6 [Wes McKinney] Add abstract stream writer and reader C++ APIs. Rename record batch stream reader and writer classes for better clarity
Author: Wes McKinney <[email protected]> Closes apache#682 from wesm/ARROW-1022 and squashes the following commits: 8fd241e [Wes McKinney] Add multithreaded read option to read_feather
Change-Id: I5cfe95272d43fd0cb08cac6646ffde30adc94bdf Author: Uwe L. Korn <[email protected]> Closes apache#684 from xhochy/ARROW-1024 and squashes the following commits: 0d1f2f3 [Uwe L. Korn] ARROW-1024: Python: Update build time numpy version to 1.10.1
Notes: * `PyList_Append` increments ref count, so new objects must be DECREF'd after being inserted * `PyArray_SimpleNewFromDescr` does not set `NPY_ARRAY_OWNDATA`, neither does `NewFromDescr` Author: Wes McKinney <[email protected]> Closes apache#685 from wesm/ARROW-1017 and squashes the following commits: 8459123 [Wes McKinney] Fix memory leak caused by list append ref count, lack of setting NPY_ARRAY_OWNDATA
Author: Uwe L. Korn <[email protected]> Closes apache#683 from xhochy/macos-wheels and squashes the following commits: bdc5651 [Uwe L. Korn] Correct the amount of dots e8d643f [Uwe L. Korn] Correct the amount of dots 4fa6d33 [Uwe L. Korn] Correct the amount of dots 5853b7c [Uwe L. Korn] Prepend LIBARRY_DIR 51a2b13 [Uwe L. Korn] Fix typo 7074eda [Uwe L. Korn] Version numbers are at a different place on macOS
…egers and floats Author: Wes McKinney <[email protected]> Closes apache#681 from wesm/ARROW-1004 and squashes the following commits: 9e0b2ea [Wes McKinney] Code review comments 45f1ecb [Wes McKinney] Fixes for manylinux1 4e4c752 [Wes McKinney] Add conversions for numpy object arrays with integers and floats
Author: Wes McKinney <[email protected]> Closes apache#687 from wesm/python-update-ipc-docs and squashes the following commits: a502abe [Wes McKinney] Fix IPC docs per API changes
…rrow Table and Schema objects Negative indexing causes an interpreter segfault right now. After this PR, no segfaults, and reasonable error messages for out of bounds indexes. Author: Phillip Cloud <[email protected]> Closes apache#686 from cpcloud/ARROW-1027 and squashes the following commits: 1bfcef4 [Phillip Cloud] ARROW-1027: [Python] Allow negative indexing in fields/columns on pyarrow Table and Schema objects
Author: Kouhei Sutou <[email protected]> Closes apache#688 from kou/glib-support-pretty-print and squashes the following commits: 815f87f [Kouhei Sutou] [GLib] Support pretty print
Author: Phillip Cloud <[email protected]> Closes apache#689 from cpcloud/ARROW-1033 and squashes the following commits: 60a3bc3 [Phillip Cloud] ARROW-1033: [Python] pytest discovers scripts/test_leak.py
Author: Julien Le Dem <[email protected]> Closes apache#644 from julienledem/TZ and squashes the following commits: 37987b9 [Julien Le Dem] add integration tests a58fcae [Julien Le Dem] add integration test 39966aa [Julien Le Dem] add other vectors and tests bf245ce [Julien Le Dem] add TZ vectors
Author: Emilio Lahr-Vivaz <[email protected]> Closes apache#676 from elahrvivaz/ARROW-1015 and squashes the following commits: 9e94e1a [Emilio Lahr-Vivaz] Changing accessor to getCustomMetadata 9ffdb5a [Emilio Lahr-Vivaz] ARROW-1015 [Java] Schema-level metadata
…rtlog This patch combines new features and improvements in the changelog into a single category and lists these first on the website (since these may be more interesting for readers). I also added the `git shortlog` to show who contributed to each release Author: Wes McKinney <[email protected]> Closes apache#691 from wesm/ARROW-1025 and squashes the following commits: 416c876 [Wes McKinney] Consolidated changelog for the website with new features/improvements shown first. Include git shortlog
…tional copy of the schema cc @TheNeuralBit Author: Wes McKinney <[email protected]> Closes apache#694 from wesm/ARROW-998 and squashes the following commits: eec4fad [Wes McKinney] Clarify that the IPC file footer contains an additional copy of the schema
This makes it easier to use templates for writing more validation logic, and also we don't need the additional array methods Author: Wes McKinney <[email protected]> Closes apache#692 from wesm/ARROW-182 and squashes the following commits: 9753a59 [Wes McKinney] Factor out Array::Validate into a separate function
Better API naming consistency with C++ Author: Wes McKinney <[email protected]> Closes apache#690 from wesm/ARROW-961 and squashes the following commits: 85b352c [Wes McKinney] Add more graceful deprecation warnings for renamed classes, test suite 69a99cd [Wes McKinney] Fix Cython compilation a15910a [Wes McKinney] Rename InMemoryOutputStream to BufferOutputStream
… format cc @TheNeuralBit -- the 64-byte padding in the C++ file writer was incorrect (http://arrow.apache.org/docs/ipc.html indicate padding to an 8-byte boundary), so this fixes that. Author: Wes McKinney <[email protected]> Closes apache#693 from wesm/ARROW-1002 and squashes the following commits: 35c023f [Wes McKinney] Fix C++ inconsistency with padding at start of IPC file format
Author: Wes McKinney <[email protected]> Closes apache#900 from wesm/fix-license-headers and squashes the following commits: 73941bf [Wes McKinney] Fix many license headers to use proper ASF one
… single Parquet file fails cc @jreback Author: Wes McKinney <[email protected]> Closes apache#902 from wesm/ARROW-1285 and squashes the following commits: b8f9ef4 [Wes McKinney] Delete any incomplete file when attempt to write single Parquet file fails
Fixes ARROW-1276 and fixes Python dev. documentation (encountered during the preparation of this PR). Author: Marco Neumann <[email protected]> Closes apache#906 from crepererum/ARROW-1276 and squashes the following commits: 1c1c92c [Marco Neumann] ARROW-1276: enable parquet serialization of empty DataFrames 1d9cc41 [Marco Neumann] add missing conda packages to python dev. doc
…and Python We aren't testing this in Travis CI because spinning up an HDFS cluster is a bit heavy weight, but this will at least enable us to do easier ongoing validation that this functionality is working properly. Author: Wes McKinney <[email protected]> Closes apache#895 from wesm/ARROW-1281 and squashes the following commits: a96e166 [Wes McKinney] Fix header 4effee7 [Wes McKinney] Fix license header d12eea4 [Wes McKinney] Fix license headers 591e7c6 [Wes McKinney] Add Python tests bbbd8c1 [Wes McKinney] Docker HDFS testing scripts, use hdfs-client.xml from Apache HAWQ (incubating)
Contributed by Stepan Kadlec <[email protected]> Closes apache#899 Change-Id: I3ad6cb9cc30946fda76e2dd454e0e19966276abc
…::BufferBuilder as in array builders Kind of an embarrassing oversight, but it's good that we caught it. In a test for incrementally building a BinaryArray, this yields about 4x speedup ``` Benchmark Time CPU Iterations --------------------------------------------------------------------------- BM_BuildBinaryArray/repeats:3 11892 us 11892 us 59 840.886MB/s BM_BuildBinaryArray/repeats:3 11903 us 11904 us 59 840.082MB/s BM_BuildBinaryArray/repeats:3 11909 us 11910 us 59 839.662MB/s BM_BuildBinaryArray/repeats:3_mean 11902 us 11902 us 59 840.21MB/s BM_BuildBinaryArray/repeats:3_stddev 7 us 7 us 0 520.137kB/s ``` before: ``` Benchmark Time CPU Iterations --------------------------------------------------------------------------- BM_BuildBinaryArray/repeats:3 45678 us 45571 us 15 219.439MB/s BM_BuildBinaryArray/repeats:3 45416 us 45209 us 15 221.197MB/s BM_BuildBinaryArray/repeats:3 45227 us 45122 us 15 221.619MB/s BM_BuildBinaryArray/repeats:3_mean 45440 us 45301 us 15 220.752MB/s BM_BuildBinaryArray/repeats:3_stddev 185 us 194 us 0 966.716kB/s ``` Author: Wes McKinney <[email protected]> Closes apache#905 from wesm/ARROW-1290 and squashes the following commits: 59d4d9c [Wes McKinney] Double buffer size when exceeding capacity in arrow::BufferBuilder, like in other array builder classes
…ce functions cc @xhochy @cpcloud for feedback on API Author: Wes McKinney <[email protected]> Closes apache#904 from wesm/ARROW-1273 and squashes the following commits: 1372565 [Wes McKinney] Add Parquet read_metadata, read_schema convenience functions
…mantics of --with-parquet Now the test suite does not fail if you build the Plasma libraries but forget to pass `--with-plasma` to the Python build Author: Wes McKinney <[email protected]> Closes apache#903 from wesm/ARROW-1289 and squashes the following commits: 0e9ce78 [Wes McKinney] Add PYARROW_BUILD_PLASMA CMake option, make plasma build work like parquet build
@jacques-n @StevenMPhillips PR for the change: dremio@b794dfa A new unit test has been added on top of original change. Author: Steven Phillips <[email protected]> Author: siddharth <[email protected]> Closes apache#890 from siddharthteotia/ARROW-1267-PR and squashes the following commits: 89a08d9 [siddharth] Handle zero length in BitVector.splitAndTransfer 84946b7 [Steven Phillips] Handle zero length case in BitVector.splitAndTransfer() b55c146 [Steven Phillips] Handle zero length case in BitVector.splitAndTransfer()
…not Bas… Fixing the inheritance tree: Nullable <Fixed-length | Var-Length>Vectors do not use "ArrowBuf data" field in BaseDataValueVector. Therefore, they should extend BaseValueVector class.. Author: siddharth <[email protected]> Closes apache#892 from siddharthteotia/ARROW-276 and squashes the following commits: d919538 [siddharth] ARROW-276: Nullable Vectors should extend BaseValueVector and not BaseDataValueVector
…Union Vectors. @jacques-n @StevenMPhillips PR for the original change-set dremio@75396ba with corresponding unit tests. Author: siddharth <[email protected]> Author: Steven Phillips <[email protected]> Closes apache#901 from siddharthteotia/ARROW-1192-PR and squashes the following commits: 3d89c99 [siddharth] ARROW-1192: splitAndTransfer changes for ListVector,UnionVector and corresponding unit tests 035e886 [Steven Phillips] Use buffer slice for splitAndTransfer in List and Union vectors
….seek I still need to validate this against the use case in dask/fastparquet#188 Author: Wes McKinney <[email protected]> Closes apache#907 from wesm/ARROW-1287 and squashes the following commits: 933f3f6 [Wes McKinney] Add testing script for checking thirdparty library against pyarrow.HdfsClient 423ca87 [Wes McKinney] Implement whence argument for pyarrow.NativeFile.seek
Author: Wes McKinney <[email protected]> Closes apache#908 from wesm/ARROW-968 and squashes the following commits: 47b71a5 [Wes McKinney] Support slices in RecordBatch.__getitem__
See conda-forge/cmake-feedstock#38. Not sure the origin of the build failure but we will pin at 3.8.0 for now Author: Wes McKinney <[email protected]> Closes apache#910 from wesm/ARROW-1294 and squashes the following commits: 69dfecc [Wes McKinney] Pin cmake=3.8.0 in MSVC toolchainbuild
…RecordBatch/Table.from_pandas Author: Wes McKinney <[email protected]> Closes apache#911 from wesm/ARROW-1291 and squashes the following commits: d442f3b [Wes McKinney] Cast non-string DataFrame columns to strings in RecordBatch/Table.from_pandas
… cannot connect to Plasma store cc @pcmoritz @robertnishihara Author: Wes McKinney <[email protected]> Closes apache#912 from wesm/ARROW-1264 and squashes the following commits: bd134d7 [Wes McKinney] Add flags to disable certain classes of unit tests 1d9de77 [Wes McKinney] Raise exception in Python instead of aborting if cannot connect to Plasma store
… and -Werror in CI Author: Wes McKinney <[email protected]> Closes apache#913 from wesm/ARROW-932 and squashes the following commits: 9534ae9 [Wes McKinney] Only pass PYARROW_CXXFLAGS when set dedcbb9 [Wes McKinney] Fix typo b5a6d9a [Wes McKinney] Supress another clang warning 2e8f105 [Wes McKinney] typo 5740f00 [Wes McKinney] Add PYARROW_CXXFLAGS option, fix MSVC compiler warnings c32ee09 [Wes McKinney] Remove print statement. Disable MSVC 4190 warning
…rval Change-Id: Ibfacba119e49e11fa7a5c34cf472db4ee2f81313
/// The kind of interval, YEAR_MONTH or DAY_TIME | ||
/// | ||
/// TODO(wesm): Should this be renamed to kind and change resolution to be | ||
/// "unit" for consistency with the other temporal types? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would be in favor of this.
/// libraries, like NumPy, have a 64-bit integer-based timedelta type with time | ||
/// resolution metadata | ||
/// | ||
/// We support two styles of intervals encoded in a 64-bit integer |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
based on prior threads I'm not clear if both types are 64-bit or only DAY_TIME with YEAR-MONTh being 32-bit.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Arrow Java uses 4 byte vectors for IntervalYearVector and 8 byte vectors for IntervalDayVector.
We would have to use 64-bit for DAY_TIME and 32-bit for YEAR_MONTH
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
32-bits for YEAR_MONTH intervals is fine with me. The main purpose of this patch is to support other interval resolutions than milliseconds in DAY_TIME
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we have two different widths would it make sense to make to separate interval tables?
It might make the contract a little bit less confusing (YEAR_MONTH has a fixed resolution), DAY_TIME has a configurable one.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
agree with @emkornfield. The two interval types differ on both the width and the resolution. It's probably cleaner to create two new types, and deprecate this one (but not delete it to handle backward compatibility).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry for deleting last comment, I missed the one white line among all the green :(
@wesm do you agree with the approach? Do you have bandwidth to update the PR?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can update this PR. I'd like to get @jacques-n or others to sign off on the format changes before it gets merged.
I don't think the current iteration of the metadata is implemented for IPC in Java, so this is not that disruptive.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually looking at this file it seems like there are other types that have different widths based on metadata (i.e.date in seconds since epoch are 32 bit and in milliseconds 64 bits, so we can probably keep as is).
Still ramping up on the java side of things, but it looks like the easiest path forward to not break existing class/structure hierarchy or cause undue confusion is to add a new IntervalUnit (maybe called EPOCH?) above and denote DAY_TIME as "optional" in the standard, but still have it reflect the current java definition (days and milliseconds stored contiguosly as 2 int32s, https://github.com/apache/arrow/blob/master/java/vector/src/main/java/org/apache/arrow/vector/IntervalDayVector.java)
It appears YEAR_MONTH already reflects the definition here (32 bit integer representing months https://github.com/apache/arrow/blob/master/java/vector/src/main/java/org/apache/arrow/vector/IntervalYearVector.java)
@wesm can you rebase? This sounds also fine to me. |
I will rebase. We need to figure out what is Dremio's migration plan and timeline for this. My understanding is that we've been blocked on someone committing to doing the Java changes. We have yet to implement anything in C++ pending this Since eventually Gandiva will want to compile expressions involving intervals I don't think we want the memory representation to be a moving target @pravindra @praveenbingo @jacques-n |
I made #3644 and #3645 to show possible implications for changes to the java side with two different approaches to adding a Interval type that supports seconds, milliseconds, etc to replace the DAY_TIME resolution type. I think the existing YEAR_MONTH type matches what has already been discussed on the mailing list. If either of these approaches look promising can we try to port the FBS file changes to this PR and then continue with implementation? |
I think the way to go is a new interval type instead of trying to retrofit the new format on top of day-time. This will retain backward compatibility with the existing consumers of the java code base. I updated #3644 which contains the new type and a java implementation with light unit tests for it. @wesm @pravindra @praveenbingo @jacques-n could you take a look at this to see if it is a reasonable way forward? |
Closing in favor of new @emkornfield pull request |
I also added brief explanatory notes to some other types.
Internal storage for timedelta / interval types seem to be somewhat inconsistent across SQL systems. ANSI SQL provides the YEAR_MONTH and DAY_TIME semantic interval types. I propose that we settle on 64-bit integer representation and add a time resolution (consistent with Time, Timestamp) for the DAY_TIME variety.
If we run into problems down the road and need something other than 64-bits, we may need to add a bit width member to Interval, but this seems like unnecessary complexity at this stage.
Is this type used in the Java implementation or any downstream users of Arrow?
@siddharthteotia @jacques-n @julianhyde