-
Notifications
You must be signed in to change notification settings - Fork 224
Conversation
Codecov Report
@@ Coverage Diff @@
## main #849 +/- ##
==========================================
- Coverage 71.50% 70.56% -0.95%
==========================================
Files 337 343 +6
Lines 18445 18693 +248
==========================================
+ Hits 13190 13191 +1
- Misses 5255 5502 +247
Continue to review full report at Codecov.
|
This might also be something to consider in the future when it is more worked out: https://arrow.apache.org/blog/2022/02/16/introducing-arrow-flight-sql/ |
One design question that I have no answer to: arrow uses variable sized utf8 and binary, but ODBC requires a fixed length. Any ideas? |
How does that work with different sized strings? Does it take the max length? Or pointers to the heap? |
It has the following representation: given a maximum length N and a batch size of
where |
So a slot in |
exactly, some bytes are un-used:
is represented as
|
Right.. and this is per batch? If so, I would choose a relatively low size so that ouliers don't blow up memory. |
This is almost done, pending pacman82/odbc-api#174 |
Just curious. Are the odbc drivers compiled via cargo or is this something that should be setup at the client side? I know that setting up ODBC with TurboDBC was non trivial. edit Note to self, click a bit before asking questions:
I do wonder how this would go with precompiled binaries. If we can dynamically link here. |
Usually the complexity comes from building turbodbc, with the right Arrow / Python / C++ ABI. This is because Turbodbc does depend on the Python ABI, C++ ABI and Arrow version, for calling into C++ Code, and creating Following the issue tracker of turbodbc I feel like well over 90% of the issues are bulid chain related. This had been my motivation to create This crate does not link against the Python Interpreter directly. It does not rely an ABI compatibility with the Arrow C++ interface. It does not even use C++ (and therfore not depend on the ABI). My hope is, that it should be way easier to set up. Setting up the ODBC driver itself manager is not required on windows, and usually trivial on MacOS or Linux, at least if you have Installing the driver for a specific data source is an additional step. Yet this is also the point of ODBC, we don't want to ship every driver ever. And we could not for the ones which will be written in the future. The effort it takes to install a specifc drivers vary hugly between vendors. Usually the system package manage works fine, though.
a) Redeploying |
Co-authored-by: Jorge Leitao <[email protected]>
This PR adds support to reading from, and writing to, an ODBC driver.
I anticipate this to be one of the most efficient ways of loading data into arrow, as
odbc-api
offers an API to load data in a columnar format whereby most buffers are copied back-to-back into arrow (even when nulls are present). variable length and validity needs a smallO(N)
deserialization, so not as fast as Arrow IPC (but likely much faster than parquet).