-
Notifications
You must be signed in to change notification settings - Fork 4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Deal with, document dependencies issue, add first README.md #22
Changes from 10 commits
52069b0
cf8189c
48648a2
4e10cd2
f487adc
5ab62fe
50fdbe3
968de10
e05c210
f043117
e576564
8e8f00b
eb8cc09
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,60 @@ | ||
# Installing Records Mover | ||
|
||
You can install records-mover with the following 'extras': | ||
|
||
* `pip3 install records-mover` - Install minimal version, not | ||
including `pandas` (needed only for local data copy), `psycopg2` | ||
(needed for Redshift or PostgreSQL connections) or `pyarrow` (needed | ||
for local Parquet manipulation). | ||
* `pip3 install records-mover[gsheets]` - Minimal install plus API | ||
libraries to access Google Sheets. | ||
* `pip3 install records-mover[mover-cli]` - Install everything and | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Is it mover-cli or movercli? I’ve seen it both ways in the docs (will flag) There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
|
||
make assumptions compatible with using mvrec on the command line. | ||
Installs `pandas`, `psycopg2-binary` and `pyarrow`. | ||
|
||
Don't use this extra if you plan on using the library because of the | ||
`psycopg2-binary` risk below. | ||
|
||
## Why this is complicated | ||
|
||
Records mover relies on a number of external libraries. Here are some | ||
things to keep in mind when using `pip install`: | ||
|
||
### pandas | ||
|
||
vinceatbluelabs marked this conversation as resolved.
Show resolved
Hide resolved
|
||
Only when installing with `pip3 install 'records-mover[movercli]'` | ||
will you get pandas installed by default. | ||
|
||
Pandas a large dependency which is needed in cases where we need to | ||
process data locally. If you are using cloud-native import/export | ||
functionality only, you shouldn't need it and can avoid the bloat. | ||
|
||
### psycopg2 | ||
|
||
psycopg2 is a library used for access to both Redshift and PostgreSQL databases. | ||
|
||
The project is | ||
[dealing](https://www.postgresql.org/message-id/CA%2Bmi_8bd6kJHLTGkuyHSnqcgDrJ1uHgQWvXCKQFD3tPQBUa2Bw%40mail.gmail.com) | ||
[with](https://www.psycopg.org/articles/2018/02/08/psycopg-274-released/) | ||
a thorny compatibility issue with native code and threading. They've | ||
published three separate versions of their library to PyPI as a | ||
result: | ||
|
||
* `psycopg2` - requires local compilation, and as such you need certain | ||
tools and maybe configuration set up. This is the hardest one to | ||
install as a result. | ||
* `psycopg2-binary` - pre-compiled version that might have threading | ||
issues if you try to use it in a multi-threaded environment with | ||
other code that might be using libssl from a different source. | ||
* `psycopg2cffi` - The version to use if you use `pypy` | ||
|
||
If you are using the mvrec command line only, you can use `pip3 | ||
install 'records-mover[movercli]` and it just uses `psycopg2-binary`. | ||
|
||
### pyarrow | ||
|
||
`pyarrow` is a Python wrapper around the Apache Arrow native library. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Hmm... is it worth making a mover-cli target that doesn’t include this? I guess it might depend on how hard it is to install these libraries in general. Should we maybe include steps for homebrew and whatever Linux distro we have already figured out for CI/docker? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
No objections here.
That'd be great! I don't have any off-the-shelf to drop in. Same feeling on not holding up this PR, but if I wasn't super clear about it, having spaces for those sorts of instructions are exactly why I added this file. |
||
It's used by records mover to manipulate Parquet files locally. The | ||
Apache Arrow native library can require build tools to install and is | ||
large; if you don't need to deal with Parquet files in the local | ||
environment you can work without it. |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1 +1,85 @@ | ||
# Records Mover | ||
# Records Mover - mvrec | ||
|
||
Records Mover is a command-line tool and Python library you can | ||
use to move relational data from one place to another. | ||
|
||
Relational data here means anything roughly "rectangular" - with | ||
columns and rows. For example, CSV it supports reading and writing | ||
data in: | ||
|
||
* Databases, including using native high-speed methods of | ||
import/export of bulk data. Redshift and Vertica are | ||
well-supported, with some support for BigQuery and PostgreSQL. | ||
* Google Sheets | ||
* Pandas DataFrames | ||
* CSV files, either alone or in a records directory - a structured | ||
directory of CSV/Parquet/etc files containing some JSON metadata | ||
about their format and origins. Records directories are especially | ||
helpful for the ever-ambiguous CSV format, where they solve the | ||
problem of 'hey, this may be a CSV - but what's the schema? What's | ||
the format of the CSV itself? How is it escaped?' | ||
|
||
The record mover can be exended expand to handle additional database | ||
and data file types by building on top of their | ||
[SQLAlchemy](https://www.sqlalchemy.org/) drivers, and is able to | ||
auto-negotiate the most efficient way of moving data from one to the | ||
other. | ||
|
||
Example CLI use: | ||
|
||
```sh | ||
pip3 install 'records_mover[movercli]' | ||
vinceatbluelabs marked this conversation as resolved.
Show resolved
Hide resolved
|
||
mvrec --help | ||
mvrec table2table mydb1 myschema1 mytable1 mydb2 myschema2 mytable2 | ||
``` | ||
|
||
For more installation notes, see [INSTALL.md](./INSTALL.md) | ||
|
||
Note that the connection details for the database names here must be | ||
configured using | ||
[db-facts](https://github.com/bluelabsio/db-facts/blob/master/CONFIGURATION.md). | ||
|
||
Example Python library use: | ||
|
||
First, install records_mover. We'll also use Pandas, so we'll install | ||
that, too: | ||
|
||
```sh | ||
pip3 install records_mover pandas | ||
``` | ||
|
||
Now we can run this code: | ||
|
||
```python | ||
#!/usr/bin/env python3 | ||
|
||
# Pull in the job lib library - be sure to run the pip install above first! | ||
from records_mover import Session | ||
from pandas import DataFrame | ||
|
||
session = Session() | ||
records = session.records | ||
|
||
# This is a SQLAlchemy database engine. | ||
# | ||
# You can instead call job_context.get_db_engine('cred name'). | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Some old uses of job_context in here There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Fixed; that's all that grep found. |
||
# | ||
# On your laptop, 'cred name' is the same thing passed to dbcli (mapping to something in LastPass). | ||
# | ||
# In Airflow, 'cred name' maps to the connection ID in the admin Connnections UI. | ||
# | ||
# Or you can build your own and pass it in! | ||
db_engine = job_context.get_default_db_engine() | ||
|
||
df = DataFrame.from_dict([{'a': 1}]) # or make your own! | ||
|
||
source = records.sources.dataframe(df=df) | ||
target = records.targets.table(schema_name='myschema', | ||
table_name='mytable', | ||
db_engine=db_engine) | ||
results = records.move(source, target) | ||
``` | ||
|
||
When moving data, the sources supported can be found | ||
[here](./records_mover/records/sources/factory.py), and the | ||
targets supported can be found [here](./records_mover/records/targets/factory.py). |
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -11,16 +11,6 @@ python_version=3.8.1 | |
# You may need `xcode-select --install` on OS X | ||
# https://github.com/pyenv/pyenv/issues/451#issuecomment-151336786 | ||
pyenv install -s "${python_version:?}" | ||
if [ "$(uname)" == Darwin ] | ||
then | ||
# Python has needed this in the past when installed by 'pyenv | ||
# install'. The current version of 'psycopg2' seems to require it | ||
# now, but Python complains when it *is* set. 🤦 | ||
CFLAGS="-I$(brew --prefix openssl)/include" | ||
export CFLAGS | ||
LDFLAGS="-L$(brew --prefix openssl)/lib" | ||
export LDFLAGS | ||
fi | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I tested this and this is no longer needed with |
||
pyenv virtualenv "${python_version:?}" records-mover-"${python_version:?}" || true | ||
pyenv local records-mover-"${python_version:?}" | ||
|
||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1 +1 @@ | ||
88.8400 | ||
88.8500 |
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -39,8 +39,19 @@ | |
}, | ||
install_requires=[ | ||
'boto>=2,<3', 'boto3', | ||
'jsonschema', 'timeout_decorator', 'awscli', | ||
'PyYAML', 'psycopg2', | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. See notes in INSTALL.md for why I think dropping |
||
'jsonschema', 'timeout_decorator', | ||
'awscli>=1,<2', | ||
# awscli pins PyYAML below 5.3 so they can maintain support | ||
# for old versions of Python. This can cause issues at | ||
# run-time if we don't constrain things here as well, as a | ||
# newer version seems to sneak in: | ||
# | ||
# pkg_resources.ContextualVersionConflict: | ||
# (PyYAML 5.3 (.../lib/python3.7/site-packages), | ||
# Requirement.parse('PyYAML<5.3,>=3.10'), {'awscli'}) | ||
# | ||
# https://github.com/aws/aws-cli/blob/develop/setup.py | ||
'PyYAML<5.3', | ||
# sqlalchemy-vertica-python 0.5.5 introduced | ||
# https://github.com/bluelabsio/sqlalchemy-vertica-python/pull/7 | ||
# which fixed a bug pulling schema information from Vertica | ||
|
@@ -64,7 +75,9 @@ | |
], | ||
extras_require={ | ||
'gsheets': gsheet_dependencies, | ||
'movercli': gsheet_dependencies + ['typing_inspect', 'docstring_parser', | ||
'movercli': gsheet_dependencies + ['typing_inspect', | ||
'docstring_parser', | ||
'psycopg2-binary', | ||
'pandas<2', | ||
'pyarrow'], | ||
}, | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Whoops! This wasn't the same as the pip install below and it should have been. This may speed up our tests a touch.