Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Deal with, document dependencies issue, add first README.md #22

Merged
merged 13 commits into from
Feb 26, 2020
Merged
2 changes: 1 addition & 1 deletion .circleci/config.yml
Original file line number Diff line number Diff line change
Expand Up @@ -24,7 +24,7 @@ commands:
. venv/bin/activate
# venv/ dir doesn't seem to save enough info to keep the
# editable installation
pip install --progress-bar=off -e .
pip install --progress-bar=off -e '.[movercli]'
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Whoops! This wasn't the same as the pip install below and it should have been. This may speed up our tests a touch.

else
python -m venv venv
. venv/bin/activate
Expand Down
60 changes: 60 additions & 0 deletions INSTALL.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,60 @@
# Installing Records Mover

You can install records-mover with the following 'extras':

* `pip3 install records-mover` - Install minimal version, not
including `pandas` (needed only for local data copy), `psycopg2`
(needed for Redshift or PostgreSQL connections) or `pyarrow` (needed
for local Parquet manipulation).
* `pip3 install records-mover[gsheets]` - Minimal install plus API
libraries to access Google Sheets.
* `pip3 install records-mover[mover-cli]` - Install everything and
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it mover-cli or movercli? I’ve seen it both ways in the docs (will flag)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

movercli. Where else did you see mover-cli in the docs?

make assumptions compatible with using mvrec on the command line.
Installs `pandas`, `psycopg2-binary` and `pyarrow`.

Don't use this extra if you plan on using the library because of the
`psycopg2-binary` risk below.

## Why this is complicated

Records mover relies on a number of external libraries. Here are some
things to keep in mind when using `pip install`:

### pandas

vinceatbluelabs marked this conversation as resolved.
Show resolved Hide resolved
Only when installing with `pip3 install 'records-mover[movercli]'`
will you get pandas installed by default.

Pandas a large dependency which is needed in cases where we need to
process data locally. If you are using cloud-native import/export
functionality only, you shouldn't need it and can avoid the bloat.

### psycopg2

psycopg2 is a library used for access to both Redshift and PostgreSQL databases.

The project is
[dealing](https://www.postgresql.org/message-id/CA%2Bmi_8bd6kJHLTGkuyHSnqcgDrJ1uHgQWvXCKQFD3tPQBUa2Bw%40mail.gmail.com)
[with](https://www.psycopg.org/articles/2018/02/08/psycopg-274-released/)
a thorny compatibility issue with native code and threading. They've
published three separate versions of their library to PyPI as a
result:

* `psycopg2` - requires local compilation, and as such you need certain
tools and maybe configuration set up. This is the hardest one to
install as a result.
* `psycopg2-binary` - pre-compiled version that might have threading
issues if you try to use it in a multi-threaded environment with
other code that might be using libssl from a different source.
* `psycopg2cffi` - The version to use if you use `pypy`

If you are using the mvrec command line only, you can use `pip3
install 'records-mover[movercli]` and it just uses `psycopg2-binary`.

### pyarrow

`pyarrow` is a Python wrapper around the Apache Arrow native library.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm... is it worth making a mover-cli target that doesn’t include this? I guess it might depend on how hard it is to install these libraries in general. Should we maybe include steps for homebrew and whatever Linux distro we have already figured out for CI/docker?

Copy link
Contributor Author

@vinceatbluelabs vinceatbluelabs Feb 26, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm... is it worth making a mover-cli target that doesn’t include this

No objections here.

Should we maybe include steps for homebrew and whatever Linux distro we have already figured out for CI/docker?

That'd be great! I don't have any off-the-shelf to drop in. Same feeling on not holding up this PR, but if I wasn't super clear about it, having spaces for those sorts of instructions are exactly why I added this file.

It's used by records mover to manipulate Parquet files locally. The
Apache Arrow native library can require build tools to install and is
large; if you don't need to deal with Parquet files in the local
environment you can work without it.
86 changes: 85 additions & 1 deletion README.md
Original file line number Diff line number Diff line change
@@ -1 +1,85 @@
# Records Mover
# Records Mover - mvrec

Records Mover is a command-line tool and Python library you can
use to move relational data from one place to another.

Relational data here means anything roughly "rectangular" - with
columns and rows. For example, CSV it supports reading and writing
data in:

* Databases, including using native high-speed methods of
import/export of bulk data. Redshift and Vertica are
well-supported, with some support for BigQuery and PostgreSQL.
* Google Sheets
* Pandas DataFrames
* CSV files, either alone or in a records directory - a structured
directory of CSV/Parquet/etc files containing some JSON metadata
about their format and origins. Records directories are especially
helpful for the ever-ambiguous CSV format, where they solve the
problem of 'hey, this may be a CSV - but what's the schema? What's
the format of the CSV itself? How is it escaped?'

The record mover can be exended expand to handle additional database
and data file types by building on top of their
[SQLAlchemy](https://www.sqlalchemy.org/) drivers, and is able to
auto-negotiate the most efficient way of moving data from one to the
other.

Example CLI use:

```sh
pip3 install 'records_mover[movercli]'
vinceatbluelabs marked this conversation as resolved.
Show resolved Hide resolved
mvrec --help
mvrec table2table mydb1 myschema1 mytable1 mydb2 myschema2 mytable2
```

For more installation notes, see [INSTALL.md](./INSTALL.md)

Note that the connection details for the database names here must be
configured using
[db-facts](https://github.com/bluelabsio/db-facts/blob/master/CONFIGURATION.md).

Example Python library use:

First, install records_mover. We'll also use Pandas, so we'll install
that, too:

```sh
pip3 install records_mover pandas
```

Now we can run this code:

```python
#!/usr/bin/env python3

# Pull in the job lib library - be sure to run the pip install above first!
from records_mover import Session
from pandas import DataFrame

session = Session()
records = session.records

# This is a SQLAlchemy database engine.
#
# You can instead call job_context.get_db_engine('cred name').
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some old uses of job_context in here

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed; that's all that grep found.

#
# On your laptop, 'cred name' is the same thing passed to dbcli (mapping to something in LastPass).
#
# In Airflow, 'cred name' maps to the connection ID in the admin Connnections UI.
#
# Or you can build your own and pass it in!
db_engine = job_context.get_default_db_engine()

df = DataFrame.from_dict([{'a': 1}]) # or make your own!

source = records.sources.dataframe(df=df)
target = records.targets.table(schema_name='myschema',
table_name='mytable',
db_engine=db_engine)
results = records.move(source, target)
```

When moving data, the sources supported can be found
[here](./records_mover/records/sources/factory.py), and the
targets supported can be found [here](./records_mover/records/targets/factory.py).
10 changes: 0 additions & 10 deletions deps.sh
Original file line number Diff line number Diff line change
Expand Up @@ -11,16 +11,6 @@ python_version=3.8.1
# You may need `xcode-select --install` on OS X
# https://github.com/pyenv/pyenv/issues/451#issuecomment-151336786
pyenv install -s "${python_version:?}"
if [ "$(uname)" == Darwin ]
then
# Python has needed this in the past when installed by 'pyenv
# install'. The current version of 'psycopg2' seems to require it
# now, but Python complains when it *is* set. 🤦
CFLAGS="-I$(brew --prefix openssl)/include"
export CFLAGS
LDFLAGS="-L$(brew --prefix openssl)/lib"
export LDFLAGS
fi
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tested this and this is no longer needed with psycopg2-binary!

pyenv virtualenv "${python_version:?}" records-mover-"${python_version:?}" || true
pyenv local records-mover-"${python_version:?}"

Expand Down
2 changes: 1 addition & 1 deletion metrics/mypy_high_water_mark
Original file line number Diff line number Diff line change
@@ -1 +1 @@
88.8400
88.8500
8 changes: 0 additions & 8 deletions requirements.txt
Original file line number Diff line number Diff line change
Expand Up @@ -2,14 +2,6 @@
setuptools>34.3.0
wheel
twine
#
# awscli seems to artifically limit the max version of PyYAML:
#
# https://github.com/aws/aws-cli/pull/4403/files
#
# pip._vendor.pkg_resources.ContextualVersionConflict: (PyYAML 5.2 (/home/circleci/project/venv/lib/python3.6/site-packages), Requirement.parse('PyYAML<5.2,>=3.10; python_version != "2.6" and python_version != "3.3"'), {'awscli'})
#
PyYAML<5.2,>3.10
flake8
nose
nose-progressive
Expand Down
19 changes: 16 additions & 3 deletions setup.py
Original file line number Diff line number Diff line change
Expand Up @@ -39,8 +39,19 @@
},
install_requires=[
'boto>=2,<3', 'boto3',
'jsonschema', 'timeout_decorator', 'awscli',
'PyYAML', 'psycopg2',
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See notes in INSTALL.md for why I think dropping psycopg2 here is the right change.

'jsonschema', 'timeout_decorator',
'awscli>=1,<2',
# awscli pins PyYAML below 5.3 so they can maintain support
# for old versions of Python. This can cause issues at
# run-time if we don't constrain things here as well, as a
# newer version seems to sneak in:
#
# pkg_resources.ContextualVersionConflict:
# (PyYAML 5.3 (.../lib/python3.7/site-packages),
# Requirement.parse('PyYAML<5.3,>=3.10'), {'awscli'})
#
# https://github.com/aws/aws-cli/blob/develop/setup.py
'PyYAML<5.3',
# sqlalchemy-vertica-python 0.5.5 introduced
# https://github.com/bluelabsio/sqlalchemy-vertica-python/pull/7
# which fixed a bug pulling schema information from Vertica
Expand All @@ -64,7 +75,9 @@
],
extras_require={
'gsheets': gsheet_dependencies,
'movercli': gsheet_dependencies + ['typing_inspect', 'docstring_parser',
'movercli': gsheet_dependencies + ['typing_inspect',
'docstring_parser',
'psycopg2-binary',
'pandas<2',
'pyarrow'],
},
Expand Down