Skip to content

Commit

Permalink
Deal with, document dependencies issue, add first README.md (#22)
Browse files Browse the repository at this point in the history
* Add README.md, INSTALL.md and move psycopg2 to mover-cli extra

* Add psycopg2-binary to movercli extra

* deps-v2 -> deps-v1

* Make PyYAML constraint transitive
  • Loading branch information
vinceatbluelabs authored Feb 26, 2020
1 parent 058decb commit b1997bb
Show file tree
Hide file tree
Showing 7 changed files with 163 additions and 24 deletions.
2 changes: 1 addition & 1 deletion .circleci/config.yml
Original file line number Diff line number Diff line change
Expand Up @@ -24,7 +24,7 @@ commands:
. venv/bin/activate
# venv/ dir doesn't seem to save enough info to keep the
# editable installation
pip install --progress-bar=off -e .
pip install --progress-bar=off -e '.[movercli]'
else
python -m venv venv
. venv/bin/activate
Expand Down
60 changes: 60 additions & 0 deletions INSTALL.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,60 @@
# Installing Records Mover

You can install records-mover with the following 'extras':

* `pip3 install records-mover` - Install minimal version, not
including `pandas` (needed only for local data copy), `psycopg2`
(needed for Redshift or PostgreSQL connections) or `pyarrow` (needed
for local Parquet manipulation).
* `pip3 install records-mover[gsheets]` - Minimal install plus API
libraries to access Google Sheets.
* `pip3 install records-mover[movercli]` - Install everything and
make assumptions compatible with using mvrec on the command line.
Installs `pandas`, `psycopg2-binary` and `pyarrow`.

Don't use this extra if you plan on using the library because of the
`psycopg2-binary` risk below.

## Why this is complicated

Records mover relies on a number of external libraries. Here are some
things to keep in mind when using `pip install`:

### pandas

Only when installing with `pip3 install 'records-mover[movercli]'`
will you get pandas installed by default.

Pandas a large dependency which is needed in cases where we need to
process data locally. If you are using cloud-native import/export
functionality only, you shouldn't need it and can avoid the bloat.

### psycopg2

psycopg2 is a library used for access to both Redshift and PostgreSQL databases.

The project is
[dealing](https://www.postgresql.org/message-id/CA%2Bmi_8bd6kJHLTGkuyHSnqcgDrJ1uHgQWvXCKQFD3tPQBUa2Bw%40mail.gmail.com)
[with](https://www.psycopg.org/articles/2018/02/08/psycopg-274-released/)
a thorny compatibility issue with native code and threading. They've
published three separate versions of their library to PyPI as a
result:

* `psycopg2` - requires local compilation, and as such you need certain
tools and maybe configuration set up. This is the hardest one to
install as a result.
* `psycopg2-binary` - pre-compiled version that might have threading
issues if you try to use it in a multi-threaded environment with
other code that might be using libssl from a different source.
* `psycopg2cffi` - The version to use if you use `pypy`

If you are using the mvrec command line only, you can use `pip3
install 'records-mover[movercli]` and it just uses `psycopg2-binary`.

### pyarrow

`pyarrow` is a Python wrapper around the Apache Arrow native library.
It's used by records mover to manipulate Parquet files locally. The
Apache Arrow native library can require build tools to install and is
large; if you don't need to deal with Parquet files in the local
environment you can work without it.
86 changes: 85 additions & 1 deletion README.md
Original file line number Diff line number Diff line change
@@ -1 +1,85 @@
# Records Mover
# Records Mover - mvrec

Records Mover is a command-line tool and Python library you can
use to move relational data from one place to another.

Relational data here means anything roughly "rectangular" - with
columns and rows. For example, CSV it supports reading and writing
data in:

* Databases, including using native high-speed methods of
import/export of bulk data. Redshift and Vertica are
well-supported, with some support for BigQuery and PostgreSQL.
* Google Sheets
* Pandas DataFrames
* CSV files, either alone or in a records directory - a structured
directory of CSV/Parquet/etc files containing some JSON metadata
about their format and origins. Records directories are especially
helpful for the ever-ambiguous CSV format, where they solve the
problem of 'hey, this may be a CSV - but what's the schema? What's
the format of the CSV itself? How is it escaped?'

The record mover can be exended expand to handle additional database
and data file types by building on top of their
[SQLAlchemy](https://www.sqlalchemy.org/) drivers, and is able to
auto-negotiate the most efficient way of moving data from one to the
other.

Example CLI use:

```sh
pip3 install 'records_mover[movercli]'
mvrec --help
mvrec table2table mydb1 myschema1 mytable1 mydb2 myschema2 mytable2
```

For more installation notes, see [INSTALL.md](./INSTALL.md)

Note that the connection details for the database names here must be
configured using
[db-facts](https://github.com/bluelabsio/db-facts/blob/master/CONFIGURATION.md).

Example Python library use:

First, install records_mover. We'll also use Pandas, so we'll install
that, too:

```sh
pip3 install records_mover pandas
```

Now we can run this code:

```python
#!/usr/bin/env python3

# Pull in the job lib library - be sure to run the pip install above first!
from records_mover import Session
from pandas import DataFrame

session = Session()
records = session.records

# This is a SQLAlchemy database engine.
#
# You can instead call session.get_db_engine('cred name').
#
# On your laptop, 'cred name' is the same thing passed to dbcli (mapping to something in LastPass).
#
# In Airflow, 'cred name' maps to the connection ID in the admin Connnections UI.
#
# Or you can build your own and pass it in!
db_engine = session.get_default_db_engine()

df = DataFrame.from_dict([{'a': 1}]) # or make your own!

source = records.sources.dataframe(df=df)
target = records.targets.table(schema_name='myschema',
table_name='mytable',
db_engine=db_engine)
results = records.move(source, target)
```

When moving data, the sources supported can be found
[here](./records_mover/records/sources/factory.py), and the
targets supported can be found [here](./records_mover/records/targets/factory.py).
10 changes: 0 additions & 10 deletions deps.sh
Original file line number Diff line number Diff line change
Expand Up @@ -11,16 +11,6 @@ python_version=3.8.1
# You may need `xcode-select --install` on OS X
# https://github.com/pyenv/pyenv/issues/451#issuecomment-151336786
pyenv install -s "${python_version:?}"
if [ "$(uname)" == Darwin ]
then
# Python has needed this in the past when installed by 'pyenv
# install'. The current version of 'psycopg2' seems to require it
# now, but Python complains when it *is* set. 🤦
CFLAGS="-I$(brew --prefix openssl)/include"
export CFLAGS
LDFLAGS="-L$(brew --prefix openssl)/lib"
export LDFLAGS
fi
pyenv virtualenv "${python_version:?}" records-mover-"${python_version:?}" || true
pyenv local records-mover-"${python_version:?}"

Expand Down
2 changes: 1 addition & 1 deletion metrics/mypy_high_water_mark
Original file line number Diff line number Diff line change
@@ -1 +1 @@
88.8400
88.8500
8 changes: 0 additions & 8 deletions requirements.txt
Original file line number Diff line number Diff line change
Expand Up @@ -2,14 +2,6 @@
setuptools>34.3.0
wheel
twine
#
# awscli seems to artifically limit the max version of PyYAML:
#
# https://github.com/aws/aws-cli/pull/4403/files
#
# pip._vendor.pkg_resources.ContextualVersionConflict: (PyYAML 5.2 (/home/circleci/project/venv/lib/python3.6/site-packages), Requirement.parse('PyYAML<5.2,>=3.10; python_version != "2.6" and python_version != "3.3"'), {'awscli'})
#
PyYAML<5.2,>3.10
flake8
nose
nose-progressive
Expand Down
19 changes: 16 additions & 3 deletions setup.py
Original file line number Diff line number Diff line change
Expand Up @@ -39,8 +39,19 @@
},
install_requires=[
'boto>=2,<3', 'boto3',
'jsonschema', 'timeout_decorator', 'awscli',
'PyYAML', 'psycopg2',
'jsonschema', 'timeout_decorator',
'awscli>=1,<2',
# awscli pins PyYAML below 5.3 so they can maintain support
# for old versions of Python. This can cause issues at
# run-time if we don't constrain things here as well, as a
# newer version seems to sneak in:
#
# pkg_resources.ContextualVersionConflict:
# (PyYAML 5.3 (.../lib/python3.7/site-packages),
# Requirement.parse('PyYAML<5.3,>=3.10'), {'awscli'})
#
# https://github.com/aws/aws-cli/blob/develop/setup.py
'PyYAML<5.3',
# sqlalchemy-vertica-python 0.5.5 introduced
# https://github.com/bluelabsio/sqlalchemy-vertica-python/pull/7
# which fixed a bug pulling schema information from Vertica
Expand All @@ -64,7 +75,9 @@
],
extras_require={
'gsheets': gsheet_dependencies,
'movercli': gsheet_dependencies + ['typing_inspect', 'docstring_parser',
'movercli': gsheet_dependencies + ['typing_inspect',
'docstring_parser',
'psycopg2-binary',
'pandas<2',
'pyarrow'],
},
Expand Down

0 comments on commit b1997bb

Please sign in to comment.