Deal with, document dependencies issue, add first README.md (#22)

* Add README.md, INSTALL.md and move psycopg2 to mover-cli extra * Add psycopg2-binary to movercli extra * deps-v2 -> deps-v1 * Make PyYAML constraint transitive
bluelabsio · Feb 26, 2020 · b1997bb · b1997bb
1 parent 058decb
commit b1997bb
Show file tree

Hide file tree

Showing 7 changed files with 163 additions and 24 deletions.
diff --git a/.circleci/config.yml b/.circleci/config.yml
@@ -24,7 +24,7 @@ commands:
               . venv/bin/activate
               # venv/ dir doesn't seem to save enough info to keep the
               # editable installation
-              pip install --progress-bar=off -e .
+              pip install --progress-bar=off -e '.[movercli]'
             else
               python -m venv venv
               . venv/bin/activate

diff --git a/INSTALL.md b/INSTALL.md
@@ -0,0 +1,60 @@
+# Installing Records Mover
+
+You can install records-mover with the following 'extras':
+
+* `pip3 install records-mover` - Install minimal version, not
+  including `pandas` (needed only for local data copy), `psycopg2`
+  (needed for Redshift or PostgreSQL connections) or `pyarrow` (needed
+  for local Parquet manipulation).
+* `pip3 install records-mover[gsheets]` - Minimal install plus API
+  libraries to access Google Sheets.
+* `pip3 install records-mover[movercli]` - Install everything and
+  make assumptions compatible with using mvrec on the command line.
+  Installs `pandas`, `psycopg2-binary` and `pyarrow`.
+
+  Don't use this extra if you plan on using the library because of the
+  `psycopg2-binary` risk below.
+
+## Why this is complicated
+
+Records mover relies on a number of external libraries.  Here are some
+things to keep in mind when using `pip install`:
+
+### pandas
+
+Only when installing with `pip3 install 'records-mover[movercli]'`
+will you get pandas installed by default.
+
+Pandas a large dependency which is needed in cases where we need to
+process data locally.  If you are using cloud-native import/export
+functionality only, you shouldn't need it and can avoid the bloat.
+
+### psycopg2
+
+psycopg2 is a library used for access to both Redshift and PostgreSQL databases.
+
+The project is
+[dealing](https://www.postgresql.org/message-id/CA%2Bmi_8bd6kJHLTGkuyHSnqcgDrJ1uHgQWvXCKQFD3tPQBUa2Bw%40mail.gmail.com)
+[with](https://www.psycopg.org/articles/2018/02/08/psycopg-274-released/)
+a thorny compatibility issue with native code and threading.  They've
+published three separate versions of their library to PyPI as a
+result:
+
+* `psycopg2` - requires local compilation, and as such you need certain
+  tools and maybe configuration set up.  This is the hardest one to
+  install as a result.
+* `psycopg2-binary` - pre-compiled version that might have threading
+  issues if you try to use it in a multi-threaded environment with
+  other code that might be using libssl from a different source.
+* `psycopg2cffi` - The version to use if you use `pypy`
+
+If you are using the mvrec command line only, you can use `pip3
+install 'records-mover[movercli]` and it just uses `psycopg2-binary`.
+
+### pyarrow
+
+`pyarrow` is a Python wrapper around the Apache Arrow native library.
+It's used by records mover to manipulate Parquet files locally.  The
+Apache Arrow native library can require build tools to install and is
+large; if you don't need to deal with Parquet files in the local
+environment you can work without it.
diff --git a/README.md b/README.md
@@ -1 +1,85 @@
-# Records Mover
+# Records Mover - mvrec
+
+Records Mover is a command-line tool and Python library you can
+use to move relational data from one place to another.
+
+Relational data here means anything roughly "rectangular" - with
+columns and rows.  For example, CSV it supports reading and writing
+data in:
+
+* Databases, including using native high-speed methods of
+  import/export of bulk data.  Redshift and Vertica are
+  well-supported, with some support for BigQuery and PostgreSQL.
+* Google Sheets
+* Pandas DataFrames
+* CSV files, either alone or in a records directory - a structured
+  directory of CSV/Parquet/etc files containing some JSON metadata
+  about their format and origins.  Records directories are especially
+  helpful for the ever-ambiguous CSV format, where they solve the
+  problem of 'hey, this may be a CSV - but what's the schema?  What's
+  the format of the CSV itself?  How is it escaped?'
+
+The record mover can be exended expand to handle additional database
+and data file types by building on top of their
+[SQLAlchemy](https://www.sqlalchemy.org/) drivers, and is able to
+auto-negotiate the most efficient way of moving data from one to the
+other.
+
+Example CLI use:
+
+```sh
+pip3 install 'records_mover[movercli]'
+mvrec --help
+mvrec table2table mydb1 myschema1 mytable1 mydb2 myschema2 mytable2
+```
+
+For more installation notes, see [INSTALL.md](./INSTALL.md)
+
+Note that the connection details for the database names here must be
+configured using
+[db-facts](https://github.com/bluelabsio/db-facts/blob/master/CONFIGURATION.md).
+
+Example Python library use:
+
+First, install records_mover.  We'll also use Pandas, so we'll install
+that, too:
+
+```sh
+pip3 install records_mover pandas
+```
+
+Now we can run this code:
+
+```python
+#!/usr/bin/env python3
+
+# Pull in the job lib library - be sure to run the pip install above first!
+from records_mover import Session
+from pandas import DataFrame
+
+session = Session()
+records = session.records
+
+# This is a SQLAlchemy database engine.
+#
+# You can instead call session.get_db_engine('cred name').
+#
+# On your laptop, 'cred name' is the same thing passed to dbcli (mapping to something in LastPass).
+#
+# In Airflow, 'cred name' maps to the connection ID in the admin Connnections UI.
+#
+# Or you can build your own and pass it in!
+db_engine = session.get_default_db_engine()
+
+df = DataFrame.from_dict([{'a': 1}]) # or make your own!
+
+source = records.sources.dataframe(df=df)
+target = records.targets.table(schema_name='myschema',
+                               table_name='mytable',
+                               db_engine=db_engine)
+results = records.move(source, target)
+```
+
+When moving data, the sources supported can be found
+[here](./records_mover/records/sources/factory.py), and the
+targets supported can be found [here](./records_mover/records/targets/factory.py).
diff --git a/deps.sh b/deps.sh
@@ -11,16 +11,6 @@ python_version=3.8.1
 #    You may need `xcode-select --install` on OS X
 #    https://github.com/pyenv/pyenv/issues/451#issuecomment-151336786
 pyenv install -s "${python_version:?}"
-if [ "$(uname)" == Darwin ]
-then
-  # Python has needed this in the past when installed by 'pyenv
-  # install'.  The current version of 'psycopg2' seems to require it
-  # now, but Python complains when it *is* set.  🤦
-  CFLAGS="-I$(brew --prefix openssl)/include"
-  export CFLAGS
-  LDFLAGS="-L$(brew --prefix openssl)/lib"
-  export LDFLAGS
-fi
 pyenv virtualenv "${python_version:?}" records-mover-"${python_version:?}" || true
 pyenv local records-mover-"${python_version:?}"
 

diff --git a/metrics/mypy_high_water_mark b/metrics/mypy_high_water_mark
@@ -1 +1 @@
-88.8400
+88.8500
diff --git a/requirements.txt b/requirements.txt
@@ -2,14 +2,6 @@
 setuptools>34.3.0
 wheel
 twine
-#
-# awscli seems to artifically limit the max version of PyYAML:
-#
-# https://github.com/aws/aws-cli/pull/4403/files
-#
-# pip._vendor.pkg_resources.ContextualVersionConflict: (PyYAML 5.2 (/home/circleci/project/venv/lib/python3.6/site-packages), Requirement.parse('PyYAML<5.2,>=3.10; python_version != "2.6" and python_version != "3.3"'), {'awscli'})
-#
-PyYAML<5.2,>3.10
 flake8
 nose
 nose-progressive

diff --git a/setup.py b/setup.py
@@ -39,8 +39,19 @@
       },
       install_requires=[
           'boto>=2,<3', 'boto3',
-          'jsonschema', 'timeout_decorator', 'awscli',
-          'PyYAML', 'psycopg2',
+          'jsonschema', 'timeout_decorator',
+          'awscli>=1,<2',
+          # awscli pins PyYAML below 5.3 so they can maintain support
+          # for old versions of Python.  This can cause issues at
+          # run-time if we don't constrain things here as well, as a
+          # newer version seems to sneak in:
+          #
+          # pkg_resources.ContextualVersionConflict:
+          #   (PyYAML 5.3 (.../lib/python3.7/site-packages),
+          #     Requirement.parse('PyYAML<5.3,>=3.10'), {'awscli'})
+          #
+          # https://github.com/aws/aws-cli/blob/develop/setup.py
+          'PyYAML<5.3',
           # sqlalchemy-vertica-python 0.5.5 introduced
           # https://github.com/bluelabsio/sqlalchemy-vertica-python/pull/7
           # which fixed a bug pulling schema information from Vertica
@@ -64,7 +75,9 @@
       ],
       extras_require={
           'gsheets': gsheet_dependencies,
-          'movercli': gsheet_dependencies + ['typing_inspect', 'docstring_parser',
+          'movercli': gsheet_dependencies + ['typing_inspect',
+                                             'docstring_parser',
+                                             'psycopg2-binary',
                                              'pandas<2',
                                              'pyarrow'],
       },