Wrappers to streamline data-science tasks using Python toolkits (such as pandas, matplotlib, etc.).
The intent is to improve the ergonomics of data science exploration and implementation, by simplifying repetitive tasks such as checking data types (i.e., data quality), procedural subplots, consistent interface to local filesystem and Amazon S3, etc.
Capabilities (non-exhaustive list):
-
smallmatter.pathlib.Path2
:pathlib
-compatible interface to abstract certain single-file operations on local filesystem and Amazon S3. -
smallmatter.pathlib.S3Path
:pathlib
-compatible interface to abstract certain single-file operations on Amazon S3. -
smallmatter.sm
: utilities for Amazon SageMaker-
smallmatter.sm.FrameworkProcessor
: a prototype to support SageMaker Python processing jobs that accept multiple files. This is done throughsource_dir
,depedencies
,requirements.txt
, andgit_config
, similar to SageMaker estimator APIs. Furthermore, thisFrameworkProcessor
supports the SageMaker managed framework containers (i.e., MXNet, PyTorch, TensorFlow, Scikit-learn, and XGBoost).It aims to give you the familiar workflow of (1) instantiate a processor, then immediately (2) call the
run(...)
method.Here's an example on how to use this
FrameworkProcessor
class -- right now, the example is a.py
file as opposed to.ipynb
. Run the Python example using this shell script. You need to update the shell script S3 prefix and SageMaker execution role, and optionally choose your preferred framework container.--s3-prefix <s3://bucket/prefix/sagemaker> --role <arn:aws:iam::111122223333:role/service-role/my-amazon-sagemaker-execution-role-1234> # Optional: Update with your preferred container, it is Pytorch here --framework_version 1.6.0 sagemaker.pytorch.estimator.PyTorch
It slightly changes the processing API by adding a SageMaker Framework estimator, which was done for two purposes: (1) auto-detect container uri, and (2) re-use the packaging mechanism in the estimator to upload to
s3://.../sourcedir.tar.gz
. -
PyTestHelpers
: intended to run pytest tests on a git repo with multiple SageMaker sourcedirs. -
get_sm_execution_role()
: an opinionated function to unify fetching the execution role inside and outside of a notebook instance.
-
-
smallmatter.typecheck
: check possible dtypes of a csv file.- For each column, list the auto-detected dtypes. Useful to check data for data qualities (e.g., mixed-in strings and numbers in the same column).
- Generate html reports of the auto-detected dtypes.
- CLI interface to generate those html reports.
-
smallmatter.ds
: for rapid prototyping of some data science tasks.SimpleMatrixPlotter
: a simpler helper class to fill-in subplots one after another.MontagePager
: a pager to group and save subplots into multiple montage image files.DFBuilder
: a helper class to build a Pandas dataframe incrementally, row-by-row.json_np_converter
: convert numpy values to JSON-compliant data type.PyExec
: typical use-case: to implement a data dictionary or config files that can mix-in Python code to construct certain variables.
-
bin/pp.sh
: standard template to runpandas-profiling
.
pip install \
'git+https://github.com/aws-samples/[email protected]#egg=smallmatter'
# Install extras capability; need the single quotes.
pip install \
'git+https://github.com/aws-samples/[email protected]#egg=smallmatter[all]'
See CONTRIBUTING for more information.
This library is licensed under the MIT-0 License. See the LICENSE file.