This repository provides scripts and notebooks that make it easy to export data in bulk from CourtListener's freely available downloads.
- Create first version of notebook suitable for Data Scientists
- Create the appropriate dtypes to optimize panda storage
- Select necessary cols usecols, for example 'created_by' date field indicating a database insert isn't necessary
- Read the opinions.csv (190+gb) chunk at a time from disk while converting into JSON
- Create a standalone script that can be piped to other tools
- Create PyPi library using Poetry: package
- Output script using json lines format
- Improve speed by using DASK DataFrame