Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Clean up README #13

Merged
merged 9 commits into from
Oct 23, 2024
Merged
Changes from 8 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
109 changes: 46 additions & 63 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,47 +1,47 @@
# CHAI

This is an attempt at an open-source data pipeline for package managers. The goal is to
have a pipeline that can run on any package manager and provide a normalized data
source for different use cases.
CHAI is an attempt at an open-source data pipeline for package managers. The
goal is to have a pipeline that can use the data from any package manager and
provide a normalized data source for myriads of different use cases.

1. db: postgres to store package specific data
1. alembic: for running migrations
1. pipeline: which fetches and writes data
## Getting Started

## Requirements
Use [Docker](https://docker.com)

- docker
1. Run `docker compose build` to create the latest Docker images.
2. Then, run `docker compose up` to launch.

> [!TIP]
> [!NOTE]
> This will run CHAI with for all package managers. As an example crates by
> itself will take over an hour and consume >5GB storage.
>
> for local development, all the requirements are within the [pkgx yaml](pkgx.yaml) file
> To run only a specific backend, comment out the others in `docker-compose.yml`.

## Setup

1. Run `docker compose build` to create the latest Docker images.
2. Run `docker compose up` to launch.
<!-- we'd like to change the above to be more friendly to users trying to run a specific
pipeline -->

### Arguments

- `PKG_MANAGER`: which package manager the pipeline will be run for. Currently, the
supported values are:
- `crates`
- `FREQUENCY`: how frequently **(in hours)** the pipeline will run. Defaults to `24`
Specify these eg. `docker compose -e FOO=bar up`:

- `FREQUENCY`: how frequently **(in hours)** the pipeline will run
(defaults to `24`)
- `FETCH`: whether the pipeline will fetch the data. Defaults to `true`
- `DEBUG`: whether the pipeline will run in debug mode. Defaults to `true`

These arguments are all configurable in the `docker-compose.yml` file.

## Hard Reset
### Docker Services Overview

If at all you need to do a hard reset, here's the steps
1. `db`: [PostgreSQL] database for the reduced package data
2. `alembic`: handles migrations
3. `pipeline`: fetches and writes data
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should add

4. `api`: a simple REST api for reading from the db


1. `rm -rf data`: removes all the data the fetcher is putting
2. `docker system prune -a -f --volumes`: removes **everything** docker-related
### Hard Reset

> [!WARNING]
>
> step 4 deletes all your docker stuff...be careful
Stuff happens. Start over:

`rm -rf ./data`: removes all the data the fetcher is putting.

<!-- this is handled now that alembic/psycopg2 are in pkgx -->
<!--
Expand All @@ -51,45 +51,25 @@ If at all you need to do a hard reset, here's the steps
- vapor's migrations are written in swift
-->

## Usage
## Goals

Our goal is to build a data schema that looks like this:

![db/CHAI_ERD.png](db/CHAI_ERD.png)

Our specific application extracts the dependency graph understand what are critical
pieces of the open-source graph. there are many other potential use cases for this data:

- license compatibility checker
- developer publications
- package popularity
- dependency analysis vulnerability tool (requires translating semver)

<!-- TODO: add these to the examples folder-->
Our specific application extracts the dependency graph understand what are
critical pieces of the open-source graph. there are many other potential use
cases for this data:

### license compatibility checker
- License compatibility checker
- Developer publications
- Package popularity
- Dependency analysis vulnerability tool (requires translating semver)

> [!WARNING]
>
> it's probably better to start with a global list of licenses and then map each
> version's to the global list...but this isn't part of v1

```sql
SELECT DISTINCT
p.name,
l.name AS license,
dep.name AS dependency,
dep_l.name AS dependency_license
FROM packages p
JOIN versions v ON p.id = v.package_id
JOIN dependencies d ON v.id = d.version_id
JOIN packages dep ON d.dependency_id = dep.id
JOIN licenses l ON v.license_id = l.id
JOIN versions dep_v ON dep.id = dep_v.package_id
JOIN licenses dep_l ON dep_v.license_id = dep_l.id
```
> [!TIP]
> Help us add the above to the examples folder.

### package popularity
### Package Popularity

```sql
SELECT p.name, SUM(v.downloads) as total_downloads
Expand All @@ -100,7 +80,7 @@ ORDER BY total_downloads DESC
LIMIT 10;
```

### developer publications
### Developer Publications

```sql
SELECT u.username, p.name, COUNT(uv.id) as publications
Expand All @@ -112,15 +92,15 @@ GROUP BY u.username, p.name
ORDER BY p.name;
```

## FAQs / common issues
## FAQs / Common Issues

1. The database url is `postgresql://postgres:s3cr3t@localhost:5435/chai`, and is used
as `CHAI_DATABASE_URL` in the environment.
1. The database url is `postgresql://postgres:s3cr3t@localhost:5435/chai`, and
is used as `CHAI_DATABASE_URL` in the environment.

## tasks
## Tasks

These are tasks that can be run using xcfile.dev. if you have pkgx, just run `dev` to
inject into your environment. if you don't...go get it.
These are tasks that can be run using [xcfile.dev]. If you use `pkgx`, typing
`dev` loads the environment. Alternatively, run them manually.

### reset

Expand Down Expand Up @@ -235,3 +215,6 @@ Refreshes table knowledge from the db.
```sh
docker-compose restart api
```

[PostgreSQL]: https://www.postgresql.org
[`pkgx`]: https://pkgx.sh
Comment on lines +220 to +221
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

strong backlinking