-
Notifications
You must be signed in to change notification settings - Fork 47
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Clean up README #13
Merged
Merged
Clean up README #13
Changes from 8 commits
Commits
Show all changes
9 commits
Select commit
Hold shift + click to select a range
3b8da88
Clean up README
mxcl 395e6a6
wip
mxcl cda545a
wip
mxcl 8860b0b
wip
mxcl 990456b
wip
mxcl 45c4366
removed the pruning ur docker stuff
sanchitram1 03263fc
removed the PKG_MANAGER reference
sanchitram1 c3191cb
Merge branch 'main' into readme-cleanup
sanchitram1 3cdc3f5
+api
sanchitram1 File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,47 +1,47 @@ | ||
# CHAI | ||
|
||
This is an attempt at an open-source data pipeline for package managers. The goal is to | ||
have a pipeline that can run on any package manager and provide a normalized data | ||
source for different use cases. | ||
CHAI is an attempt at an open-source data pipeline for package managers. The | ||
goal is to have a pipeline that can use the data from any package manager and | ||
provide a normalized data source for myriads of different use cases. | ||
|
||
1. db: postgres to store package specific data | ||
1. alembic: for running migrations | ||
1. pipeline: which fetches and writes data | ||
## Getting Started | ||
|
||
## Requirements | ||
Use [Docker](https://docker.com) | ||
|
||
- docker | ||
1. Run `docker compose build` to create the latest Docker images. | ||
2. Then, run `docker compose up` to launch. | ||
|
||
> [!TIP] | ||
> [!NOTE] | ||
> This will run CHAI with for all package managers. As an example crates by | ||
> itself will take over an hour and consume >5GB storage. | ||
> | ||
> for local development, all the requirements are within the [pkgx yaml](pkgx.yaml) file | ||
> To run only a specific backend, comment out the others in `docker-compose.yml`. | ||
|
||
## Setup | ||
|
||
1. Run `docker compose build` to create the latest Docker images. | ||
2. Run `docker compose up` to launch. | ||
<!-- we'd like to change the above to be more friendly to users trying to run a specific | ||
pipeline --> | ||
|
||
### Arguments | ||
|
||
- `PKG_MANAGER`: which package manager the pipeline will be run for. Currently, the | ||
supported values are: | ||
- `crates` | ||
- `FREQUENCY`: how frequently **(in hours)** the pipeline will run. Defaults to `24` | ||
Specify these eg. `docker compose -e FOO=bar up`: | ||
|
||
- `FREQUENCY`: how frequently **(in hours)** the pipeline will run | ||
(defaults to `24`) | ||
- `FETCH`: whether the pipeline will fetch the data. Defaults to `true` | ||
- `DEBUG`: whether the pipeline will run in debug mode. Defaults to `true` | ||
|
||
These arguments are all configurable in the `docker-compose.yml` file. | ||
|
||
## Hard Reset | ||
### Docker Services Overview | ||
|
||
If at all you need to do a hard reset, here's the steps | ||
1. `db`: [PostgreSQL] database for the reduced package data | ||
2. `alembic`: handles migrations | ||
3. `pipeline`: fetches and writes data | ||
|
||
1. `rm -rf data`: removes all the data the fetcher is putting | ||
2. `docker system prune -a -f --volumes`: removes **everything** docker-related | ||
### Hard Reset | ||
|
||
> [!WARNING] | ||
> | ||
> step 4 deletes all your docker stuff...be careful | ||
Stuff happens. Start over: | ||
|
||
`rm -rf ./data`: removes all the data the fetcher is putting. | ||
|
||
<!-- this is handled now that alembic/psycopg2 are in pkgx --> | ||
<!-- | ||
|
@@ -51,45 +51,25 @@ If at all you need to do a hard reset, here's the steps | |
- vapor's migrations are written in swift | ||
--> | ||
|
||
## Usage | ||
## Goals | ||
|
||
Our goal is to build a data schema that looks like this: | ||
|
||
![db/CHAI_ERD.png](db/CHAI_ERD.png) | ||
|
||
Our specific application extracts the dependency graph understand what are critical | ||
pieces of the open-source graph. there are many other potential use cases for this data: | ||
|
||
- license compatibility checker | ||
- developer publications | ||
- package popularity | ||
- dependency analysis vulnerability tool (requires translating semver) | ||
|
||
<!-- TODO: add these to the examples folder--> | ||
Our specific application extracts the dependency graph understand what are | ||
critical pieces of the open-source graph. there are many other potential use | ||
cases for this data: | ||
|
||
### license compatibility checker | ||
- License compatibility checker | ||
- Developer publications | ||
- Package popularity | ||
- Dependency analysis vulnerability tool (requires translating semver) | ||
|
||
> [!WARNING] | ||
> | ||
> it's probably better to start with a global list of licenses and then map each | ||
> version's to the global list...but this isn't part of v1 | ||
|
||
```sql | ||
SELECT DISTINCT | ||
p.name, | ||
l.name AS license, | ||
dep.name AS dependency, | ||
dep_l.name AS dependency_license | ||
FROM packages p | ||
JOIN versions v ON p.id = v.package_id | ||
JOIN dependencies d ON v.id = d.version_id | ||
JOIN packages dep ON d.dependency_id = dep.id | ||
JOIN licenses l ON v.license_id = l.id | ||
JOIN versions dep_v ON dep.id = dep_v.package_id | ||
JOIN licenses dep_l ON dep_v.license_id = dep_l.id | ||
``` | ||
> [!TIP] | ||
> Help us add the above to the examples folder. | ||
|
||
### package popularity | ||
### Package Popularity | ||
|
||
```sql | ||
SELECT p.name, SUM(v.downloads) as total_downloads | ||
|
@@ -100,7 +80,7 @@ ORDER BY total_downloads DESC | |
LIMIT 10; | ||
``` | ||
|
||
### developer publications | ||
### Developer Publications | ||
|
||
```sql | ||
SELECT u.username, p.name, COUNT(uv.id) as publications | ||
|
@@ -112,15 +92,15 @@ GROUP BY u.username, p.name | |
ORDER BY p.name; | ||
``` | ||
|
||
## FAQs / common issues | ||
## FAQs / Common Issues | ||
|
||
1. The database url is `postgresql://postgres:s3cr3t@localhost:5435/chai`, and is used | ||
as `CHAI_DATABASE_URL` in the environment. | ||
1. The database url is `postgresql://postgres:s3cr3t@localhost:5435/chai`, and | ||
is used as `CHAI_DATABASE_URL` in the environment. | ||
|
||
## tasks | ||
## Tasks | ||
|
||
These are tasks that can be run using xcfile.dev. if you have pkgx, just run `dev` to | ||
inject into your environment. if you don't...go get it. | ||
These are tasks that can be run using [xcfile.dev]. If you use `pkgx`, typing | ||
`dev` loads the environment. Alternatively, run them manually. | ||
|
||
### reset | ||
|
||
|
@@ -235,3 +215,6 @@ Refreshes table knowledge from the db. | |
```sh | ||
docker-compose restart api | ||
``` | ||
|
||
[PostgreSQL]: https://www.postgresql.org | ||
[`pkgx`]: https://pkgx.sh | ||
Comment on lines
+220
to
+221
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. strong backlinking |
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should add