Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

crates.io top 20k crates verification process #11

Open
paolobarbolini opened this issue Apr 1, 2024 · 15 comments
Open

crates.io top 20k crates verification process #11

paolobarbolini opened this issue Apr 1, 2024 · 15 comments

Comments

@paolobarbolini
Copy link
Member

paolobarbolini commented Apr 1, 2024

I've scraped (I'm lazy, I should have used the database dumps) the top 20k crates by recent downloads count. I've published the list and the script at https://gist.github.com/paolobarbolini/b5101b3ad378bcb6bc5c282349edfd4c.

I'll soon be getting a server from Hetzner with 320 GB of disk and see if I can go through the entire list without running out of disk space. I'll also use the list as a way of fixing some of the shortcomings which have been reported in other issues.

⚠️⚠️⚠️ WARNING ⚠️⚠️⚠️

Before you open issues in the projects you think are affected, investigate the reports thoroughly. This software is still v0.0.1 for a very good reason.

@paolobarbolini
Copy link
Member Author

While analyzing crates I caught an error and got stuck trying to fetch submodules from https://github.com/quickwit-oss/tantivy/tree/dff022b30aff6bcd4df7e908f6fa2f86e551204b because it used git over SSH. I guess GIT_TERMINAL_PROMPT=0 isn't enough.

@paolobarbolini
Copy link
Member Author

I've opened #12 to stop missing repositories from continuously blocking the clone process. This patch is already present on the machine I'm doing the scanning from.

@paolobarbolini
Copy link
Member Author

The clone process was interrupted by ntex-rs/ntex#333 😅. I've applied a patch locally for now and I'll see how to fix it permanently. Turns out with a large enough pool of crates almost every ensure! will probably get hit at some point

@paolobarbolini
Copy link
Member Author

While analyzing crates I caught an error and got stuck trying to fetch submodules from https://github.com/quickwit-oss/tantivy/tree/dff022b30aff6bcd4df7e908f6fa2f86e551204b because it used git over SSH. I guess GIT_TERMINAL_PROMPT=0 isn't enough.

It just happened again with the investments crate

@paolobarbolini
Copy link
Member Author

I'm starting to analyze the logs, which I'll publish once we finish analyzing all crates. I've already encountered something, which I've reported at rust-db/refinery#323

@paolobarbolini
Copy link
Member Author

I've also opened TimelyDataflow/timely-dataflow#559

@link2xt
Copy link
Contributor

link2xt commented Apr 2, 2024

Not sure if it is in top 20k crates, but here is a pgp crate issue: rpgp/rpgp#327
EDIT: pgp is in top 20k

@paolobarbolini
Copy link
Member Author

Processing got stuck at ~18.5k crates. Heres the log: output.log.gz WARNING: I've already verified that there are a lot of false positives.

I'll merge #18, #19 and #20 locally and have it re-run on all crates

@paolobarbolini
Copy link
Member Author

pgp is in top 20k

Don't worry about the 20k limit, it's just a number I've picked for doing the "official" scrape after having done a very rough 5k one in the previous days 😃

@link2xt
Copy link
Contributor

link2xt commented Apr 2, 2024

Issue for brotli, have not tested if it is reproducible: dropbox/rust-brotli#178
async-rusqlite PR to add repository: jsdw/async-rusqlite#2

@paolobarbolini
Copy link
Member Author

Here are the results from the second run: output.log.gz

@paolobarbolini
Copy link
Member Author

There are too many crates without a repository field. I'd like to start opening issues on crates that have recently released new versions, which are the ones more likely to respond. I wrote this very rough scraper for finding out the last updated date of each crate in the list

[package]
name = "cargo-recent-crates"
version = "0.1.0"
edition = "2021"

[dependencies]
reqwest = { version = "0.12", default-features = false, features = ["rustls-tls", "json", "blocking"] }
chrono = { version = "0.4", features = ["serde"] }
serde = { version = "1", features = ["derive"] }
use std::{thread, time::Duration};

use chrono::{DateTime, Utc};

fn main() {
    let mut client = reqwest::blocking::Client::builder().user_agent("https://github.com/M4SS-Code/cargo-goggles/issues/11 scraping crates with no repository field").build().unwrap();

    let crates = [
        // put crates here
    ];

    #[derive(Debug, serde::Deserialize)]
    struct C {
        #[serde(rename = "crate")]
        c: Cr,
    }

    #[derive(Debug, serde::Deserialize)]
    struct Cr {
        updated_at: DateTime<Utc>,
    }

    for k in crates {
        for _ in 0..3 {
            let j = match client
                .get(format!("https://crates.io/api/v1/crates/{k}"))
                .send()
            {
                Ok(r) => match r.json::<C>() {
                    Ok(j) => j,
                    Err(err) => {
                        eprintln!("{err:?}");
                        thread::sleep(Duration::from_secs(5));
                        continue;
                    }
                },
                Err(err) => {
                    eprintln!("{err:?}");
                    thread::sleep(Duration::from_secs(5));
                    continue;
                }
            };

            println!("{k}\t{}", j.c.updated_at.to_rfc3339());

            break;
        }

        thread::sleep(Duration::from_secs(2));
    }
}

@link2xt
Copy link
Contributor

link2xt commented Apr 2, 2024

Maybe also make a post on Mastodon with #rust and #rustlang tags asking maintainers to add repository field?
Then at least some will set it before you have to make an issue.

@paolobarbolini
Copy link
Member Author

paolobarbolini commented Apr 2, 2024

Maybe also make a post on Mastodon with #rust and #rustlang tags asking maintainers to add repository field? Then at least some will set it before you have to make an issue.

Sounds like a good idea.

In the meantime here's the list (it's actually .tsv but GitHub didn't like it): crates.csv

@paolobarbolini
Copy link
Member Author

paolobarbolini commented Apr 3, 2024

I haven't posted it on Twitter or Mastodon yet, or seen if cargo could make it more obvious when the repository field is missing, but I did open a few issues on projects I recognized from the list and I've gotten this response 1, which is an interesting wake-up call 2.

I'm not sure opening issues this way is doable at this point, for once we're still just 3 people playing with our toys figuring out what to do with them. I think I'll dedicate more time on the development side to get something much more usable than the current version and see this can also help others, be it in a CLI or library form.

Footnotes

  1. https://github.com/zellij-org/zellij/issues/3241#issuecomment-2033667800

  2. Full disclosure we're not going to monetize this, but there are other benefits we could enjoy as a company like the publicity.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants