Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Move ecosystem detection tool to nodejs org #7935

Closed
cjihrig opened this issue Aug 1, 2016 · 24 comments
Closed

Move ecosystem detection tool to nodejs org #7935

cjihrig opened this issue Aug 1, 2016 · 24 comments
Assignees
Labels
meta Issues and PRs related to the general management of the project.

Comments

@cjihrig
Copy link
Contributor

cjihrig commented Aug 1, 2016

This is a continuation of the discussion in #7619. For a while now, we have pinged @ChALkeR every time we wanted to know how heavily something was used in the ecosystem. I'm proposing that we bring this into the nodejs org so that it is more accessible to other people (busFactor++;), and more likely to get additional maintainers. @ChALkeR is currently working on the project in https://github.com/ChALkeR/Gzemnid.

@cjihrig cjihrig added the discuss Issues opened for discussions and feedbacks. label Aug 1, 2016
@ChALkeR
Copy link
Member

ChALkeR commented Aug 1, 2016

Citing myself from #7619 (comment) and below:

I'm trying to host the scripts at https://github.com/ChALkeR/Gzemnid, but it's only slowly transitioning from a set of hacky bash scripts that require manual actions to something more sensible. Perhaps will push some updates soon. That also works as a server to allow searches using a web API. I had it running as web service some time ago at http://gzemnid.oserv.org/, but it's not ready yet, and atm it's down.

Perhaps I will spend a bit more time on that in the coming days.

The 2016-01-28 dataset is hosted at http://oserv.org/npm/Gzemnid/2016-01-28/ — the *.lzo files and the bash script. Anyone could download the pre-built dataset and obtain exactly the same greps.

Also that repo provides the deep dependency checker (to build lists like this), pre-built files are named deps-nested.json in other folders.

Perhaps we need a separate issue for that somewhere.

I will try to make that one working without any additional non-documented actions, document it a bit (at least list the commands), and then it'll be at least usable by anyone other than myself. I'm all for moving that into the org once it reaches a «usable» state.

@jasnell
Copy link
Member

jasnell commented Aug 1, 2016

@ChALkeR : Can you describe the infrastructure / hosting requirements that you have to run this? I can look into whether or not some part of our existing dedicated hosting resources can be used.

@mscdex mscdex added the meta Issues and PRs related to the general management of the project. label Aug 1, 2016
@ChALkeR
Copy link
Member

ChALkeR commented Aug 1, 2016

@jasnell Atm it's suboptimal, some fixes are required to make it faster and less consuming. The main requirement is storage, but I can't tell the exact numbers now. Something around 200 GiB, perhaps (could be hosted on slow storage, it affects only dataset rebuild time). With proper fixes, that could be brought down to a few GiB, but a cache of about 80 GiB for packed files would still be useful in case if the rebuilding rules change.

The built dataset itself is around 5 GiB, so it's quite small.

@ChALkeR
Copy link
Member

ChALkeR commented Aug 1, 2016

@jasnell I will to move the whole process to a small vps again to check if everything works fine (another reason is that I am far from my PC atm and can't rebuild this on my notebook due to slow internet). That would also help me to outline (and lower) the requirements to run this thing autonomously.

@jasnell
Copy link
Member

jasnell commented Aug 2, 2016

No worries.
@mhdawson ... can some of our dedicated softlayer resources be used for this? (Particularly noting the storage requirements)

@ChALkeR
Copy link
Member

ChALkeR commented Aug 2, 2016

@jasnell More specific:

Tarballs for @latest are consuming 79 GiB now. It's not required to store them, but they would be needed in case if the algorithm would be changed.

Partials (uncompressed) are exprected to consume around 30-50 GiB (I need to readjust the blacklist). Those are required to obtain reasonable rebuild speed and are essentially pre-built dataset chunks for each package. Those could be stored in the compressed form (reducing their size to something around 5 GiB), but that's not supported yet.

The dataset itself should be about the same size as all the partilals together (minus a few GiB). The compressed size is expected to be about 5 GiB, and we might want to keep several versions of those.

Deep dependencies builder has some memory requirements, but I don't remember how much memory exactly does it consume. It's below 2 GiB, I think.

I will post an update with real numbers once I finish rebuilding the dataset on my VPS.

@ChALkeR
Copy link
Member

ChALkeR commented Aug 3, 2016

Ok. Partials are consuming 40 GiB.

That could increase as I would probably want to add more information there, i.e. package.json (to check postinstall scripts, for example) and the disk usage (to build a list of abnormally huge packages). That would be only +2-4 GiB though, I suppose. Or it could also decrease once I update the blacklist.


Upd: partials are 36 GiB with package.json files and fixed and updated blacklist. This could significantly increase in the future, though — it grows with the npm size.

@ChALkeR
Copy link
Member

ChALkeR commented Aug 4, 2016

Ok. The unpacked dataset size is 29 GiB, 27 GiB of which is code search data.

The actual (packed) dataset size is 4.3 GiB, latest version is uploaded to http://oserv.org/npm/Gzemnid/2016-08-04/.

@ChALkeR ChALkeR self-assigned this Aug 25, 2016
@ChALkeR
Copy link
Member

ChALkeR commented Aug 25, 2016

Status update: no additional manual actions are required anymore, and it could be used by itself to build the same datasets that I'm using, without some internal knowledge.

I began documenting the commands and making it easier to work with the tool, I will hopefully make it into something sensible this week.

@jbergstroem
Copy link
Member

Perhaps the build group can host this? We have room in our infrastructure.

@ChALkeR
Copy link
Member

ChALkeR commented Aug 30, 2016

Status update: initial documentation is in place, search script also got merged, atm it has no more bash parts. The usage is pretty simple now.

I believe we could start moving this into the org at this point, if we would decide on that,

@ChALkeR
Copy link
Member

ChALkeR commented Aug 30, 2016

/cc @nodejs/ctc, should this be mentioned on a CTC meeting?

@MylesBorins
Copy link
Contributor

Big +1 on adding this to the org and having documentation that will allow individuals to make use of the tool!

@jasnell
Copy link
Member

jasnell commented Aug 31, 2016

+1 to moving this in.

@jbergstroem
Copy link
Member

+1 from me too. Anyone else from the build group wants to chip in seeing how we will likely deploy this?

@targos
Copy link
Member

targos commented Aug 31, 2016

@ChALkeR does your tool need to make a lot of calls to the npm database ? If that's the case, it may be useful (for speed) to host a copy with continuous replication of https://skimdb.npmjs.com/registry on the same server.

@ChALkeR
Copy link
Member

ChALkeR commented Aug 31, 2016

@targos I thought about that. No, it doesn't. I don't think that a replication is needed.

Everything that I get from skimdb could be solved with a follower (and takes one or two minutes per run even without a follower), and I don't think that that replication would have api.npmjs.org data.

Dependency builder is the only thing that could actually benefit from a skimdb replica, not because calling npm registry is long, but because that could speed up the storage, and I am not sure if it is worth it at the moment — that could save up to only ~30-40% of the dependency builder build time, as I estimate.

@Trott
Copy link
Member

Trott commented Sep 7, 2016

Result from CTC discussion last week was:

  • Prep repo for migration.
  • Open issue in nodejs/tsc.

@ChALkeR
Copy link
Member

ChALkeR commented Oct 23, 2016

Ok, this got stalled for a bit mostly due to personal time constraints, I will now try to continue this effort =).

@ChALkeR
Copy link
Member

ChALkeR commented Oct 23, 2016

Pre-built dataset update: http://oserv.org/npm/Gzemnid/2016-10-22/, it's under 5 GiB.

To perform code search, you need three slim.code.*.txt.lz4 files and search.code.sh (one-liner).

That dataset is exactly the one I currently use for code search, and it was built following the instructions on https://github.com/ChALkeR/Gzemnid/blob/master/README.md.

@Trott
Copy link
Member

Trott commented Jul 15, 2017

Should this remain open?

@gibfahn
Copy link
Member

gibfahn commented Jul 16, 2017

Should this remain open?

I think so. Gzemnid seems like it'll be as useful as CitGM for testing the breakingness of changes, and I assume this is just waiting on people to get the time to add it to our infra.

@Trott
Copy link
Member

Trott commented Mar 9, 2018

Removing discuss label because it seems like decisions have been made. Re-add if you think it should remain.

@Trott Trott removed the discuss Issues opened for discussions and feedbacks. label Mar 9, 2018
@cjihrig
Copy link
Contributor Author

cjihrig commented May 29, 2018

This moved to nodejs/TSC#490 and subsequently nodejs/admin#130. Closing this.

@cjihrig cjihrig closed this as completed May 29, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
meta Issues and PRs related to the general management of the project.
Projects
None yet
Development

No branches or pull requests

9 participants