Move ecosystem detection tool to nodejs org #7935

cjihrig · 2016-08-01T14:34:49Z

This is a continuation of the discussion in #7619. For a while now, we have pinged @ChALkeR every time we wanted to know how heavily something was used in the ecosystem. I'm proposing that we bring this into the nodejs org so that it is more accessible to other people (busFactor++;), and more likely to get additional maintainers. @ChALkeR is currently working on the project in https://github.com/ChALkeR/Gzemnid.

The text was updated successfully, but these errors were encountered:

ChALkeR · 2016-08-01T14:39:28Z

Citing myself from #7619 (comment) and below:

I'm trying to host the scripts at https://github.com/ChALkeR/Gzemnid, but it's only slowly transitioning from a set of hacky bash scripts that require manual actions to something more sensible. Perhaps will push some updates soon. That also works as a server to allow searches using a web API. I had it running as web service some time ago at http://gzemnid.oserv.org/, but it's not ready yet, and atm it's down.

Perhaps I will spend a bit more time on that in the coming days.

The 2016-01-28 dataset is hosted at http://oserv.org/npm/Gzemnid/2016-01-28/ — the *.lzo files and the bash script. Anyone could download the pre-built dataset and obtain exactly the same greps.

Also that repo provides the deep dependency checker (to build lists like this), pre-built files are named deps-nested.json in other folders.

Perhaps we need a separate issue for that somewhere.

I will try to make that one working without any additional non-documented actions, document it a bit (at least list the commands), and then it'll be at least usable by anyone other than myself. I'm all for moving that into the org once it reaches a «usable» state.

jasnell · 2016-08-01T14:50:48Z

@ChALkeR : Can you describe the infrastructure / hosting requirements that you have to run this? I can look into whether or not some part of our existing dedicated hosting resources can be used.

ChALkeR · 2016-08-01T15:33:21Z

@jasnell Atm it's suboptimal, some fixes are required to make it faster and less consuming. The main requirement is storage, but I can't tell the exact numbers now. Something around 200 GiB, perhaps (could be hosted on slow storage, it affects only dataset rebuild time). With proper fixes, that could be brought down to a few GiB, but a cache of about 80 GiB for packed files would still be useful in case if the rebuilding rules change.

The built dataset itself is around 5 GiB, so it's quite small.

ChALkeR · 2016-08-01T15:38:17Z

@jasnell I will to move the whole process to a small vps again to check if everything works fine (another reason is that I am far from my PC atm and can't rebuild this on my notebook due to slow internet). That would also help me to outline (and lower) the requirements to run this thing autonomously.

jasnell · 2016-08-02T04:14:25Z

No worries.
@mhdawson ... can some of our dedicated softlayer resources be used for this? (Particularly noting the storage requirements)

ChALkeR · 2016-08-02T05:50:11Z

@jasnell More specific:

Tarballs for @latest are consuming 79 GiB now. It's not required to store them, but they would be needed in case if the algorithm would be changed.

Partials (uncompressed) are exprected to consume around 30-50 GiB (I need to readjust the blacklist). Those are required to obtain reasonable rebuild speed and are essentially pre-built dataset chunks for each package. Those could be stored in the compressed form (reducing their size to something around 5 GiB), but that's not supported yet.

The dataset itself should be about the same size as all the partilals together (minus a few GiB). The compressed size is expected to be about 5 GiB, and we might want to keep several versions of those.

Deep dependencies builder has some memory requirements, but I don't remember how much memory exactly does it consume. It's below 2 GiB, I think.

I will post an update with real numbers once I finish rebuilding the dataset on my VPS.

ChALkeR · 2016-08-03T06:53:37Z

Ok. Partials are consuming 40 GiB.

That could increase as I would probably want to add more information there, i.e. package.json (to check postinstall scripts, for example) and the disk usage (to build a list of abnormally huge packages). That would be only +2-4 GiB though, I suppose. Or it could also decrease once I update the blacklist.

Upd: partials are 36 GiB with package.json files and fixed and updated blacklist. This could significantly increase in the future, though — it grows with the npm size.

ChALkeR · 2016-08-04T21:01:52Z

Ok. The unpacked dataset size is 29 GiB, 27 GiB of which is code search data.

The actual (packed) dataset size is 4.3 GiB, latest version is uploaded to http://oserv.org/npm/Gzemnid/2016-08-04/.

ChALkeR · 2016-08-25T13:11:37Z

Status update: no additional manual actions are required anymore, and it could be used by itself to build the same datasets that I'm using, without some internal knowledge.

I began documenting the commands and making it easier to work with the tool, I will hopefully make it into something sensible this week.

jbergstroem · 2016-08-25T13:18:37Z

Perhaps the build group can host this? We have room in our infrastructure.

ChALkeR · 2016-08-30T22:23:58Z

Status update: initial documentation is in place, search script also got merged, atm it has no more bash parts. The usage is pretty simple now.

I believe we could start moving this into the org at this point, if we would decide on that,

ChALkeR · 2016-08-30T22:24:35Z

/cc @nodejs/ctc, should this be mentioned on a CTC meeting?

MylesBorins · 2016-08-31T16:33:51Z

Big +1 on adding this to the org and having documentation that will allow individuals to make use of the tool!

jasnell · 2016-08-31T17:11:46Z

+1 to moving this in.

jbergstroem · 2016-08-31T17:33:48Z

+1 from me too. Anyone else from the build group wants to chip in seeing how we will likely deploy this?

targos · 2016-08-31T17:52:42Z

@ChALkeR does your tool need to make a lot of calls to the npm database ? If that's the case, it may be useful (for speed) to host a copy with continuous replication of https://skimdb.npmjs.com/registry on the same server.

ChALkeR · 2016-08-31T18:03:13Z

@targos I thought about that. No, it doesn't. I don't think that a replication is needed.

Everything that I get from skimdb could be solved with a follower (and takes one or two minutes per run even without a follower), and I don't think that that replication would have api.npmjs.org data.

Dependency builder is the only thing that could actually benefit from a skimdb replica, not because calling npm registry is long, but because that could speed up the storage, and I am not sure if it is worth it at the moment — that could save up to only ~30-40% of the dependency builder build time, as I estimate.

Trott · 2016-09-07T00:12:50Z

Result from CTC discussion last week was:

Prep repo for migration.
Open issue in nodejs/tsc.

ChALkeR · 2016-10-23T09:52:58Z

Ok, this got stalled for a bit mostly due to personal time constraints, I will now try to continue this effort =).

ChALkeR · 2016-10-23T21:38:08Z

Pre-built dataset update: http://oserv.org/npm/Gzemnid/2016-10-22/, it's under 5 GiB.

To perform code search, you need three slim.code.*.txt.lz4 files and search.code.sh (one-liner).

That dataset is exactly the one I currently use for code search, and it was built following the instructions on https://github.com/ChALkeR/Gzemnid/blob/master/README.md.

Trott · 2017-07-15T06:26:40Z

Should this remain open?

gibfahn · 2017-07-16T14:01:19Z

Should this remain open?

I think so. Gzemnid seems like it'll be as useful as CitGM for testing the breakingness of changes, and I assume this is just waiting on people to get the time to add it to our infra.

Trott · 2018-03-09T04:48:45Z

Removing discuss label because it seems like decisions have been made. Re-add if you think it should remain.

cjihrig · 2018-05-29T01:22:17Z

This moved to nodejs/TSC#490 and subsequently nodejs/admin#130. Closing this.

cjihrig added the discuss Issues opened for discussions and feedbacks. label Aug 1, 2016

cjihrig mentioned this issue Aug 1, 2016

repl: Undocumented public methods on REPLServer.prototype #7619

Closed

mscdex added the meta Issues and PRs related to the general management of the project. label Aug 1, 2016

ChALkeR self-assigned this Aug 25, 2016

ChALkeR added the ctc-agenda label Aug 31, 2016

ChALkeR mentioned this issue Aug 31, 2016

Node.js Foundation Core Technical Committee (CTC) Meeting 2016-08-31 #8330

Closed

Trott mentioned this issue Sep 6, 2016

Node.js Foundation Core Technical Committee (CTC) Meeting 2016-09-07 #8425

Closed

Trott removed the ctc-agenda label Sep 7, 2016

ChALkeR mentioned this issue Sep 7, 2016

Prepare repo for migration to nodejs org nodejs/Gzemnid#7

Closed

This was referenced Feb 10, 2018

process: deprecate process assert #18666

Closed

Move Gzemnid into the org nodejs/TSC#490

Closed

Trott removed the discuss Issues opened for discussions and feedbacks. label Mar 9, 2018

cjihrig closed this as completed May 29, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Move ecosystem detection tool to nodejs org #7935

Move ecosystem detection tool to nodejs org #7935

cjihrig commented Aug 1, 2016

ChALkeR commented Aug 1, 2016

jasnell commented Aug 1, 2016

ChALkeR commented Aug 1, 2016 •

edited

Loading

ChALkeR commented Aug 1, 2016 •

edited

Loading

jasnell commented Aug 2, 2016

ChALkeR commented Aug 2, 2016 •

edited

Loading

ChALkeR commented Aug 3, 2016 •

edited

Loading

ChALkeR commented Aug 4, 2016

ChALkeR commented Aug 25, 2016

jbergstroem commented Aug 25, 2016

ChALkeR commented Aug 30, 2016 •

edited

Loading

ChALkeR commented Aug 30, 2016 •

edited

Loading

MylesBorins commented Aug 31, 2016

jasnell commented Aug 31, 2016

jbergstroem commented Aug 31, 2016

targos commented Aug 31, 2016

ChALkeR commented Aug 31, 2016 •

edited

Loading

Trott commented Sep 7, 2016

ChALkeR commented Oct 23, 2016

ChALkeR commented Oct 23, 2016 •

edited

Loading

Trott commented Jul 15, 2017

gibfahn commented Jul 16, 2017

Trott commented Mar 9, 2018

cjihrig commented May 29, 2018

Move ecosystem detection tool to nodejs org #7935

Move ecosystem detection tool to nodejs org #7935

Comments

cjihrig commented Aug 1, 2016

ChALkeR commented Aug 1, 2016

jasnell commented Aug 1, 2016

ChALkeR commented Aug 1, 2016 • edited Loading

ChALkeR commented Aug 1, 2016 • edited Loading

jasnell commented Aug 2, 2016

ChALkeR commented Aug 2, 2016 • edited Loading

ChALkeR commented Aug 3, 2016 • edited Loading

ChALkeR commented Aug 4, 2016

ChALkeR commented Aug 25, 2016

jbergstroem commented Aug 25, 2016

ChALkeR commented Aug 30, 2016 • edited Loading

ChALkeR commented Aug 30, 2016 • edited Loading

MylesBorins commented Aug 31, 2016

jasnell commented Aug 31, 2016

jbergstroem commented Aug 31, 2016

targos commented Aug 31, 2016

ChALkeR commented Aug 31, 2016 • edited Loading

Trott commented Sep 7, 2016

ChALkeR commented Oct 23, 2016

ChALkeR commented Oct 23, 2016 • edited Loading

Trott commented Jul 15, 2017

gibfahn commented Jul 16, 2017

Trott commented Mar 9, 2018

cjihrig commented May 29, 2018

ChALkeR commented Aug 1, 2016 •

edited

Loading

ChALkeR commented Aug 1, 2016 •

edited

Loading

ChALkeR commented Aug 2, 2016 •

edited

Loading

ChALkeR commented Aug 3, 2016 •

edited

Loading

ChALkeR commented Aug 30, 2016 •

edited

Loading

ChALkeR commented Aug 30, 2016 •

edited

Loading

ChALkeR commented Aug 31, 2016 •

edited

Loading

ChALkeR commented Oct 23, 2016 •

edited

Loading