Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Read the registry straight from the tarball #2431

Merged
merged 3 commits into from
May 26, 2021
Merged

Conversation

KristofferC
Copy link
Member

@KristofferC KristofferC commented Mar 12, 2021

Example:

julia> readdir(joinpath(homedir(), ".julia/registries")) # empty
String[]

(Pkg) pkg> add Example
  Installing known registries into `~/.julia`
   Resolving package versions...
   Installed Example ─ v0.5.3

julia> readdir(joinpath(homedir(), ".julia/registries/General"))  # only the tar file and an info file with uuid, filename, tree hash
2-element Vector{String}:
 ".registry_info.toml"
 "General.tar.gz"

TODO

  • Discuss possible implications of not having the registry files available.

@DilumAluthge
Copy link
Member

A lot of tools (RegistryCI, CompatHelper, RetroCap, etc.) access registry files directly.

Maybe we could have an environment variable that disables this functionality? So, by default, we would use this functionality.

But if you set something like this...

ENV["JULIA_UNPACK_REGISTRY_TARBALLS"] = "true"

... then it would untar the registry tarballs so that all of the registry files would be available.

@c42f
Copy link
Member

c42f commented Mar 18, 2021

Signficantly slower if we already have the registry unpacked

Is the underlying problem here that tar files have no global index, so you're forced to do a lot of reading and seeking just to get to one small part of it? In that case it might be an improvement to do a one-time creation of a much smaller index sidecar file for the tar when it's downloaded, and store the index. Essentially a cache of the offsets found in create_inmemory_filesystem()?

@KristofferC
Copy link
Member Author

Is the underlying problem here that tar files have no global index, so you're forced to do a lot of reading and seeking just to get to one small part of it?

Partially, however, reading all the headers is quite fast:

julia> @btime system = Pkg.Registry.create_inmemory_filesystem(joinpath(homedir(), ".julia/registries/General"))
  49.823 ms (587812 allocations: 47.08 MiB)

The big difference is time is likely due to that there is a bunch of caching for the other types of registries that I haven't hooked this up to. After that, I think the timing should be fine.

@c42f
Copy link
Member

c42f commented Mar 18, 2021

however, reading all the headers is quite fast

Right that seems quite conclusive. 50 ms to scan the tar file is only roughly 1/10 of the slowdown.

@KristofferC
Copy link
Member Author

A lot of tools (RegistryCI, CompatHelper, RetroCap, etc.) access registry files directly.

Aren't they using the git version of the registry?

@DilumAluthge
Copy link
Member

A lot of tools (RegistryCI, CompatHelper, RetroCap, etc.) access registry files directly.

Aren't they using the git version of the registry?

They usually use the Git version. But, for example, CompatHelper can be toggled to use the Pkg server version instead, if for example someone wants to use a private registry with CompatHelper, and that private registry is only available from a private Pkg server.

@KristofferC
Copy link
Member Author

Those bots could start using the API in https://github.com/JuliaLang/Pkg.jl/blob/master/src/Registry/registry_instance.jl.

I guess we can keep the existing one, but the number of configurations (git, uncompressed pkg server, compressed pkg server) is getting hard to manage.

@KristofferC KristofferC force-pushed the kc/read_tarball branch 2 times, most recently from 4ffea6a to fea2361 Compare March 18, 2021 13:32
@KristofferC KristofferC force-pushed the kc/read_tarball branch 3 times, most recently from 9cf03f5 to 6aa5b53 Compare March 31, 2021 15:09
@KristofferC
Copy link
Member Author

Should be ready for review.

CHANGELOG.md Outdated Show resolved Hide resolved
@KristofferC KristofferC changed the title WIP: Read the registry straight from the tarball Read the registry straight from the tarball Mar 31, 2021
src/Registry/Registry.jl Outdated Show resolved Hide resolved
@KristofferC KristofferC force-pushed the kc/read_tarball branch 3 times, most recently from 8a3de1d to 6a64c9b Compare April 1, 2021 20:26
@StefanKarpinski
Copy link
Member

I haven't reviewed everything, but I approve of the method of extracting the tarball into a Dict{String,String} mapping tarball paths to file contents. We could add that as an official API for Tar but for now the code to implement it is small and simple enough that it's fine for it to live here. Will try to review the rest at some point but don't let me hold this up.

@KristofferC
Copy link
Member Author

KristofferC commented Apr 1, 2021

One thing you might have a comment about is https://github.com/JuliaLang/Pkg.jl/pull/2431/files#diff-207055bf1cfbe124497d7079ce1ffb06c549f3534fc7b0988b2051fd1a74f1c2R191.

Since we cache things based on the tree-hash we don't want to have to read the compressed file to know what the tree hash of that registry is. Therefore it creates another file at registry installation (.registry_info.toml), next to the compressed registry which contains the uuid, git-tree-sha1 and the name of the compressed file, for example:

base ❯ cat .registry_info.toml 
───────┬────────────────────────────────────────────────────────────────────────────
       │ File: .registry_info.toml
───────┼────────────────────────────────────────────────────────────────────────────
   1   │ git-tree-sha1 = "2b9cdcd382d82f90833c9b57a87686da568d9f62"
   2   │ uuid = "23338594-aafe-5451-b93e-139f81909106"
   3   │ filename = "General.tar.gz"
───────┴────────────────────────────────────────────────────────────────────────────

@StefanKarpinski
Copy link
Member

Yeah, that seems fine to me. Other options I considered when thinking about this feature were:

  • Append/prepend some data to the tarball that includes the tree hash — the TAR format supports comment blocks.
  • Change the way we save these to ~/.julia/registries/General/$tree.tar.gz or something

But just having a file with this info seems simple too. What about a single TOML file with metadata about all of the registries? We'd have to lock on modifying that file, obviously, but that's a thing we do often in Pkg, so not worse than anything else we do.

@KristofferC
Copy link
Member Author

I also thought about using the file name too but you probably want to be able to retrieve both the UUID and the tree sha and at that point, the file name becomes kinda big which is why I went with the file.

I think I don't see any big advantage in keeping the registry info for all registries in one file and it makes the implementation slightly harder I think I rather keep it the way it is now.

@StefanKarpinski
Copy link
Member

If I'm understanding correctly, is the layout with this PR like this?

~/.julia/registries/General.tar.gz
~/.julia/registries/General/.registry_info.toml

If so, maybe avoid the extra directory level and call the second file ~/.julia/registries/General.toml or something?

@KristofferC
Copy link
Member Author

KristofferC commented Apr 2, 2021

Almost, it is:

~/.julia/registries/General/General.tar.gz
~/.julia/registries/General/.registry_info.toml

I thought about putting the two files under registries but then you won't (at leat not as trivially) replace existing registries/General registries (that you get from 1.6, git etc) with the compressed version. By keeping the same folder structure it will smoothly "upgrade" and "downgrade" between the compressed and non-compressed versions.

@StefanKarpinski
Copy link
Member

Hmm. Feels a bit messy to me. I think as long as we have clear criteria for detecting which one exists we should be ok. The current criteria are:

  • if $registry/.git exists then it's a git cloned registry
  • if $registry/.tree_info.toml exists then it's an extracted tarball
  • if $registry/{$registry.tar.gz,.registry_info.toml} exist then it's an unextracted tarball

If someone is switching back to older Julia versions, how is the last case treated? It might be better if the $registry/ directory doesn't exist at all, no? Then an older Julia version will just ignore it instead of getting confused.

@KristofferC
Copy link
Member Author

KristofferC commented Apr 2, 2021

If someone is switching back to older Julia versions, how is the last case treated?

1.6 detects that the folder has no Registry.toml considers it invalid and downloads a new registry that replaces it. On 1.5 it might go bananas.

Then an older Julia version will just ignore it instead of getting confused.

Yes, but when you switch back to Julia 1.7 it will find two copies of the General registry (remember the old 1.6 registry is still valid) Figuring out which one to use (which one is more up to date) feels awkward.

@StefanKarpinski
Copy link
Member

Figuring out which one to use (which one is more up to date) feels awkward.

If there are two, I would say pick the one that more Julia versions will understand and get rid of the other one.

@GunnarFarneback
Copy link
Contributor

On 1.5 it might go bananas.

More specifically:

(@v1.5) pkg> registry add
 Installing known registries into `/tmp/testdepot`
######################################################################## 100,0%
ERROR: SystemError: opening file "/tmp/testdepot/registries/General/Registry.toml": No such file or directory
[30 steps of backtrace omitted]

Julia 1.1 to 1.4 and 1.5 with disabled package server first makes a git clone, then apparently discards it and gives the same error. Julia 1.0 gives up without trying to clone.

@KristofferC KristofferC mentioned this pull request Apr 14, 2021
14 tasks
@KristofferC KristofferC force-pushed the kc/read_tarball branch 2 times, most recently from 0075dbc to 3e29422 Compare May 26, 2021 11:44
@KristofferC
Copy link
Member Author

Been trying this out today and it seems to work well. Also made it backwards compatible so that old julias won't freak out when it finds compressed registries.

@StefanKarpinski
Copy link
Member

What file layout did you end up going with? On the last Pkg call we had settled on this:

  • ~/.julia/registries/General.toml with info about the registry and where to find it
  • ~/.julia/registries/General.tar.gz for compressed tarballs with the path in the TOML file
  • ~/.julia/registries/General/ with .tree_info.toml inside for legacy unpacked tarball registries
  • ~/.julia/registries/General/ with .git inside for git cloned registries

The first two would be ignored by older Julias, allowing them to keep working.

@KristofferC
Copy link
Member Author

Exactly that

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants