-
-
Notifications
You must be signed in to change notification settings - Fork 4.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Proposal: seperate website repository and assets #12048
Comments
I think we shouldn't push these assets to a Git repository.The tldr archives, PDFs, etc. do not need version control. Every client downloads the latest version anyway. On top of that, such a repository grows in size very quickly (we push a couple megabytes of binary data on every commit that changes a page), and because of that we are going to arrive at the exact same problem later on. This is a band-aid solution, and if we're going to change the way we distribute pages, we might as well do it right. In my opinion, we need something that can be easily overwritten - without the problem of garbage that piles up every commit.
Actually, it's about 15 GB. I've been thinking about this for a while now, and I actually wanted to open a similar issue. We could upload the assets to the latest release of I've already edited the deploy script to upload assets to both places - if everyone agrees, I can make a PR.
That should definitely be done - this repo currently takes 16 minutes to clone. |
Agreed, this is indeed an efficient approach (in the long run). But I am not sure how some of our clients will fetch it, wildcards? GitHub Rest API? (If Rest API then there would be issues when fetching multiple archives cross-platform)
Wow, that's larger than I initially thought (I haven't cloned the repo in a while).
This looks good on paper, I am interested in hearing what others would think. If we went with it, we could set up link redirects to the correct location (either as DNS records or via a separate repo like https://github.com/tldr-pages/chatroom).
Exactly, will do the currently proposed changes once others agree too. |
No need for any API calls. Clients will fetch it the same way they do it now, just from a different location.
Some clients download directly from https://raw.githubusercontent.com, and you can't tell GitHub to redirect this to something else. Additionally, it would be great if all clients used the same URL, because currently some use the redirect and some do not.
If we decide to use releases, we can only switch repositories once all clients have been updated and we're ready to stop supporting the old method. |
My only concern with releases is, the frequency. Right now all clients can access the "latest" version. If we decide to release, a client can only access the latest release. And when is a release "big" enough, to prevent a release per change? After changing X numbers of pages? We need to think out this release strategy. We could keep the release process as is and keep release numbers, but add a timestamp for the commit? E.g. |
I meant uploading assets to the same release (overwriting them) every commit, because we do not need old versions. Releases will still be created only on client specification updates. |
Yep, I guess I wasn't clear in my previous comment😅. I meant if others agree will perform the changes suggested here first #12048 (comment) (to allow independent working with the website) then discuss your long-term viable method (maybe in a seperate issue for better visibility/trackability). |
These changes will break clients that use https://raw.githubusercontent.com/tldr-pages/tldr-pages.github.io instead of https://tldr.sh/assets (the URL will then be https://raw.githubusercontent.com/tldr-pages/assets). |
Oh, I haven't considered this. Yeah, this would indeed break clients linking directly to the repository instead of the website. Will fix it first (by opening PRs in clients soon). |
And if we are going to make PRs that change the URL, then they might as well be with the final solution. If you do this now, that will force every client to create a patch release, and then another one with the URL to GitHub releases.
|
I agree with @acuteenvy's suggestion to use release artifacts instead of hosting them using a bespoke mechanism (git-tracked, even 😱)
If we feel it's too onerous to update the release artifacts on every commit, perhaps we could adopt a snapshot strategy, where a new archive would be generated on a time basis (say, weekly, or daily) rather than on a commit basis. I think that either option ought to be frequent enough for the vast majority of users. |
On a separate note: since we'd be recreating the website repository, I'd strongly suggest that we take the opportunity to filter the git history to remove all the asset update commits but preserve all the other changes; that way we wouldn't have a split in the history of the website code between the old repository and the new one. |
Building the assets and overwriting them in the release doesn't take a lot of time, and does not produce garbage (unlike the git method). I don't see why we wouldn't want to do that. |
Autolinks are part of the CommonMark spec (ref <https://spec.commonmark.org/0.30/#autolinks>) and well supported. Redirect indicators are removed as a part of #12048.
This change seems like a good plan to me. I thank you once again @kdbharun for taking the initiative here wrt infrastructure! I suggest we implement it at the earliest available opportunity. @acuteenvy: I agree that something e.g. like using GitHub releases would be a better plan given it really doesn't need version control, but that would likely take longer to implement. I suggest we implement this as @kdbharun suggests first to buy ourselves some time and fix the immediate problem (since they've gone to all the trouble of testing it etc :P), and then look at that in a separate issue. A related plan here could be once we have a separate assets repo to adjust the script to always amend the last commit & force-push, with only a normal commit every ~month or so? This would be more compatible with existing clients than GitHub releases.
It will, but the client spec says clients MUST download from e.g. https://tldr.sh/assets/tldr.zip - so any clients doing that are in violation of the client spec. We do need to update the client spec ref this tho, given it indicates where each URL redirects to. I've opened PR #12133 to resolve this. |
I've already implemented and tested it (#12062). Updating all clients would definitely take a while though, but that is not much of a problem if we continue to support the old method for some time (which is what I did in the PR). |
Welcome, @acuteenvy's solution would work the best in fixing the issue permanently. Dropping the DM I sent you here, as I don't have time to type it again 😅 . My initial proposal was an ideally short-term solution (I didn't consider the use of GitHub releases or clients using redirects for GitHub page links). Now that we have discussed it (I have personally used this approach in other projects), I think going with releases would be better, in the long run, [I can separate the website once we have migrated fully, maybe next year; no need to do it immediately.]. Post your PRs merge (#12133) and acuteenvy's PR at #12062 (I think we can make a minor release, informing the client authors about the future change in 1 year [in the meanwhile we can try updating existing clients to use release links i.e. https://github.com/tldr-pages/tldr/releases/download/latest/tldr-pages..zip]). We can leave this issue open for now, and close it when we fully drop this method of uploading assets using GitHub pages. Also, I would love to hear some client author's feedback on what they think about this (publishing to GitHub releases). cc @dbrgn @niklasmohrin @rwv |
tldr.inbrowser.app use tldr git repo archive download function directly. https://github.com/tldr-pages/tldr/archive/refs/heads/main.zip Therefore it shouldn’t matter. |
Tealdeer uses tldr.sh to download the latest archive. As long as the semantics of that endpoint don't change, there shouldn't be anything to do for us. Personally, I would like the delay of the github repo and the zip downloaded from tldr.sh to be kept to a minimum to avoid possible user confusion. I think something like 30min should be the maximum. If feasible, an instant update on every commit would be nice (although care must be taken to always keep the newest version when two PRs get merged right after one another). |
Maybe we can redirect asset to https://github.com/tldr-pages/tldr/archive/refs/heads/main.zip. Let GitHub handle archive and cache for us. Also this keeps asset.zip always up to date. But locale asset will be a problem. |
Client spec:
My personal opinion: do we really need language specific caching? The whole repo is only 7.2MB. I see little benefit and the complexity is increased since the client needs to deal with locale and the not-found circumstances. |
While space/connectivity isn't a concern for most of us, in clients like Node client (when a page isn't found or you make a typo the asset is fetched again adding to the overhead), the main reason for this approach is that we have certain clients/integrations (extensions) only targeting a few select languages (where having others isn't necessary [to prevent increasing the application's size] and also some embedded system/cellular users (from remote regions, etc) expressed the same issue in the past. [Thus we introduced this method] Let me take myself as an example, in my university wireless network speeds are capped at 2 MBPS and the ones you get with GitHub are even lower (I can use cellular 5G) but if I use their network then fetching the entire archive would take anywhere between 30 seconds to a minute (whereas with the current method, it is way faster). |
If you think that's not needed, don't implement it. There are many clients for many different use cases, and we will still continue to provide the full archive. |
Merged #12062 and it has successfully added the assets to the latest release.
Regarding this, we could add a branch protection rule for the main branch enforcing merge queue (i.e. when multiple PRs are merged only when a job is completed, the next one starts). I think I have proposed this in the chatroom or a thread before, not sure where 😅 . Checkout https://docs.github.com/en/repositories/configuring-branches-and-merges-in-your-repository/configuring-pull-request-merges/managing-a-merge-queue for more information. |
Tested the implementation of merge queues for the past hour; the advantage it introduces is that we can enforce PRs to be in sync with the main branch and we can also limit the number of PRs that can be merged at a time. I tested it in my fork (by setting the default branch, enabling it in branch protection, squash as the merge strategy, allowing a maximum 1 PR to run build in the queue, allowing only one PR to be merged at a merge commit); removed the "push" parameter from the action. But it comes with a lot of disadvantages:
After referring to this online and in docs; playing around with it. I would like to conclude at the current state I don't think it would be feasible for us. Since this is a fairly new feature I hope it will improve, we can check back into it in the future, until then maintainers should ensure not to retrigger old failed workflow runs if a newer one has succeeded (and the asset is deployed) as we specify in the maintainers guide. I have attached the workflow files I used for future reference: |
The new assets on the release look cool, but I don't see an archive for English there? I also don't see an archive for all pages attached to the release as we currently have in the website git repo? |
See #12062 (comment) for more information. TLDR. Release assets aren't alphabetically arranged and recently updated ones are shown at last (all the assets in the website git repo are present here too). |
* CLIENT-SPECIFICATION: remove redirect indicators, use autolinks Autolinks are part of the CommonMark spec (ref <https://spec.commonmark.org/0.30/#autolinks>) and are well supported. Redirect indicators are removed as a part of #12048. * CLIENT-SPECIFICATION: update changelog --------- Co-authored-by: Lena <[email protected]> Co-authored-by: K.B.Dharun Krishna <[email protected]>
Cool, ty for the clarification @kbdharun! |
I have been thinking about opening this issue for a while, I initially hinted at it in this thread in the chat room and made occasional mentions about it in PRs, I had time to test the changes during Winter holidays last year; making a formal issue now to discuss it and the detailed steps on how to perform the transition seamlessly.
Proposal
Summary
This proposal aims to separate
assets
andwebsite
contents into separate repositories to ease up contributing, maintenance, etc.Problem
git
takes a lot of time and also takes a whopping 6 GB+ storage space (the main culprit being theassets
directoriesgit
cache gets large over time), this makes it nearly impossible for most contributors to contribute/improve our website.Solution
My solution to solve this issue is to move the website and assets to separate repositories (tldr-pages.github.io, assets) and archive the current website repository under a different name like
old-site
.Advantages
assets
repos in future (if we receive an intimation from GitHub about site size) [i.e. we can download the contents as ZIP fromassets
repo and unpublish it, then rename and archive it likeassets-1
, create a new one under the same name].Considerations
Modifications (@ main repo)
deploy.yml
would need to be modified to deploy to the new assets repository instead of inside a directory and the repo slug needs to be updated too. (See the patch file section below for more information)Clarifications
tldr-bot
for it to commit and push changes.https://tldr.sh/repo-name/contents
, in this case, it would behttps://tldr.sh/assets/
(we already use this approach for showing web version of manpages attlrc
at https://tldr-pages.github.io/tlrc [https://tldr.sh/tlrc/]).Steps
deploy.yml
(to commit/push to newassets
repo).assets
directory (and then rest of website files except.git
).old-website
.tldr-pages.github.io
and for assets under the nameassets
, then commit the files taken from the ZIP archive to the respective repositories.Patch for
deploy.yml
Location: https://github.com/tldr-pages/tldr/blob/main/scripts/deploy.sh.
References and Testing
I tested these changes two weeks ago in my fork with this assets repository (the live version [available till this issues closure] can be found here)
Conclusion
We have been optimizing the build and deploy processes for the past few weeks making building and committing new archives/PDFs only when they are modified. This proposal is the last part of completing the optimization work.
If given the green light, I can perform the changes soon.
I would like to ping some of our active maintainers and people with access to infrastructure for your opinion about this. (Will inform the same in the chatroom)
cc @sebastiaanspeck , @sbrl, @SethFalco, @agnivade, @acuteenvy, @owenvoke, @blueskyson, @waldyrious
The text was updated successfully, but these errors were encountered: