Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Serve static HTML #634

Closed
rviscomi opened this issue Jan 15, 2020 · 19 comments
Closed

Serve static HTML #634

rviscomi opened this issue Jan 15, 2020 · 19 comments
Labels
development Building the Almanac tech stack enhancement New feature or request performance Issues related to site performance

Comments

@rviscomi
Copy link
Member

Our backend is currently hosted on Google App Engine running a Python server and rendering Jinja templates in Flask based on JSON config data.

For every URL, there is effectively one static HTML page. However, every time a user visits one of these pages, the Python server needs to render the templates using (in-memory) JSON config data. Because each type of page requires a different amount of rendering, performance varies across the site:

image

This chart tracks the 30-day trend of CrUX data for four of our pages: Home, CSS, JS, and Table of Contents. The ToC page has the highest % of fast FCP on mobile, at ~80%, although it's been dropping over time. The CSS page, which is quite large in terms of content and assets, has the fewest fast FCP experiences, at about 40%.

I'd argue that these pages should take more or less the same amount of time to download and render, and they should be closer to 80% fast than 40%. Ideally 100% fast!

The key to this, I think, is to stop server-side rendering pages. When we deploy the site, none of the HTML will change. So we should take a snapshot of the markup and serve those files statically as we would a JPEG. Sure, the caching headers can be different but the key is that the server doesn't have to think before it returns the contents. For larger documents like the CSS or HTTP/2 chapters, I'd love to explore HTTP chunking as a way to give clients a head start to load dependencies and render above-fold content early.

I think this can simplify our tech stack as well. For example, we can host our static assets (HTML, JS, CSS, images) on a host like Netlify or even directly from this repo using GitHub Pages. Without server-side code in production, we can more quickly deliver contents to users, which should improve our bottom-line metrics like engagement rates.

One very small exception to this is the Contributors page which shuffles the list of contributors on each request. The HTML is different each time and we can either shuffle in JS (bad because of a delay in rendering) or remove that feature entirely (bad because it unfairly features some contributors at the top more prominently). Open to discussion about how to resolve that.

@HTTPArchive/developers any thoughts on this proposal?

@rviscomi rviscomi added enhancement New feature or request development Building the Almanac tech stack labels Jan 15, 2020
@borisschapira
Copy link
Contributor

borisschapira commented Jan 15, 2020

As a strong Jamstack advocate, I'm in. I recommend using 11ty which is a pretty straightforward node SSR. 11ty support nunjucks templates, and nunjucks is heavily inspired by jinja2 so I guess migrating the templates won't be long. And there's an easy way to deal with internationalization through conventions and a content key.

Regarding the contributors list, we can always request the data through XRH and insert it in the DOM after a shuffle.

@rviscomi
Copy link
Member Author

Thanks @borisschapira. Question about 11ty: does that require us to change our development workflow as well? Or could we simply take the static resources from our current local setup and push that out to the 11ty server?

@rviscomi
Copy link
Member Author

Regarding the contributors list, we can always request the data through XRH and insert it in the DOM after a shuffle.

This is effectively the same as shuffling in JS with the drawback of having to block rendering on the XHR. If we did shuffle in JS we could always bake the JSON contributors into the HTML document. 😕

@tunetheweb
Copy link
Member

tunetheweb commented Jan 15, 2020

I'm not convinced server side rendering is the problem.

The TOC chapter is the only one of the one you've tested which has no hero image. That's why it's faster IMHO. The delay is in downloading the images - and particularly the hero image which will be delivered as the large image (though from last release it will use smaller images if screen doesn't support larger images and <picture> is supported).

Does the CrUX data show TTFB which would be a better indicator of whether the server side rendering was holding us up?

I would also be interested in seeing the CrUX data in a months time as we implemented a number of fixes this weekend in the first release since the freeze:

I would hope these would improve things, though I admit it doesn't appear to have from above graphs despite apparently including this week when those changes were already live 😔

However, I do like the idea of a static site. Can we make more use of GitHub Actions? I don't know much about them (though would be keen to learn) but could flow be:

  • We open pull request as now.
  • A GitHub action runs:
    • if the PR does not contains chapter edits to markdown/other edits to HTML templates then end, else:
    • spin up a linux server
    • check out current feature branch
    • auto update the appropriate timestamp in the changed file(s) - Autogenerate timestamps #317
    • run npm run generate
    • run the development server (would it have to set up the python env each time?)
    • grab the rendered HTML from the development server (using curl or similar) for the altered page(s) and store in /static directory
    • check in everything to feature branch
  • if this passes pull request is ready to merge to master.

Now I don't know much about GitHub actions and whether above is possible, but if it is it would massively simplify making changes (and would make it much easier to accept small, typo type edits from new, random contributors who are not as familiar with how to generate the site), would give us the static site we potentially want and wouldn't require a big move off the current tech stack which has served us reasonably well. In future we could even take it to the next stage and auto-deploy if we wanted.

Or is the use of using a development server to generate the HTML for us a bit hacky and is there a better way to generate the HTML from Jinja2 templates from the command line?

HTTP Chunked Encoding is no longer a thing in HTTP/2 (which Google App Engine uses) - it's inbuilt in the protocol and happens by default. Even if this was not the case, I'm not sure we need it with the size of our HTML files. The HTTP/2 chapter is the CDN one and at 25.4KB gzipped that's not much at all really.

For contributors page I'd have the contributors listed alphabetical and an inline <script> tag to quick reorder them. I can't see that causing much delay and leaves us with the fallback when JavaScript is not enabled.

@DirtyF
Copy link

DirtyF commented Jan 16, 2020

@ethomson might help on the GitHub Actions topic?

@tunetheweb
Copy link
Member

Oh that would be good @DirtyF !

The other option (whether we do GitHub actions - which would have other benefits) is just to stick Cloudflare or similar in front of it, Benefit from it's caching of the HTML pages and get HTTP/3, improved HTTP/2 prioritisation implementation as Google's CDN is sub-optimal in this respect and even further improvements prioritiations on top of that better HTTP/2 and 3 prioritisation. But not sure if we have budget for that,

@ethomson
Copy link

👋 What you’ve described is not impossible with GitHub Actions but I fear that it will be a bit challenging at present: reason being that pull requests from forks have downgraded security rights. So - doing something like spinning up a Linux server sounds like it would require some sort of secret or token that had permission to start a service. (eg you might need an AWS token to start the VM.)

You can add that key as a secret in your repository and GitHub Actions workflows will be able to access it... except workflows that are running as pull requests from forks. Those won’t get secrets (to prevent a malicious person from trying to steal your secrets).

I don’t have a very good workaround for you at present, I’m afraid. I think a GitHub App may make more sense at the moment.

@tunetheweb
Copy link
Member

Thanks for getting back so quickly @ethomson!

pull requests from forks have downgraded security rights.

Is there such a thing as a post-commit hook or GitHub action to apply changes to master after a PR has been merged?

So - doing something like spinning up a Linux server sounds like it would require some sort of secret or token that had permission to start a service. (eg you might need an AWS token to start the VM.)

For development this is run as a Flask development server started from the command line and running on port 8080 so don't need any particular service permissions. Though does require some dependencies as detailed in steps 1-3 of the README.md in our src directory.

As I said above, using the whole flask development server is probably overkill and there may be better ways to convert Jinja2 templates to actual HTML but still, I'm curious to know what the options are anyway since I'm less familiar with GitHub and GitHub actions in particular.

@ethomson
Copy link

Is there such a thing as a post-commit hook or GitHub action to apply changes to master after a PR has been merged?

Yes! You can use this trigger:

on:
  push:
    branches: [master]

which will trigger on any push to master, or merge of a pull request into master.

For development this is run as a Flask development server started from the command line and running on port 8080 so don't need any particular service permissions.

Aha - thanks for the explanation. If that's the case then this makes a lot more sense for a GitHub Actions workflow.

name: Build
on: 
  push:
    branches:
    - master
jobs:
  build:
    runs-on: ubuntu-latest
    steps:
    - run: |
        sudo pip install virtualenv
        virtualenv --python python3 env
        source env/bin/activate
        pip install -r requirements.txt
        python main.py
        ...

@tunetheweb
Copy link
Member

Oh this looks promising (and interesting to learn about!!).

And that would allow us to checkout master, make changes, and then commit to master again?

@ethomson
Copy link

Yeah, you'll need to set up a personal access token that has write access to the repository. Then add two secrets to your repository, one with the username of the user that created that PAT, and another with the PAT itself. For example, if you set username to PUSH_USERNAME and the PAT itself to PUSH_TOKEN, then you can do this:

git checkout master
echo hello world > foo.txt
git add foo.txt
git config --global user.email "[email protected]"
git config --global user.name "Automatic Update"
git commit -m"Automatic update"
echo 'machine github.com login ${{ secrets.PUSH_USERNAME }} password ${{ secrets.PUSH_TOKEN }}' > ~/.netrc
git push origin master

@tunetheweb
Copy link
Member

@ethompson I've set this up in my fork, and it seems to be working well.

The one thing I'm stuck with is getting a list of files changed in that commit. This post seems to suggest using this:

git diff-tree --no-commit-id --name-only -r ${{ github.sha }}

But that is returning blank - see the "Update timestamp for modified files" section of an example run

Run echo "files changed in commit $GITHUB_SHA:"
files changed in commit dc0374f86f61f738b9e7ae0ce5036d87374a18a2:

No files are returned by that command, but clicking on the dc0374f link at top of the screen shows the file did change, and the same command run locally shows that it changed.

Any ideas why that would not work? It's almost like the commit hasn't fully committed by this point, but since this is a post-commit action it should be there in master by this point shouldn't it?

Or is there a better way of getting the list of files included in this commit?

@ethomson
Copy link

By default, the checkout action does a depth=1 on master (and the PR branch being built, if there is one), which means that diff-tree can't compare against its first-parent. In this case, it's running:

/usr/bin/git -c protocol.version=2 fetch --no-tags --prune --progress --no-recurse-submodules --depth=1 origin +refs/heads/master*:refs/remotes/origin/master* +refs/tags/master*:refs/tags/master*

If I understand the depth and reachability, I think that you could get away with fetch-depth: 1. You could change your checkout to:

- uses: actions/checkout@v2
  with:
    fetch-depth: 1

(You only need to diff github.sha against its first parent, which is master to understand what changed.)

@ethomson
Copy link

But you can also set fetch-depth: 0 to get the whole, not-at-all shallow, repository.

@tunetheweb
Copy link
Member

Thanks. fetch-depth: 1 didn't work for some reason, but fetch-depth: 0 did.

@ethomson
Copy link

Thanks - I'm a little surprised, but not completely surprised. I'll 👀 at this a bit to better understand it, but I'm glad that fetch-depth: 0 did.

@tunetheweb
Copy link
Member

OK so at the moment I have the following working on my fork, after a Pull Request is merged:

  1. spin up a linux server
  2. check out current master
  3. auto update the appropriate timestamp in the changed file(s) - Autogenerate timestamps #317
  4. run npm run generate
  5. check in everything to master branch

This takes care of #317 (with the added bonus of npm run generate too).

I can't get it to start the flask server in the background (also doesn't work locally on my command line, though I can background it once it's started).

But then honestly think having markdown->jinja2 template->web server->html is a bit of a hacky workflow to generate a static site.

Separately, I'm not sure we really need a static site as don't think that's the cause for what @rviscomi has raised in #634 (comment) as discussed in #638

Anyway still been very useful to learn about Github Actions (they're very cool!), spotted and fixed a couple of bugs, and think we can implement what I have in to solve #317. So will move this discussion over there.

Thanks @ethomson for all your help!

@tunetheweb
Copy link
Member

Actually @ethomson I've another problem.

That works when I commit to master directly. When I do it via a pull request, it again returns no files when I run git diff-tree in the Github Action 😞

@tunetheweb
Copy link
Member

It would appear from the last release that since we added caching headers in #608 Google Cloud is caching our site for the 3 hours that we set. That means this line from @rviscomi 's opening comment is incorrect:

However, every time a user visits one of these pages, the Python server needs to render the templates using (in-memory) JSON config data.

Only the first user from each cloud server (set by region I presume?) takes this hit.

This can be proven with the contributors page:

This is supposed to give a random ordering with each refresh. However it does not - it gives the same "random" ordering on each refresh and you have to wait 3 hours since it was cached to get a random ordering.

However if you add a query string you get a new random ordering (which is also now locked for 3 hours no matter how many times you refresh) as detailed in Google Cloud Caching documentation:

Different query string, different random ordering:

Therefore we effectively have a static site already, thanks to the Google Cloud CDN in front of us. Yeah we possibly could improve that first user's experience with a truly static site, but I don't think it's worth changing tech stat just for that to be honest. Plus we've yet to prove that was ever the reason for the slowness and, as per #638 , there are other things I think are bigger impact.

So I vote to close this issue.

Barry

P.S. @ethomson would still like to get GitHub actions working to build the site, if you get a chance to figure out the answer to my last comment.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
development Building the Almanac tech stack enhancement New feature or request performance Issues related to site performance
Projects
None yet
Development

No branches or pull requests

5 participants