docs: ADR for serving static assets #110

ormsbee · 2023-10-31T17:20:28Z

A first stab at an asset serving ADR that could serve Studio and the LMS. I'm especially looking for feedback on how we could do the cookie auth handoff (or for someone to tell me why it's a terrible idea).

ormsbee · 2023-10-31T18:43:09Z

docs/decisions/0015-serving-static-assets.rst

+
+OPEN QUESTIONS:
+
+* What's the best way to do handoff between LMS/Studio and get that cookie information over to the asset domains?


One idea would be to have a one-time key that we store in cache and redirect the user's browser with it. So it could be something like:

Browser makes a request to a new endpoint in Studio/LMS.

Studio/LMS generates a random token, and stores user information associated with it in our backend cache (redis or memecached).

Studio/LMS then redirects the request to an auth URL on the static assets server with that random token in the querystring.

Asset server finds that entry in the cache, logs that user in, and removes the cache entry.

Asset server is now using session auth for that user.

It's cheating in a way because it relies on the asset server and the Studio or LMS instance being on the same service backend. But we're making that the case anyway because of permissions checking integration. And it's simpler than having signed requests (though it could ultimately be extended to go in that direction if we ever want to separate it into an entirely different service).

docs/decisions/0015-serving-static-assets.rst

bradenmacdonald · 2023-10-31T18:50:19Z

docs/decisions/0015-serving-static-assets.rst

+
+OPEN QUESTIONS:
+
+* What's the best way to do handoff between LMS/Studio and get that cookie information over to the asset domains?


Maybe:

Browser requests a file without any auth headers

Edge function requests the metadata from the LMS API (if not already cached) to determine if auth is needed.

If no auth is needed, serve the file directly.

If auth is needed, return a 302 redirect to a special LMS endpoint. That endpoint redirects back to the CDN but includes an auth header in the query string.

Edge function serves the file along with a Set-Cookie header, moving the auth token from query string to cookie.

For future requests, the browser will send the auth cookie and the edge function can verify it with the LMS if needed, or ignore it if the asset doesn't require auth.

bradenmacdonald · 2023-10-31T18:51:39Z

docs/decisions/0015-serving-static-assets.rst

+
+The high scale version of this approach will require having a CDN with programmable workers and a scalable object store that supports signed URLs (such as S3). The main objective would be to shift the file streaming burden out of Django and onto the CDN and object store.
+
+In Learning Core, we would implement an APIView (or possibly extend the asset-serving one) to return a JSON response of file metadata. This would include things like size, MIME type, last modified date, cache expiration policy, etc. The response would also contain a signed URL pointing to a object store resource, like S3. The CDN worker then does the fetch on that resource. We would create an example worker for at least CloudFlare.


As mentioned above, I think we should just go with this approach right away, though for now we can substitute a minimal in-process python django app in lieu of an edge function.

connorhaugh

Intriguing. I have some questions above and then a document level question.

I think an implicit choice made is that there is a 1-1 mapping of assets to xblocks. We, however, know that reuse of assets is very common across blocks. How does that square with permissions and versioning?

connorhaugh · 2023-10-31T17:50:32Z

docs/decisions/0015-serving-static-assets.rst

+Context
+--------
+
+Both Studio and the LMS need to serve course team authored static assets as part of the authoring and learning experiences. These will most often be images, but may also include things like subtitles, audio files, and even JavaScript. It does NOT typically include video files, which are treated separately because of their size.


Suggested change

Both Studio and the LMS need to serve course team authored static assets as part of the authoring and learning experiences. These will most often be images, but may also include things like subtitles, audio files, and even JavaScript. It does NOT typically include video files, which are treated separately because of their size.

Both Studio and the LMS need to serve course team authored static assets as part of the authoring and learning experiences. "Static assets" in the edx-platform context presently refers to: image files, audio files, text document files like pdfs, video transcription files, and even javascript and python files. As video files have a different implementation due to their large file size, they are not commonly referred to as static assets.

I think this is an important distinction to make, so I made it a little clearer. I also think that you should note in the decision that we are choosing to maintain this distinction between static assets and videos.

connorhaugh · 2023-10-31T18:55:33Z

docs/decisions/0015-serving-static-assets.rst

+  Assets may reference each other in relative links, e.g. a JavaScript file that references images or other JavaScript files. That means that our solution cannot require querystring-based authorization tokens in the style of S3 signed URLs, since asset files would have no way to encode those into their relative links.
+
+**Multiple versions of the asset should be available at the same time.**
+  Our system should be able to serve at minimum the current draft and published versions of an asset. Ideally, it should be able to serve any version of an asset. This is a departure from the way Studio and the LMS currently handle files and uploads, since there is currently no versioning at all–assets exist in a flat namespace at the course level and are immediately published.


I see why this is useful, but I want to dig into this a little bit, because this is inherently a costly choice, right? I am curious: what is the process by which an author augments a file to create a new version? Is it re-upload with the same name?

Yeah, basically. If you have a Problem and /static/figure1.webp is reference, and you upload a new /static/figure1.webp, then you're creating a new version of that XBlock with that file updated.

It is somewhat costly, since we're holding onto both the old and new images in this case. But at the same time, it gives us the ability to actually batch changes in XBlock content and files together, e.g. for publishing them both at the same time.

connorhaugh · 2023-10-31T18:56:44Z

docs/decisions/0015-serving-static-assets.rst

+Security Requirements
+~~~~~~~~~~~~~~~~~~~~~
+
+**Assets must enforce user+file read permissions at the Learning Context level.**


+1 to this as being a major req.

ormsbee · 2023-11-01T15:12:48Z

@connorhaugh:

I think an implicit choice made is that there is a 1-1 mapping of assets to xblocks. We, however, know that reuse of assets is very common across blocks. How does that square with permissions and versioning?

The mapping isn't really 1:1 though. The raw data for the asset lives in a file referenced by a RawContent model. But that same asset can exist in many different XBlock components. It can even exist as multiple different file names within the same XBlock if we really want it to, though I can't think of a plausible use case for that.

In terms of re-use, I think there are two scenarios that can play out:

Incidental re-use, with component-local assets.

We upload a cute cat image in one problem, and then upload it (and a few others) in a second problem in the same library. We're effectively "re-using" the image data, but not in any way that requires centralized management. The system saves the raw data once, but there are two different rows of ComponentVersionRawContent that reference it. The two images will be available at two different URLs and can have separate permissions checks applied, since they are addressed as part of two different XBlock components, and the system may give a 403 for one you're not yet allowed to see.

Intentional re-use, with centralized management.

I think this one's trickier, because there are a couple flavors.

The first kind is linking to the assets of other Components directly. Under the covers, this is how existing Files and Uploads in Studio would be handled (with a Component that represents all the current files and uploads stored for a course). Things could be linked directly, and anyone enrolled in the course would have access to it.

There's also linking to a shared asset where the dependency is more explicitly tracked. Say we have a library of custom grader code, that includes both Python code that can be executed, as well as JavaScript code for certain problem types. Our component XBlock (a ProblemBlock) uses these assets. There are a few broad approaches I can think of:

Optimize for the browser by downloading things from the original component location, even if that's in another Learning Package. This is the best for browser caching, especially if we're using it across many different ProblemBlocks in our course, but it can make permissions checking difficult.
Make the download URL of the shared content appear to be nested inside the borrowing component–i.e. make it look like the asset belongs to the ProblemBlock and not wherever it originally came from. This simplifies permissions checking because you're only ever checking that you have permissions to the ProblemBlock, not to anything it might be using.
Something in between, possibly at the LearningPackage level.

ormsbee · 2023-11-05T04:26:02Z

Made some major revisions. I'll probably ping the forums for more eyes this coming week.

docs/decisions/0015-serving-static-assets.rst

* Accepted some suggested edits on how we define assets. * Revised proposal to take advantage of X-Accel-Redirect. * Mandated object storage server. * Made a first pass at how auth could work.

ormsbee · 2023-12-02T20:57:52Z

@bradenmacdonald: Finally getting back to this. Addressed your last comment about how an object store should no longer be a requirement. Honestly relieved to not have to try to push that change through just yet. 😛

ormsbee · 2023-12-02T20:59:14Z

@connorhaugh: I've included your edit suggestion. Please let me know if you have any other concerns.

bradenmacdonald

We'll learn more as we go but this seems like a good direction to me.

regisb

I learned about this ADR via https://github.com/openedx/blockstore/issues/314#issuecomment-1924458461 on the Blockstore DEPR issue. I realize I'm very late to the game but I think very strongly that we must reconsider the new requirement of a second top-level domain.

regisb · 2024-02-05T08:59:50Z

docs/decisions/0015-serving-static-assets.rst

+
+  The further implication of this requirement is that *permissions checking must be extensible*. The openedx-learning repo will implement the details of how to serve an asset, but it will not have the necessary models and logic to determine whether it is allowed to.
+
+**Assets must be served from an entirely different domain than the LMS and Studio instances.**


Does this requirement apply to all Open edX platforms, and not just edX.org? We can't possibly expect Open edX users to register a second domain name to host their platforms. At the very least, this item should be discussed with the BTR. Personally, I would oppose such a requirement.

@regisb: It would apply to all Open edX platforms.

For now, we'll use a different sub-domain for development purposes (so it would come across like a new service). There are still some long term security risks with that, but it's better than the current state of things. I'll put a writeup together on security tradeoffs before bringing this to the BTR and likely the security working group. We won't add a hard requirement of a new domain for assets before doing that feedback cycle.

That being said, right now I'm really trying to unblock some folks on libraries work, and then I'm taking time off to go back to my hometown for the first time since the pandemic. It's likely that I won't have the time to go through this process until next month.

Thank you for your feedback.

FYI @bradenmacdonald, @kdmccormick: ^

I'll put a writeup together on security tradeoffs before bringing this to the BTR and likely the security working group.

Thanks Dave, I'll keep an eye out for that.

We can't possibly expect Open edX users to register a second domain name to host their platforms.

This is a good point, and it makes me wonder if a sibling subdomain would be sufficient for security. For example, if a site operator today had:

example.org # lms

studio.example.org # cms

apps.example.org # mfes

could we move them to:

assets.example.org # assets

lms.example.org # lms

studio.example.org # cms

apps.example.org # mfes

example.org/* # caddy wildcard redirect to lms.example.org/*, just to preserve existing LMS URLs.

A sibling subdomain only works if you are extremely careful to never set cookies onto the root domain. If you look at tutor today, for example, many cookies like lms_sessionid get set onto the root .local.edly.io domain. While it may be possible to configure Open edX to work without root domain cookies, there is not really any secure way to enforce this "don't set cookies on the root domain" as a policy, so it remains a potential security issue. Plus it works both ways - if you use a subdomain domain for untrusted content, that untrusted content can set cookies onto your root domain. (Unless it's a public suffix like opencraft.hosting.)

It's much safer to use a completely unrelated domain. This is exactly why GitHub puts Pages on github.io rather than GitHub.com, btw (and many other sites do similar things).

We can't possibly expect Open edX users to register a second domain name to host their platforms.

To be clear, registering a second domain name is usually not going to be necessary except for aesthetic reasons.

Almost all production deployments happen on the cloud and virtually all cloud providers provide free domain names on public suffixes, e.g. d111111abcdef8.cloudfront.net or x.s3.us-west-2.amazonaws.com (AWS), x.r2.dev or x.workers.dev (CloudFlare), etc. Any of these free domains will suffice. (Edit: of course these aren't actually "free" unless you're already paying for a CDN, block storage, edge workers, or a load balancer. But you probably want at least one of those anyways.)

regisb · 2024-02-28T05:01:29Z

docs/decisions/0015-serving-static-assets.rst

+  The further implication of this requirement is that *permissions checking must be extensible*. The openedx-learning repo will implement the details of how to serve an asset, but it will not have the necessary models and logic to determine whether it is allowed to.
+
+**Assets must be served from an entirely different domain than the LMS and Studio instances.**
+  To reduce our chance of maliciously uploaded JavaScript compromising LMS and Studio users, user-uploaded assets must live on an entirely different domain from LMS and Studio (i.e. not just another subdomain). So if our LMS is located at ``sandbox.openedx.org``, the files should be accessed at a URL like ``assets.sandbox.openedx.io``.


I fail to see how this security issue would be mitigated by hosting the javascript files on a different domain. It seems to me that the cookie access capability of custom javascript depends on the runtime context, not the hosting domain.

For instance, javascript loaded from abc.com, but executed on def.com, would be able to access the cookies from abc.com. What am I missing?

Good point. Actually this is only required assuming that content is being run in an iframe, and the main source code file (HTML) for the iframe is being hosted among the static assets. It's true that as long as the iframe has a different origin than the LMS, it doesn't matter where the actual asset files are stored. I think we need to think this through a bit more. CC @ormsbee

FWIW, people can (and occasionally do) upload HTML files as static assets.

Right. I probably don't have the whole picture, but it seems to me that this is a very niche use case. It would be a shame if this specific scenario by itself forced us to setup another domain name.

Is this a use case that we want to preserve, or it a new one? In its current form, isn't it already a security liability, as it allows course staff to run arbitrary scripts?

Is this a use case that we want to preserve, or it a new one? In its current form, isn't it already a security liability, as it allows course staff to run arbitrary scripts?

It is a liability. One we're hoping to close here.

Whether these use cases should be preserved is a product call. I've seen HTML uploads used by folks who have some fancy JS simulation already built out from some other project, and they want to run it in their course. I've also seen it used by people who have some kind of syllabus from a Word export, where all the images are base64-encoded data urls embedded into the HTML itself, leading to a 20 MB file that would choke the HTMLBlock editor.

I'd be delighted to kick those cases to the curb, but I don't know what other vectors there are. PDFs run JavaScript for form validation these days. I think that's locked down fairly tight and wouldn't have access to cookies, but I'm not sure. A more serious vulnerability is that JS can be embedded into an SVG file. Browsers won't run it when you include SVGs using the <img> tag, but if you send someone a link to look at this cool image and they click it, their session could be compromised.

Now there are ways to sanitize it, set up proper content security policies, etc. But I guess my point is that I don't know what else is out there, or will be out there five years from now. Using a separate domain helps to remove this class of vulnerability, which is probably why places like MDN recommend storing it on a different domain:

Sandbox uploaded files. Store them on a different server and allow access to the file only through a different subdomain or even better through a completely different domain.

Sorry for the slow answer...

Now I understand the general recommendation about hosting uploaded assets on a different subdomain or domain. But that should be a recommendation, not a requirement. It's a whole different thing to say "we recommend you host your uploaded assets on a different domain, but if you understand the risks you can use a subdomain of your LMS, or even the same domain" -- and to actually support those use cases.

If we do end up going the route of having this be a recommendation instead of a hard requirement, I think we would still want the implementation with two domains to still be the default from a "security by default" perspective. I don't mind people choosing to not do this but I don't want the software to make it easy to do the "wrong thing."

regisb · 2024-02-29T03:47:28Z

docs/decisions/0015-serving-static-assets.rst

+~~~~~~~~~~~~~~~~~~~~~~~~
+
+**The asset server must be capable of handling high levels of traffic.**
+  Django views are poor choice for streaming files at scale, especially when deploying using WSGI (as Open edX does), since it will tie down a worker process for the entire duration of the response. While a Django-based streaming response may sufficient for small-to-medium traffic sites, we should allow for a more scalable solution that fully takes advantage of modern CDN capabilities.


As a side note, this is not the case for uwsgi, which can serve static assets without pausing workers. This is the primary reason why tutor uses uwsgi. But on the other hand uwsgi is no longer maintained, and thus we find ourselves shopping for a replacement, hoping that we can find one with the same feature set.

Sure. But even if uwsgi can send files efficiently when its file serving subsystem is invoked, a Django view is still going to tie down the worker process because it's Django doing the streaming in that case. I don't know the specifics of how you'd offload the file serving from the Django view to uwsgi (I'm sure it's possible), but at that point, it's functionally equivalent to using the X-Sendfile header and letting caddy do it, isn't it? In either case, the Django view is saying, "I can't send this efficiently, so I'm just going to send the headers for where to find it, and delegate the sending to something with better concurrency/performance than I have."

This adds a number of API calls to the Authoring public API in order to support associating static assets with Components and allowing them to be downloaded. Added: * get_component_by_uuid * get_component_version_by_uuid * get_content_info_headers * get_redirect_response_for_component_asset Modified: * create_next_component_version - made title optional * create_component_version_content - annotation for learner_downloadable Most of the justification for this approach can be found in the docstring for get_redirect_response_for_component_asset. Note that there is currently no backend redis/memcached caching of the ComponentVersion and Content information. This means that every time a request is made, we will do a couple of database queries in order to do this lookup. I had actually done a version with such caching in an earlier iteration of this PR, but it added a lot of complexity for what I thought would be minimal gain, since going through the middleware will cause about 20 database queries anyway. This implements the first part of the static assets ADR for Learning Core: openedx#110 Important notes: * No view or auth is implemented. That is the responsibility of edx-platform. Learning Core only provides the storage and response generation. * The responses generated will require the use of a reverse proxy that can handle the X-Accel-Redirect header. See ADR for details.

This adds a number of API calls to the Authoring public API in order to support associating static assets with Components and allowing them to be downloaded. Added: * get_component_by_uuid * get_component_version_by_uuid * get_content_info_headers * get_redirect_response_for_component_asset Modified: * create_next_component_version - made title optional * create_component_version_content - annotation for learner_downloadable Most of the justification for this approach can be found in the docstring for get_redirect_response_for_component_asset. Note that there is currently no backend redis/memcached caching of the ComponentVersion and Content information. This means that every time a request is made, we will do a couple of database queries in order to do this lookup. I had actually done a version with such caching in an earlier iteration of this PR, but it added a lot of complexity for what I thought would be minimal gain, since going through the middleware will cause about 20 database queries anyway. This implements the first part of the static assets ADR for Learning Core: #110 Important notes: * No view or auth is implemented. That is the responsibility of edx-platform. Learning Core only provides the storage and response generation. * The responses generated will require the use of a reverse proxy that can handle the X-Accel-Redirect header. See ADR for details.

ormsbee requested review from bradenmacdonald, kdmccormick, connorhaugh and feanil October 31, 2023 17:21

ormsbee commented Oct 31, 2023

View reviewed changes

bradenmacdonald reviewed Oct 31, 2023

View reviewed changes

connorhaugh reviewed Oct 31, 2023

View reviewed changes

bradenmacdonald reviewed Nov 6, 2023

View reviewed changes

docs/decisions/0015-serving-static-assets.rst Outdated Show resolved Hide resolved

ormsbee added 3 commits December 2, 2023 15:42

docs: ADR for serving static assets

714a7ac

fixup!: revise based on feedback/research

00caad7

* Accepted some suggested edits on how we define assets. * Revised proposal to take advantage of X-Accel-Redirect. * Mandated object storage server. * Made a first pass at how auth could work.

fixup!: removed outdated info about requiring an object store

bc9d457

ormsbee force-pushed the assets-adr branch from a6f4931 to bc9d457 Compare December 2, 2023 20:56

bradenmacdonald approved these changes Dec 4, 2023

View reviewed changes

ormsbee merged commit 66f4fa2 into openedx:main Dec 4, 2023
7 checks passed

ormsbee deleted the assets-adr branch December 4, 2023 22:38

ormsbee mentioned this pull request Feb 2, 2024

[DEPR]: Blockstore openedx/public-engineering#238

Closed

This comment was marked as duplicate.

Sign in to view

regisb reviewed Feb 5, 2024

View reviewed changes

regisb reviewed Feb 28, 2024

View reviewed changes

regisb reviewed Feb 29, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

docs: ADR for serving static assets #110

docs: ADR for serving static assets #110

ormsbee commented Oct 31, 2023 •

edited

Loading

ormsbee Oct 31, 2023

bradenmacdonald Oct 31, 2023 •

edited

Loading

bradenmacdonald Oct 31, 2023

connorhaugh left a comment

connorhaugh Oct 31, 2023

connorhaugh Oct 31, 2023

ormsbee Oct 31, 2023

connorhaugh Oct 31, 2023

ormsbee commented Nov 1, 2023

ormsbee commented Nov 5, 2023

ormsbee commented Dec 2, 2023

ormsbee commented Dec 2, 2023

bradenmacdonald left a comment

This comment was marked as duplicate.

regisb left a comment

regisb Feb 5, 2024

ormsbee Feb 5, 2024

ormsbee Feb 5, 2024

kdmccormick Feb 5, 2024

bradenmacdonald Feb 7, 2024 •

edited

Loading

regisb Feb 28, 2024

bradenmacdonald Feb 29, 2024

ormsbee Feb 29, 2024 •

edited

Loading

regisb Feb 29, 2024

ormsbee Feb 29, 2024

regisb Apr 4, 2024

feanil Jul 24, 2024

regisb Feb 29, 2024

ormsbee Feb 29, 2024


		OPEN QUESTIONS:

		* What's the best way to do handoff between LMS/Studio and get that cookie information over to the asset domains?


		The high scale version of this approach will require having a CDN with programmable workers and a scalable object store that supports signed URLs (such as S3). The main objective would be to shift the file streaming burden out of Django and onto the CDN and object store.

		In Learning Core, we would implement an APIView (or possibly extend the asset-serving one) to return a JSON response of file metadata. This would include things like size, MIME type, last modified date, cache expiration policy, etc. The response would also contain a signed URL pointing to a object store resource, like S3. The CDN worker then does the fetch on that resource. We would create an example worker for at least CloudFlare.


		The further implication of this requirement is that permissions checking must be extensible. The openedx-learning repo will implement the details of how to serve an asset, but it will not have the necessary models and logic to determine whether it is allowed to.

		Assets must be served from an entirely different domain than the LMS and Studio instances.

docs: ADR for serving static assets #110

docs: ADR for serving static assets #110

Conversation

ormsbee commented Oct 31, 2023 • edited Loading

Choose a reason for hiding this comment

bradenmacdonald Oct 31, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

connorhaugh left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ormsbee commented Nov 1, 2023

Incidental re-use, with component-local assets.

Intentional re-use, with centralized management.

ormsbee commented Nov 5, 2023

ormsbee commented Dec 2, 2023

ormsbee commented Dec 2, 2023

bradenmacdonald left a comment

Choose a reason for hiding this comment

This comment was marked as duplicate.

regisb left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

bradenmacdonald Feb 7, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ormsbee Feb 29, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ormsbee commented Oct 31, 2023 •

edited

Loading

bradenmacdonald Oct 31, 2023 •

edited

Loading

bradenmacdonald Feb 7, 2024 •

edited

Loading

ormsbee Feb 29, 2024 •

edited

Loading