-
Notifications
You must be signed in to change notification settings - Fork 824
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
RFC-1 Asset abstraction #3792
Comments
Housekeeping note, should Flysystem https://github.com/thephpleague/flysystem be included in References? |
Looks like a nice RFC. Some of my thoughts:
If accepted and developed then this will make a lot of lives easier. 👍 |
@willmorgan we can't drop 5.3 support on a 3.x release |
Although I am all for asset abstraction, I am raising 2 issues pertaining to backward compatibility.
I'm being a stickler about this because it is going to be well-nigh impossible to explain to clients that we have to burn a lot of hours rewriting a lot of code and massaging their content because we want them to upgrade to SilverStripe 3.2 from 3.1. Keep in mind that these same clients will have recently had substantial expenses upgrading from 2.4. Also, the project has publicly committed to semantic versioning, so it is perplexing that the first RFC isn't in compliance. |
I have to agree with @oddnoc - I don't see any way which this can be in a 3.x release - its a major change to an API. I thought that we were adopting semantic versioning once the PDO release has come out, which I assumed would be a 3.2 release under the current schema. That would make this a 4.x release at best. I'm also worried about the continued persistence in maintaining PHP 5.3 support and this is a perfect example - having to resolve a third party's non-support of 5.3 isn't a trivial matter (who knows what they think about it!) |
My expectation is that as long as you use the simple APL with the local Regarding minimum PHP version, semver is why PHP needs to stay at 5.3. We Regarding the APL interface's flexibility: the interface itself is a public Regarding syncing files from a file source (remote or local): you could I've seen no use case for sharding filestores, and allowing it Hamish Friedlander | CTO |
Thanks for feedback everyone :-) The targeted Framework version seems to be the most controversial issue. I have the following proposition: let's delay that decision. It's hard to say at this point what backwards-incompatible changes would need to be introduced exactly anyway. I think we should simply leave our options open: if the dev team manages to solve all BC issues during implementation (which seems technically possible), then we'll submit it against 3.x in accordance with semver. If not -> 4.x. I'll rewrite the BC section to this effect. @oddnoc, @stevie-mayhew, @dhensby is the above approach acceptable? Also, people have provided a couple of opinions i.e. whether the tuple should be an object, or whether we should support mutliple APLs. There is a fine line to tread there: it's like the tabs vs spaces discussion. My belief is that the best approach is to let the RFC writer's opinion prevail, because it's easiest - as long as that solution works. It doesn't need to be perfect, it just needs to work for enough people. @dhensby, clarification regarding the "multiple APLs": you don't need to support parallel APLs to support multiple backends - see the "Amazon S3 and clustered sites" chapter. APL is designed to be more like a singleton - it's "the" abstraction layer. You can put whatever you want underneath :-) @willmorgan, re syncing files from remote source, we will only get "Sync files & assets" work in a backwards-compatible way with the Simple API. Not sure if reflection should be built into the APL interface. What I'm saying is there is a limit of what we can architect up front - and I think we are near that limit. @oddnoc re migration script, if you use Simple APL, you won't need to migrate the data in the Content - both representations will happily live side by side. We don't rule out writing a migration script, but it's not necessary for BC. @marijnkampf added, thanks! edit: So the "proposal" now says:
The "backwards-incompatible changes" chapter has also been tweaked accordingly. edit2: Even better wording for the "proposal":
|
Yes that makes complete sense. I think that having a "target version" for a RFC is a good idea, but as you say it shouldn't preclude a version bump if the end result becomes overly complex or obtuse due to attempts to keep within that constraint. |
I think that schema changes are okay for minor releases, as long as running dev/build takes care of everything for you, as that is understood as a standard silverstripe post-update action. I'd also expect that you don't need to introduce any new configuration, which would mean that, if no asset backend configuration is included, the system would assume that you're using a simple asset backend, pointing to the assets directory, and using a PHP 5.3-compatible non-Flysystem filestore—i'd probably recommend a class that deals only with local files and conforms to Flysystem API. Whether or not it's worth going to that effort to get this into 3.x is up to the opinion of the implementor. It's still "dangerous" and I think the implementation would need broad, rigorous scrutiny to see if it was actually viable for 3.x, and unless we got consensus that it was, it would go into 4.0. I'd like to see it in 3.x, but it's much more important that we honour our commitment to semantic versioning. |
The issue isn't so much the tuple object itself as the illusion of flexibility that it would provide. Changing the parameters that get passed to uniquely identify an assets will require that all callers and backends get updated. Under the current model, this will be obvious—method signatures will change for the callers, and the interface that backends implement will change, causing parse-time errors. If we have a tuple, things might still work on the surface, but there may bugs may appear—and since it's do to with asset persistence, these could be data-loss causing bugs. A tuple object would make the method signatures a bit more DRY, so maybe there are ways of addressing my concern while still using tuples. Essentially, you need to ensure that the code chokes quickly until it's updated to support the new tuple format, which is a little counter-intuitive, as I'm essentially asking that we make upgrading more difficult. ;-) |
So I had a draft written up on the mailing list thread that I never actually hit 'send' on, so will paste below. As mentioned in the RFC there's been some discussion on this previously, but we've actually built a working set of modules (and the s3 layer for it) that meets a significant portion of the requirements. It was built as a module with an eye on replacing the File/Folder implementation completely, but the scope of re-writing all that too would have pushed past the deadlines we had at the time. The RFC in general has quite a lot of similarities to work already done (file reference identifier, an APL type service, underlying file operation abstraction (via flysystem, which unfortunately wasn't around when I first started on my stuff!) so there's not much I can fault with the high-level design (except of course version requirement - PHP5.4 & SS4.0 seem much more realistic!). The APL interface does seem a little awkward to me though, but I'll need to digest things a bit more before commenting further on that. Anyway, the following are some points based on the work already done
But if I've also got something uploaded separately to a path in that same S3 location (eg other/files/generated/by/other/application/file.jpg) I can simply reference it like
This kind of referencing is very handy for the following point
// in this case, the $sservice figures out what S3Bucket means)
$reader = $service->findReaderFor('S3Bucket:||other/files/generated/by/other/application/file.jpg');
$writer = new FilesystemWriter('some/path/locally.jpg');
$writer->write($reader);
S3Service:
constructor:
key: {my_key}
secret: {secret}
region: ap-southeast-2
S3ContentReader:
type: prototype
properties:
s3service: %$S3Service
bucket: public-content
baseUrl: https://public-content.s3-ap-southeast-2.amazonaws.com
S3ContentWriter:
type: prototype
properties:
s3service: %$S3Service
bucket: public-content
baseUrl: https://public-content.s3-ap-southeast-2.amazonaws.com
PrivateS3ContentReader:
type: prototype
properties:
s3service: %$S3Service
bucket: private-content
baseUrl: https://private-content.s3-ap-southeast-2.amazonaws.com
PrivateS3ContentWriter:
type: prototype
properties:
s3service: %$S3Service
bucket: private-content
baseUrl: https://private-content.s3-ap-southeast-2.amazonaws.com
ContentService:
properties:
stores:
File:
ContentReader: FileContentReader
ContentWriter: FileContentWriter
S3DevBucket:
ContentReader: S3ContentReader
ContentWriter: S3ContentWriter
PrivateBucket:
ContentReader: PrivateS3ContentReader
ContentWriter: PrivateS3ContentWriter |
@mateusz Thanks for the followup. The approach of targeting 3.x if and only if backward compatibility is retained is completely acceptable. |
Wow, massive effort guys - and very well written! I largely agree with the proposal, just some curve balls to consider below. Would you consider a third party DMS to be a viable APL backend? It would complicate the APL in terms of inferring a folder hierarchy based on externally stored data (rather than Related to this: We're assuming that SilverStripe is the only consumer of a specific file storage service, right? Given that organisational data (hierarchies, tags) is only optionally stored in the local database Sharded stores: One use case could be different backup policies in an S3 store. One bucket might be used for small assets hooked up to CloudFront, another bucket for more important documents which are backed up to Glacier. While this could be solved through cascading backends, one of these will always carry the performance penalty of the failover check. It'll add significant API surface, so not sure its worth the effort. Do you see a use for I'm aware that the APL interface is an MVP at the moment, so I'll hold off asking about stream support and such. |
@nyeholt Thanks for posting, happy to hear you find the high-level design watertight :-) I must admit I'm struggling to confidently build a mental picture of your architecture just from the code so I can't appreciate your design in full. Maybe you would like to write up a short architecture document, i.e. a mini-RFC, for your proposition? Can you confirm my understanding though: your solution seems less abstract than this RFC in that it works only with the storage backends, and does not define any intermediate layer to decouple file manipulation and representation from the core (i.e. bottom-up approach). So we'd need to come up with an appropriate abstraction for the Framework to encapsulate all the backend operations anyway (and we would end up with something like the APL). This RFC comes from the opposite perspective: it works out the intermediate layer's interface, but does not define anything about the backends (i.e. top-down approach). It seems to me that you would be able to implement your proposition as a concrete APL, would you not? Now to the specific points raised.
Correct, it doesn't. In this RFC the storage is decoupled completely. You can build new data object structure based on the APL interface without using Folder & Files. APL doesn't care about the Folders either - it uses the ParentObject as a generalised reference to a container.
Yes, you use the "Variant" field of the tuple to store variants of a file. In other words, the APL is aware of the variants and may optimise for it.
This comes at a cost of having to know the storage details of files, and hence at a cost of coupling the storage representation to your code. Nevertheless no-one will stop people from bypassing the APL and interacting directly with the backends, perhaps with some nominal support from the APL (i.e. getBackendIdentifier() to fetch the internal identifier).
This can be done within the concrete APL, we don't limit what functionality the APL provides, but it needs to be encapsulated within.
As above, this is up to the APL implementation. It seems better not to force the solution to natively use "secureBackend" and "insecureBackend", because some APLs might be able to support security within a single backend.
Again, this couples the storage representation to the Framework. Thanks again for raising concerns :-) |
@chillu Welcome to the RFC and thanks for helping to bullet-proof the solution ;-)
Maybe I'm missing something, but I don't see any problem with writing a DMS-supporting APL. The APL interface is "directory-agnostic" (see "Handling directories" chapter). As long as you can describe your file objects via the tuple, you can build the APL however you want. For example the Simple APL will infer the directory tree from the ParentObjects, but an S3 APL would perhaps ignore the directory tree and work only with Hashes. For readonly, how about throwing Exceptions from the write operations?
Kind of. You are right in assuming the DB is the canonical store for the file metadata. However, since the APL specifics are undefined, you are free to store this metadata in your backend and support other clients in this way. This means you will need to handle the synchronisation issues (which would land you similar problems as with the current asset system, but it would at least be incapsulated withint the APL this time). On the practical side, you should be able to derive all the information you need from the tuple (specifically, the ParentObject).
I think this can still be implemented within the APL. You can for example derive the shard ID from the Filename. The only question that remains in my mind is do we need to have some hinting mechanism to tell the APL which backend we prefer (i.e. is there enough data in the tuple for the APL to derive the appropriate backend, or we need some more information).
Yes, my assumption is The corollary is you can build your own Files & Images section where there is no folders, or folders are in reverse order :-) APL doesn't care.
My knowledge is vague here. I assume this is for serving video streams? If yes, then that should work fine because, again, APL is abstract, and the URLs don't need to point back to SilverStripe at all. Say you link a video resource into your CMS Content. A shortcode will be injected. Then when a user requests the page, APL::getAsURL gets called based on the tuple from the shortcode. APL verifies that this tuple is a streamable resource (i.e. by looking at the extension in Filename, or inspecting ParentObject), and is free to return a direct URL to a box with streaming capability. Uff. I hope that answers your questions? :-) Keep asking if you have anything more. |
@chillu One thing this solution doesn't do at all is deal with the case where some other file store wants to be the truth about what files exist and have SilverStripe be a client of that truth. So a DMS could be used as the backend store, but there's no support for having SilverStripe interrogate that store to find out what files exist. In fact, it makes that slightly harder, because it removes the single "all files live here" property that Files & Images used to have. |
Yep, it's much simpler than you're possibly guessing - it'll be worth putting together even if just to highlight the kinds of things I'd like to be able to do... some of which I'll put inline below :). I think the main bit that makes it confusing is that the cdn-content module is working within the framework's extension points, rather than being cleanly defined APIs - and that there's a lot of stuff in there to translate between raw file i/o operations and the API abstractions.
The Framework still makes use of File objects for assets, but effectively the only way between the framework and the backend is the file reference (eg $file = File::get()->byID(2);
$reader = $file->getReader();
$url = $reader->getURL(); or, based on a FileContent db field on a data object class MyObject {
$db = array(
'MyFile' => 'FileContent',
);
public function FileLink() {
$reader = $this->MyFile->getURL();
return $reader->getURL();
}
} or, just from an identifier (which is what's under the cover of the above two) $identifier = 'NamedSource:||awef/seg35w/file.jpg';
$reader = $contentService->getReader($identifier);
$url = $reader->getURL(); So from a framework level, the separation is abstracted at the ContentService point; at that level, you're dealing just with string identifiers going in, and receiving read/write wrappers back (implementers of underlying file I/O functionality). To my understanding, where the RFC has the APL as the abstraction point, in the content-services module that is explicitly split across 3 separate abstractions; the ContentService deals with finding an appropriate backend for a given reference, and is responsible for handling things like sharding, (and implementation could be swapped to handle cascading lookups), whereas the Reader/Writer pair are responsible for exposing the actual operations that can be performed on a file. From my read of things, the equivalent to the ContentService type of structure is a hierarchical APL type of structure, where the Part of my motivation for having the explicitly separate interfaces is so that user-land code can operate on content items with a consistent API (see further below).
Yep, that would be as far as I'd expect things to be; the main thing I'd be hoping for is that I could load into a File asset using a consistent API without having to translate between raw operations. eg // passed in via config/form or something
$remoteItemId = 'RemoteStore:||known/path.jpg';
$remoteReader = $contentService->getReader($remoteItemId);
$file = File::get()->byID(2);
$file->getWriter()->write($remoteReader); But it's not just paths, the same API should hold for raw content $file = File::get()->byID(2);
$file->getWriter()->write(new RawContentReader("some kind of string here"));
$file->getWriter()->write(new StreamContentReader(fopen('http://hax.ru/totally_legit.php')));
And to cover Hamish's point about remote content stores making content available, it'd be really nice if the API would provide this capability, if for no other reason than importing from that remote store. I've got the following in place in the content-services module, though it's not completely thought through just yet. // the remote folder ID is a known structure in the remote store, passed in via config/form or something
$remoteFolderId = 'RemoteStore:||known/path';
$remoteFolder = $contentService->getReader($remoteFolderId);
foreach ($remoteFolder->getList() as $remoteReader) {
$file = File::create();
$file->Title = basename($remoteReader->getId());
$file->getWriter()->write($remoteReader);
$file->write();
}
True, my point is that I'd like to see interfaces defined in such a way that this is mandated so that user-land code has a consistent understanding of how to copy a file from one location to another (as in the examples above).
Agreed, I guess my point is more that the 'problem' of secure files should be a framework handled thing, rather than a module necessarily.
Not really - by specific backend store, what I mean is a label that points at something configured via DI, which could be actually be an instance of Anything, and interpreted in any way that Anything chooses. So while Your comment in response to Ingo's Anyway - will try and put together a diagram and the specific interfaces separately and link them a bit later today. |
Please take my comments with a grain of salt, as I think the proposal as written would suit most use cases however I do have a few things to note. First, this is a very complex proposal, that ends up actually not solving much more than, for example, micmania1's pull request. Complexity is added to support versioning and workflow, but then they are out of the scope of this RFC. I also wonder whether these would be better solved in other ways? Versioned files, for example, are possible to retrofit (in a way) within the current framework - e.g. by introducing an intermediary data object that represents the versioned file, and just storing a new file for every version. I don't really see why swapping to a strange mashed up "File Reference" field is a positive step - it's just screaming out to be an object in its own right, stored in its own table. Solving the versioning and workflow on DataObjects in general would then be extendable to solve versioning of files. Unless I'm just not getting it? On the 5.4 issue with Flysystem, why even bother looking at 5.3 compatibility? It's just going to be a waste of time, and it seems like the fact that Flysystem requires 5.4 should make the decision easy to go with SS4.x and bump version requirements. |
And the separate architecture diagram / interface description at https://github.com/silverstripe-australia/silverstripe-content-services/wiki/Architecture You'll notice that the diagram is very familiar... :) |
@nyeholt nice diagram, I've seen it somewhere ;-) I think the biggest issue is that you are ignoring the versioning. Native versioning in the Framework is an important design goal for us, a strategic choice if you will. To make files first-class data types, they must support this natively as well. Otherwise files might as well remain DataObjects, and we could use something akin to @micmania1's solution. Your file identifier assumes that the store+path+filename describes the file uniquely - but that's only true in a operating system context. Framework also needs to handle the time component - a more appropriate mental model here is git repo, not a filesystem. Effectively, you are making the Framework bow to the requirements of a classical unix file system, instead of trying to make the FS serve the actual needs of the Framework. So I'm pretty much arguing that the tuple stored in the DBField needs to contain enough information to identify a file uniquely, and that "your" file identifier hasn't got enough granularity - it lacks support for versioning and variants. @hafriedlander was initially arguing that the hash by itself is good enough tuple, but that was struck down because of lack of context for constructing the URLs. So to summarise the proposed solutions so far:
*) you are right in saying that having a ParentObject reference is a hidden way of specifying the details of the store, which you accomplish by introducing a textual label. If we can agree that the tuple presented in this RFC is necessary for our goals, then the only major difference that remains between your solution and this RFC (which can be easily seen by comparing the two diagrams) is how context is handled. You prefer to use injection via a string identifier (i.e. your label) and to have well-defined interface for readers and writers. We are more vague and just point back at the ParentObject, saying - figure it out on your own. Both approaches require people to decorate DataObjects, so it feels like this difference is just a matter of opinion :-) So what remains is: do we agree on the tuple contents, and if we do not, then how do you propose the versioning is handled? If we can't reconcile our views, then the only remaining option is arbitrage of the competing proposals by the core-committers. P.S. I don't know about everyone else out there, but this discussion really helps me to understand what problems are out there and how they can or cannot be solved. I hope it's as useful to you too! |
Not ignoring versioning at all; the point of having the FileContent field stored as a string is to allow for this. I think you're assuming that the file identifier string is being constructed based on a file system representation, which is not the case at all. A ContentWriter is responsible for creating a unique identifier for a given file (or make use of the default abstract implementation). The only requirement being that the identifier can be used to look the asset up later. I think the examples having / chars have given a false impression :) I'll add a couple examples later tonight that show how a file actually gets mapped from a few use cases; uploading a file, creating a variation, and storing a theme in a CDN. Which is another use case I had forgotten about until now 😂 |
@spronkey thanks for commenting - I'm pretty much going to link other comments in here, because we already spoke about these things. Like I wrote in the response to @nyeholt, versioning support is a critical feature of the Framework. Retrofitting versioning into the current model is what do at the moment (see silverstripe-versionedfiles module), and it is rather ugly, because we end up hacking the API by packing the version number into the Filename. What we are trying to do is to solve the root issue here, by making files first-class data types. Regarding how the tuple is represented, see Sam's argument (search upwards for "The issue isn't so much the tuple object itself as the illusion of flexibility that it would provide."). Aiming at 3.x just makes the implementation harder - and it's us doing the work, so why are you worried? :-) Community will have the final say in judging if the solution is backwards compatible before merging. Regarding the Flysystem, I don't think we should get stuck at seeing it as the one and only solution - in all proposed solutions there is an added abstraction layer on top. We should definitely be thinking in broader terms :-) |
@nyeholt I see, yes I did assume that "your" file id was well defined. Would definitely be great to see how you solve all requirements listed in the motivation at the top. |
Oh yeah sorry @nyeholt - one more thing :-) with the RFC process, the author drives the development. If you make a proposition, it's assumed you have the resources implement it. In other words, we don't need to all agree on the specifics of the solution. If this RFC is approved by the core-committers as "a solution", then we are free to implement it that way - because it's us doing the work. This is especially true if we only differ in opinions, not the substance :-) |
@mateusz My point re: the DBField thing - I'm trying to understand why FileReferenceField is a glorified string, and not a DataObject in its own right. You say this proposed "first class data type" is solving the root issue - but I see it as a dirty hack and a step backwards that's bound to cause headaches in the future. DBField instances should represent single values - FileReferenceField is storing multiple values. Some RDMSs have denormalised type support (i.e. postgres and JSON type), but SS's ORM supports databases that don't (i.e. mysql), which tells me that FileReferenceField really should be a table on its own, and thus in SS represented as an object in its own right, with hash, variant, filename, parent object as its fields? Someone above mentioned the case where systems other than SilverStripe will be accessing the files - I think this is worth further consideration. With the FileReferenceField, there's a big impediment to using raw SQL to access the files, and hierarchy information stored in the database - one has to pull apart a string. Using a DataObject for this removes this impediment completely. Re: versioned files - if FileReferenceField was an object in its own right, versioning could be taken care of in the same way that SiteTree versioning is. The only catch is that some additional handling would have to take place when files are stored, to ensure that versions remain distinct. Sam's comments around the tuple object were referring to using a tuple as the params to the APL API, not the storage of file information in a denormalised DBField tuple. I agree with Sam's comments re: not using a tuple for APL API, and using distinct params as proposed. If the API changes there should be a clear notice to the API consumer. |
@spronkey http://api.silverstripe.org/3.1/class-CompositeField.html - each element of the tuple would be it's own column in the table, but there is no reason we need to, or should, have a single field represent a single column. |
Sorry, I meant http://api.silverstripe.org/3.0/class-CompositeDBField.html |
Ended up getting tied up all weekend, so will respond in a bit more detail from work. Regarding
My main reason for going through this thoroughly is because I have an existing implementation that has been built with the same goals in mind, and want to make sure I understand how the proposed API will work under the use cases I've implemented or have in mind - I'm quite happy to dump a bunch of code for a better and more widely maintained alternative :). A couple of thoughts on the APL interface
|
Other than versioning and private files: is there anything Flysystem + relative database file paths do not handle? Trying to understand the needs/priorities for the added complexity. :) |
@assertchris by definition, any asset persistency system that needs/wants to use hashes :-) But versioning is important - it's a basic premise of this RFC. If that premise is challenged, a new RFC needs to be written :-) |
@mateusz Out of my depth, in here. Moving along... :) |
So besides arguing about which semantic version this goes into or what the low level API is no-one has been screaming blue murder? |
No murder. Even better: the core-committers approved this, so the games may begin. I mean implementation. I hope we can start implementing this soon! Yay! 🎆 |
One query from the implementation side of things... is it envisaged that there will be a set of explicit |
No updates for a bit, is this underway or on deck soon? I think it would increase SS adoption significantly since horizontal scaling is quite important and tricky without this. |
@clyonsEIS I've clarified the milestones and added the pretty multicolour labels, so anyone coming from UserVoice can quickly see the status. https://silverstripe.uservoice.com/forums/251266-new-features/suggestions/6572660-asset-abstraction |
I've created a PR for initial implementation of this feature (not including updating of File dataobject). Or things like file variants. |
Second pull request is at #4652 |
@tractorcow I think this has now been completed? |
Yes this is done! |
Note to readers: the body of this RFC contains the current proposal, fully updated based on feedback. It's not necessary to read the comments, however they might aid your understanding and contain answers to some of your questions.
RFC-1 Asset abstraction
Purpose and outcome
This RFC proposes a change to the SilverStripe Framework to add flexibility to the asset subsystem.
Community is welcome to give feedback. The proposal will be submitted to the core-committers for approval. We expect it will evolve further during development, and it will eventually form a basis for an architecture document.
Motivation
The driver for this change is solving the following problems we keep encountering in practice:
We currently cannot accomplish this due to the monolithic approach to the file handling in the Framework. This capability is currently expressed as a set of specialised
DataObject
s:File
,Folder
andImage
which implicitly tie the database representation to the local filesystem, preventing any flexibility.Additionaly, we self-impose the following requirements which we see as important:
Proposal
To solve these problems we propose to:
DataObjects
) to model properties (DBFields
).Specifically, we would do so by introducing a new concept of a generalised file reference - a set of values (i.e. a tuple) uniquely pointing to a file regardless of the storage backend in use.
Then we would change the model representation of the files: instead of using the coarse-grained data objects such as
File
we would move to using the$db
fields. For this a newDBField
would be introduced, called FileReference.The
File
andFolder
would retain their APIs as much as possible, but be retrofitted to use theFileReference
. They would now conceptually become an inseparable part of the Files & Images subsystem of the CMS.UploadField
would be updated to be able to work directly with theFileReference
s, additonally to the existing has_one support.Finally, we would remove all direct filesystem operations from the Framework and place them behind the new Asset Persistence Layer (APL) interface. This RFC does not define how an APL should work under the hood. Neither does it define the architecture nor interfaces of the storage backends. APL is tied to data stored by that APL, and changing the APL will cause data loss without a migration script.
This RFC is targeted at the next major release.
File reference tuple
File reference is a set of values (i.e. a tuple) describing a file uniquely, and independently of a storage backend. It must be generalised to allow different kinds of backends. We envisage the following values to be included in this tuple:
File
referenceThe available values are not extensible. Backends may use any or all values of the tuple.
The tuple itself is not represented in the codebase directly as a class or an object, but manifests in the
FileReference
as stored fields and in the discrete APL operation parameters and return values.FileReference
FileReference
is a mechanism for expressing the file reference in the code and for storing it in the database. It would most likely be implemented as a packed or compositeDBField
, stored in table columns. The ParentObject does not need to be stored because it can be derived from the containing data object.Additionally, the
FileReference
acts as a convenient proxy object to the APL interface, since it has the ability to introspect all the necessary values (APL may still be used directly, albeit more verbosely). By necessity theFileReference
s API it would follow the APL interface as laid out in "Appendix A: APL interface".FileReference
will deal internally with some of the plumbing required by the APL, such as passing the ParentObject reference, or storing the returned, possibly updated, tuple values (see the "Asset Persistence Layer (APL)" chapter for code samples).FileReference can be attached to any DataObject - not only to
File
s.If you want specific values in the tuple, you will need to initialise them before storing any data. As an example, you might want to give the APL a hint that you require a specific Filename.
See the "Appendix B: FileReference class mockup" for the proposed layout of that class.
Asset Persistence Layer (APL)
All existing direct file operations in the Framework would be rewritten to use the APL, either indirectly through the
FileReference
, or directly. A concrete APL would be configured once per site using a dependency injection.APL requires all File Reference tuple values to be passed to all its operations.
Expanding on the previous example:
See the "Appendix A: APL interface" for the proposed interface to the APL.
Using parameters is at the APL discretion - it would be perfectly legal for an implementation to ignore some of the values as long as it can ensure uniqueness.
Additionally, APL setters may modify the tuple values passed to it to ensure consistency. Callers have to inspect the return value and make sure to update their understanding of the file accordingly. This would for example be used in the situation of filename conflicts. Also see the "Conflict resolution" chapter below.
A special case of a tuple rewrite is when a caller passes "null" as one of the tuple values. Caller can then expect the APL implementation to generate the value. This would be used when creating new files to obtain a hash.
Internally, APL's storage format should be treated as proprietary to that APL. APL's are interchangeable in terms of the API, but not the data already stored - a data migration script would be necessary for that.
See "Appendix C: Trivial APL method" for some contextual pseudocode.
Storage backends
This RFC does not define how a concrete APL would work under the hood. The architecture and the interfaces of the storage backends is left to the implementor.
That said, we expect the Flysystem will be useful for this.
Simple APL
A default, mostly-backwards-compatible implementation of the APL would be built as part of this RFC. With this APL, it would be possible for the apparent filesystem on disk would remain as it currently is to allow easier migration of the existing environments.
The Hash value in the tuple would be ignored by this backend and most likely Flysystem will be used as backend.
Although the Simple APL would support different backends through the Flysystem, because of the problems described in the "Asynchronous APL API" and the "Performance" chapters, we wouldn't recommend using the S3 backend here.
Other considerations
Handling directories
A directory (a "folder") is a concept that is not reflected in the APL interface. APL operations can only handle file references.
Under the hood however, a concrete APL may use the tuple values provided in the method calls to reconstruct the directories (specifically, the ParentObject might be used to recover the structure). It is entirely up to the implementation to handle this.
On the other end of the spectrum, the Files & Images' notion of "folders" will remain expressed as
Folder
data objects, but theirFileReference
would be set to null (again, that's because the APL does not handle folders explicitly).Conflict resolution
A conflict may occur if an attempt is made to write data to a tuple which already exists. The APL caller can pass their resolution preference though the
$conflictResolution
parameter.The APL may support the following resolution approaches:
Derived files
The
Image
class provides file processing services. Files derived in this way are not represented in the database and are currently written to disk to an arbitrary location by theGDBackend
.To keep this RFC focused on the file abstraction, we propose to do the minimum adjustment needed to remove the direct filesystem operations while avoiding refactoring of the processing subsystem.
The
FileReference
would allow us to obtain a derived file reference (which is not stored in the database).Note the APL should not change tuple values for derived files because we have no place to store them on the consuming side.
Images in content
These will need to be rewritten to use shortcodes. Direct file references by path are no longer valid.
Rationale
How does this solve the mentioned problems
Changing root path for asset storage
With all filesystem operations abstracted away, the default APL can include a configuration parameter for setting the filesystem root path.
Amazon S3 and clustered sites
The access pattern used by an APL can be non-trivial. An example of a more sophisticated approach to persistency is a load-balanced site using S3 as a primary storage, and the filesystem as a secondary storage. Such a site would have the following problems:
Here is a sequence diagram illustrating a cascading operation of such an APL. The backends are internal to the APL and invisible to the userspace code.
Versioned files
The APL does not deal with file versions, but it has the potential to store many distinct files with the same name thanks to the presence of the Hash in the file reference tuple.
The
FileReference
serialises to the database as strings, so it may versioned just as any other database field. This paves the way to usingVersioned
extension directly onFile
.Private files
Once we switch to the File References, using
File
objects to manage files would no longer be obligatory. It would now be possible to bypassFile
and "attach" files to arbitraryDataObject
s using the File ReferenceDBField
directly.This would mean opting out of the usual functionality - the file would no longer appear in the "Files & Images" section. Custom methods to manipulate the assets would need to be built, or the
UploadField
could be used to directly interface with theFileReference
s.Files with workflow
Essentially a mix of the file versioning and private files approach could be used to accomplish the workflow. Since files are now treated as model properties, we have a freedom to create new models and interfaces. We can also easily move files between the models by copying the tuple - e.g. from the "Workflow" model into the "Files & Images".
Alternative proposals
Common objections
Deferred & rejected features
Deferred: Rewriting derived files subsystem into a separate API
Deferred to keep this RFC focused. See "Derived files" chapter.
Rejected: Versioned APL
We have decided the APL is not the right layer to implement file versioning and that it should be implemented on a higher level. This RFC will make it possible to accomplish both versioning folders and files.
See "How does this solve the problems" chapter.
Rejected: Tuple as an object
This would be too complex and ultimately make the conceptual model too convoluted.
The drawback of not doing this is that in this proposal the
FileReference
must act both as the tuple representation in code (with all tuple manipulation methods), and as a database field representation.This violates the single-responsibility principle and leads to edge-case problems such as the one presented in the "Derived files" chapter, where a crippled
FileReference
is produced that does not represent any database entity (which is counter-intuitive, because it's aDBField
after all).Rejected: Extending the tuple with more values
We think the tuple is such a fundamental representation of the file resource that it should be fixed (just as a
Varchar
representation is fixed). Tuple was designed such that it should contain enough specificity to allow for all use cases listed in "Motivation" chapter.Allowing to extend it with custom values would cause module compatibility problems, and we want the APLs to be interchangeable.
Rejected: Leaving derived files subsystem as it is (with direct local filesystem operations)
This wouldn't work for clustered sites. File derivation isn't always triggered on demand, which means some derived file will only be available on some local filesystem. Passing the derived files into the APL makes sure the files are always treated consistently.
For the same reason this won't work for deployments where there is no local persistent storage available (e.g. AWS deployment utilising only S3).
Rejected: Asynchronous APL API
Writes to a backend might take long time, it's unclear what a "rename" or "exception" conflict resolution should do if we execute subsequent write to the same resource reference without waiting for completion. In this case we could end up losing data.
To solve this we could introduce an asynchronous API for the APL. A complete tuple corresponds to a single write operation, so we could possibly add a "getWriteStatus()" method to the APL. It could then return "in progress", "success", "failure". a UI could poll that until the result was success; if it returned failure then the rename/replace/retry logic could be put in place.
However this can also be solved by using an APL that supports content hashes (there woudln't ever be any conflicts in such an APL) so we decided that it's not worth introducing the additional complexity related to the asynchronous API.
It's worth noting the Flysystem developers discarded asynchronous operation in their API.
Impact
Backwards-incompatible changes
If using the default Simple APL, we are aiming at minimising the backwards-incompatible changes when compared to "3.x" Framework branch. There will certainly be a DB schema change to
File
.For the Simple APL with non-default configuration, for other APLs, and for future major Framework versions the following changes are expected:
File
,Image
andFolder
(e.g. removal of Filename field).ASSETS_DIR
andASSETS_PATH
removed.Security
The significant impact is that until the "secureassets" module is rewritten to support the APL and to work with the tuples there won't be any easy way to support secure files.
From the user perspective, there won't be any changes in how
File
objects handle security. AssetAdmin, GridField and Subsites should not be impacted.The real potential for improvement lies in custom APLs which will be able to secure files in ways different from the
.htaccess
approach. TheFileReference
won't have any inherent notion of security and it will be up to the implementors of the APL to derive the security properties from the file reference tuple (especially the ParentObject).For example, APL, while performing a write operation, could distinguish between secure (proxied) and insecure (directly accessible) assets by inspecting the ParentObject. It'd then store the resource in an appropriate backend. It would be up to the "secure" backend to ensure the authorisation is done upon a request, e.g. by passing the request through a Controller.
Performance
The Simple APL will write to the local filesystem, so we expect the performance not to be different than before the change. The APL method calls will be synchronous.
However a special consideration needs to be given when writing APLs with long-running operations. The APL layer does not inherently support asynchronicity: writing data asynchronously leaves the backend in an unspecified state and the API does not make any provisions for introspecting this state. Instead we recommend using content hashes in your APL (see also the "Asynchronous APL API" chapter) - this allows the APL to avoid tuple clashes and unexpected overwrites of content.
Another thing to keep in mind for custom APLs is that the presence of long-running synchronous operations will impact the UI considerably. One solution here could be to execute some part of the write synchronously (i.e. to the local filesystem), but delegate the long-running writes to a queue.
Scalability
This change is a prerequisite for better scalability. Shifting the assets off the local machine is required for any kind of clustered or scalable environment. That said, the work included in this RFC will not affect the scalability directly - the Simple APL will predominantly be targeted towards the local filesystem.
Maintenance & operational
Backups
With Simple APL, the backups will be done as before: a database dump and the filesystem snapshot are enough to restore all data.
However backups may not be trivial when other APLs are used. People performing maintenance will need to know how to find out which APL is in use, and act accordingly.
Migration
With Simple APL the filesystem representation will remain in place. However on the database level the migration from older sites will not be a straighforward copy. A migration script will be provided to convert the old File, Folder and Image data into the format used by the
FileReferences
.The same situation will occur when an APL is exchanged on a project with existing data - storage format is proprietary to the APL, so the data needs to be transformed for the new APL to understand it.
Internationalisation & localisation
We expect minimal or no modifications to the UI, so there should be no impact to i18n.
For l10n, situation will not change either: with Simple APL the filesystem representation will remain as is. The database representation will change, but it will still hold the filenames. Other parts of the tuple do not introduce any l10n considerations.
References
Appendix A: APL interface
Here is our proposition for the initial base APL interface. Note that we expect this interface to undergo some degree of evolution. We might discover other generic methods that can ba added, and we also need to decide how to provide for the delete, copy, move and "check if exists" operations.
The tuple needs to be included in all calls and is always
$hash, $variant, $filename, $parentObj
. Any values in the tuple may be set to null to indicate the caller is expecting the APL implementation to generate the values. The possibly modified tuple is also returned from the setters.See the "Proposal" chapter for more information on the tuple.
Appendix B: FileReference class mockup
Here is our initial proposition for the base
FileReference
interface. As with the APL interface, note that we expect this interface to undergo some degree of evolution.See the "Proposal" chapter for more information on the
FileReference
.Appendix C: Trivial APL method
Example pseudocode for a trivial APL method to illustrate several concepts from this RFC. Note this is not a code from any actual APL - Simple APL will use Flysystem and will be more complex.
The text was updated successfully, but these errors were encountered: