Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RFC-1 Asset abstraction #3792

Closed
mateusz opened this issue Jan 14, 2015 · 65 comments
Closed

RFC-1 Asset abstraction #3792

mateusz opened this issue Jan 14, 2015 · 65 comments

Comments

@mateusz
Copy link
Contributor

mateusz commented Jan 14, 2015

Note to readers: the body of this RFC contains the current proposal, fully updated based on feedback. It's not necessary to read the comments, however they might aid your understanding and contain answers to some of your questions.

RFC-1 Asset abstraction

Purpose and outcome

This RFC proposes a change to the SilverStripe Framework to add flexibility to the asset subsystem.

Community is welcome to give feedback. The proposal will be submitted to the core-committers for approval. We expect it will evolve further during development, and it will eventually form a basis for an architecture document.

Motivation

The driver for this change is solving the following problems we keep encountering in practice:

  • changing root path for asset storage
  • support Amazon S3 as a backend
  • support backends for sites residing on multiple servers (clustered sites)
  • have files versioned
  • have files with workflow
  • be able to model private files that do not appear in the Files & Images section

We currently cannot accomplish this due to the monolithic approach to the file handling in the Framework. This capability is currently expressed as a set of specialised DataObjects: File, Folder and Image which implicitly tie the database representation to the local filesystem, preventing any flexibility.

Additionaly, we self-impose the following requirements which we see as important:

  • ability to preserve the current assets UI
  • ability to connect into the existing asset storage (i.e. a degree of backwards compatibility)

Proposal

To solve these problems we propose to:

  1. Create a backend-independent way of referring to files in the Framework.
  2. "Degrade" files from entities in their own right (DataObjects) to model properties (DBFields).
  3. Hide all file operations behind an interface.

Specifically, we would do so by introducing a new concept of a generalised file reference - a set of values (i.e. a tuple) uniquely pointing to a file regardless of the storage backend in use.

Then we would change the model representation of the files: instead of using the coarse-grained data objects such as File we would move to using the $db fields. For this a new DBField would be introduced, called FileReference.

The File and Folder would retain their APIs as much as possible, but be retrofitted to use the FileReference. They would now conceptually become an inseparable part of the Files & Images subsystem of the CMS.

UploadField would be updated to be able to work directly with the FileReferences, additonally to the existing has_one support.

Finally, we would remove all direct filesystem operations from the Framework and place them behind the new Asset Persistence Layer (APL) interface. This RFC does not define how an APL should work under the hood. Neither does it define the architecture nor interfaces of the storage backends. APL is tied to data stored by that APL, and changing the APL will cause data loss without a migration script.

This RFC is targeted at the next major release.

File reference tuple

File reference is a set of values (i.e. a tuple) describing a file uniquely, and independently of a storage backend. It must be generalised to allow different kinds of backends. We envisage the following values to be included in this tuple:

Hash Variant Filename ParentObject
Example "d0be2dc..." "Resized640x480" "logo.jpg" A File reference
Description sha1 of the base content Variant of the base content. "null" for the base file The name of the file The specific object that contains the given tuple in its FileReference
Rationale For managing content conflicts For supporting derived files For reconstructing the URL given a file reference and for managing naming conflicts For reconstructing the URL and the directory structure

The available values are not extensible. Backends may use any or all values of the tuple.

The tuple itself is not represented in the codebase directly as a class or an object, but manifests in the FileReference as stored fields and in the discrete APL operation parameters and return values.

FileReference

FileReference is a mechanism for expressing the file reference in the code and for storing it in the database. It would most likely be implemented as a packed or composite DBField, stored in table columns. The ParentObject does not need to be stored because it can be derived from the containing data object.

Additionally, the FileReference acts as a convenient proxy object to the APL interface, since it has the ability to introspect all the necessary values (APL may still be used directly, albeit more verbosely). By necessity the FileReferences API it would follow the APL interface as laid out in "Appendix A: APL interface".

FileReference will deal internally with some of the plumbing required by the APL, such as passing the ParentObject reference, or storing the returned, possibly updated, tuple values (see the "Asset Persistence Layer (APL)" chapter for code samples).

$file = File::get()->byID(1);

// The following becomes deprecated, we no longer are allowed direct access:
// file_get_contents($file->Filename)

// Instead we need to operate through the FileReference:
$fileReference = $file->obj('FileReference');
$fileReference->getAsString();

FileReference can be attached to any DataObject - not only to Files.

class MyDocument extends DataObject {
    private static $db = array(
        'AttachedFile' => 'FileReference'
    );
    ...
}

If you want specific values in the tuple, you will need to initialise them before storing any data. As an example, you might want to give the APL a hint that you require a specific Filename.

$doc = new MyDocument();
// If using CompositeDBField, one could set the sub-fields directly.
$doc->AttachedFile = array(null, null, 'document.txt');
$doc->obj('AttachedFile')->setFromString('James James Morrison Morrison');

See the "Appendix B: FileReference class mockup" for the proposed layout of that class.

Asset Persistence Layer (APL)

All existing direct file operations in the Framework would be rewritten to use the APL, either indirectly through the FileReference, or directly. A concrete APL would be configured once per site using a dependency injection.

APL requires all File Reference tuple values to be passed to all its operations.

Expanding on the previous example:

// Under the hood of the FileReference
function getAsString() {
    // Obtain the APL reference, most likely via DI.
    $apl = Injector::inst()->get('AssetPersistenceLayer');
    // Obtain the parent object reference.
    $parentObj = DataList::create($this->getParentClass())->byID($this->getParent());

    // Pass the tuple values as discrete parameters.
    return $apl->getAsString($this->Hash, $this->Variant, $this->Filename, $parentObj);
}

See the "Appendix A: APL interface" for the proposed interface to the APL.

Using parameters is at the APL discretion - it would be perfectly legal for an implementation to ignore some of the values as long as it can ensure uniqueness.

Additionally, APL setters may modify the tuple values passed to it to ensure consistency. Callers have to inspect the return value and make sure to update their understanding of the file accordingly. This would for example be used in the situation of filename conflicts. Also see the "Conflict resolution" chapter below.

A special case of a tuple rewrite is when a caller passes "null" as one of the tuple values. Caller can then expect the APL implementation to generate the value. This would be used when creating new files to obtain a hash.

Internally, APL's storage format should be treated as proprietary to that APL. APL's are interchangeable in terms of the API, but not the data already stored - a data migration script would be necessary for that.

See "Appendix C: Trivial APL method" for some contextual pseudocode.

Storage backends

This RFC does not define how a concrete APL would work under the hood. The architecture and the interfaces of the storage backends is left to the implementor.

That said, we expect the Flysystem will be useful for this.

Simple APL

A default, mostly-backwards-compatible implementation of the APL would be built as part of this RFC. With this APL, it would be possible for the apparent filesystem on disk would remain as it currently is to allow easier migration of the existing environments.

The Hash value in the tuple would be ignored by this backend and most likely Flysystem will be used as backend.

Although the Simple APL would support different backends through the Flysystem, because of the problems described in the "Asynchronous APL API" and the "Performance" chapters, we wouldn't recommend using the S3 backend here.

Other considerations

Handling directories

A directory (a "folder") is a concept that is not reflected in the APL interface. APL operations can only handle file references.

Under the hood however, a concrete APL may use the tuple values provided in the method calls to reconstruct the directories (specifically, the ParentObject might be used to recover the structure). It is entirely up to the implementation to handle this.

On the other end of the spectrum, the Files & Images' notion of "folders" will remain expressed as Folder data objects, but their FileReference would be set to null (again, that's because the APL does not handle folders explicitly).

Conflict resolution

A conflict may occur if an attempt is made to write data to a tuple which already exists. The APL caller can pass their resolution preference though the $conflictResolution parameter.

The APL may support the following resolution approaches:

  • "exception" - throw an APLConflict exception.
  • "overwrite" - overwrite the data at the tuple
  • "rename" - allow the APL to change the tuple values to non-conflicting and write the data to a new location. APL will return the modified tuple.

Derived files

The Image class provides file processing services. Files derived in this way are not represented in the database and are currently written to disk to an arbitrary location by the GDBackend.

To keep this RFC focused on the file abstraction, we propose to do the minimum adjustment needed to remove the direct filesystem operations while avoiding refactoring of the processing subsystem.

The FileReference would allow us to obtain a derived file reference (which is not stored in the database).

// Obtain the derived FileReference from an existing one.
$derivedReference = $fileReference->derive("Resized640x480");

// Generate if needed.
if (some_way_of_checking_if_we_need_to_generate($derivedReference)) {
    $derivedContent = existing_derivation_function($fileReference->getAsString());
    $derivedReference->setFromString($derivedContent);
}

// Work with the new reference as normal.
$derivedReference->getAsURL();

Note the APL should not change tuple values for derived files because we have no place to store them on the consuming side.

Images in content

These will need to be rewritten to use shortcodes. Direct file references by path are no longer valid.

Rationale

How does this solve the mentioned problems

Changing root path for asset storage

With all filesystem operations abstracted away, the default APL can include a configuration parameter for setting the filesystem root path.

Amazon S3 and clustered sites

The access pattern used by an APL can be non-trivial. An example of a more sophisticated approach to persistency is a load-balanced site using S3 as a primary storage, and the filesystem as a secondary storage. Such a site would have the following problems:

  • S3 is eventually consistent and so the resource might not yet be there when we ask for it, especially if it has just been written.
  • the asset might have been uploaded to another machine in the cluster, so the file might not yet be synced back to the local filesystem.

Here is a sequence diagram illustrating a cascading operation of such an APL. The backends are internal to the APL and invisible to the userspace code.

Versioned files

The APL does not deal with file versions, but it has the potential to store many distinct files with the same name thanks to the presence of the Hash in the file reference tuple.

The FileReference serialises to the database as strings, so it may versioned just as any other database field. This paves the way to using Versioned extension directly on File.

Private files

Once we switch to the File References, using File objects to manage files would no longer be obligatory. It would now be possible to bypass File and "attach" files to arbitrary DataObjects using the File Reference DBField directly.

private static $db = array(
    'AttachedFile' => 'FileReference'
);

This would mean opting out of the usual functionality - the file would no longer appear in the "Files & Images" section. Custom methods to manipulate the assets would need to be built, or the UploadField could be used to directly interface with the FileReferences.

Files with workflow

Essentially a mix of the file versioning and private files approach could be used to accomplish the workflow. Since files are now treated as model properties, we have a freedom to create new models and interfaces. We can also easily move files between the models by copying the tuple - e.g. from the "Workflow" model into the "Files & Images".

Alternative proposals

  • micmania1's pull request refactors the Framework so that the filesystem operations are abstracted away, but the file references are not generalised (i.e. the backend still utilises the notion of path).
  • Marcus's proposal is similar to this RFC in that the filesystem operations are abstracted away, and a notion of a file reference is introduced (called a pointer - "ContentId"). What differs is this solution does not involve Framework modifications and is delivered as modules.

Common objections

  • Targeted version was seen as 4.x. The common argument was that with a change of this size maintaining backwards-compatibility with the 3.x branch according to semver would be distracting and hard. We have changed this RFC to target 4.x.
  • Versioning of files was not universally seen as an important driver for this refactoring. It was thought that just decoupling the filesystem operations would be a good-enough solution (micmania1's solution would work in this case). This RFC is of opposite opinion: a potential for file versioning is a strategic feature of the Framework.
  • File reference composition varied between people. Some seen it as just the hash. Some preferred additional metadata. This RFC opinion is to settle on a minimum that allows us to deliver the goals, while still allowing for rich URLs and sharding.
  • File refenrece representation was variously seen as an object containing all tuple values, or just a string. This RFC's opinion is to use discrete values for a less obscure API.

Deferred & rejected features

Deferred: Rewriting derived files subsystem into a separate API

Deferred to keep this RFC focused. See "Derived files" chapter.

Rejected: Versioned APL

We have decided the APL is not the right layer to implement file versioning and that it should be implemented on a higher level. This RFC will make it possible to accomplish both versioning folders and files.

See "How does this solve the problems" chapter.

Rejected: Tuple as an object

This would be too complex and ultimately make the conceptual model too convoluted.

The drawback of not doing this is that in this proposal the FileReference must act both as the tuple representation in code (with all tuple manipulation methods), and as a database field representation.

This violates the single-responsibility principle and leads to edge-case problems such as the one presented in the "Derived files" chapter, where a crippled FileReference is produced that does not represent any database entity (which is counter-intuitive, because it's a DBField after all).

Rejected: Extending the tuple with more values

We think the tuple is such a fundamental representation of the file resource that it should be fixed (just as a Varchar representation is fixed). Tuple was designed such that it should contain enough specificity to allow for all use cases listed in "Motivation" chapter.

Allowing to extend it with custom values would cause module compatibility problems, and we want the APLs to be interchangeable.

Rejected: Leaving derived files subsystem as it is (with direct local filesystem operations)

This wouldn't work for clustered sites. File derivation isn't always triggered on demand, which means some derived file will only be available on some local filesystem. Passing the derived files into the APL makes sure the files are always treated consistently.

For the same reason this won't work for deployments where there is no local persistent storage available (e.g. AWS deployment utilising only S3).

Rejected: Asynchronous APL API

Writes to a backend might take long time, it's unclear what a "rename" or "exception" conflict resolution should do if we execute subsequent write to the same resource reference without waiting for completion. In this case we could end up losing data.

To solve this we could introduce an asynchronous API for the APL. A complete tuple corresponds to a single write operation, so we could possibly add a "getWriteStatus()" method to the APL. It could then return "in progress", "success", "failure". a UI could poll that until the result was success; if it returned failure then the rename/replace/retry logic could be put in place.

However this can also be solved by using an APL that supports content hashes (there woudln't ever be any conflicts in such an APL) so we decided that it's not worth introducing the additional complexity related to the asynchronous API.

It's worth noting the Flysystem developers discarded asynchronous operation in their API.

Impact

Backwards-incompatible changes

If using the default Simple APL, we are aiming at minimising the backwards-incompatible changes when compared to "3.x" Framework branch. There will certainly be a DB schema change to File.

For the Simple APL with non-default configuration, for other APLs, and for future major Framework versions the following changes are expected:

  • API changes for File, Image and Folder (e.g. removal of Filename field).
  • Direct filesystem access no longer permitted (will break with non-default APLs).
  • ASSETS_DIR and ASSETS_PATH removed.
  • Shortcodes for Content imagery introduced - non-shortcoded images will cease to work with non-default APLs.
  • possibly more: to be seen during the development.

Security

The significant impact is that until the "secureassets" module is rewritten to support the APL and to work with the tuples there won't be any easy way to support secure files.

From the user perspective, there won't be any changes in how File objects handle security. AssetAdmin, GridField and Subsites should not be impacted.

The real potential for improvement lies in custom APLs which will be able to secure files in ways different from the .htaccess approach. The FileReference won't have any inherent notion of security and it will be up to the implementors of the APL to derive the security properties from the file reference tuple (especially the ParentObject).

For example, APL, while performing a write operation, could distinguish between secure (proxied) and insecure (directly accessible) assets by inspecting the ParentObject. It'd then store the resource in an appropriate backend. It would be up to the "secure" backend to ensure the authorisation is done upon a request, e.g. by passing the request through a Controller.

Performance

The Simple APL will write to the local filesystem, so we expect the performance not to be different than before the change. The APL method calls will be synchronous.

However a special consideration needs to be given when writing APLs with long-running operations. The APL layer does not inherently support asynchronicity: writing data asynchronously leaves the backend in an unspecified state and the API does not make any provisions for introspecting this state. Instead we recommend using content hashes in your APL (see also the "Asynchronous APL API" chapter) - this allows the APL to avoid tuple clashes and unexpected overwrites of content.

Another thing to keep in mind for custom APLs is that the presence of long-running synchronous operations will impact the UI considerably. One solution here could be to execute some part of the write synchronously (i.e. to the local filesystem), but delegate the long-running writes to a queue.

Scalability

This change is a prerequisite for better scalability. Shifting the assets off the local machine is required for any kind of clustered or scalable environment. That said, the work included in this RFC will not affect the scalability directly - the Simple APL will predominantly be targeted towards the local filesystem.

Maintenance & operational

Backups

With Simple APL, the backups will be done as before: a database dump and the filesystem snapshot are enough to restore all data.

However backups may not be trivial when other APLs are used. People performing maintenance will need to know how to find out which APL is in use, and act accordingly.

Migration

With Simple APL the filesystem representation will remain in place. However on the database level the migration from older sites will not be a straighforward copy. A migration script will be provided to convert the old File, Folder and Image data into the format used by the FileReferences.

The same situation will occur when an APL is exchanged on a project with existing data - storage format is proprietary to the APL, so the data needs to be transformed for the new APL to understand it.

Internationalisation & localisation

We expect minimal or no modifications to the UI, so there should be no impact to i18n.

For l10n, situation will not change either: with Simple APL the filesystem representation will remain as is. The database representation will change, but it will still hold the filenames. Other parts of the tuple do not introduce any l10n considerations.

References

  1. UserVoice: Files should be stored in DataObject fields
  2. UserVoice: Filesystem Abstraction
  3. GitHub: micmania1's pull request
  4. dev-list: Storing files in dataobject fields thread
  5. dev-list: Historical ASSET_BASE_PATH thread
  6. dev-list: Historical Silverstripe (Assets) vs. Scaling in the Cloud thread
  7. Flysystem

Appendix A: APL interface

Here is our proposition for the initial base APL interface. Note that we expect this interface to undergo some degree of evolution. We might discover other generic methods that can ba added, and we also need to decide how to provide for the delete, copy, move and "check if exists" operations.

The tuple needs to be included in all calls and is always $hash, $variant, $filename, $parentObj. Any values in the tuple may be set to null to indicate the caller is expecting the APL implementation to generate the values. The possibly modified tuple is also returned from the setters.

See the "Proposal" chapter for more information on the tuple.

interface AssetPersistenceLayer {

    /**
     * Write the data directly.
     *
     * The tuple may be modified by this method - it's the caller's responsibility
     * to update their data structures based on the return value. However
     * the APL should not change tuple values for derived files - i.e. if the Variant
     * value of the tuple is not null.
     *
     * @param $hash string|null Tuple part 1
     * @param $variant string|null Tuple part 2  
     * @param $filename string|null Tuple part 3
     * @param $parentObj Object|null Tuple part 4
     * @param $data Binary data to set on the file reference.
     * @param $conflictResolution string Conflict resolution hint (exception|overwrite|rename)
     *
     * @return array($hash, $variant, $filename, $parentObj) The (possibly modified) file reference tuple.
     */
    function setFromString($hash, $variant, $filename, $parentObj, $data, $conflictResolution);

    /**
     * Possible variant #1: it might be faster for the implementation to operate directly from disk.
     */
    function setFromLocalFile(..., $path, ...);

    /**
     * Possible variant #2: for large objects it would be useful to provide for direct stream handling.
     */
    function setFromStream(..., $stream, ...);

    /** 
     * @param $hash string|null Tuple part 1
     * @param $variant string|null Tuple part 2  
     * @param $filename string|null Tuple part 3
     * @param $parentObj Object|null Tuple part 4
     *
     * @return string Data from the file.
     */
    function getAsString($hash, $variant, $filename, $parentObj);

    /**
     * Possible variant #1: for large objects it might be useful to provide for direct stream access.
     */
    function getAsStream(...);

    /**
     * @param $hash string|null Tuple part 1
     * @param $variant string|null Tuple part 2  
     * @param $filename string|null Tuple part 3
     * @param $parentObj Object|null Tuple part 4
     *
     * @return string URL for the data blob that can be fetched directly by the user.
     */
    function getAsURL($hash, $variant, $filename, $parentObj);
}

Appendix B: FileReference class mockup

Here is our initial proposition for the base FileReference interface. As with the APL interface, note that we expect this interface to undergo some degree of evolution.

See the "Proposal" chapter for more information on the FileReference.

class FileReference implement CompositeDBField {

    /**
     * Write the data directly to the tuple represented by this FileReference.
     *
     * @param $data Binary data to set on the file reference.
     * @param $conflictResolution string Conflict resolution hint (exception|overwrite|rename)
     */
    function setFromString($data, $conflictResolution);

    /**
     * It might be faster for the implementation to operate directly on the file,
     * so the consuming code may choose to use this instead.
     *
     * @param $path Absolute path to the file
     * @param $conflictResolution string Conflict resolution hint (exception|overwrite|rename)
     */
    function setFromLocalFile($path, $conflictResolution);

    /** 
     * Get the data from the file resource represented by this FileReference.
     * @return string Data from the file.
     */
    function getAsString();

    /**
     * Get the URL for the file resource represented by this FileReference. 
     *
     * @return string URL for the data blob that can be fetched directly by the user.
     */
    function getAsURL();

    // ... other methods implemented as required by the CompositeDBField interface.
}

Appendix C: Trivial APL method

Example pseudocode for a trivial APL method to illustrate several concepts from this RFC. Note this is not a code from any actual APL - Simple APL will use Flysystem and will be more complex.

function setFromString(
    $hash,
    $variant,
    $filename,
    $parentObj,
    $data,
    $conflictResolution
) {

    // This backend does not support hashes, make sure it's null.
    $hash = null;

    // It is technically legal to pass an empty tuple value and expect the APL
    // to generate it.
    if (!$filename) $filename = generate_random_name();

    // Find out the directory from the context.
    if ($parentObj instanceof File) {
        $path = $parentObj->getParent()->getRelativePath() . '/';
    } else {
        $path = 'homeless/';
    }

    // Configurable root directory for the backend.
    $path = $this->config()->root_directory . '/' . $path;

    // Do not permit tuple changes for derived files.
    if ($variant && $conflictResolution!='overwrite') throw new Exception(...);

    // Variable conflict handling.
    if (file_exists($path . $filename)) {
        switch($conflictResolution) {
            case 'exception': 
                throw new APLConflict(...);
                break;
            case 'rename':
                $filename = $this->resolveFilenameConflict($filename);
                break;
        }
    }

    // Finally - write the data.
    file_put_contents($path . $filename, $data);

    // Return the new - possibly updated - tuple.
    return array($hash, $variant, $filename, $parentObj);
}
@mateusz mateusz changed the title RFC-1 Filesystem independence RFC-1 Filesystem abstraction Jan 14, 2015
@mateusz mateusz changed the title RFC-1 Filesystem abstraction RFC-1 Asset abstraction Jan 14, 2015
@marijnkampf
Copy link
Contributor

Housekeeping note, should Flysystem https://github.com/thephpleague/flysystem be included in References?

@willmorgan
Copy link
Contributor

Looks like a nice RFC. Some of my thoughts:

  • I agree that Folders can be safely discarded - code-wise, I don't think we should be relying on asset folder structure beyond ensuring that all assets sit under a certain base path / URL which can be set by the developer. Their existence in the current assets functionality in the CMS is a real pain, so yeah - I initially disagreed but upon reflection, that part is 👍
  • One thing that might not have been thought about is syncing existing files from a remote source.
  • Using Flysystem ❤️
  • After reading the discussion in the greater PHP community around PHP version upgrades, I think a bump to 5.4 (or even 5.5) would be fine. In fact I would encourage us to drop 5.3 support in the next 3.x anyway.

If accepted and developed then this will make a lot of lives easier. 👍

@dhensby
Copy link
Contributor

dhensby commented Jan 14, 2015

  1. I believe we need to be able to run multiple APLs on one site. I foresee that developers may want to store some files on a local filesystem and some remotely
  2. 3.x seems an unrealistic release milestone if we are identifying API breakages at this early stage
  3. I think the parameters defiend in the APL interface are too rigid and don't leave us future flexibility without irritating API changes. The obvious way to solve it is using a tuple object (which also makes this solution more customisable) but I see this option was rejected. I think this should be reconsidered.

@willmorgan we can't drop 5.3 support on a 3.x release

@oddnoc
Copy link
Contributor

oddnoc commented Jan 14, 2015

Although I am all for asset abstraction, I am raising 2 issues pertaining to backward compatibility.

  1. This RFC violates semantic versioning by proposing backward-incompatible changes on the 3.x branch. There needs to be a 100% compatibility layer, or this RFC must be deferred to 4.x.
  2. A migration script will need to be provided to convert old filesystem paths in content to shortcodes.

I'm being a stickler about this because it is going to be well-nigh impossible to explain to clients that we have to burn a lot of hours rewriting a lot of code and massaging their content because we want them to upgrade to SilverStripe 3.2 from 3.1. Keep in mind that these same clients will have recently had substantial expenses upgrading from 2.4. Also, the project has publicly committed to semantic versioning, so it is perplexing that the first RFC isn't in compliance.

@stevie-mayhew
Copy link
Contributor

I have to agree with @oddnoc - I don't see any way which this can be in a 3.x release - its a major change to an API. I thought that we were adopting semantic versioning once the PDO release has come out, which I assumed would be a 3.2 release under the current schema. That would make this a 4.x release at best.

I'm also worried about the continued persistence in maintaining PHP 5.3 support and this is a perfect example - having to resolve a third party's non-support of 5.3 isn't a trivial matter (who knows what they think about it!)

@hafriedlander
Copy link
Contributor

My expectation is that as long as you use the simple APL with the local
filesystem, the system would be backwards compatible with one exception:
there will be a schema change. Upgrading to the new schema would be
automate-able. I'm not sure if that schema change is enough to trigger
requiring a new major version. If you use anything except the simple APL,
or anything except the local filesystem then you would definitely need to
change code too, but I don't see this as breaking backwards compatibility.
If it turned out we couldn't maintain backwards compatibility with the
simple APL and local filesystem, I agree this would need to go into 4.0,
but I don't see we need to pre-emptively decide that.

Regarding minimum PHP version, semver is why PHP needs to stay at 5.3. We
can't change the minimum version for 3.2, because that would be a
backwards-incompatible change. If this goes into 4.0 instead, and we decide
that the minimum PHP version should go up for 4.0 (which I think it should,
and it's likely it will, but that should be a deliberate decision, not be
decided indirectly by an RFC), then we would adjust the RFC's expectations.

Regarding the APL interface's flexibility: the interface itself is a public
API. Having an old APL implementation need upgrading when the API changes
is a feature, not a bug. If we make it flexible, then how do we manage
compatibility across multiple APLs? How do we flag when a new version of
the framework's expectations of an APL have changed?

Regarding syncing files from a file source (remote or local): you could
write a script to do that if you liked, but it's outside the scope of this
RFC & the framework's responsibilities. Depending on the APL, it might not
be possible to re-construct the database from the filestore. For instance,
in the hypothetical versioned store, where everything is a hash, there are
no directories. After this change, the reference is only guaranteed one way
(database -> filestore) and is lossy.

I've seen no use case for sharding filestores, and allowing it
significantly complicates the API, but if there's specific use cases you
can think of that would make it desirable I'd be keen to hear them. Note
that as designed it does allow files to be stored both locally and
remotely, just not for some files to be one way and some to be another.

Hamish Friedlander | CTO
SilverStripe

@mateusz
Copy link
Contributor Author

mateusz commented Jan 14, 2015

Thanks for feedback everyone :-) The targeted Framework version seems to be the most controversial issue.

I have the following proposition: let's delay that decision. It's hard to say at this point what backwards-incompatible changes would need to be introduced exactly anyway. I think we should simply leave our options open: if the dev team manages to solve all BC issues during implementation (which seems technically possible), then we'll submit it against 3.x in accordance with semver. If not -> 4.x. I'll rewrite the BC section to this effect.

@oddnoc, @stevie-mayhew, @dhensby is the above approach acceptable?

Also, people have provided a couple of opinions i.e. whether the tuple should be an object, or whether we should support mutliple APLs. There is a fine line to tread there: it's like the tabs vs spaces discussion. My belief is that the best approach is to let the RFC writer's opinion prevail, because it's easiest - as long as that solution works. It doesn't need to be perfect, it just needs to work for enough people.

@dhensby, clarification regarding the "multiple APLs": you don't need to support parallel APLs to support multiple backends - see the "Amazon S3 and clustered sites" chapter. APL is designed to be more like a singleton - it's "the" abstraction layer. You can put whatever you want underneath :-)

@willmorgan, re syncing files from remote source, we will only get "Sync files & assets" work in a backwards-compatible way with the Simple API. Not sure if reflection should be built into the APL interface. What I'm saying is there is a limit of what we can architect up front - and I think we are near that limit.

@oddnoc re migration script, if you use Simple APL, you won't need to migrate the data in the Content - both representations will happily live side by side. We don't rule out writing a migration script, but it's not necessary for BC.

@marijnkampf added, thanks!

edit: So the "proposal" now says:

We don't commit to any target Framework versions for this feature. However we would like to submit this against "3.x" if we can resolve all breaking changes by introducing a Simple APL.

The "backwards-incompatible changes" chapter has also been tweaked accordingly.

edit2: Even better wording for the "proposal":

We would endeavour to write a backwards-compatible solution that could be included in a minor release of 3.x. Once the solution is implemented we would open it up for review and get consensus that it was backwards compatible. If not, then the change would be incorporated into 4.x instead.

@stevie-mayhew
Copy link
Contributor

Yes that makes complete sense. I think that having a "target version" for a RFC is a good idea, but as you say it shouldn't preclude a version bump if the end result becomes overly complex or obtuse due to attempts to keep within that constraint.

@sminnee
Copy link
Member

sminnee commented Jan 14, 2015

I think that schema changes are okay for minor releases, as long as running dev/build takes care of everything for you, as that is understood as a standard silverstripe post-update action.

I'd also expect that you don't need to introduce any new configuration, which would mean that, if no asset backend configuration is included, the system would assume that you're using a simple asset backend, pointing to the assets directory, and using a PHP 5.3-compatible non-Flysystem filestore—i'd probably recommend a class that deals only with local files and conforms to Flysystem API. Whether or not it's worth going to that effort to get this into 3.x is up to the opinion of the implementor.

It's still "dangerous" and I think the implementation would need broad, rigorous scrutiny to see if it was actually viable for 3.x, and unless we got consensus that it was, it would go into 4.0.

I'd like to see it in 3.x, but it's much more important that we honour our commitment to semantic versioning.

@sminnee
Copy link
Member

sminnee commented Jan 14, 2015

I think the parameters defiend in the APL interface are too rigid and don't leave us future flexibility without irritating API changes. The obvious way to solve it is using a tuple object (which also makes this solution more customisable) but I see this option was rejected. I think this should be reconsidered.

The issue isn't so much the tuple object itself as the illusion of flexibility that it would provide. Changing the parameters that get passed to uniquely identify an assets will require that all callers and backends get updated. Under the current model, this will be obvious—method signatures will change for the callers, and the interface that backends implement will change, causing parse-time errors. If we have a tuple, things might still work on the surface, but there may bugs may appear—and since it's do to with asset persistence, these could be data-loss causing bugs.

A tuple object would make the method signatures a bit more DRY, so maybe there are ways of addressing my concern while still using tuples. Essentially, you need to ensure that the code chokes quickly until it's updated to support the new tuple format, which is a little counter-intuitive, as I'm essentially asking that we make upgrading more difficult. ;-)

@nyeholt
Copy link
Contributor

nyeholt commented Jan 15, 2015

So I had a draft written up on the mailing list thread that I never actually hit 'send' on, so will paste below. As mentioned in the RFC there's been some discussion on this previously, but we've actually built a working set of modules (and the s3 layer for it) that meets a significant portion of the requirements. It was built as a module with an eye on replacing the File/Folder implementation completely, but the scope of re-writing all that too would have pushed past the deadlines we had at the time.

The RFC in general has quite a lot of similarities to work already done (file reference identifier, an APL type service, underlying file operation abstraction (via flysystem, which unfortunately wasn't around when I first started on my stuff!) so there's not much I can fault with the high-level design (except of course version requirement - PHP5.4 & SS4.0 seem much more realistic!). The APL interface does seem a little awkward to me though, but I'll need to digest things a bit more before commenting further on that.

Anyway, the following are some points based on the work already done

  • The Model level representation (Folder / File objects) shouldn't dictate the file system representation, so that changing the title of a Folder data object shouldn't trigger a cascading change to any of the blobs on disk (or remote storage). Some file backends may not support a 'simple' way to rename folders in a file hierarchy, while it may be way too expensive in terms of number of operations for other backends - and a failure halfway through may not be easily recovered.
  • File operations (eg resampling, doc conversions etc) need to either a) have a simple way to copy the item locally, then store back in the CDN, in a way that's simple to look up again or b) have a backend that supports those file operations on your behalf. a) is much simpler to achieve. I followed path a) and so far things work as expected, though there are edge cases that keep cropping up as to the ordering of operations depending on how a variation is referenced by the CMS (particular during the replacement of a file). For speed, these variations are also tracked via the DB which makes for much speedier lookups of "does this exist in a backend".
  • It's very handy to be able to make use of existing remote objects via the same API, even if it wasn't generated by SS (I believe Flysystem will have this covered). For example, a file pointer in my module is something like

S3Bucket:||g4a/g4af3204etc/filename.jpg

But if I've also got something uploaded separately to a path in that same S3 location (eg other/files/generated/by/other/application/file.jpg) I can simply reference it like

S3Bucket:||other/files/generated/by/other/application/file.jpg

This kind of referencing is very handy for the following point

  • It should also be simple to copy between items, regardless of file backend (I believe Flysystem will have this covered too)
// in this case, the $sservice figures out what S3Bucket means)
$reader = $service->findReaderFor('S3Bucket:||other/files/generated/by/other/application/file.jpg'); 

$writer = new FilesystemWriter('some/path/locally.jpg');
$writer->write($reader);
  • It would be really good if 'secure' file management was a native part of things; even if this means the creation of a 'private' asset store and a 'public' asset store, and copying between the two as permissions change. I'd really like to see a 'defaultPublicStore' and 'defaultPrivateStore' for assets.
  • I encode the specific backend store that an item is located in the filepointer; for example, S3Bucket:||g4a/g4af3204etc/filename.jpg points at an reader/writer pair that is configured like the below yml. In this case, I've got two separate S3 buckets configured for content storage; from the CMS I can configure a folder to have content stored in either of them if desired. It'd be really nice to have that flexibility (I think someone's already mentioned this)
  S3Service:
    constructor:
      key: {my_key}
      secret: {secret}
      region: ap-southeast-2
  S3ContentReader:
    type: prototype
    properties:
      s3service: %$S3Service
      bucket: public-content
      baseUrl: https://public-content.s3-ap-southeast-2.amazonaws.com
  S3ContentWriter:
    type: prototype
    properties:
      s3service: %$S3Service
      bucket: public-content
      baseUrl: https://public-content.s3-ap-southeast-2.amazonaws.com
  PrivateS3ContentReader:
    type: prototype
    properties:
      s3service: %$S3Service
      bucket: private-content
      baseUrl: https://private-content.s3-ap-southeast-2.amazonaws.com
  PrivateS3ContentWriter:
    type: prototype
    properties:
      s3service: %$S3Service
      bucket: private-content
      baseUrl: https://private-content.s3-ap-southeast-2.amazonaws.com
  ContentService:
    properties:
      stores:
        File:
          ContentReader: FileContentReader
          ContentWriter: FileContentWriter
        S3DevBucket:
          ContentReader: S3ContentReader
          ContentWriter: S3ContentWriter
        PrivateBucket:
          ContentReader: PrivateS3ContentReader
          ContentWriter: PrivateS3ContentWriter

@oddnoc
Copy link
Contributor

oddnoc commented Jan 15, 2015

@mateusz Thanks for the followup. The approach of targeting 3.x if and only if backward compatibility is retained is completely acceptable.

@chillu
Copy link
Member

chillu commented Jan 15, 2015

Wow, massive effort guys - and very well written! I largely agree with the proposal, just some curve balls to consider below.

Would you consider a third party DMS to be a viable APL backend? It would complicate the APL in terms of inferring a folder hierarchy based on externally stored data (rather than Folder), and require the notion of a readonly filesystem. I assume that its out of scope and should rather be supported by a more generic file picker for the CMS, but want to make sure we've got our bases covered.

Related to this: We're assuming that SilverStripe is the only consumer of a specific file storage service, right? Given that organisational data (hierarchies, tags) is only optionally stored in the local database File, Folder), external data essentially becomes anonymous hashed blobs. Makes it harder for other systems to access the same external data in a meaningful way, they'll have to go through some SilverStripe web service.

Sharded stores: One use case could be different backup policies in an S3 store. One bucket might be used for small assets hooked up to CloudFront, another bucket for more important documents which are backed up to Glacier. While this could be solved through cascading backends, one of these will always carry the performance penalty of the failover check. It'll add significant API surface, so not sure its worth the effort.

Do you see a use for ParentObject outside of the built-in Folder structure? Nothing in core could rely on it since the object would be implementation specific (e.g. an S3 folder reference).

I'm aware that the APL interface is an MVP at the moment, so I'll hold off asking about stream support and such.

@mateusz
Copy link
Contributor Author

mateusz commented Jan 15, 2015

@nyeholt Thanks for posting, happy to hear you find the high-level design watertight :-)

I must admit I'm struggling to confidently build a mental picture of your architecture just from the code so I can't appreciate your design in full. Maybe you would like to write up a short architecture document, i.e. a mini-RFC, for your proposition?

Can you confirm my understanding though: your solution seems less abstract than this RFC in that it works only with the storage backends, and does not define any intermediate layer to decouple file manipulation and representation from the core (i.e. bottom-up approach). So we'd need to come up with an appropriate abstraction for the Framework to encapsulate all the backend operations anyway (and we would end up with something like the APL).

This RFC comes from the opposite perspective: it works out the intermediate layer's interface, but does not define anything about the backends (i.e. top-down approach). It seems to me that you would be able to implement your proposition as a concrete APL, would you not?

Now to the specific points raised.

The Model level representation (Folder / File objects) shouldn't dictate the file system representation

Correct, it doesn't. In this RFC the storage is decoupled completely. You can build new data object structure based on the APL interface without using Folder & Files. APL doesn't care about the Folders either - it uses the ParentObject as a generalised reference to a container.

File operations (eg resampling, doc conversions etc) need to either a) have a simple way to copy the item locally, then store back in the CDN, in a way that's simple to look up again

Yes, you use the "Variant" field of the tuple to store variants of a file. In other words, the APL is aware of the variants and may optimise for it.

It's very handy to be able to make use of existing remote objects via the same API

This comes at a cost of having to know the storage details of files, and hence at a cost of coupling the storage representation to your code. Nevertheless no-one will stop people from bypassing the APL and interacting directly with the backends, perhaps with some nominal support from the APL (i.e. getBackendIdentifier() to fetch the internal identifier).

It should also be simple to copy between items, regardless of file backend

This can be done within the concrete APL, we don't limit what functionality the APL provides, but it needs to be encapsulated within.

It would be really good if 'secure' file management was a native part of things

As above, this is up to the APL implementation. It seems better not to force the solution to natively use "secureBackend" and "insecureBackend", because some APLs might be able to support security within a single backend.

I encode the specific backend store that an item is located in the filepointer

Again, this couples the storage representation to the Framework.

Thanks again for raising concerns :-)

@mateusz
Copy link
Contributor Author

mateusz commented Jan 15, 2015

@chillu Welcome to the RFC and thanks for helping to bullet-proof the solution ;-)

Would you consider a third party DMS to be a viable APL backend.

Maybe I'm missing something, but I don't see any problem with writing a DMS-supporting APL. The APL interface is "directory-agnostic" (see "Handling directories" chapter). As long as you can describe your file objects via the tuple, you can build the APL however you want. For example the Simple APL will infer the directory tree from the ParentObjects, but an S3 APL would perhaps ignore the directory tree and work only with Hashes.

For readonly, how about throwing Exceptions from the write operations?

We're assuming that SilverStripe is the only consumer of a specific file storage service

Kind of. You are right in assuming the DB is the canonical store for the file metadata. However, since the APL specifics are undefined, you are free to store this metadata in your backend and support other clients in this way. This means you will need to handle the synchronisation issues (which would land you similar problems as with the current asset system, but it would at least be incapsulated withint the APL this time).

On the practical side, you should be able to derive all the information you need from the tuple (specifically, the ParentObject).

Sharded stores

I think this can still be implemented within the APL. You can for example derive the shard ID from the Filename. The only question that remains in my mind is do we need to have some hinting mechanism to tell the APL which backend we prefer (i.e. is there enough data in the tuple for the APL to derive the appropriate backend, or we need some more information).

Do you see a use for ParentObject outside of the built-in Folder structure?

Yes, my assumption is ParentObject can be any Object representing the context (or can be empty). The abstract APL doesn't care about the Folders. The Folder is a construct used by File & Images to structure the files in the CMS via the database. A concrete APL might store files in a similar tree on the filesystem (read up on Simple APL), but it does not guarantee to do so.

The corollary is you can build your own Files & Images section where there is no folders, or folders are in reverse order :-) APL doesn't care.

Stream support

My knowledge is vague here. I assume this is for serving video streams? If yes, then that should work fine because, again, APL is abstract, and the URLs don't need to point back to SilverStripe at all.

Say you link a video resource into your CMS Content. A shortcode will be injected. Then when a user requests the page, APL::getAsURL gets called based on the tuple from the shortcode. APL verifies that this tuple is a streamable resource (i.e. by looking at the extension in Filename, or inspecting ParentObject), and is free to return a direct URL to a box with streaming capability.

Uff. I hope that answers your questions? :-) Keep asking if you have anything more.

@hafriedlander
Copy link
Contributor

@chillu One thing this solution doesn't do at all is deal with the case where some other file store wants to be the truth about what files exist and have SilverStripe be a client of that truth. So a DMS could be used as the backend store, but there's no support for having SilverStripe interrogate that store to find out what files exist. In fact, it makes that slightly harder, because it removes the single "all files live here" property that Files & Images used to have.

@nyeholt
Copy link
Contributor

nyeholt commented Jan 16, 2015

Maybe you would like to write up a short architecture document, i.e. a mini-RFC, for your proposition?

Yep, it's much simpler than you're possibly guessing - it'll be worth putting together even if just to highlight the kinds of things I'd like to be able to do... some of which I'll put inline below :). I think the main bit that makes it confusing is that the cdn-content module is working within the framework's extension points, rather than being cleanly defined APIs - and that there's a lot of stuff in there to translate between raw file i/o operations and the API abstractions.

works only with the storage backends, and does not define any intermediate layer to decouple file manipulation and representation from the core

The Framework still makes use of File objects for assets, but effectively the only way between the framework and the backend is the file reference (eg Label:||identifier/string/name.jpg) and the ContentService; the content service chooses wrapper objects representing 'read' and 'write' actions around the specific backend based on the Label part of the reference; these wrappers then defer to the specific backends for actually performing operations. So to get the URL for an actual file object;

$file = File::get()->byID(2);
$reader = $file->getReader();
$url = $reader->getURL();

or, based on a FileContent db field on a data object

class MyObject {
    $db = array(
        'MyFile'    => 'FileContent',
    );

    public function FileLink() {
        $reader = $this->MyFile->getURL();
        return $reader->getURL();
    }   
}

or, just from an identifier (which is what's under the cover of the above two)

$identifier = 'NamedSource:||awef/seg35w/file.jpg';
$reader = $contentService->getReader($identifier);
$url = $reader->getURL();

So from a framework level, the separation is abstracted at the ContentService point; at that level, you're dealing just with string identifiers going in, and receiving read/write wrappers back (implementers of underlying file I/O functionality). To my understanding, where the RFC has the APL as the abstraction point, in the content-services module that is explicitly split across 3 separate abstractions; the ContentService deals with finding an appropriate backend for a given reference, and is responsible for handling things like sharding, (and implementation could be swapped to handle cascading lookups), whereas the Reader/Writer pair are responsible for exposing the actual operations that can be performed on a file.

From my read of things, the equivalent to the ContentService type of structure is a hierarchical APL type of structure, where the AssetPersistenceLayer retrieved from DI would be an instance of something like SelectivePersistenceLayer, where its getAsString and setAsString methods would then load up the specific instances of a ContentReader/Writer and use those to hook into the relevant backend layers to retrieve bits and pieces.

Part of my motivation for having the explicitly separate interfaces is so that user-land code can operate on content items with a consistent API (see further below).

hence at a cost of coupling the storage representation to your code ... perhaps with some nominal support from the APL (i.e. getBackendIdentifier() to fetch the internal identifier

Yep, that would be as far as I'd expect things to be; the main thing I'd be hoping for is that I could load into a File asset using a consistent API without having to translate between raw operations. eg

// passed in via config/form or something
$remoteItemId = 'RemoteStore:||known/path.jpg';

$remoteReader = $contentService->getReader($remoteItemId);

$file = File::get()->byID(2);
$file->getWriter()->write($remoteReader);

But it's not just paths, the same API should hold for raw content

$file = File::get()->byID(2);
$file->getWriter()->write(new RawContentReader("some kind of string here"));
$file->getWriter()->write(new StreamContentReader(fopen('http://hax.ru/totally_legit.php')));

And to cover Hamish's point about remote content stores making content available, it'd be really nice if the API would provide this capability, if for no other reason than importing from that remote store. I've got the following in place in the content-services module, though it's not completely thought through just yet.

// the remote folder ID is a known structure in the remote store, passed in via config/form or something
$remoteFolderId = 'RemoteStore:||known/path';

$remoteFolder = $contentService->getReader($remoteFolderId);
foreach ($remoteFolder->getList() as $remoteReader) {
    $file = File::create();
    $file->Title = basename($remoteReader->getId());
    $file->getWriter()->write($remoteReader);
    $file->write();
}

This can be done within the concrete APL, we don't limit what functionality the APL provides, but it needs to be encapsulated within.

True, my point is that I'd like to see interfaces defined in such a way that this is mandated so that user-land code has a consistent understanding of how to copy a file from one location to another (as in the examples above).

As above, this is up to the APL implementation. It seems better not to force the solution to natively use "secureBackend" and "insecureBackend", because some APLs might be able to support security within a single backend.

Agreed, I guess my point is more that the 'problem' of secure files should be a framework handled thing, rather than a module necessarily.

Again, this couples the storage representation to the Framework.

Not really - by specific backend store, what I mean is a label that points at something configured via DI, which could be actually be an instance of Anything, and interpreted in any way that Anything chooses. So while S3Bucket:||aw3/aw3eshe34/file.jpg might initially be configured to dump into an S3 store somewhere using the S3ContentReader/Writer implementation, I might at some point copy those files to a local directory and choose to instead configure S3Bucket to use the FileContentReader/Writer implementation. Or I might configure S3Bucket to use a CascadeReader/Writer pair that first looks in local filesystem, falling back to S3 if it can't find it.

Your comment in response to Ingo's Sharded stores question is essentially this implementation, wherein the shard info is encoded in the identifier. The ContentService interprets this shard information to find a backend that will handle the Label part of the identifier.

Anyway - will try and put together a diagram and the specific interfaces separately and link them a bit later today.

@spronkey
Copy link

Please take my comments with a grain of salt, as I think the proposal as written would suit most use cases however I do have a few things to note.

First, this is a very complex proposal, that ends up actually not solving much more than, for example, micmania1's pull request. Complexity is added to support versioning and workflow, but then they are out of the scope of this RFC. I also wonder whether these would be better solved in other ways? Versioned files, for example, are possible to retrofit (in a way) within the current framework - e.g. by introducing an intermediary data object that represents the versioned file, and just storing a new file for every version.

I don't really see why swapping to a strange mashed up "File Reference" field is a positive step - it's just screaming out to be an object in its own right, stored in its own table. Solving the versioning and workflow on DataObjects in general would then be extendable to solve versioning of files. Unless I'm just not getting it?

On the 5.4 issue with Flysystem, why even bother looking at 5.3 compatibility? It's just going to be a waste of time, and it seems like the fact that Flysystem requires 5.4 should make the decision easy to go with SS4.x and bump version requirements.

@nyeholt
Copy link
Contributor

nyeholt commented Jan 16, 2015

And the separate architecture diagram / interface description at https://github.com/silverstripe-australia/silverstripe-content-services/wiki/Architecture

You'll notice that the diagram is very familiar... :)

@mateusz
Copy link
Contributor Author

mateusz commented Jan 17, 2015

@nyeholt nice diagram, I've seen it somewhere ;-)

I think the biggest issue is that you are ignoring the versioning. Native versioning in the Framework is an important design goal for us, a strategic choice if you will. To make files first-class data types, they must support this natively as well. Otherwise files might as well remain DataObjects, and we could use something akin to @micmania1's solution.

Your file identifier assumes that the store+path+filename describes the file uniquely - but that's only true in a operating system context. Framework also needs to handle the time component - a more appropriate mental model here is git repo, not a filesystem. Effectively, you are making the Framework bow to the requirements of a classical unix file system, instead of trying to make the FS serve the actual needs of the Framework.

So I'm pretty much arguing that the tuple stored in the DBField needs to contain enough information to identify a file uniquely, and that "your" file identifier hasn't got enough granularity - it lacks support for versioning and variants. @hafriedlander was initially arguing that the hash by itself is good enough tuple, but that was struck down because of lack of context for constructing the URLs.

So to summarise the proposed solutions so far:

micmania1 nyeholt this rfc
path X X
filename X X X
context* X X
variant X
hash X

*) you are right in saying that having a ParentObject reference is a hidden way of specifying the details of the store, which you accomplish by introducing a textual label.

If we can agree that the tuple presented in this RFC is necessary for our goals, then the only major difference that remains between your solution and this RFC (which can be easily seen by comparing the two diagrams) is how context is handled. You prefer to use injection via a string identifier (i.e. your label) and to have well-defined interface for readers and writers. We are more vague and just point back at the ParentObject, saying - figure it out on your own. Both approaches require people to decorate DataObjects, so it feels like this difference is just a matter of opinion :-)

So what remains is: do we agree on the tuple contents, and if we do not, then how do you propose the versioning is handled? If we can't reconcile our views, then the only remaining option is arbitrage of the competing proposals by the core-committers.

P.S. I don't know about everyone else out there, but this discussion really helps me to understand what problems are out there and how they can or cannot be solved. I hope it's as useful to you too!

@nyeholt
Copy link
Contributor

nyeholt commented Jan 17, 2015

Not ignoring versioning at all; the point of having the FileContent field stored as a string is to allow for this.

I think you're assuming that the file identifier string is being constructed based on a file system representation, which is not the case at all. A ContentWriter is responsible for creating a unique identifier for a given file (or make use of the default abstract implementation). The only requirement being that the identifier can be used to look the asset up later. I think the examples having / chars have given a false impression :) I'll add a couple examples later tonight that show how a file actually gets mapped from a few use cases; uploading a file, creating a variation, and storing a theme in a CDN. Which is another use case I had forgotten about until now 😂

@mateusz
Copy link
Contributor Author

mateusz commented Jan 17, 2015

@spronkey thanks for commenting - I'm pretty much going to link other comments in here, because we already spoke about these things.

Like I wrote in the response to @nyeholt, versioning support is a critical feature of the Framework. Retrofitting versioning into the current model is what do at the moment (see silverstripe-versionedfiles module), and it is rather ugly, because we end up hacking the API by packing the version number into the Filename. What we are trying to do is to solve the root issue here, by making files first-class data types.

Regarding how the tuple is represented, see Sam's argument (search upwards for "The issue isn't so much the tuple object itself as the illusion of flexibility that it would provide.").

Aiming at 3.x just makes the implementation harder - and it's us doing the work, so why are you worried? :-) Community will have the final say in judging if the solution is backwards compatible before merging.

Regarding the Flysystem, I don't think we should get stuck at seeing it as the one and only solution - in all proposed solutions there is an added abstraction layer on top. We should definitely be thinking in broader terms :-)

@mateusz
Copy link
Contributor Author

mateusz commented Jan 17, 2015

@nyeholt I see, yes I did assume that "your" file id was well defined. Would definitely be great to see how you solve all requirements listed in the motivation at the top.

@mateusz
Copy link
Contributor Author

mateusz commented Jan 17, 2015

Oh yeah sorry @nyeholt - one more thing :-) with the RFC process, the author drives the development. If you make a proposition, it's assumed you have the resources implement it.

In other words, we don't need to all agree on the specifics of the solution. If this RFC is approved by the core-committers as "a solution", then we are free to implement it that way - because it's us doing the work. This is especially true if we only differ in opinions, not the substance :-)

@spronkey
Copy link

@mateusz My point re: the DBField thing - I'm trying to understand why FileReferenceField is a glorified string, and not a DataObject in its own right. You say this proposed "first class data type" is solving the root issue - but I see it as a dirty hack and a step backwards that's bound to cause headaches in the future.

DBField instances should represent single values - FileReferenceField is storing multiple values. Some RDMSs have denormalised type support (i.e. postgres and JSON type), but SS's ORM supports databases that don't (i.e. mysql), which tells me that FileReferenceField really should be a table on its own, and thus in SS represented as an object in its own right, with hash, variant, filename, parent object as its fields?

Someone above mentioned the case where systems other than SilverStripe will be accessing the files - I think this is worth further consideration. With the FileReferenceField, there's a big impediment to using raw SQL to access the files, and hierarchy information stored in the database - one has to pull apart a string. Using a DataObject for this removes this impediment completely.

Re: versioned files - if FileReferenceField was an object in its own right, versioning could be taken care of in the same way that SiteTree versioning is. The only catch is that some additional handling would have to take place when files are stored, to ensure that versions remain distinct.

Sam's comments around the tuple object were referring to using a tuple as the params to the APL API, not the storage of file information in a denormalised DBField tuple. I agree with Sam's comments re: not using a tuple for APL API, and using distinct params as proposed. If the API changes there should be a clear notice to the API consumer.

@hafriedlander
Copy link
Contributor

@spronkey http://api.silverstripe.org/3.1/class-CompositeField.html - each element of the tuple would be it's own column in the table, but there is no reason we need to, or should, have a single field represent a single column.

@hafriedlander
Copy link
Contributor

@nyeholt
Copy link
Contributor

nyeholt commented Jan 18, 2015

Ended up getting tied up all weekend, so will respond in a bit more detail from work. Regarding

In other words, we don't need to all agree on the specifics of the solution. If this RFC is approved by the core-committers as "a solution", then we are free to implement it that way - because it's us doing the work. This is especially true if we only differ in opinions, not the substance :-)

My main reason for going through this thoroughly is because I have an existing implementation that has been built with the same goals in mind, and want to make sure I understand how the proposed API will work under the use cases I've implemented or have in mind - I'm quite happy to dump a bunch of code for a better and more widely maintained alternative :).

A couple of thoughts on the APL interface

  • How do you delete what a tuple is pointing at?
  • Is copying meant to be supported?

@assertchris
Copy link
Contributor

Other than versioning and private files: is there anything Flysystem + relative database file paths do not handle? Trying to understand the needs/priorities for the added complexity. :)

@mateusz
Copy link
Contributor Author

mateusz commented Feb 25, 2015

@assertchris by definition, any asset persistency system that needs/wants to use hashes :-) But versioning is important - it's a basic premise of this RFC. If that premise is challenged, a new RFC needs to be written :-)

@assertchris
Copy link
Contributor

@mateusz Out of my depth, in here. Moving along... :)

@stojg
Copy link

stojg commented Feb 25, 2015

So besides arguing about which semantic version this goes into or what the low level API is no-one has been screaming blue murder?

@mateusz
Copy link
Contributor Author

mateusz commented Mar 5, 2015

No murder. Even better: the core-committers approved this, so the games may begin. I mean implementation. I hope we can start implementing this soon! Yay! 🎆

@nyeholt
Copy link
Contributor

nyeholt commented Mar 6, 2015

One query from the implementation side of things... is it envisaged that there will be a set of explicit interfaces defined for a concrete implementation to be written against, or an implied interface set based on an initial concrete implementation?

@clyonsEIS
Copy link
Contributor

No updates for a bit, is this underway or on deck soon? I think it would increase SS adoption significantly since horizontal scaling is quite important and tricky without this.

@willmorgan
Copy link
Contributor

@clyonsEIS I've clarified the milestones and added the pretty multicolour labels, so anyone coming from UserVoice can quickly see the status.

https://silverstripe.uservoice.com/forums/251266-new-features/suggestions/6572660-asset-abstraction

@tractorcow
Copy link
Contributor

I've created a PR for initial implementation of this feature (not including updating of File dataobject). Or things like file variants.

#4599

@tractorcow
Copy link
Contributor

Second pull request is at #4652

@sminnee sminnee modified the milestones: 4.0-alpha1, 4.0.0 Oct 13, 2015
@sminnee sminnee added the ready label Apr 11, 2016
@sminnee
Copy link
Member

sminnee commented Apr 11, 2016

@tractorcow I think this has now been completed?

@tractorcow
Copy link
Contributor

Yes this is done!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests