Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP] Add support for external files via RFC2017 mime types #1234

Closed
wants to merge 6 commits into from

Conversation

cjcolvar
Copy link
Member

@cjcolvar cjcolvar commented Apr 20, 2017

This PR takes the pattern used by Fedora and some Hydra implementors (see related links below) and enables it at the ActiveFedora::File level. The use case for Avalon is that we want to store technical metadata and track derivative files that do not live in our repository but somewhere external, possibly a streaming server.

Since the only way to determine if there is external content is to inspect the mime type, a request to fcr:metadata is required to get the file's content. You can see I had to remove the test in spec/integration/attached_files_spec.rb due to this additional request. Certain calls now require requesting fcr:metadata to get the mime type to check if there is external content.

I'm guess there are things I've overlooked or edge cases that aren't handled so please give feedback on improvements!

The last failing test is related to #1233 and should disappear if I rebase once that is merged.

Related tickets and code:
#1156
samvera/ldp#71
https://github.com/projecthydra/hydra-works/blob/master/lib/hydra/works/services/add_external_file_to_file_set.rb
https://github.com/lbiedinger/ldp/tree/redirects
https://github.com/pulibrary/plum/blob/master/app/jobs/ingest_file_job.rb

@cjcolvar cjcolvar requested review from jcoyne, escowles and afred April 20, 2017 18:05
Copy link
Contributor

@escowles escowles left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good to me — I've got just a couple of questions.

content_will_change! unless @content == string_or_io
@content = string_or_io
end

def content
local_or_remote_content(true)
external_content? ? external_content : local_or_remote_content(true)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to explicitly retrieve the external content URL from the description, and then retrieve it? Can't we just follow Fedora's redirect?

def external_url
return nil unless mime_type.start_with? "message/external-body"
url = mime_type.split(';').at(2)
url.nil? ? nil : url[/\"(.*?)\"/, 1]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a HTTP header parser we can use that would be more robust against slight changes to the external-body format? E.g., I know the spaces are optional, and I think you might be able to use single quotes or change the order of the tokens so the URL was in the second position, etc.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I couldn't really find anything to do this parsing for us. Any suggestions?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought one of the HTTP clients would have this (and they must internally), but I couldn't find anything. I did find a gem called content-type that will do this: https://rubygems.org/gems/content-type

ContentType.parse("message/external-body; access-type=\"URL\"; URL=\"http://example.org/1\"").parameters['url']

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it worth pulling in this gem (of 2 commits from 3 years ago) along with its dependency parslet for this? Maybe it would be better to come up with something good enough that we own instead.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I'm overthinking this — we could probably even just use a regex like:

url.match(/url=['"](.+?)['"]/i).captures.first

end

def external_url=(external_file_url)
self.mime_type = "message/external-body; access-type=URL; URL=\"#{external_file_url}\""
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@cjcolvar This definitely makes it easy. I'm wondering if we could use another property. Could we make content itself an HTML redirect so that when someone hits the resource it actually redirects to the location? Or does that work with using mime_type?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fedora currently redirects if the file has a mime_type using message/external-body (RFC2017). This means the content is still stored but is never exposed.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@cjcolvar interesting. so you could still have binary content on content but the override to mime_type would force a redirect?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd have to check to make sure, but I think you could store content but you wouldn't be able to retrieve it. But once you changed the mime type to something else, then you could probably retrieve the content again.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is currently how File and Fedora work. I'm attempting to have .content return the bits for either external or internal. Thus making the File act like it either has external or internal content but never both at the same time. I had started with a separate .external_content method but thought this approach clearer.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@escowles Are you suggesting if I want to create an AF::File that points to a non-resolvable or non-HTTP URL then I might be better off storing that in another field (like ebucore:locator) and setting empty content instead of using message/external-body in the mime_type?

I figured it would be better to use the same approach to store the external reference no matter the protocol or resolvability and mark the AF::File as external content in all of those cases because that is what it is representing.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@cjcolvar I don't know about this. Right now, Fedora sends redirects, but there is a lot of interest in having Fedora proxy content instead (both HTTP and potentially local files too). My main concern is that if we support anything beyond the currently-implemented behavior (which is already in use by several apps), it will complicate implementing whatever gets decided about proxying or supporting files instead of just HTTP. I'm hoping to discuss this with the new Fedora API spec group that's meeting tomorrow.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@escowles Great! Looking forward to the outcome of that then!

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@escowles Any updates from the Fedora API spec group meeting?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The current proposal for the API spec is here: https://github.com/fcrepo/fcrepo-specification/wiki/External-Content (this wiki page has been generally agreed upon and will be incorporated into the spec shortly).

The basic pattern is different from the current implementation:

  • Either url or local-file are supported, and they are to be interpreted by the server.
  • The server will then proxy the content and the client doesn't need to know anything about the content being external.
  • The location of the external content will be exposed to the client in a Content-Location header.
  • Supporting redirects like the current implementation is still undecided — if anyone has a use case for redirects that proxying won't cover, I'd recommend opening an issue in https://github.com/fcrepo/fcrepo-specification/issues

So, I think we should support the current behavior now, but I don't think we need to do anything about local-file or other types right now.

@@ -147,6 +161,7 @@ def to_solr(solr_doc = {}, _opts = {})
end

def content=(string_or_io)
raise "This file has external content. First call external_url=nil if you want to use content=." if external_content?
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@cjcolvar might be nice to have a external_content_reset method that does the setting to nil for you.

@@ -196,7 +216,7 @@ def uploaded_file?(payload)
end

def local_or_remote_content(ensure_fetch = true)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it make sense to settle on the word "external" instead of "remote" for this method (and any others that may be applicable)? We could alias :local_or_remote_content, :local_or_external_content, and add a deprecation warning.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My understanding is that remote_content here means the version of the content that lives in fedora as opposed to the local yet to be persisted version of the content. External content is different from those two and thus a new name. It might be appropriate to call this method :local_or_remote_or_external_content now but that seems wordy and confusing.

@cjcolvar
Copy link
Member Author

Closing because I'm no longer working on this and might be irrelevant now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants