-
Notifications
You must be signed in to change notification settings - Fork 25.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support extraction of all metadata #22339
Conversation
As we have here an ingest processor, we can offer extracting all possible metadata instead of only a small subset. That makes even more interesting the ingest processor as it can receive for example a picture and it will be possible to extract much more information than before. This PR adds a new property `raw_metadata` which is not set by default. That means that nothing change for users unless they explicitly ask for `"properties": [ "raw_metadata" ]`. For example: ``` PUT _ingest/pipeline/attachment { "description" : "Extract all metadata", "processors" : [ { "attachment" : { "field" : "data", "properties": [ "raw_metadata" ] } } ] } PUT my_index/my_type/my_id?pipeline=attachment { "data": "e1xydGYxXGFuc2kNCkxvcmVtIGlwc3VtIGRvbG9yIHNpdCBhbWV0DQpccGFyIH0=" } GET my_index/my_type/my_id ``` gives back: ```json { "found": true, "_index": "my_index", "_type": "my_type", "_id": "my_id", "_version": 1, "_source": { "data": "e1xydGYxXGFuc2kNCkxvcmVtIGlwc3VtIGRvbG9yIHNpdCBhbWV0DQpccGFyIH0=", "attachment": { "raw_metadata": { "X-Parsed-By": "org.apache.tika.parser.rtf.RTFParser", "Content-Type": "application/rtf" } } } } ``` Of course, much more metadata can be extracted. For example, this is what a `docx` Word document can generate: ``` "attachment": { "raw_metadata": { "date": "2015-02-20T11:36:00Z", "cp:revision": "22", "Total-Time": "6", "extended-properties:AppVersion": "15.0000", "meta:paragraph-count": "1", "meta:word-count": "15", "dc:creator": "Windows User", "extended-properties:Company": "JDI", "Word-Count": "15", "dcterms:created": "2012-10-12T11:17:00Z", "meta:line-count": "1", "Last-Modified": "2015-02-20T11:36:00Z", "dcterms:modified": "2015-02-20T11:36:00Z", "Last-Save-Date": "2015-02-20T11:36:00Z", "meta:character-count": "92", "Template": "Normal.dotm", "Line-Count": "1", "Paragraph-Count": "1", "meta:save-date": "2015-02-20T11:36:00Z", "meta:character-count-with-spaces": "106", "Application-Name": "Microsoft Office Word", "extended-properties:TotalTime": "6", "modified": "2015-02-20T11:36:00Z", "Content-Type": "application/vnd.openxmlformats-officedocument.wordprocessingml.document", "X-Parsed-By": "org.apache.tika.parser.microsoft.ooxml.OOXMLParser", "creator": "Windows User", "meta:author": "Windows User", "meta:creation-date": "2012-10-12T11:17:00Z", "extended-properties:Application": "Microsoft Office Word", "meta:last-author": "Luka Lampret", "Creation-Date": "2012-10-12T11:17:00Z", "xmpTPg:NPages": "1", "Character-Count-With-Spaces": "106", "Last-Author": "Luka Lampret", "Character Count": "92", "Page-Count": "1", "Revision-Number": "22", "Application-Version": "15.0000", "extended-properties:Template": "Normal.dotm", "Author": "Windows User", "publisher": "JDI", "meta:page-count": "1", "dc:publisher": "JDI" } } ```
@spinscale Could you review it please? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
left a few comments, not sure we need that extra nesting into raw_metadata
- which is not a very descriptive field
{ | ||
"attachment" : { | ||
"field" : "data", | ||
"properties": [ "raw_metadata" ] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
from a user perspective: why is this called raw_metadata
- doesnt this simply mean all
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well. It's coming from https://github.com/dadoonet/fscrawler#disabling-raw-metadata where I'm actually putting all that stuff under a meta.raw
field.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Using "raw" because it's unfiltered and not modified.
"_source": { | ||
"data": "e1xydGYxXGFuc2kNCkxvcmVtIGlwc3VtIGRvbG9yIHNpdCBhbWV0DQpccGFyIH0=", | ||
"attachment": { | ||
"raw_metadata": { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same here from a user perspective: why embed it into raw_metadata? couldnt that be part of the upper level attachment
data structure?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes we can change that but I think it can be easier for users to separate that. So they can more easily ignore the whole content of this inner object at index time with enabled: false
in the mapping.
for (Map.Entry<String, Object> entry : rawMetadata.entrySet()) { | ||
logger.info("assertThat(rawMetadata.get(\"{}\"), is(\"{}\"));", entry.getKey(), entry.getValue()); | ||
}*/ | ||
assertThat(rawMetadata.get("date"), is("2015-02-20T11:36:00Z")); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
assertThat(rawMetadata, hasEntry("date", "2015...");
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
++. Thanks!
return parseBase64Document(getAsBase64(file), processor); | ||
} | ||
|
||
// Adding this method to more easily write the asciidoc documentation |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I dont understand this comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just let it here as a comment. It helps developers to easily add a new type of file in the test suite, collects all assertions in the logs, then copy and paste the log in the test case itself.
By default, the `ingest-attachment` plugin only extracts a subset of the most common metadata. | ||
|
||
If you want to get back all the raw metadata, you can set `properties` to `raw_metadata`. | ||
It will populate a subfield `raw_metadata` with all key/value pairs found as metadata in the document. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should we mention that this can be a lot of fields and it might make more sense to select those one wants? Looks pretty verbose to me
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
About the number of fields generated, I agree. That's why I never added this feature to the mapper attachments plugin.
Here, we know that ingest
can help to filter out some fields later on.
We can indeed add another property and instead of modifying properties
, just have a raw_properties
list which is by default [ "_none_" ]
, but can be [ "_all_" ]
or a list of specific properties to be included.
So users would write:
{
"attachment" : {
"field" : "data",
"raw_properties": [ "_all_" ]
}
}
WDYT?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@spinscale ping?
@spinscale Ping? :) |
After reading this a few more times I am not too happy with the way the configuration works. Reasons below
The good thing is, we have the rename processor to easily rename all the fields, but I feel we should rethink the configuration of this processor to be more consistent - the single field vs. field thing is just too confusing to me. Also just using the How about a configuration like
The user does not care if we extracted a field from the raw data it is the content-length from another field, so I think we should hide that. |
@spinscale I really like your proposal of specifying properties we want to extract. In such a case, setting Really smart. I'm going to update my PR. Thanks! |
@spinscale So I updated the PR based on your feedback. People can now do:
Which will extract all "raw" metadata plus the content itself. Note that I deprecated old field names to avoid any conflict with raw metadata names. So we are now using It's deprecated so we can still read the "old" property names like I wonder if we should have also a special "hack" where
instead of:
WDYT? |
@spinscale ping? |
I still think that |
I can do it. It's actually a decision between flexibility vs complexity. So I'm going to implement what you said. Thanks for the feedback! |
@spinscale I pushed a new change. LMK. Thanks! |
throw new IllegalArgumentException(value + " is not one of the known keys"); | ||
} | ||
|
||
public static ReservedProperty findDeprecatedProperty(String value) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
unused?
this.key = key; | ||
} | ||
|
||
public static ReservedProperty parse(String value) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
unused?
return properties; | ||
} | ||
|
||
public Set<ReservedProperty> getReservedProperties() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
no need to be public?
@spinscale a friendly reminder here. :) |
@dadoonet - apologies for such a long PR processes. If you are able to fix the merge conflicts we will pick this back up and work towards getting this merged in. |
# Conflicts: # plugins/ingest-attachment/src/main/java/org/elasticsearch/ingest/attachment/AttachmentProcessor.java
@jakelandis I did merge the master branch into my branch. |
@dadoonet - we have had alot of instability in the builds recently. Things are getting better, can you merge master in again ? |
@jakelandis I just merged latest master into my branch. I can see that some errors might not be related to my PR. |
Ping? Someone has some spare time to review it? |
@dadoonet Thx for this PR, can you merge master in again? |
Hi @dadoonet. Sorry for the really long delay on this one. We'd like to push it through. If you're still interested, would you re-merge master and fix the conflicts? Thanks. |
I created today #78754 which extracts more standard metadata than before. I'm wondering actually if there is a use case for the current PR. Do people need to extract the "raw" metadata and do specific post-treatment on them? Just asking because if we don't want it anymore (because of #78754), it is useless for me to update this PR again. @masseyke WDYT? |
Thanks @dadoonet. That makes sense to me. Let me discuss it with the team to make sure there are no objections. I'm not sure if we would still need this one or not. |
Until now, we have been extracted a few number of fields from the binary files sent to the ingest attachment plugin: * `content`, * `title`, * `author`, * `keywords`, * `date`, * `content_type`, * `content_length`, * `language`. Tika has a list of more standard properties which can be extracted: * `modified`, * `format`, * `identifier`, * `contributor`, * `coverage`, * `modifier`, * `creator_tool`, * `publisher`, * `relation`, * `rights`, * `source`, * `type`, * `description`, * `print_date`, * `metadata_date`, * `latitude`, * `longitude`, * `altitude`, * `rating`, * `comments` This commit exposes those new fields. Related to elastic#22339.
Until now, we have been extracted a few number of fields from the binary files sent to the ingest attachment plugin: * `content`, * `title`, * `author`, * `keywords`, * `date`, * `content_type`, * `content_length`, * `language`. Tika has a list of more standard properties which can be extracted: * `modified`, * `format`, * `identifier`, * `contributor`, * `coverage`, * `modifier`, * `creator_tool`, * `publisher`, * `relation`, * `rights`, * `source`, * `type`, * `description`, * `print_date`, * `metadata_date`, * `latitude`, * `longitude`, * `altitude`, * `rating`, * `comments` This commit exposes those new fields. Related to #22339. Co-authored-by: Keith Massey <[email protected]>
Until now, we have been extracted a few number of fields from the binary files sent to the ingest attachment plugin: * `content`, * `title`, * `author`, * `keywords`, * `date`, * `content_type`, * `content_length`, * `language`. Tika has a list of more standard properties which can be extracted: * `modified`, * `format`, * `identifier`, * `contributor`, * `coverage`, * `modifier`, * `creator_tool`, * `publisher`, * `relation`, * `rights`, * `source`, * `type`, * `description`, * `print_date`, * `metadata_date`, * `latitude`, * `longitude`, * `altitude`, * `rating`, * `comments` This commit exposes those new fields. Related to elastic#22339. Co-authored-by: Keith Massey <[email protected]>
Until now, we have been extracted a few number of fields from the binary files sent to the ingest attachment plugin: * `content`, * `title`, * `author`, * `keywords`, * `date`, * `content_type`, * `content_length`, * `language`. Tika has a list of more standard properties which can be extracted: * `modified`, * `format`, * `identifier`, * `contributor`, * `coverage`, * `modifier`, * `creator_tool`, * `publisher`, * `relation`, * `rights`, * `source`, * `type`, * `description`, * `print_date`, * `metadata_date`, * `latitude`, * `longitude`, * `altitude`, * `rating`, * `comments` This commit exposes those new fields. Related to #22339. Co-authored-by: Keith Massey <[email protected]> Co-authored-by: David Pilato <[email protected]>
As we have here an ingest processor, we can offer extracting all possible metadata instead of only a small subset.
That makes even more interesting the ingest processor as it can receive for example a picture and it will be possible to extract much more information than before.
This PR adds a new property
raw_metadata
which is not set by default. That means that nothing change for users unless they explicitly ask for"properties": [ "raw_metadata" ]
.For example:
gives back:
Of course, much more metadata can be extracted. For example, this is what a
docx
Word document can generate: