Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support extraction of all metadata #22339

Closed
wants to merge 22 commits into from

Conversation

dadoonet
Copy link
Member

As we have here an ingest processor, we can offer extracting all possible metadata instead of only a small subset.
That makes even more interesting the ingest processor as it can receive for example a picture and it will be possible to extract much more information than before.

This PR adds a new property raw_metadata which is not set by default. That means that nothing change for users unless they explicitly ask for "properties": [ "raw_metadata" ].

For example:

PUT _ingest/pipeline/attachment
{
  "description" : "Extract all metadata",
  "processors" : [
    {
      "attachment" : {
        "field" : "data",
        "properties": [ "raw_metadata" ]
      }
    }
  ]
}
PUT my_index/my_type/my_id?pipeline=attachment
{
  "data": "e1xydGYxXGFuc2kNCkxvcmVtIGlwc3VtIGRvbG9yIHNpdCBhbWV0DQpccGFyIH0="
}
GET my_index/my_type/my_id

gives back:

{
  "found": true,
  "_index": "my_index",
  "_type": "my_type",
  "_id": "my_id",
  "_version": 1,
  "_source": {
    "data": "e1xydGYxXGFuc2kNCkxvcmVtIGlwc3VtIGRvbG9yIHNpdCBhbWV0DQpccGFyIH0=",
    "attachment": {
      "raw_metadata": {
        "X-Parsed-By": "org.apache.tika.parser.rtf.RTFParser",
        "Content-Type": "application/rtf"
      }
    }
  }
}

Of course, much more metadata can be extracted. For example, this is what a docx Word document can generate:

"attachment": {
  "raw_metadata": {
    "date": "2015-02-20T11:36:00Z",
    "cp:revision": "22",
    "Total-Time": "6",
    "extended-properties:AppVersion": "15.0000",
    "meta:paragraph-count": "1",
    "meta:word-count": "15",
    "dc:creator": "Windows User",
    "extended-properties:Company": "JDI",
    "Word-Count": "15",
    "dcterms:created": "2012-10-12T11:17:00Z",
    "meta:line-count": "1",
    "Last-Modified": "2015-02-20T11:36:00Z",
    "dcterms:modified": "2015-02-20T11:36:00Z",
    "Last-Save-Date": "2015-02-20T11:36:00Z",
    "meta:character-count": "92",
    "Template": "Normal.dotm",
    "Line-Count": "1",
    "Paragraph-Count": "1",
    "meta:save-date": "2015-02-20T11:36:00Z",
    "meta:character-count-with-spaces": "106",
    "Application-Name": "Microsoft Office Word",
    "extended-properties:TotalTime": "6",
    "modified": "2015-02-20T11:36:00Z",
    "Content-Type": "application/vnd.openxmlformats-officedocument.wordprocessingml.document",
    "X-Parsed-By": "org.apache.tika.parser.microsoft.ooxml.OOXMLParser",
    "creator": "Windows User",
    "meta:author": "Windows User",
    "meta:creation-date": "2012-10-12T11:17:00Z",
    "extended-properties:Application": "Microsoft Office Word",
    "meta:last-author": "Luka Lampret",
    "Creation-Date": "2012-10-12T11:17:00Z",
    "xmpTPg:NPages": "1",
    "Character-Count-With-Spaces": "106",
    "Last-Author": "Luka Lampret",
    "Character Count": "92",
    "Page-Count": "1",
    "Revision-Number": "22",
    "Application-Version": "15.0000",
    "extended-properties:Template": "Normal.dotm",
    "Author": "Windows User",
    "publisher": "JDI",
    "meta:page-count": "1",
    "dc:publisher": "JDI"
  }
}

As we have here an ingest processor, we can offer extracting all possible metadata instead of only a small subset.
That makes even more interesting the ingest processor as it can receive for example a picture and it will be possible to extract much more information than before.

This PR adds a new property `raw_metadata` which is not set by default. That means that nothing change for users unless they explicitly ask for `"properties": [ "raw_metadata" ]`.

For example:

```
PUT _ingest/pipeline/attachment
{
  "description" : "Extract all metadata",
  "processors" : [
    {
      "attachment" : {
        "field" : "data",
        "properties": [ "raw_metadata" ]
      }
    }
  ]
}
PUT my_index/my_type/my_id?pipeline=attachment
{
  "data": "e1xydGYxXGFuc2kNCkxvcmVtIGlwc3VtIGRvbG9yIHNpdCBhbWV0DQpccGFyIH0="
}
GET my_index/my_type/my_id
```

gives back:

```json
{
  "found": true,
  "_index": "my_index",
  "_type": "my_type",
  "_id": "my_id",
  "_version": 1,
  "_source": {
    "data": "e1xydGYxXGFuc2kNCkxvcmVtIGlwc3VtIGRvbG9yIHNpdCBhbWV0DQpccGFyIH0=",
    "attachment": {
      "raw_metadata": {
        "X-Parsed-By": "org.apache.tika.parser.rtf.RTFParser",
        "Content-Type": "application/rtf"
      }
    }
  }
}
```

Of course, much more metadata can be extracted. For example, this is what a `docx` Word document can generate:

```
"attachment": {
  "raw_metadata": {
    "date": "2015-02-20T11:36:00Z",
    "cp:revision": "22",
    "Total-Time": "6",
    "extended-properties:AppVersion": "15.0000",
    "meta:paragraph-count": "1",
    "meta:word-count": "15",
    "dc:creator": "Windows User",
    "extended-properties:Company": "JDI",
    "Word-Count": "15",
    "dcterms:created": "2012-10-12T11:17:00Z",
    "meta:line-count": "1",
    "Last-Modified": "2015-02-20T11:36:00Z",
    "dcterms:modified": "2015-02-20T11:36:00Z",
    "Last-Save-Date": "2015-02-20T11:36:00Z",
    "meta:character-count": "92",
    "Template": "Normal.dotm",
    "Line-Count": "1",
    "Paragraph-Count": "1",
    "meta:save-date": "2015-02-20T11:36:00Z",
    "meta:character-count-with-spaces": "106",
    "Application-Name": "Microsoft Office Word",
    "extended-properties:TotalTime": "6",
    "modified": "2015-02-20T11:36:00Z",
    "Content-Type": "application/vnd.openxmlformats-officedocument.wordprocessingml.document",
    "X-Parsed-By": "org.apache.tika.parser.microsoft.ooxml.OOXMLParser",
    "creator": "Windows User",
    "meta:author": "Windows User",
    "meta:creation-date": "2012-10-12T11:17:00Z",
    "extended-properties:Application": "Microsoft Office Word",
    "meta:last-author": "Luka Lampret",
    "Creation-Date": "2012-10-12T11:17:00Z",
    "xmpTPg:NPages": "1",
    "Character-Count-With-Spaces": "106",
    "Last-Author": "Luka Lampret",
    "Character Count": "92",
    "Page-Count": "1",
    "Revision-Number": "22",
    "Application-Version": "15.0000",
    "extended-properties:Template": "Normal.dotm",
    "Author": "Windows User",
    "publisher": "JDI",
    "meta:page-count": "1",
    "dc:publisher": "JDI"
  }
}
```
@dadoonet
Copy link
Member Author

@spinscale Could you review it please?

Copy link
Contributor

@spinscale spinscale left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

left a few comments, not sure we need that extra nesting into raw_metadata - which is not a very descriptive field

{
"attachment" : {
"field" : "data",
"properties": [ "raw_metadata" ]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

from a user perspective: why is this called raw_metadata - doesnt this simply mean all?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well. It's coming from https://github.com/dadoonet/fscrawler#disabling-raw-metadata where I'm actually putting all that stuff under a meta.raw field.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using "raw" because it's unfiltered and not modified.

"_source": {
"data": "e1xydGYxXGFuc2kNCkxvcmVtIGlwc3VtIGRvbG9yIHNpdCBhbWV0DQpccGFyIH0=",
"attachment": {
"raw_metadata": {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same here from a user perspective: why embed it into raw_metadata? couldnt that be part of the upper level attachment data structure?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes we can change that but I think it can be easier for users to separate that. So they can more easily ignore the whole content of this inner object at index time with enabled: false in the mapping.

for (Map.Entry<String, Object> entry : rawMetadata.entrySet()) {
logger.info("assertThat(rawMetadata.get(\"{}\"), is(\"{}\"));", entry.getKey(), entry.getValue());
}*/
assertThat(rawMetadata.get("date"), is("2015-02-20T11:36:00Z"));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

assertThat(rawMetadata, hasEntry("date", "2015...");

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

++. Thanks!

return parseBase64Document(getAsBase64(file), processor);
}

// Adding this method to more easily write the asciidoc documentation
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I dont understand this comment

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just let it here as a comment. It helps developers to easily add a new type of file in the test suite, collects all assertions in the logs, then copy and paste the log in the test case itself.

By default, the `ingest-attachment` plugin only extracts a subset of the most common metadata.

If you want to get back all the raw metadata, you can set `properties` to `raw_metadata`.
It will populate a subfield `raw_metadata` with all key/value pairs found as metadata in the document.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we mention that this can be a lot of fields and it might make more sense to select those one wants? Looks pretty verbose to me

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

About the number of fields generated, I agree. That's why I never added this feature to the mapper attachments plugin.
Here, we know that ingest can help to filter out some fields later on.

We can indeed add another property and instead of modifying properties, just have a raw_properties list which is by default [ "_none_" ], but can be [ "_all_" ] or a list of specific properties to be included.

So users would write:

{
    "attachment" : {
        "field" : "data",
        "raw_properties": [ "_all_" ]
    }
}

WDYT?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@spinscale ping?

@dadoonet
Copy link
Member Author

@spinscale Ping? :)

@dadoonet dadoonet added v5.3.0 and removed v5.2.0 labels Jan 20, 2017
@spinscale
Copy link
Contributor

spinscale commented Jan 23, 2017

After reading this a few more times I am not too happy with the way the configuration works. Reasons below

  • There is one configuration property that enables dozens of other ones. This is fundamentally different to the other already existing configuration params
  • There are overlaps with existing properties and the raw field names, i.e. Content-Type, Date - who wins here?
  • There are inconsistent field names, i.e meta: (all lowercase) or Last-Save-Date snake case or just publisher vs. Author - which is just part of the territory of fiel formats I'd say (we can clean this up with a rename processor though)
  • There are fields which have been enriched from the processor and are not coming from the document, it seems? The X-Parsed-By header seems to such a candidate. Also confusing but still manageable

The good thing is, we have the rename processor to easily rename all the fields, but I feel we should rethink the configuration of this processor to be more consistent - the single field vs. field thing is just too confusing to me. properties is already an array, but it could be easily configured to be an array of regular expressions (as this is part of the configuration it can be precompiled or we use Regex.simpleMatchToAutomaton(String ... regexes)).

Also just using the raw configuration to extract one or two more fields and then removing all the others (of which you dont know all the names because of all the different ways to write the field names) sounds not like a good idea to me.

How about a configuration like

{
      "attachment" : {
        "field" : "data",
        "properties": [ "content", "title", "pdf:*", "*-Count"  ]
      }
    }

The user does not care if we extracted a field from the raw data it is the content-length from another field, so I think we should hide that.

@dadoonet
Copy link
Member Author

@spinscale I really like your proposal of specifying properties we want to extract. In such a case, setting "properties": [ "*" ] would extract everything.

Really smart. I'm going to update my PR. Thanks!

@dadoonet
Copy link
Member Author

@spinscale So I updated the PR based on your feedback.

People can now do:

PUT _ingest/pipeline/attachment
{
  "description" : "Extract all metadata",
  "processors" : [
    {
      "attachment" : {
        "field" : "data",
        "properties": [ "*", "_content_" ]
      }
    }
  ]
}

Which will extract all "raw" metadata plus the content itself.

Note that I deprecated old field names to avoid any conflict with raw metadata names. So we are now using _content_, _title_, _author_, _keywords_, _date_, _content_type_, _content_length_, _language_ as "specific" properties.

It's deprecated so we can still read the "old" property names like content, "title"...

I wonder if we should have also a special "hack" where _*_ would mean as well any of the fixed ones.
That way people could write:

        "properties": [ "pdf:*", "_*_" ]

instead of:

        "properties": [ "pdf:*", "_content_", "_title_", "_author_", "_keywords_", "_date_", "_content_type_", "_content_length_", "_language_" ]

WDYT?

@dadoonet
Copy link
Member Author

@spinscale ping?

@spinscale
Copy link
Contributor

I still think that "properties": [ "*", "_content_" ] is confusing. Why do I need to match everything, and then some more to really get everything? Should * not match everything plus our own static fields?

@dadoonet
Copy link
Member Author

Should * not match everything plus our own static fields?

I can do it. It's actually a decision between flexibility vs complexity.
That said, in the context of ingest, as we don't add so many static fields, it can be easy to manually remove them if they are not needed.

So I'm going to implement what you said.

Thanks for the feedback!

@dadoonet
Copy link
Member Author

@spinscale I pushed a new change. LMK. Thanks!

throw new IllegalArgumentException(value + " is not one of the known keys");
}

public static ReservedProperty findDeprecatedProperty(String value) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

unused?

this.key = key;
}

public static ReservedProperty parse(String value) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

unused?

return properties;
}

public Set<ReservedProperty> getReservedProperties() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no need to be public?

@dadoonet
Copy link
Member Author

@spinscale a friendly reminder here. :)

@jakelandis
Copy link
Contributor

@dadoonet - apologies for such a long PR processes. If you are able to fix the merge conflicts we will pick this back up and work towards getting this merged in.

# Conflicts:
#	plugins/ingest-attachment/src/main/java/org/elasticsearch/ingest/attachment/AttachmentProcessor.java
@dadoonet
Copy link
Member Author

dadoonet commented Feb 5, 2019

@jakelandis I did merge the master branch into my branch.
Not sure why the build is failing though.

@dadoonet dadoonet removed the v7.0.0 label Feb 5, 2019
@jakelandis
Copy link
Contributor

@dadoonet - we have had alot of instability in the builds recently. Things are getting better, can you merge master in again ?

@dadoonet
Copy link
Member Author

@jakelandis I just merged latest master into my branch. I can see that some errors might not be related to my PR.

@dadoonet
Copy link
Member Author

Ping? Someone has some spare time to review it?

@theroch
Copy link

theroch commented Feb 26, 2021

@dadoonet Thx for this PR, can you merge master in again?
I'm really interested in this PR

@masseyke
Copy link
Member

Hi @dadoonet. Sorry for the really long delay on this one. We'd like to push it through. If you're still interested, would you re-merge master and fix the conflicts? Thanks.

@dadoonet
Copy link
Member Author

dadoonet commented Oct 6, 2021

I created today #78754 which extracts more standard metadata than before.

I'm wondering actually if there is a use case for the current PR. Do people need to extract the "raw" metadata and do specific post-treatment on them?

Just asking because if we don't want it anymore (because of #78754), it is useless for me to update this PR again.

@masseyke WDYT?

@masseyke
Copy link
Member

masseyke commented Oct 7, 2021

Thanks @dadoonet. That makes sense to me. Let me discuss it with the team to make sure there are no objections. I'm not sure if we would still need this one or not.

@dakrone dakrone requested review from masseyke and removed request for masseyke October 14, 2021 15:24
dadoonet added a commit to dadoonet/elasticsearch that referenced this pull request Oct 21, 2021
Until now, we have been extracted a few number of fields from the binary files sent to the ingest attachment plugin:

* `content`,
* `title`,
* `author`,
* `keywords`,
* `date`,
* `content_type`,
* `content_length`,
* `language`.

Tika has a list of more standard properties which can be extracted:

* `modified`,
* `format`,
* `identifier`,
* `contributor`,
* `coverage`,
* `modifier`,
* `creator_tool`,
* `publisher`,
* `relation`,
* `rights`,
* `source`,
* `type`,
* `description`,
* `print_date`,
* `metadata_date`,
* `latitude`,
* `longitude`,
* `altitude`,
* `rating`,
* `comments`

This commit exposes those new fields.

Related to elastic#22339.
dadoonet added a commit that referenced this pull request Nov 23, 2021
Until now, we have been extracted a few number of fields from the binary files sent to the ingest attachment plugin:

* `content`,
* `title`,
* `author`,
* `keywords`,
* `date`,
* `content_type`,
* `content_length`,
* `language`.

Tika has a list of more standard properties which can be extracted:

* `modified`,
* `format`,
* `identifier`,
* `contributor`,
* `coverage`,
* `modifier`,
* `creator_tool`,
* `publisher`,
* `relation`,
* `rights`,
* `source`,
* `type`,
* `description`,
* `print_date`,
* `metadata_date`,
* `latitude`,
* `longitude`,
* `altitude`,
* `rating`,
* `comments`

This commit exposes those new fields.

Related to #22339.

Co-authored-by: Keith Massey <[email protected]>
masseyke added a commit to masseyke/elasticsearch that referenced this pull request Nov 29, 2021
Until now, we have been extracted a few number of fields from the binary files sent to the ingest attachment plugin:

* `content`,
* `title`,
* `author`,
* `keywords`,
* `date`,
* `content_type`,
* `content_length`,
* `language`.

Tika has a list of more standard properties which can be extracted:

* `modified`,
* `format`,
* `identifier`,
* `contributor`,
* `coverage`,
* `modifier`,
* `creator_tool`,
* `publisher`,
* `relation`,
* `rights`,
* `source`,
* `type`,
* `description`,
* `print_date`,
* `metadata_date`,
* `latitude`,
* `longitude`,
* `altitude`,
* `rating`,
* `comments`

This commit exposes those new fields.

Related to elastic#22339.

Co-authored-by: Keith Massey <[email protected]>
masseyke added a commit that referenced this pull request Nov 29, 2021
Until now, we have been extracted a few number of fields from the binary files sent to the ingest attachment plugin:

* `content`,
* `title`,
* `author`,
* `keywords`,
* `date`,
* `content_type`,
* `content_length`,
* `language`.

Tika has a list of more standard properties which can be extracted:

* `modified`,
* `format`,
* `identifier`,
* `contributor`,
* `coverage`,
* `modifier`,
* `creator_tool`,
* `publisher`,
* `relation`,
* `rights`,
* `source`,
* `type`,
* `description`,
* `print_date`,
* `metadata_date`,
* `latitude`,
* `longitude`,
* `altitude`,
* `rating`,
* `comments`

This commit exposes those new fields.

Related to #22339.

Co-authored-by: Keith Massey <[email protected]>

Co-authored-by: David Pilato <[email protected]>
@masseyke
Copy link
Member

masseyke commented Dec 3, 2021

@dadoonet I don't think we ever really finished the conversation, but we can close this one now since we have #78754, right?

@dadoonet
Copy link
Member Author

@dadoonet I don't think we ever really finished the conversation, but we can close this one now since we have #78754, right?

Yes. I think that if there is some demand, we can always revisit this later.

Let's close it.

@dadoonet dadoonet closed this Dec 14, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Data Management/Ingest Node Execution or management of Ingest Pipelines including GeoIP >enhancement Team:Data Management Meta label for data/management team
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants