-
Notifications
You must be signed in to change notification settings - Fork 24.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Extract more standard metadata from binary files #78754
Conversation
Pinging @elastic/es-data-management (Team:Data Management) |
@elasticmachine run elasticsearch-ci/part-1 |
@elasticmachine run elasticsearch-ci/rest-compatibility |
@elasticmachine run elasticsearch-ci/bwc |
@dadoonet the code looks good to me, but I think some of the counters need to be updated for the integration tests. I've just rerun the 3 failing builds because the results had aged out. |
Ha! I forgot about the integration tests! :) Should be ok now. |
Until now, we have been extracted a few number of fields from the binary files sent to the ingest attachment plugin: * `content`, * `title`, * `author`, * `keywords`, * `date`, * `content_type`, * `content_length`, * `language`. Tika has a list of more standard properties which can be extracted: * `modified`, * `format`, * `identifier`, * `contributor`, * `coverage`, * `modifier`, * `creator_tool`, * `publisher`, * `relation`, * `rights`, * `source`, * `type`, * `description`, * `print_date`, * `metadata_date`, * `latitude`, * `longitude`, * `altitude`, * `rating`, * `comments` This commit exposes those new fields. Related to elastic#22339.
401cf64
to
2645084
Compare
@elasticmachine run elasticsearch-ci/rest-compatibility |
@masseyke What do you think should be fixed on my end to pass the rest-compatibility test? |
@dadoonet I'm not sure. I don't know why the test is failing the way it is because it appears you have fixed it. I am about to re-pull your branch and run the test locally. |
I have no idea about what the rest-compatibility is doing but may be it's because new fields have been added so when you compare a previous version with the current, it looks like a breaking change? |
Yeah you are right that the test sees this as a breaking change since the fields weren't in 7.x. And it doesn't look like we have the version information of the requesting node in the AttachmentProcessor so we could not selectively filter out fields (which is how we handle this in other places). I'm trying to figure out what the best way forward is. It might be as you say to just declare this a breaking change (and verify that it does not actually break anything). |
As we added new fields, this test is failing the bwc tests. We can not have access to `RestApiVersion` in the `AttachmentProcessor` so we can't decide to produce or not the fields depending on the version. As the change is trivial and not removing any existing field, we could skip this regression test.
I updated the tests to skip bwc tests with 7.x versions. Let see if it's ok. The other option would be to change the code and find a way to propagate the REST API version to the AttachmentProcessor and do not emit new fields when the version is lower than 8.x. |
@masseyke Do you know what should I do to fix the test? |
# Conflicts: # plugins/ingest-attachment/src/test/java/org/elasticsearch/ingest/attachment/AttachmentProcessorTests.java
@dadoonet sorry for the slow reply. I'm going to take a look at this today. But I think that the problem is that it is actually a breaking change for backwards compatibility so it might be tricky to fix. |
So I could make this configurable? Like having a setting which activates the new feature? |
Yeah I think it'll have to be something like that. I believe I had looked previously, and we don't have access to the version of the calling server like we do in a lot of REST calls. That's the way we usually get around this kind of thing -- if the version of the caller is detected as 7.x, we return what 7.x expects to see. |
@elasticmachine run elasticsearch-ci/bwc |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me.
Thanks! |
It has not been backported to 8.0 branch. Is there anything to do to trigger the backport action? |
* upstream/master: (29 commits) Fix typo (elastic#80925) Increase docker compose timeouts for CI builds TSDB: fix error without feature flag (elastic#80945) [DOCS] Relocate `index.mapping.dimension_fields.limit` setting docs (elastic#80964) Explicit write methods for always-missing values (elastic#80958) TSDB: move TimeSeriesModeIT to yaml tests (elastic#80933) [ML] Removing temporary debug (elastic#80956) Remove unused ConnectTransportException#node (elastic#80944) Reinterpret dots in field names as object structure (elastic#79922) Remove obsolete typed legacy index templates (elastic#80937) Remove unnecessary shuffle in unassigned shards allocation. (elastic#65172) TSDB: Tests for nanosecond timeprecision timestamp just beyond the limit (elastic#80932) Cleanup SLM History Item .equals (elastic#80938) Rework breaking changes for new structure (elastic#80907) [DOCS] Fix elasticsearch-reset-password typo (elastic#80919) [ML] No need to use parent task client when internal infer delegates (elastic#80905) Fix shadowed vars pt6 (elastic#80899) add ignore info (elastic#80924) Fix several potential circuit breaker leaks in Aggregators (elastic#79676) Extract more standard metadata from binary files (elastic#78754) ...
* upstream/master: (319 commits) Fix typo (elastic#80925) Increase docker compose timeouts for CI builds TSDB: fix error without feature flag (elastic#80945) [DOCS] Relocate `index.mapping.dimension_fields.limit` setting docs (elastic#80964) Explicit write methods for always-missing values (elastic#80958) TSDB: move TimeSeriesModeIT to yaml tests (elastic#80933) [ML] Removing temporary debug (elastic#80956) Remove unused ConnectTransportException#node (elastic#80944) Reinterpret dots in field names as object structure (elastic#79922) Remove obsolete typed legacy index templates (elastic#80937) Remove unnecessary shuffle in unassigned shards allocation. (elastic#65172) TSDB: Tests for nanosecond timeprecision timestamp just beyond the limit (elastic#80932) Cleanup SLM History Item .equals (elastic#80938) Rework breaking changes for new structure (elastic#80907) [DOCS] Fix elasticsearch-reset-password typo (elastic#80919) [ML] No need to use parent task client when internal infer delegates (elastic#80905) Fix shadowed vars pt6 (elastic#80899) add ignore info (elastic#80924) Fix several potential circuit breaker leaks in Aggregators (elastic#79676) Extract more standard metadata from binary files (elastic#78754) ...
Until now, we have been extracted a few number of fields from the binary files sent to the ingest attachment plugin: * `content`, * `title`, * `author`, * `keywords`, * `date`, * `content_type`, * `content_length`, * `language`. Tika has a list of more standard properties which can be extracted: * `modified`, * `format`, * `identifier`, * `contributor`, * `coverage`, * `modifier`, * `creator_tool`, * `publisher`, * `relation`, * `rights`, * `source`, * `type`, * `description`, * `print_date`, * `metadata_date`, * `latitude`, * `longitude`, * `altitude`, * `rating`, * `comments` This commit exposes those new fields. Related to elastic#22339. Co-authored-by: Keith Massey <[email protected]>
Until now, we have been extracted a few number of fields from the binary files sent to the ingest attachment plugin: * `content`, * `title`, * `author`, * `keywords`, * `date`, * `content_type`, * `content_length`, * `language`. Tika has a list of more standard properties which can be extracted: * `modified`, * `format`, * `identifier`, * `contributor`, * `coverage`, * `modifier`, * `creator_tool`, * `publisher`, * `relation`, * `rights`, * `source`, * `type`, * `description`, * `print_date`, * `metadata_date`, * `latitude`, * `longitude`, * `altitude`, * `rating`, * `comments` This commit exposes those new fields. Related to #22339. Co-authored-by: Keith Massey <[email protected]> Co-authored-by: David Pilato <[email protected]>
Until now, we have been extracted a few number of fields from the binary files sent to the ingest attachment plugin:
content
,title
,author
,keywords
,date
,content_type
,content_length
,language
.Tika has a list of more standard properties which can be extracted:
modified
,format
,identifier
,contributor
,coverage
,modifier
,creator_tool
,publisher
,relation
,rights
,source
,type
,description
,print_date
,metadata_date
,latitude
,longitude
,altitude
,rating
,comments
This commit exposes those new fields.
Related to #22339.