Snapshots as simple archives #81210

ywelsch · 2021-12-01T14:10:09Z

Goal

The goal of this effort is to provide access to older Elasticsearch data, for compliance or regulatory reasons, the occasional lookback or investigation, or to rehydrate parts of it. Access to the data is expected to be very infrequent, and can therefore happen with limited performance and query capabilities. Running old versions of Elasticsearch to access the old data is not practical as it would require running outdated and unsupported software.

A non-goal of this effort is to fully solve the major version upgrade problem. "Snapshots as simple archives" is an important first step however towards longer term data retention and access. It will allow some users to refrain from upgrading their archived data, and refraining from upgrading is probably the simplest upgrade option.

Solution

Snapshots have long been used for backup purposes. With this new feature, they can be used for archival purposes as well now. Elasticsearch will have the ability to access older snapshot repositories and the data therein. In addition, some basic query and aggregation capabilities are available, and it allows reindexing the data into newer Elasticsearch clusters without having the old cluster present. It provides the guarantee that the data put into Elasticsearch (and stored in snapshots) does not have an EOL, but can be accessed for a long time into the future (even if at reduced speed). The data can either be restored with read-only access, or the data can be accessed via searchable snapshots so that the archived data won't even need to fully reside on local disks for access.

Phases

Phase 0: Prototype

Basic implementation showing feasibility (Allow reading _source from older snapshots #77542)

Phase 1: MVP (target release: 8.3)

Allow Elasticsearch 8 nodes to access snapshot repositories written by previous Elasticsearch versions going back to Elasticsearch 5.0. Allow restoring indices from snapshots in the old repository into the Elasticsearch 8 cluster as well as mounting them as searchable snapshots. Allow basic query and aggregation capabilities based on postings / doc values as well as runtime fields on these indices.

Supported field types

Old mappings are imported as much "as-is" as possible into Elasticsearch 8, but only provide regular query / aggregation capabilities on a select subset of fields:

Numeric types
boolean type
ip type
geo_point type
date types: the date format setting on date fields is supported in so far as it behaves similarly across these versions. In case it is not, this field can be updated on legacy indices so that it can be changed by a user if need be.
keyword type: the normalizer setting on keyword fields is supported in so far as it behaves similarly across these versions. In case it is not, this field can be updated on legacy indices if need be.
text type: scoring capabilities are limited, and all queries return constant scores that are equal to 1.0. The analyzer settings on text fields are supported in so far as they behave similarly across these versions. In case they do not, they can be updated on legacy indices if need be.
Multi-fields
Field aliases
object fields
some basic metadata fields, e.g. _type for querying Elasticsearc 5 indices
runtime fields
_source field

Elasticsearch 5 indices with mappings that have multiple mapping types are collapsed together on a best-effort basis before they are imported.

In case the auto-import of mappings does not work, or the new version can't make sense of the mapping, it falls back to a lightweight import of the mapping where the original mapping is stored in the _meta section of the imported index's mapping, and relies on the user to put the relevant mapping parts manually in place.

Supported APIs

Archive indices are read-only, and provide data access via the search and field capabilities APIs. They do not support the Get API nor any write APIs.

Archive indices allow running queries as well as aggregations in so far as they are supported by the given field type (see above).

Due to _source access the data can also be reindexed to a new index that has full compatibility with the current Elasticsearch version.

List of tasks:

Phase 2: Cluster management & ILM integration

Phase 1 still requires users during a major version upgrade to take extra steps: snapshot the data that can't make it to the next major version, and delete it from the cluster, then do the upgrade, and finally restore / mount the data against as legacy indices. The goal of phase 2 is to automatize some of this, making it easier for user to go through a major version upgrade. Some steps could include providing an ILM integration so that indices can be transitioned to an "archive" where they will be limited to doc-values / source-only access, as well as allow users to upgrade to the next major version by auto-converting indices to archival.

Phase 2 won't be worked on immediately and is captured in #87291

The text was updated successfully, but these errors were encountered:

elasticmachine · 2021-12-01T14:10:13Z

Pinging @elastic/es-distributed (Team:Distributed)

elasticmachine · 2021-12-01T14:10:13Z

Pinging @elastic/es-search (Team:Search)

Adds Lucene support for reading _id and _source from ES 5 / ES 6 indices. Relates #81210

Adapts peer recovery so that it properly integrates with the hook to convert old indices. Relates #81210

Adds support for reading doc values formats of ES 5 and 6. Relates #81210

Allows searching on number field types (long, short, int, float, double, byte, half_float) when those fields are not indexed (index: false) but just doc values are enabled. This enables searches on archive data, which has access to doc values but not index structures. When combined with searchable snapshots, it allows downloading only data for a given (doc value) field to quickly filter down to a select set of documents. Note to reviewers: I have split isSearchable into two separate methods isIndexed and isSearchable on MappedFieldType. The former one is about whether actual indexing data structures have been used (postings or points), and the latter one on whether you can run queries on the given field (e.g. used by field caps). For number field types, queries are now allowed whenever points are available or when doc values are available (i.e. searchability is expanded). Relates #81210 and #52728

Similar to #82409, but for date fields. Allows searching on date field types (date, date_nanos) when those fields are not indexed (index: false) but just doc values are enabled. This enables searches on archive data, which has access to doc values but not index structures. When combined with searchable snapshots, it allows downloading only data for a given (doc value) field to quickly filter down to a select set of documents. Relates #81210 and #52728

Allows searching on keyword fields when those fields are not indexed (index: false) but just doc values are enabled. This enables searches on archive data, which has access to doc values but not index structures. When combined with searchable snapshots, it allows downloading only data for a given (doc value) field to quickly filter down to a select set of documents. Relates #81210 and #52728

Allows searching on boolean fields when those fields are not indexed (index: false) but just doc values are enabled. This enables searches on archive data, which has access to doc values but not index structures. When combined with searchable snapshots, it allows downloading only data for a given (doc value) field to quickly filter down to a select set of documents. Relates #81210 and #52728

Allows searching on ip fields when those fields are not indexed (index: false) but just doc values are enabled. This enables searches on archive data, which has access to doc values but not index structures. When combined with searchable snapshots, it allows downloading only data for a given (doc value) field to quickly filter down to a select set of documents. Relates #81210 and #52728

For archival indices, where mappings might not be parseable by new ES versions anymore, copy the mapping to the `_meta/legacy_mappings` section. Relates #81210

Allows running queries against _type on 5.x indices as well as returning _type in search results. Relates #81210

Ensure archive indices have a write block and the write block can't be removed. Relates #81210

Archival won't ship with 8.2, hence removing the files there. I found it easier to remove the code than reintroducing a way to only enable it in snapshot builds. Relates #81210

As part of #81210 we would like to add support for handling legacy (Elasticsearch 5 and 6) mappings in newer Elasticsearch versions. The idea is to import old mappings "as-is" into Elasticsearch 8, and adapt the mapper parsers so that they can handle those old mappings. Only a select subset of the legacy mapping will actually be parsed, and fields that are neither known to newer ES version nor supported for search will be mapped as "placeholder fields", i.e., they are still represented as fields in the system so that they can give proper error messages when queried by a user. Fields that are supported: - field data types that support doc values only fields - normalizer on keyword fields and date formats on date fields are on supported in so far as they behave similarly across versions. In case they are not, these fields are now updateable on legacy indices so that they can be "fixed" by user. - object fields - nested fields in limited form (not supporting nested queries) - add tests / checks in follow-up PR - multi fields - field aliases - metadata fields - runtime fields (auto-import to be added for future versions) 5.x indices with mappings that have multiple mapping types are collapsed together on a best-effort basis before they are imported. Relates #81210

Adds support for the Lucene 5 postings format (used by Lucene 6 and 7). Relates #81210

Integrates the fields API with placeholder fields. Relates #81210

The get API relies under the hood on accessing postings to lookup the _id and retrieve the corresponding document. Guaranteeing this access via postings is not something we would like to guarantee on archive indices. While we are adding "text field support" for archive indices, we reserve the flexibility to eventually swap that out with a "runtime-text field" variant, and hence only provide those capabilities that can be emulated via a runtime field. Doing the same for "get" would mean doing a full scan of the index (using stored fields), which is counterintuitive to what the get API is meant to be used for (quick lookup of document). We would therefore rather not have the API accessible on archive indices. Relates #81210

Codifies the requirement to add Lucene BWC codecs and corresponding tests when upgrading Elasticsearch to the next major version. Relates #81210

Adds support for "text" fields in archive indices, with the goal of adding simple filtering support on text fields when querying archive indices. There are some differences to regular text fields: - no global statistics: queries on text fields return constant score (similar to match_only_text). - analyzer fields can be updated - if defined analyzer is not available, falls back to default analyzer - no guarantees that analyzers are BWC The above limitations also give us the flexibility to eventually swap out the implementation with a "runtime-text field" variant, and hence only provide those capabilities that can be emulated via a runtime field. Relates #81210

Adds documentation for the new snapshots as archive feature. Relates #81210

Adds documentation for the new snapshots as archive feature. Relates elastic#81210

ywelsch · 2022-06-01T12:35:00Z

🚢ed in 8.3.0. Items for a possible future phase 2 are captured in #87291

ywelsch added >feature :Distributed Coordination/Snapshot/Restore Anything directly related to the `_snapshot/*` APIs :Search/Search Search-related issues that do not fall into other categories Meta labels Dec 1, 2021

ywelsch self-assigned this Dec 1, 2021

elasticmachine added Team:Distributed (Obsolete) Meta label for distributed team (obsolete). Replaced by Distributed Indexing/Coordination. Team:Search Meta label for search team labels Dec 1, 2021

ywelsch mentioned this issue Dec 2, 2021

Add codec support for Lucene 6 and 7 versions #81258

Merged

ywelsch added a commit that referenced this issue Dec 8, 2021

Add codec support for Lucene 6 and 7 versions (#81258)

0685af2

Adds Lucene support for reading _id and _source from ES 5 / ES 6 indices. Relates #81210

ywelsch mentioned this issue Dec 8, 2021

Make peer recovery work with archive data #81522

Merged

ywelsch added a commit that referenced this issue Dec 14, 2021

Make peer recovery work with archive data (#81522)

1e99bc6

Adapts peer recovery so that it properly integrates with the hook to convert old indices. Relates #81210

ywelsch mentioned this issue Jan 4, 2022

Add doc values support for ES 5 and ES 6 #82207

Merged

ywelsch added a commit that referenced this issue Jan 10, 2022

Add doc values support for ES 5 and ES 6 (#82207)

c55a460

Adds support for reading doc values formats of ES 5 and 6. Relates #81210

ywelsch mentioned this issue Jan 11, 2022

Allow docvalues-only search on number types #82409

Merged

ywelsch mentioned this issue Jan 14, 2022

Allow doc-values only search on date types #82602

Merged

ywelsch mentioned this issue Jan 20, 2022

Allow doc-values only search on keyword fields #82846

Merged

This was referenced Jan 24, 2022

Allow doc-values only search on boolean fields #82925

Merged

Allow doc-values only search on ip fields #82929

Merged

ywelsch mentioned this issue Jan 25, 2022

Copy old mappings to _meta section #83041

Merged

ywelsch added a commit that referenced this issue Jan 27, 2022

Copy old mappings to _meta section (#83041)

5707b65

For archival indices, where mappings might not be parseable by new ES versions anymore, copy the mapping to the `_meta/legacy_mappings` section. Relates #81210

ywelsch mentioned this issue Jan 27, 2022

Provide access to _type in 5.x indices #83195

Merged

ywelsch added a commit that referenced this issue Jan 27, 2022

Provide access to _type in 5.x indices (#83195)

ac9f30a

Allows running queries against _type on 5.x indices as well as returning _type in search results. Relates #81210

This was referenced Mar 17, 2022

Handle legacy mappings with placeholder fields #85059

Merged

Use write block on archive indices #85102

Merged

ywelsch added a commit that referenced this issue Mar 18, 2022

Use write block on archive indices (#85102)

9fdd0dc

Ensure archive indices have a write block and the write block can't be removed. Relates #81210

ywelsch mentioned this issue Mar 24, 2022

Support older postings formats #85303

Merged

ywelsch changed the title ~~Snapshots as a simple archive tier~~ Snapshots as simple archives Mar 28, 2022

ywelsch mentioned this issue Mar 31, 2022

Remove archival functionality from 8.2 branch #85524

Merged

ywelsch added a commit that referenced this issue Apr 28, 2022

Support older postings formats (#85303)

c082858

Adds support for the Lucene 5 postings format (used by Lucene 6 and 7). Relates #81210

This was referenced Apr 28, 2022

Docs for snapshots as simple archives #86261

Merged

Allow field retrieval on placeholder fields #86289

Merged

ywelsch added a commit that referenced this issue Apr 29, 2022

Allow field retrieval on placeholder fields (#86289)

425be6a

Integrates the fields API with placeholder fields. Relates #81210

This was referenced May 10, 2022

Add text field support to archive indices #86591

Merged

Disable get API on legacy indices #86594

Merged

This was referenced May 11, 2022

Add points metadata support for archive indices #86655

Merged

Add reminder to add BWC codecs on major version upgrade #86844

Merged

ywelsch added a commit that referenced this issue May 17, 2022

Add reminder to add BWC codecs on major version upgrade (#86844)

f9641b8

Codifies the requirement to add Lucene BWC codecs and corresponding tests when upgrading Elasticsearch to the next major version. Relates #81210

ywelsch added a commit that referenced this issue May 30, 2022

Docs for snapshots as simple archives (#86261)

46b386b

Adds documentation for the new snapshots as archive feature. Relates #81210

ywelsch added a commit to ywelsch/elasticsearch that referenced this issue May 30, 2022

Docs for snapshots as simple archives (elastic#86261)

5dd5a7b

Adds documentation for the new snapshots as archive feature. Relates elastic#81210

ywelsch added release highlight v8.3.0 labels Jun 1, 2022

ywelsch mentioned this issue Jun 1, 2022

Snapshots as archives improvements #87291

Open

3 tasks

ywelsch closed this as completed Jun 1, 2022

cbuescher mentioned this issue Nov 28, 2024

Update archival indices logic to support ES 7 indices #116565

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Snapshots as simple archives #81210

Snapshots as simple archives #81210

ywelsch commented Dec 1, 2021 •

edited

Loading

elasticmachine commented Dec 1, 2021

elasticmachine commented Dec 1, 2021

ywelsch commented Jun 1, 2022

Snapshots as simple archives #81210

Snapshots as simple archives #81210

Comments

ywelsch commented Dec 1, 2021 • edited Loading

Goal

Solution

Phases

Phase 0: Prototype

Phase 1: MVP (target release: 8.3)

Supported field types

Supported APIs

Phase 2: Cluster management & ILM integration

elasticmachine commented Dec 1, 2021

elasticmachine commented Dec 1, 2021

ywelsch commented Jun 1, 2022

ywelsch commented Dec 1, 2021 •

edited

Loading