Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Centralize Lucene files extensions in one place #71416

Merged
merged 10 commits into from
Apr 12, 2021

Conversation

tlrx
Copy link
Member

@tlrx tlrx commented Apr 7, 2021

Today Elasticsearch enumerates Lucene files extensions for various purposes: grouping files in segment stats under a description, mapping files in memory through HybridDirectory or adjusting the caching strategy for Lucene files in searchable snapshots.

But when a new extension is handled somewhere(let's say, added to the list of files to mmap) it is easy to forget to add it in other places. This pull request is an attempt to centralize in a single place all known Lucene files extensions in Elasticsearch.

@tlrx tlrx added >non-issue :Distributed Indexing/Engine Anything around managing Lucene and the Translog in an open shard. v7.13.0 labels Apr 7, 2021
@elasticmachine elasticmachine added the Team:Distributed (Obsolete) Meta label for distributed team (obsolete). Replaced by Distributed Indexing/Coordination. label Apr 7, 2021
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-distributed (Team:Distributed)

// Compound files are tricky because they store all the information for the segment. Benchmarks
// suggested that not mapping them hurts performance.
CFS("cfs", "Compound Files", false, true),
CMP("cmp", "Completion Index", true, false),
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Introduced in this PR, this file will be treated as a metadata one by searchable snapshots.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does it mean that this file will be downloaded eagerly even if the query doesn't need it? This doesn't sound like the right trade-off to me?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, it means that this file will be cached a bit differently to speed up searchable snapshots shards recoveries. For example if this file is smaller or equal to 64KB it will be fully cached into a doc in the .snapshot-blob-cache system index that will be later retrieved when opening the Directory.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah sorry I was confused in my previous comment, I had missed that this file was read eagerly when opening an index.

TVX("tvx", "Term Vector Index", false, false),
VEC("vec", "Vector Data", false, false),
// Lucene 9.0 indexed vectors metadata
VEM("vem","Vector Metadata", true, false);
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In this PR the Segments Stats API uses this list of extensions instead of a specific, more limited one. If this PR is merged then the API will return more file types which I think is good.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

++ that's great

public static LuceneFilesExtensions fromExtension(String ext) {
if (ext != null && ext.isEmpty() == false) {
final LuceneFilesExtensions extension = extensions.get(ext);
assert extension != null: "unknown Lucene file extension [" + ext + ']';
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a best effort to catch any missing extension

@tlrx tlrx requested review from jimczi and ywelsch April 8, 2021 12:35
Copy link
Contributor

@ywelsch ywelsch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've left minor comments, looking good o.w.

TVX("tvx", "Term Vector Index", false, false),
VEC("vec", "Vector Data", false, false),
// Lucene 9.0 indexed vectors metadata
VEM("vem","Vector Metadata", true, false);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

++ that's great


private static final Map<String, LuceneFilesExtensions> extensions;
static {
final Map<String, LuceneFilesExtensions> list = new HashMap<>(values().length);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why call this list?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a leftover, I renamed it to map

@@ -42,7 +42,7 @@ public void testPreload() throws IOException {
doTestPreload("*");
Settings build = Settings.builder()
.put(IndexModule.INDEX_STORE_TYPE_SETTING.getKey(), IndexModule.Type.HYBRIDFS.name().toLowerCase(Locale.ROOT))
.putList(IndexModule.INDEX_STORE_PRE_LOAD_SETTING.getKey(), "dvd", "bar")
.putList(IndexModule.INDEX_STORE_PRE_LOAD_SETTING.getKey(), "dvd", "tmp")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why way this change necessary? bar is not a valid extension, just as tmp?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

bar is not a valid extension while tmp is used by Lucene for temporary files. I've added

// Temporary Lucene file
TMP("tmp", "Temporary File", false, false),

for this purpose

@@ -314,7 +271,7 @@ public ByteRange computeBlobCacheByteRange(String fileName, long fileLength, Byt
}
}

if (METADATA_FILES_EXTENSIONS.contains(fileExtension)) {
if (fileExtension.isMetadata()) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This would lead to a NPE in case where we have an unknown extension - not good. Let's safeguard against this.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤦 thanks for spotting this

@tlrx tlrx added the v8.0.0 label Apr 12, 2021
@tlrx
Copy link
Member Author

tlrx commented Apr 12, 2021

@elasticmachine run elasticsearch-ci/2 (#66392 caused the test to fail)

@tlrx
Copy link
Member Author

tlrx commented Apr 12, 2021

I opened #71556 for the CI failure.

@tlrx
Copy link
Member Author

tlrx commented Apr 12, 2021

@ywelsch Thanks for your review! I've updated the code to apply your feedback. CI checks found some issues unrelated with this change. Can you please have another look?

@tlrx tlrx requested a review from ywelsch April 12, 2021 12:57
Copy link
Contributor

@ywelsch ywelsch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@tlrx tlrx merged commit 8a0bece into elastic:master Apr 12, 2021
@tlrx tlrx deleted the centralize-lucene-files-exts branch April 12, 2021 13:58
@tlrx
Copy link
Member Author

tlrx commented Apr 12, 2021

Thanks Yannick!

tlrx added a commit to tlrx/elasticsearch that referenced this pull request Apr 12, 2021
Elasticsearch enumerates Lucene files extensions for various
purposes: grouping files in segment stats under a description,
mapping files in memory through HybridDirectory or adjusting
the caching strategy for Lucene files in searchable snapshots.

But when a new extension is handled somewhere(let's say,
added to the list of files to mmap) it is easy to forget to add it
in other places. This commit is an attempt to centralize in a
single place all known Lucene files extensions in Elasticsearch.
tlrx added a commit that referenced this pull request Apr 12, 2021
Elasticsearch enumerates Lucene files extensions for various
purposes: grouping files in segment stats under a description,
mapping files in memory through HybridDirectory or adjusting
the caching strategy for Lucene files in searchable snapshots.

But when a new extension is handled somewhere(let's say,
added to the list of files to mmap) it is easy to forget to add it
in other places. This commit is an attempt to centralize in a
single place all known Lucene files extensions in Elasticsearch.

Backport of #71416
KDI("kdi", "Points Index", false, true),
// Lucene 8.6 point format metadata file
KDM("kdm", "Points Metadata", true, false),
LIV("liv", "Live Documents", false, false),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With the current implementation of live docs, they are fully read when opening an index, should we treat them as metadata?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IIRC we did not flag liv files as metadata since we were expecting most indices to use soft-deletes and also because in my mind liv files can become large (?)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

.liv files can indeed be quite large, but so can be .cmp files.

tlrx added a commit that referenced this pull request Apr 15, 2021
… APIs (#71643)

Since #16661 it is possible to know the total sizes for some Lucene segment files 
by using the Node Stats or Indices Stats API with the include_segment_file_sizes 
parameter, and the list of file extensions has been extended in #71416.

This commit adds a bit more information about file sizes like the number of files 
(count), the min, max and average file sizes in bytes that share the same extension.

Here is a sample:
"cfs" : {
  "description" : "Compound Files",
  "size_in_bytes" : 2260,
  "min_size_in_bytes" : 2260,
  "max_size_in_bytes" : 2260,
  "average_size_in_bytes" : 2260,
  "count" : 1
}

This commit also simplifies how compound file sizes were computed: before 
compound segment files were extracted and sizes aggregated with regular 
non-compound files sizes (which can be confusing and out of the scope of 
the original issue #6728), now CFS/CFE files appears as distinct files.

These new information are provided to give a better view of the segment 
files and are useful in many cases, specially with frozen searchable snapshots 
whose segment stats can now be introspected thanks to the 
include_unloaded_segments parameter.
tlrx added a commit to tlrx/elasticsearch that referenced this pull request Apr 15, 2021
… APIs (elastic#71643)

Since elastic#16661 it is possible to know the total sizes for some Lucene segment files 
by using the Node Stats or Indices Stats API with the include_segment_file_sizes 
parameter, and the list of file extensions has been extended in elastic#71416.

This commit adds a bit more information about file sizes like the number of files 
(count), the min, max and average file sizes in bytes that share the same extension.

Here is a sample:
"cfs" : {
  "description" : "Compound Files",
  "size_in_bytes" : 2260,
  "min_size_in_bytes" : 2260,
  "max_size_in_bytes" : 2260,
  "average_size_in_bytes" : 2260,
  "count" : 1
}

This commit also simplifies how compound file sizes were computed: before 
compound segment files were extracted and sizes aggregated with regular 
non-compound files sizes (which can be confusing and out of the scope of 
the original issue elastic#6728), now CFS/CFE files appears as distinct files.

These new information are provided to give a better view of the segment 
files and are useful in many cases, specially with frozen searchable snapshots 
whose segment stats can now be introspected thanks to the 
include_unloaded_segments parameter.
tlrx added a commit that referenced this pull request Apr 15, 2021
… Stats APIs (#71725)

Since #16661 it is possible to know the total sizes for some Lucene segment files 
by using the Node Stats or Indices Stats API with the include_segment_file_sizes 
parameter, and the list of file extensions has been extended in #71416.

This commit adds a bit more information about file sizes like the number of files 
(count), the min, max and average file sizes in bytes that share the same extension.

Here is a sample:
"cfs" : {
  "description" : "Compound Files",
  "size_in_bytes" : 2260,
  "min_size_in_bytes" : 2260,
  "max_size_in_bytes" : 2260,
  "average_size_in_bytes" : 2260,
  "count" : 1
}

This commit also simplifies how compound file sizes were computed: before 
compound segment files were extracted and sizes aggregated with regular 
non-compound files sizes (which can be confusing and out of the scope of 
the original issue #6728), now CFS/CFE files appears as distinct files.

These new information are provided to give a better view of the segment 
files and are useful in many cases, specially with frozen searchable snapshots 
whose segment stats can now be introspected thanks to the 
include_unloaded_segments parameter.

Backport of #71643
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Distributed Indexing/Engine Anything around managing Lucene and the Translog in an open shard. >non-issue Team:Distributed (Obsolete) Meta label for distributed team (obsolete). Replaced by Distributed Indexing/Coordination. v7.13.0 v8.0.0-alpha1
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants