-
Notifications
You must be signed in to change notification settings - Fork 25k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Centralize Lucene files extensions in one place #71416
Conversation
Pinging @elastic/es-distributed (Team:Distributed) |
// Compound files are tricky because they store all the information for the segment. Benchmarks | ||
// suggested that not mapping them hurts performance. | ||
CFS("cfs", "Compound Files", false, true), | ||
CMP("cmp", "Completion Index", true, false), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Introduced in this PR, this file will be treated as a metadata one by searchable snapshots.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does it mean that this file will be downloaded eagerly even if the query doesn't need it? This doesn't sound like the right trade-off to me?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, it means that this file will be cached a bit differently to speed up searchable snapshots shards recoveries. For example if this file is smaller or equal to 64KB it will be fully cached into a doc in the .snapshot-blob-cache
system index that will be later retrieved when opening the Directory.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah sorry I was confused in my previous comment, I had missed that this file was read eagerly when opening an index.
TVX("tvx", "Term Vector Index", false, false), | ||
VEC("vec", "Vector Data", false, false), | ||
// Lucene 9.0 indexed vectors metadata | ||
VEM("vem","Vector Metadata", true, false); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In this PR the Segments Stats API uses this list of extensions instead of a specific, more limited one. If this PR is merged then the API will return more file types which I think is good.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
++ that's great
public static LuceneFilesExtensions fromExtension(String ext) { | ||
if (ext != null && ext.isEmpty() == false) { | ||
final LuceneFilesExtensions extension = extensions.get(ext); | ||
assert extension != null: "unknown Lucene file extension [" + ext + ']'; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a best effort to catch any missing extension
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've left minor comments, looking good o.w.
TVX("tvx", "Term Vector Index", false, false), | ||
VEC("vec", "Vector Data", false, false), | ||
// Lucene 9.0 indexed vectors metadata | ||
VEM("vem","Vector Metadata", true, false); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
++ that's great
server/src/main/java/org/elasticsearch/index/store/LuceneFilesExtensions.java
Outdated
Show resolved
Hide resolved
server/src/main/java/org/elasticsearch/index/store/LuceneFilesExtensions.java
Outdated
Show resolved
Hide resolved
|
||
private static final Map<String, LuceneFilesExtensions> extensions; | ||
static { | ||
final Map<String, LuceneFilesExtensions> list = new HashMap<>(values().length); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why call this list
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's a leftover, I renamed it to map
@@ -42,7 +42,7 @@ public void testPreload() throws IOException { | |||
doTestPreload("*"); | |||
Settings build = Settings.builder() | |||
.put(IndexModule.INDEX_STORE_TYPE_SETTING.getKey(), IndexModule.Type.HYBRIDFS.name().toLowerCase(Locale.ROOT)) | |||
.putList(IndexModule.INDEX_STORE_PRE_LOAD_SETTING.getKey(), "dvd", "bar") | |||
.putList(IndexModule.INDEX_STORE_PRE_LOAD_SETTING.getKey(), "dvd", "tmp") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why way this change necessary? bar
is not a valid extension, just as tmp
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
bar
is not a valid extension while tmp
is used by Lucene for temporary files. I've added
// Temporary Lucene file
TMP("tmp", "Temporary File", false, false),
for this purpose
@@ -314,7 +271,7 @@ public ByteRange computeBlobCacheByteRange(String fileName, long fileLength, Byt | |||
} | |||
} | |||
|
|||
if (METADATA_FILES_EXTENSIONS.contains(fileExtension)) { | |||
if (fileExtension.isMetadata()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This would lead to a NPE in case where we have an unknown extension - not good. Let's safeguard against this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🤦 thanks for spotting this
@elasticmachine run elasticsearch-ci/2 (#66392 caused the test to fail) |
I opened #71556 for the CI failure. |
@ywelsch Thanks for your review! I've updated the code to apply your feedback. CI checks found some issues unrelated with this change. Can you please have another look? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Thanks Yannick! |
Elasticsearch enumerates Lucene files extensions for various purposes: grouping files in segment stats under a description, mapping files in memory through HybridDirectory or adjusting the caching strategy for Lucene files in searchable snapshots. But when a new extension is handled somewhere(let's say, added to the list of files to mmap) it is easy to forget to add it in other places. This commit is an attempt to centralize in a single place all known Lucene files extensions in Elasticsearch.
Elasticsearch enumerates Lucene files extensions for various purposes: grouping files in segment stats under a description, mapping files in memory through HybridDirectory or adjusting the caching strategy for Lucene files in searchable snapshots. But when a new extension is handled somewhere(let's say, added to the list of files to mmap) it is easy to forget to add it in other places. This commit is an attempt to centralize in a single place all known Lucene files extensions in Elasticsearch. Backport of #71416
KDI("kdi", "Points Index", false, true), | ||
// Lucene 8.6 point format metadata file | ||
KDM("kdm", "Points Metadata", true, false), | ||
LIV("liv", "Live Documents", false, false), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
With the current implementation of live docs, they are fully read when opening an index, should we treat them as metadata?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IIRC we did not flag liv
files as metadata since we were expecting most indices to use soft-deletes and also because in my mind liv files can become large (?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
.liv
files can indeed be quite large, but so can be .cmp
files.
… APIs (#71643) Since #16661 it is possible to know the total sizes for some Lucene segment files by using the Node Stats or Indices Stats API with the include_segment_file_sizes parameter, and the list of file extensions has been extended in #71416. This commit adds a bit more information about file sizes like the number of files (count), the min, max and average file sizes in bytes that share the same extension. Here is a sample: "cfs" : { "description" : "Compound Files", "size_in_bytes" : 2260, "min_size_in_bytes" : 2260, "max_size_in_bytes" : 2260, "average_size_in_bytes" : 2260, "count" : 1 } This commit also simplifies how compound file sizes were computed: before compound segment files were extracted and sizes aggregated with regular non-compound files sizes (which can be confusing and out of the scope of the original issue #6728), now CFS/CFE files appears as distinct files. These new information are provided to give a better view of the segment files and are useful in many cases, specially with frozen searchable snapshots whose segment stats can now be introspected thanks to the include_unloaded_segments parameter.
… APIs (elastic#71643) Since elastic#16661 it is possible to know the total sizes for some Lucene segment files by using the Node Stats or Indices Stats API with the include_segment_file_sizes parameter, and the list of file extensions has been extended in elastic#71416. This commit adds a bit more information about file sizes like the number of files (count), the min, max and average file sizes in bytes that share the same extension. Here is a sample: "cfs" : { "description" : "Compound Files", "size_in_bytes" : 2260, "min_size_in_bytes" : 2260, "max_size_in_bytes" : 2260, "average_size_in_bytes" : 2260, "count" : 1 } This commit also simplifies how compound file sizes were computed: before compound segment files were extracted and sizes aggregated with regular non-compound files sizes (which can be confusing and out of the scope of the original issue elastic#6728), now CFS/CFE files appears as distinct files. These new information are provided to give a better view of the segment files and are useful in many cases, specially with frozen searchable snapshots whose segment stats can now be introspected thanks to the include_unloaded_segments parameter.
… Stats APIs (#71725) Since #16661 it is possible to know the total sizes for some Lucene segment files by using the Node Stats or Indices Stats API with the include_segment_file_sizes parameter, and the list of file extensions has been extended in #71416. This commit adds a bit more information about file sizes like the number of files (count), the min, max and average file sizes in bytes that share the same extension. Here is a sample: "cfs" : { "description" : "Compound Files", "size_in_bytes" : 2260, "min_size_in_bytes" : 2260, "max_size_in_bytes" : 2260, "average_size_in_bytes" : 2260, "count" : 1 } This commit also simplifies how compound file sizes were computed: before compound segment files were extracted and sizes aggregated with regular non-compound files sizes (which can be confusing and out of the scope of the original issue #6728), now CFS/CFE files appears as distinct files. These new information are provided to give a better view of the segment files and are useful in many cases, specially with frozen searchable snapshots whose segment stats can now be introspected thanks to the include_unloaded_segments parameter. Backport of #71643
Today Elasticsearch enumerates Lucene files extensions for various purposes: grouping files in segment stats under a description, mapping files in memory through
HybridDirectory
or adjusting the caching strategy for Lucene files in searchable snapshots.But when a new extension is handled somewhere(let's say, added to the list of files to mmap) it is easy to forget to add it in other places. This pull request is an attempt to centralize in a single place all known Lucene files extensions in Elasticsearch.