Centralize Lucene files extensions in one place #71416

tlrx · 2021-04-07T15:18:49Z

Today Elasticsearch enumerates Lucene files extensions for various purposes: grouping files in segment stats under a description, mapping files in memory through HybridDirectory or adjusting the caching strategy for Lucene files in searchable snapshots.

But when a new extension is handled somewhere(let's say, added to the list of files to mmap) it is easy to forget to add it in other places. This pull request is an attempt to centralize in a single place all known Lucene files extensions in Elasticsearch.

elasticmachine · 2021-04-07T15:18:52Z

Pinging @elastic/es-distributed (Team:Distributed)

tlrx · 2021-04-08T11:58:52Z

server/src/main/java/org/elasticsearch/index/store/LuceneFilesExtensions.java

+    // Compound files are tricky because they store all the information for the segment. Benchmarks
+    // suggested that not mapping them hurts performance.
+    CFS("cfs", "Compound Files", false, true),
+    CMP("cmp", "Completion Index", true, false),


Introduced in this PR, this file will be treated as a metadata one by searchable snapshots.

Does it mean that this file will be downloaded eagerly even if the query doesn't need it? This doesn't sound like the right trade-off to me?

No, it means that this file will be cached a bit differently to speed up searchable snapshots shards recoveries. For example if this file is smaller or equal to 64KB it will be fully cached into a doc in the .snapshot-blob-cache system index that will be later retrieved when opening the Directory.

Ah sorry I was confused in my previous comment, I had missed that this file was read eagerly when opening an index.

tlrx · 2021-04-08T12:00:40Z

server/src/main/java/org/elasticsearch/index/store/LuceneFilesExtensions.java

+    TVX("tvx", "Term Vector Index", false, false),
+    VEC("vec", "Vector Data", false, false),
+    // Lucene 9.0 indexed vectors metadata
+    VEM("vem","Vector Metadata", true, false);


In this PR the Segments Stats API uses this list of extensions instead of a specific, more limited one. If this PR is merged then the API will return more file types which I think is good.

++ that's great

tlrx · 2021-04-08T12:05:57Z

server/src/main/java/org/elasticsearch/index/store/LuceneFilesExtensions.java

+    public static LuceneFilesExtensions fromExtension(String ext) {
+        if (ext != null && ext.isEmpty() == false) {
+            final LuceneFilesExtensions extension = extensions.get(ext);
+            assert extension != null: "unknown Lucene file extension [" + ext + ']';


This is a best effort to catch any missing extension

ywelsch

I've left minor comments, looking good o.w.

ywelsch · 2021-04-12T07:02:30Z

server/src/main/java/org/elasticsearch/index/store/LuceneFilesExtensions.java

+    TVX("tvx", "Term Vector Index", false, false),
+    VEC("vec", "Vector Data", false, false),
+    // Lucene 9.0 indexed vectors metadata
+    VEM("vem","Vector Metadata", true, false);


++ that's great

server/src/main/java/org/elasticsearch/index/store/LuceneFilesExtensions.java

ywelsch · 2021-04-12T07:04:55Z

server/src/main/java/org/elasticsearch/index/store/LuceneFilesExtensions.java

+
+    private static final Map<String, LuceneFilesExtensions> extensions;
+    static {
+        final Map<String, LuceneFilesExtensions> list = new HashMap<>(values().length);


why call this list?

It's a leftover, I renamed it to map

ywelsch · 2021-04-12T07:06:03Z

server/src/test/java/org/elasticsearch/index/store/FsDirectoryFactoryTests.java

@@ -42,7 +42,7 @@ public void testPreload() throws IOException {
        doTestPreload("*");
        Settings build = Settings.builder()
            .put(IndexModule.INDEX_STORE_TYPE_SETTING.getKey(), IndexModule.Type.HYBRIDFS.name().toLowerCase(Locale.ROOT))
-            .putList(IndexModule.INDEX_STORE_PRE_LOAD_SETTING.getKey(), "dvd", "bar")
+            .putList(IndexModule.INDEX_STORE_PRE_LOAD_SETTING.getKey(), "dvd", "tmp")


why way this change necessary? bar is not a valid extension, just as tmp?

bar is not a valid extension while tmp is used by Lucene for temporary files. I've added

// Temporary Lucene file
TMP("tmp", "Temporary File", false, false),

for this purpose

ywelsch · 2021-04-12T07:08:08Z

.../main/java/org/elasticsearch/xpack/searchablesnapshots/cache/blob/BlobStoreCacheService.java

@@ -314,7 +271,7 @@ public ByteRange computeBlobCacheByteRange(String fileName, long fileLength, Byt
            }
        }

-        if (METADATA_FILES_EXTENSIONS.contains(fileExtension)) {
+        if (fileExtension.isMetadata()) {


This would lead to a NPE in case where we have an unknown extension - not good. Let's safeguard against this.

🤦 thanks for spotting this

tlrx · 2021-04-12T09:52:34Z

@elasticmachine run elasticsearch-ci/2 (#66392 caused the test to fail)

tlrx · 2021-04-12T12:09:31Z

I opened #71556 for the CI failure.

tlrx · 2021-04-12T12:57:52Z

@ywelsch Thanks for your review! I've updated the code to apply your feedback. CI checks found some issues unrelated with this change. Can you please have another look?

ywelsch

LGTM

tlrx · 2021-04-12T13:58:42Z

Thanks Yannick!

Elasticsearch enumerates Lucene files extensions for various purposes: grouping files in segment stats under a description, mapping files in memory through HybridDirectory or adjusting the caching strategy for Lucene files in searchable snapshots. But when a new extension is handled somewhere(let's say, added to the list of files to mmap) it is easy to forget to add it in other places. This commit is an attempt to centralize in a single place all known Lucene files extensions in Elasticsearch.

Elasticsearch enumerates Lucene files extensions for various purposes: grouping files in segment stats under a description, mapping files in memory through HybridDirectory or adjusting the caching strategy for Lucene files in searchable snapshots. But when a new extension is handled somewhere(let's say, added to the list of files to mmap) it is easy to forget to add it in other places. This commit is an attempt to centralize in a single place all known Lucene files extensions in Elasticsearch. Backport of #71416

jpountz · 2021-04-12T16:40:20Z

server/src/main/java/org/elasticsearch/index/store/LuceneFilesExtensions.java

+    KDI("kdi", "Points Index", false, true),
+    // Lucene 8.6 point format metadata file
+    KDM("kdm", "Points Metadata", true, false),
+    LIV("liv", "Live Documents", false, false),


With the current implementation of live docs, they are fully read when opening an index, should we treat them as metadata?

IIRC we did not flag liv files as metadata since we were expecting most indices to use soft-deletes and also because in my mind liv files can become large (?)

.liv files can indeed be quite large, but so can be .cmp files.

… APIs (#71643) Since #16661 it is possible to know the total sizes for some Lucene segment files by using the Node Stats or Indices Stats API with the include_segment_file_sizes parameter, and the list of file extensions has been extended in #71416. This commit adds a bit more information about file sizes like the number of files (count), the min, max and average file sizes in bytes that share the same extension. Here is a sample: "cfs" : { "description" : "Compound Files", "size_in_bytes" : 2260, "min_size_in_bytes" : 2260, "max_size_in_bytes" : 2260, "average_size_in_bytes" : 2260, "count" : 1 } This commit also simplifies how compound file sizes were computed: before compound segment files were extracted and sizes aggregated with regular non-compound files sizes (which can be confusing and out of the scope of the original issue #6728), now CFS/CFE files appears as distinct files. These new information are provided to give a better view of the segment files and are useful in many cases, specially with frozen searchable snapshots whose segment stats can now be introspected thanks to the include_unloaded_segments parameter.

… APIs (elastic#71643) Since elastic#16661 it is possible to know the total sizes for some Lucene segment files by using the Node Stats or Indices Stats API with the include_segment_file_sizes parameter, and the list of file extensions has been extended in elastic#71416. This commit adds a bit more information about file sizes like the number of files (count), the min, max and average file sizes in bytes that share the same extension. Here is a sample: "cfs" : { "description" : "Compound Files", "size_in_bytes" : 2260, "min_size_in_bytes" : 2260, "max_size_in_bytes" : 2260, "average_size_in_bytes" : 2260, "count" : 1 } This commit also simplifies how compound file sizes were computed: before compound segment files were extracted and sizes aggregated with regular non-compound files sizes (which can be confusing and out of the scope of the original issue elastic#6728), now CFS/CFE files appears as distinct files. These new information are provided to give a better view of the segment files and are useful in many cases, specially with frozen searchable snapshots whose segment stats can now be introspected thanks to the include_unloaded_segments parameter.

… Stats APIs (#71725) Since #16661 it is possible to know the total sizes for some Lucene segment files by using the Node Stats or Indices Stats API with the include_segment_file_sizes parameter, and the list of file extensions has been extended in #71416. This commit adds a bit more information about file sizes like the number of files (count), the min, max and average file sizes in bytes that share the same extension. Here is a sample: "cfs" : { "description" : "Compound Files", "size_in_bytes" : 2260, "min_size_in_bytes" : 2260, "max_size_in_bytes" : 2260, "average_size_in_bytes" : 2260, "count" : 1 } This commit also simplifies how compound file sizes were computed: before compound segment files were extracted and sizes aggregated with regular non-compound files sizes (which can be confusing and out of the scope of the original issue #6728), now CFS/CFE files appears as distinct files. These new information are provided to give a better view of the segment files and are useful in many cases, specially with frozen searchable snapshots whose segment stats can now be introspected thanks to the include_unloaded_segments parameter. Backport of #71643

Centralize Lucene files extensions

be21bf5

tlrx added >non-issue :Distributed Indexing/Engine Anything around managing Lucene and the Translog in an open shard. v7.13.0 labels Apr 7, 2021

elasticmachine added the Team:Distributed (Obsolete) Meta label for distributed team (obsolete). Replaced by Distributed Indexing/Coordination. label Apr 7, 2021

tlrx added 6 commits April 8, 2021 10:50

add tmp

a991bf9

Merge branch 'master' into centralize-lucene-files-exts

939c553

add cmp/lkp

66dec29

more fixes

9bdcae9

more fixes

7ff0eea

cmp as metadata

b4eb5b3

tlrx commented Apr 8, 2021

View reviewed changes

tlrx requested review from jimczi and ywelsch April 8, 2021 12:35

ywelsch reviewed Apr 12, 2021

View reviewed changes

tlrx added 2 commits April 12, 2021 10:33

feedback

11d4483

Merge branch 'master' into centralize-lucene-files-exts

bd17456

tlrx added the v8.0.0 label Apr 12, 2021

Merge branch 'master' into centralize-lucene-files-exts

b3fc0ae

tlrx requested a review from ywelsch April 12, 2021 12:57

ywelsch approved these changes Apr 12, 2021

View reviewed changes

tlrx merged commit 8a0bece into elastic:master Apr 12, 2021

tlrx deleted the centralize-lucene-files-exts branch April 12, 2021 13:58

tlrx mentioned this pull request Apr 12, 2021

[7.x] Centralize Lucene files extensions in one place #71568

Merged

jpountz reviewed Apr 12, 2021

View reviewed changes

tlrx mentioned this pull request Apr 13, 2021

Enhanced segment files sizes information in Nodes Stats/Indices Stats APIs #71643

Merged

tlrx mentioned this pull request Apr 15, 2021

[7.x] Enhanced segment files sizes information in Nodes Stats/Indices Stats APIs #71725

Merged

mattweber mentioned this pull request Jun 15, 2021

Custom Lucene Extensions Blocked #74150

Open

jakelandis added v8.0.0-alpha1 and removed v8.0.0 labels Jul 26, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Centralize Lucene files extensions in one place #71416

Centralize Lucene files extensions in one place #71416

tlrx commented Apr 7, 2021 •

edited

Loading

elasticmachine commented Apr 7, 2021

tlrx Apr 8, 2021

jpountz Apr 12, 2021

tlrx Apr 12, 2021

jpountz Apr 12, 2021

tlrx Apr 8, 2021

ywelsch Apr 12, 2021

tlrx Apr 8, 2021

ywelsch left a comment

ywelsch Apr 12, 2021

ywelsch Apr 12, 2021

tlrx Apr 12, 2021

ywelsch Apr 12, 2021

tlrx Apr 12, 2021

ywelsch Apr 12, 2021

tlrx Apr 12, 2021

tlrx commented Apr 12, 2021

tlrx commented Apr 12, 2021

tlrx commented Apr 12, 2021

ywelsch left a comment

tlrx commented Apr 12, 2021

jpountz Apr 12, 2021

tlrx Apr 13, 2021

jpountz Apr 13, 2021

Centralize Lucene files extensions in one place #71416

Centralize Lucene files extensions in one place #71416

Conversation

tlrx commented Apr 7, 2021 • edited Loading

elasticmachine commented Apr 7, 2021

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ywelsch left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tlrx commented Apr 12, 2021

tlrx commented Apr 12, 2021

tlrx commented Apr 12, 2021

ywelsch left a comment

Choose a reason for hiding this comment

tlrx commented Apr 12, 2021

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tlrx commented Apr 7, 2021 •

edited

Loading