Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Doc-value-only fields #52728

Closed
jpountz opened this issue Feb 24, 2020 · 11 comments
Closed

Doc-value-only fields #52728

jpountz opened this issue Feb 24, 2020 · 11 comments
Labels
>feature :Search Foundations/Mapping Index mappings, including merging and defining field types Team:Search Foundations Meta label for the Search Foundations team in Elasticsearch

Comments

@jpountz
Copy link
Contributor

jpountz commented Feb 24, 2020

Users who index time series typically care a lot about indexing rate and space efficiency. Disabling inverted structures like the inverted index and points would help on both fronts. Queries could still work using doc values, but more slowly, which is a trade-off that these users are often happy to make.

Default mappings would still create inverted structures, so users would have to opt-in to trade search efficiency for disk space / indexing rate.

I'd like to make this change depend on some feedback mechanism as outlined in #48058, so that users having slow queries because inverted structures have been disabled would never come as a surprise.

@jpountz jpountz added :Search Foundations/Mapping Index mappings, including merging and defining field types >feature labels Feb 24, 2020
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-search (:Search/Mapping)

@mayya-sharipova
Copy link
Contributor

I would like to clarify something. Users currently can disable inverted data structures for fields by using "index": false

"mappings": {
      "properties": {
          "location": {
              "type": "geo_point",
              "doc_values": true, 
              "index": false
          }
      }
  }

By this will disable some queries. Is your proposal to make all queries work on docValues instead of inverted data structures? I am wondering if this even possible for all field types?

@jpountz
Copy link
Contributor Author

jpountz commented Feb 26, 2020

Is your proposal to make all queries work on docValues instead of inverted data structures?

Yes. For instance all numeric fields create queries on doc values today, but only as a way to speed up query execution when there is another filter in the query that is more selective than this one (using IndexOrDocValuesQuery).

I am wondering if this even possible for all field types?

I think this would work for all fields that support doc values, but I think that this feature would be mostly useful for number fields, so I was thinking of focusing on those first.

@jtibshirani
Copy link
Contributor

The issue #48665 could relate to this idea -- if an index is sorted on a numeric field, we could perform fast range queries that rely only on the field's doc values.

@javanna
Copy link
Member

javanna commented Jan 13, 2022

This is being worked on as part of #82409 for numeric fields. @jpountz you mentioned above that you would like this type of slower queries to be linked to some warning mechanism. Do you think that that is a hard requirement or can we add support queries on doc_values although we haven't yet figured out the details around the warning mechanism?

ywelsch added a commit that referenced this issue Jan 13, 2022
Allows searching on number field types (long, short, int, float, double, byte, half_float) when those fields are not
indexed (index: false) but just doc values are enabled.

This enables searches on archive data, which has access to doc values but not index structures. When combined with
searchable snapshots, it allows downloading only data for a given (doc value) field to quickly filter down to a select set
of documents.

Note to reviewers:

I have split isSearchable into two separate methods isIndexed and isSearchable on MappedFieldType. The former one is
about whether actual indexing data structures have been used (postings or points), and the latter one on whether you
can run queries on the given field (e.g. used by field caps). For number field types, queries are now allowed whenever
points are available or when doc values are available (i.e. searchability is expanded).

Relates #81210 and #52728
@jpountz
Copy link
Contributor Author

jpountz commented Jan 13, 2022

I think that the direction we're taking involves doing more and more costly stuff (think of runtime fields or identifying sequences of events with EQL) and we will want to reconsider whether we really want to warn on slow operations. So I wouldn't make this a hard requirement, and I wonder if this should still be a requirement at all.

ywelsch added a commit that referenced this issue Jan 17, 2022
Similar to #82409, but for date fields.

Allows searching on date field types (date, date_nanos) when those fields are not indexed (index: false) but just doc
values are enabled.

This enables searches on archive data, which has access to doc values but not index structures. When combined with
searchable snapshots, it allows downloading only data for a given (doc value) field to quickly filter down to a select set
of documents.

Relates #81210 and #52728
ywelsch added a commit that referenced this issue Jan 24, 2022
Allows searching on keyword fields when those fields are not indexed (index: false) but just doc values are enabled.

This enables searches on archive data, which has access to doc values but not index structures. When combined with
searchable snapshots, it allows downloading only data for a given (doc value) field to quickly filter down to a select set
of documents.

Relates #81210 and #52728
ywelsch added a commit that referenced this issue Jan 24, 2022
Allows searching on boolean fields when those fields are not indexed (index: false) but just doc values are enabled.

This enables searches on archive data, which has access to doc values but not index structures. When combined with
searchable snapshots, it allows downloading only data for a given (doc value) field to quickly filter down to a select set
of documents.

Relates #81210 and #52728
ywelsch added a commit that referenced this issue Jan 25, 2022
Allows searching on ip fields when those fields are not indexed (index: false) but just doc values are enabled.

This enables searches on archive data, which has access to doc values but not index structures. When combined with
searchable snapshots, it allows downloading only data for a given (doc value) field to quickly filter down to a select set
of documents.

Relates #81210 and #52728
@ywelsch
Copy link
Contributor

ywelsch commented Jan 25, 2022

The following doc-value-only fields (term + range query support) have been implemented in 8.1.0:

For feature parity with runtime fields, geo_point is missing, as well as other queries such as distanceFeatureQuery query support for date, or wilcard/regexp/prefix/fuzzy etc. queries for keyword.

ywelsch added a commit that referenced this issue Feb 2, 2022
Similar to #82409, but for geo_point fields.

Allows searching on geo_point fields when those fields are not indexed (index: false) but just doc values are enabled.

Also adds distance feature query support for date fields (bringing date field to feature parity with runtime fields)

This enables searches on archive data, which has access to doc values but not index structures. When combined with
searchable snapshots, it allows downloading only data for a given (doc value) field to quickly filter down to a select set
of documents.

Relates #81210 and #52728
ywelsch added a commit that referenced this issue Feb 2, 2022
Adds doc-values-only search support for wilcard/regexp/prefix/fuzzy etc. queries on keyword fields.

Relates #81210 and #52728
@ywelsch
Copy link
Contributor

ywelsch commented Feb 2, 2022

For feature parity with runtime fields, geo_point is missing, as well as other queries such as distanceFeatureQuery query support for date, or wilcard/regexp/prefix/fuzzy etc. queries for keyword.

These missing pieces for feature parity with runtime fields have been implemented as well now (ES 8.1.0):

I've used native doc-value based queries where possible, and have used their runtime field equivalents when not available.

@jpountz
Copy link
Contributor Author

jpountz commented Feb 3, 2022

Implemented in 8.1.

@jpountz jpountz closed this as completed Feb 3, 2022
@ywelsch
Copy link
Contributor

ywelsch commented Feb 3, 2022

Just some basic benchmarks:

I've run the nyc_taxis Rally benchmark once with all eligible fields set to index:true, and all eligible fields set to index:false, providing a bit of insight into indexing throughput / storage improvements and query impact:


Comparing baseline
  Race ID: 329a13d3-986f-4a8c-a720-9df828bc4615
  Race timestamp: 2022-02-03 13:35:21
  Challenge: append-no-conflicts
  Car: defaults
  User tags: nyc_taxis=index-true

with contender
  Race ID: 2501b726-4abe-454f-8488-567d29aa6ebf
  Race timestamp: 2022-02-03 15:01:48
  Challenge: append-no-conflicts
  Car: defaults
  User tags: nyc_taxis=index-false

------------------------------------------------------
    _______             __   _____
   / ____(_)___  ____ _/ /  / ___/_________  ________
  / /_  / / __ \/ __ `/ /   \__ \/ ___/ __ \/ ___/ _ \
 / __/ / / / / / /_/ / /   ___/ / /__/ /_/ / /  /  __/
/_/   /_/_/ /_/\__,_/_/   /____/\___/\____/_/   \___/
------------------------------------------------------
            
|                                                        Metric |                Task |    Baseline |   Contender |     Diff |   Unit |     Diff % |
|--------------------------------------------------------------:|--------------------:|------------:|------------:|---------:|-------:|-----------:|
|                    Cumulative indexing time of primary shards |                     |     194.355 |     153.399 | -40.9562 |    min |    -21.07% |
|             Min cumulative indexing time across primary shard |                     |     194.355 |     153.399 | -40.9562 |    min |    -21.07% |
|          Median cumulative indexing time across primary shard |                     |     194.355 |     153.399 | -40.9562 |    min |    -21.07% |
|             Max cumulative indexing time across primary shard |                     |     194.355 |     153.399 | -40.9562 |    min |    -21.07% |
|           Cumulative indexing throttle time of primary shards |                     |           0 |           0 |        0 |    min |      0.00% |
|    Min cumulative indexing throttle time across primary shard |                     |           0 |           0 |        0 |    min |      0.00% |
| Median cumulative indexing throttle time across primary shard |                     |           0 |           0 |        0 |    min |      0.00% |
|    Max cumulative indexing throttle time across primary shard |                     |           0 |           0 |        0 |    min |      0.00% |
|                       Cumulative merge time of primary shards |                     |     81.1003 |     42.0936 | -39.0067 |    min |    -48.10% |
|                      Cumulative merge count of primary shards |                     |         213 |         123 |      -90 |        |    -42.25% |
|                Min cumulative merge time across primary shard |                     |     81.1003 |     42.0936 | -39.0067 |    min |    -48.10% |
|             Median cumulative merge time across primary shard |                     |     81.1003 |     42.0936 | -39.0067 |    min |    -48.10% |
|                Max cumulative merge time across primary shard |                     |     81.1003 |     42.0936 | -39.0067 |    min |    -48.10% |
|              Cumulative merge throttle time of primary shards |                     |     2.86078 |     12.4765 |  9.61568 |    min |   +336.12% |
|       Min cumulative merge throttle time across primary shard |                     |     2.86078 |     12.4765 |  9.61568 |    min |   +336.12% |
|    Median cumulative merge throttle time across primary shard |                     |     2.86078 |     12.4765 |  9.61568 |    min |   +336.12% |
|       Max cumulative merge throttle time across primary shard |                     |     2.86078 |     12.4765 |  9.61568 |    min |   +336.12% |
|                     Cumulative refresh time of primary shards |                     |     1.45248 |    0.941633 | -0.51085 |    min |    -35.17% |
|                    Cumulative refresh count of primary shards |                     |          99 |          83 |      -16 |        |    -16.16% |
|              Min cumulative refresh time across primary shard |                     |     1.45248 |    0.941633 | -0.51085 |    min |    -35.17% |
|           Median cumulative refresh time across primary shard |                     |     1.45248 |    0.941633 | -0.51085 |    min |    -35.17% |
|              Max cumulative refresh time across primary shard |                     |     1.45248 |    0.941633 | -0.51085 |    min |    -35.17% |
|                       Cumulative flush time of primary shards |                     |     2.45805 |      1.4157 | -1.04235 |    min |    -42.41% |
|                      Cumulative flush count of primary shards |                     |          31 |          24 |       -7 |        |    -22.58% |
|                Min cumulative flush time across primary shard |                     |     2.45805 |      1.4157 | -1.04235 |    min |    -42.41% |
|             Median cumulative flush time across primary shard |                     |     2.45805 |      1.4157 | -1.04235 |    min |    -42.41% |
|                Max cumulative flush time across primary shard |                     |     2.45805 |      1.4157 | -1.04235 |    min |    -42.41% |
|                                       Total Young Gen GC time |                     |      86.211 |      70.502 |  -15.709 |      s |    -18.22% |
|                                      Total Young Gen GC count |                     |       16076 |       11686 |    -4390 |        |    -27.31% |
|                                         Total Old Gen GC time |                     |           0 |           0 |        0 |      s |      0.00% |
|                                        Total Old Gen GC count |                     |           0 |           0 |        0 |        |      0.00% |
|                                                    Store size |                     |     24.2728 |     18.5978 | -5.67501 |     GB |    -23.38% |
|                                                 Translog size |                     | 5.12227e-08 | 5.12227e-08 |        0 |     GB |      0.00% |
|                                        Heap used for segments |                     |           0 |           0 |        0 |     MB |      0.00% |
|                                      Heap used for doc values |                     |           0 |           0 |        0 |     MB |      0.00% |
|                                           Heap used for terms |                     |           0 |           0 |        0 |     MB |      0.00% |
|                                           Heap used for norms |                     |           0 |           0 |        0 |     MB |      0.00% |
|                                          Heap used for points |                     |           0 |           0 |        0 |     MB |      0.00% |
|                                   Heap used for stored fields |                     |           0 |           0 |        0 |     MB |      0.00% |
|                                                 Segment count |                     |          35 |          30 |       -5 |        |    -14.29% |
|                                   Total Ingest Pipeline count |                     |           0 |           0 |        0 |        |      0.00% |
|                                    Total Ingest Pipeline time |                     |           0 |           0 |        0 |     ms |      0.00% |
|                                  Total Ingest Pipeline failed |                     |           0 |           0 |        0 |        |      0.00% |
|                                                Min Throughput |               index |     88025.9 |      106313 |  18286.8 | docs/s |    +20.77% |
|                                               Mean Throughput |               index |     90703.2 |      108634 |  17931.1 | docs/s |    +19.77% |
|                                             Median Throughput |               index |     89972.1 |      108041 |  18068.7 | docs/s |    +20.08% |
|                                                Max Throughput |               index |       95306 |      111936 |  16629.5 | docs/s |    +17.45% |
|                                       50th percentile latency |               index |     736.684 |     630.131 | -106.553 |     ms |    -14.46% |
|                                       90th percentile latency |               index |      996.58 |     764.178 | -232.402 |     ms |    -23.32% |
|                                       99th percentile latency |               index |     2282.64 |      1646.3 | -636.339 |     ms |    -27.88% |
|                                     99.9th percentile latency |               index |     3556.14 |     2379.08 | -1177.05 |     ms |    -33.10% |
|                                    99.99th percentile latency |               index |     4885.32 |     2653.19 | -2232.13 |     ms |    -45.69% |
|                                      100th percentile latency |               index |     5491.51 |     2778.72 | -2712.78 |     ms |    -49.40% |
|                                  50th percentile service time |               index |     736.684 |     630.131 | -106.553 |     ms |    -14.46% |
|                                  90th percentile service time |               index |      996.58 |     764.178 | -232.402 |     ms |    -23.32% |
|                                  99th percentile service time |               index |     2282.64 |      1646.3 | -636.339 |     ms |    -27.88% |
|                                99.9th percentile service time |               index |     3556.14 |     2379.08 | -1177.05 |     ms |    -33.10% |
|                               99.99th percentile service time |               index |     4885.32 |     2653.19 | -2232.13 |     ms |    -45.69% |
|                                 100th percentile service time |               index |     5491.51 |     2778.72 | -2712.78 |     ms |    -49.40% |
|                                                    error rate |               index |           0 |           0 |        0 |      % |      0.00% |
|                                                Min Throughput |             default |     3.01882 |     3.01892 |  0.00011 |  ops/s |      0.00% |
|                                               Mean Throughput |             default |     3.03072 |     3.03086 |  0.00014 |  ops/s |      0.00% |
|                                             Median Throughput |             default |       3.028 |     3.02814 |  0.00014 |  ops/s |      0.00% |
|                                                Max Throughput |             default |     3.05434 |     3.05458 |  0.00024 |  ops/s |      0.01% |
|                                       50th percentile latency |             default |     9.08409 |     9.15136 |  0.06727 |     ms |     +0.74% |
|                                       90th percentile latency |             default |     10.2095 |     9.89905 | -0.31045 |     ms |     -3.04% |
|                                       99th percentile latency |             default |     15.8616 |      10.952 | -4.90962 |     ms |    -30.95% |
|                                      100th percentile latency |             default |     16.8962 |      12.853 | -4.04328 |     ms |    -23.93% |
|                                  50th percentile service time |             default |     7.64502 |     7.66309 |  0.01807 |     ms |     +0.24% |
|                                  90th percentile service time |             default |     8.80482 |     8.43405 | -0.37077 |     ms |     -4.21% |
|                                  99th percentile service time |             default |     14.1504 |     9.90026 | -4.25015 |     ms |    -30.04% |
|                                 100th percentile service time |             default |     15.0338 |     11.5141 | -3.51969 |     ms |    -23.41% |
|                                                    error rate |             default |           0 |           0 |        0 |      % |      0.00% |
|                                                Min Throughput |               range |    0.701676 |    0.704622 |  0.00295 |  ops/s |     +0.42% |
|                                               Mean Throughput |               range |    0.702749 |     0.70761 |  0.00486 |  ops/s |     +0.69% |
|                                             Median Throughput |               range |    0.702505 |     0.70692 |  0.00442 |  ops/s |     +0.63% |
|                                                Max Throughput |               range |    0.704951 |    0.713771 |  0.00882 |  ops/s |     +1.25% |
|                                       50th percentile latency |               range |     734.247 |     8.50008 | -725.747 |     ms |    -98.84% |
|                                       90th percentile latency |               range |     751.682 |      9.3546 | -742.328 |     ms |    -98.76% |
|                                       99th percentile latency |               range |     758.811 |     10.4575 | -748.354 |     ms |    -98.62% |
|                                      100th percentile latency |               range |     762.375 |     10.8897 | -751.485 |     ms |    -98.57% |
|                                  50th percentile service time |               range |     732.134 |     5.93663 | -726.198 |     ms |    -99.19% |
|                                  90th percentile service time |               range |     749.884 |     6.76047 | -743.123 |     ms |    -99.10% |
|                                  99th percentile service time |               range |     756.849 |     8.03772 | -748.811 |     ms |    -98.94% |
|                                 100th percentile service time |               range |     759.911 |     8.27144 |  -751.64 |     ms |    -98.91% |
|                                                    error rate |               range |           0 |           0 |        0 |      % |      0.00% |
|                                                Min Throughput | distance_amount_agg |     2.01194 |     2.01193 |   -1e-05 |  ops/s |     -0.00% |
|                                               Mean Throughput | distance_amount_agg |     2.01965 |     2.01962 |   -3e-05 |  ops/s |     -0.00% |
|                                             Median Throughput | distance_amount_agg |     2.01786 |     2.01783 |   -3e-05 |  ops/s |     -0.00% |
|                                                Max Throughput | distance_amount_agg |     2.03528 |     2.03526 |   -2e-05 |  ops/s |     -0.00% |
|                                       50th percentile latency | distance_amount_agg |     6.17123 |     6.11228 | -0.05894 |     ms |     -0.96% |
|                                       90th percentile latency | distance_amount_agg |     7.36044 |     7.13197 | -0.22846 |     ms |     -3.10% |
|                                       99th percentile latency | distance_amount_agg |     7.75873 |     8.25812 |  0.49939 |     ms |     +6.44% |
|                                      100th percentile latency | distance_amount_agg |     8.08622 |     14.2681 |  6.18191 |     ms |    +76.45% |
|                                  50th percentile service time | distance_amount_agg |     4.60604 |     4.56346 | -0.04258 |     ms |     -0.92% |
|                                  90th percentile service time | distance_amount_agg |     5.66406 |     5.51616 |  -0.1479 |     ms |     -2.61% |
|                                  99th percentile service time | distance_amount_agg |     6.13562 |     6.46549 |  0.32987 |     ms |     +5.38% |
|                                 100th percentile service time | distance_amount_agg |     6.31664 |     12.9113 |  6.59465 |     ms |   +104.40% |
|                                                    error rate | distance_amount_agg |           0 |           0 |        0 |      % |      0.00% |
|                                                Min Throughput |       autohisto_agg |     1.48352 |    0.999732 | -0.48379 |  ops/s |    -32.61% |
|                                               Mean Throughput |       autohisto_agg |     1.49077 |     1.10926 |  -0.3815 |  ops/s |    -25.59% |
|                                             Median Throughput |       autohisto_agg |     1.49156 |      1.1199 | -0.37166 |  ops/s |    -24.92% |
|                                                Max Throughput |       autohisto_agg |     1.49437 |     1.16959 | -0.32478 |  ops/s |    -21.73% |
|                                       50th percentile latency |       autohisto_agg |     449.693 |     21635.7 |    21186 |     ms |  +4711.21% |
|                                       90th percentile latency |       autohisto_agg |     462.611 |     26051.9 |  25589.3 |     ms |  +5531.49% |
|                                       99th percentile latency |       autohisto_agg |     473.142 |     27035.5 |  26562.4 |     ms |  +5614.04% |
|                                      100th percentile latency |       autohisto_agg |     474.255 |     27146.6 |  26672.4 |     ms |  +5624.05% |
|                                  50th percentile service time |       autohisto_agg |     448.318 |     777.905 |  329.587 |     ms |    +73.52% |
|                                  90th percentile service time |       autohisto_agg |     461.221 |     792.415 |  331.194 |     ms |    +71.81% |
|                                  99th percentile service time |       autohisto_agg |     471.023 |     805.855 |  334.832 |     ms |    +71.09% |
|                                 100th percentile service time |       autohisto_agg |     473.589 |     806.224 |  332.635 |     ms |    +70.24% |
|                                                    error rate |       autohisto_agg |           0 |           0 |        0 |      % |      0.00% |
|                                                Min Throughput |  date_histogram_agg |     1.50061 |     1.25783 | -0.24278 |  ops/s |    -16.18% |
|                                               Mean Throughput |  date_histogram_agg |     1.50085 |     1.25821 | -0.24264 |  ops/s |    -16.17% |
|                                             Median Throughput |  date_histogram_agg |     1.50081 |     1.25825 | -0.24256 |  ops/s |    -16.16% |
|                                                Max Throughput |  date_histogram_agg |     1.50122 |     1.25853 | -0.24269 |  ops/s |    -16.17% |
|                                       50th percentile latency |  date_histogram_agg |     246.812 |     96794.2 |  96547.3 |     ms | +39117.83% |
|                                       90th percentile latency |  date_histogram_agg |     259.229 |      122427 |   122168 |     ms | +47127.42% |
|                                       99th percentile latency |  date_histogram_agg |     266.048 |      128103 |   127837 |     ms | +48050.18% |
|                                      100th percentile latency |  date_histogram_agg |     274.515 |      128754 |   128479 |     ms | +46802.27% |
|                                  50th percentile service time |  date_histogram_agg |     244.722 |     794.614 |  549.892 |     ms |   +224.70% |
|                                  90th percentile service time |  date_histogram_agg |      257.43 |     808.432 |  551.002 |     ms |   +214.04% |
|                                  99th percentile service time |  date_histogram_agg |     263.994 |      822.15 |  558.156 |     ms |   +211.43% |
|                                 100th percentile service time |  date_histogram_agg |     271.715 |     831.092 |  559.377 |     ms |   +205.87% |
|                                                    error rate |  date_histogram_agg |           0 |           0 |        0 |      % |      0.00% |

@ywelsch
Copy link
Contributor

ywelsch commented Feb 4, 2022

Some observations on the above:

  • index throughput improvements and storage savings of up to 20%.
  • the target-throughput for many queries/aggs was defined too high on the benchmark as to account for doc-value only fields, so latency can be ignored in this comparison, and we should look only at service time here (or go through the benchmark again and define new values for target-throughput.
  • doc value based queries are not performing too badly. In the case of the range query, it performed even better. This is because (by default) track_total_hits:false and it only retrieves the top 10 docs (_doc order), allowing quick termination without having to run this over the full data set.
  • As an additional datapoint, I ran the range query with track_total_hits:true, which takes 3.1s for doc-value based range query over full nyc_taxis data set vs 1.3s for point-based range query.
  • while aggregations look slower here for doc-value-only fields, this is because there's always a query filtering docs first here, so we're not measuring pure agg performance.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
>feature :Search Foundations/Mapping Index mappings, including merging and defining field types Team:Search Foundations Meta label for the Search Foundations team in Elasticsearch
Projects
None yet
Development

No branches or pull requests

7 participants