tsdb: Fix split16 to avoid indexing 14 docs from the end #378

pquentin · 2023-02-06T07:59:57Z

We drop 2 documents from the corpus so that the document count is a multiple of 16 which will allow Rally to split at exact boundaries.

To confirm the fix, I've run the following queries at 2% of indexing and nothing from 2021-04-29 got indexed:

% curl http://localhost:39200/tsdb/_search --data-binary @d.json -H 'Content-Type: application/json' | jq .                                                   
{                                                                                                                                                              
  "took": 2557,                                                                
  "timed_out": false,                                                          
  "_shards": {                                                                                                                                                 
    "total": 1,                                                                
    "successful": 1,                                                                                                                                           
    "skipped": 0,                                                                                                                                              
    "failed": 0                                                                                                                                                
  },                                                                                                                                                           
  "hits": {                                                                    
    "total": {                                                                 
      "value": 10000,                                                          
      "relation": "gte"                                                        
    },                                                                         
    "max_score": null,                                                         
    "hits": []                                                                                                                                                 
  },                                                                           
  "aggregations": {                                                            
    "NAME": {                                                                  
      "buckets": [                                                             
        {                                                                      
          "key_as_string": "2021-04-28T17:00:00.000Z",                         
          "key": 1619629200000,                                                
          "doc_count": 2099499                                                 
        },                                                                     
        {                                                                      
          "key_as_string": "2021-04-28T18:00:00.000Z",                         
          "key": 1619632800000,                                                
          "doc_count": 484385                                                  
        }                                                                      
      ]                                                                                                                                                        
    }                                                                          
  }                                                                                                                                                            
}

with d.json having the following contents:

{
  "size": 0,
  "aggs": {
    "NAME": {
      "date_histogram": {
        "field": "@timestamp",
        "calendar_interval": "hour"
      }
    }
  }
}

We drop 2 documents from the corpus so that the document count is a multiple of 16 which will allow Rally to split at exact boundaries.

dliappis

LGTM

dliappis · 2023-02-06T09:46:55Z

tsdb/README.md

+### Generating the split16 corpus
+
+By default, with N indexing clients Rally will split documents.json in N parts and bulk index from
+them in parallel. As a result, by default ingest is not done in order, which makes TSDB sorting


these two sentences might be useful to add in the Rally docs.

tsdb: Fix split16 to avoid indexing 14 docs from the end

2a3a189

We drop 2 documents from the corpus so that the document count is a multiple of 16 which will allow Rally to split at exact boundaries.

pquentin added the bug label Feb 6, 2023

pquentin requested review from martijnvg and dliappis February 6, 2023 07:59

pquentin self-assigned this Feb 6, 2023

dliappis approved these changes Feb 6, 2023

View reviewed changes

pquentin merged commit 7df06b3 into elastic:master Feb 6, 2023

pquentin deleted the tsdb-split16-fix branch February 6, 2023 12:22

This was referenced Feb 7, 2023

Document bulk behavior with multiple clients/documents elastic/rally#1666

Merged

Allow indexing data in order with multiple indexing clients elastic/rally#1650

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

tsdb: Fix split16 to avoid indexing 14 docs from the end #378

tsdb: Fix split16 to avoid indexing 14 docs from the end #378

pquentin commented Feb 6, 2023

dliappis left a comment

dliappis Feb 6, 2023

tsdb: Fix split16 to avoid indexing 14 docs from the end #378

tsdb: Fix split16 to avoid indexing 14 docs from the end #378

Conversation

pquentin commented Feb 6, 2023

dliappis left a comment

Choose a reason for hiding this comment

dliappis Feb 6, 2023

Choose a reason for hiding this comment