-
Notifications
You must be signed in to change notification settings - Fork 8.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implement first stages of the ZDT migration algorithm #152219
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Self-review
/** | ||
* The current model versions of the mapping of the index. | ||
* | ||
* @remark: Only defined for indices using the zdt migration algorithm. | ||
*/ | ||
mappingVersions?: { [k: string]: number }; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IndexMapping
/ IndexMappingMeta
are used in a lot of places, and it was going to be hard to have distinct types for each algo without entering the Generic loop of hell. Using the same type for both algo's meta felt like the most pragmatic approach by a very large margin.
case 'CREATE_TARGET_INDEX': | ||
return Stages.createTargetIndex( | ||
current, | ||
response as StateActionResponse<'CREATE_TARGET_INDEX'>, | ||
context | ||
); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
NIT: I wanted to automate the dispatch by just defining a map of AllActionStates=>ModelStage
, but it was a pain given the non unknow-able generic parameters of ModelStage
. So for now it's a 4 line copy pasta for each stage. I will eventually revisit later.
describe('Action: init', () => { | ||
let context: MockedMigratorContext; | ||
|
||
const currentIndex = '.kibana_1'; | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The only stage I added unit test for until we validate the workflow
export const init: ModelStage< | ||
'INIT', | ||
'CREATE_TARGET_INDEX' | 'UPDATE_INDEX_MAPPINGS' | 'UPDATE_ALIASES' | 'FATAL' | ||
> = (state, res, context) => { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So, this is a big stage. I wondered if it made sense to split it into multiple, just to increase readability and isolation of concern, but given all the data necessary for all the decisions made in this stage are fetched in a single request, splitting the stage was causing artificial noop
-type actions and transitions just to improve an artificial isolation, so I didn't like it. Also, we have the same kind of gigantic init stage logic in the v2 algo, so it's probably our best option.
// completed. Retry this step to see if the task has completed after an | ||
// exponential delay. We will basically keep polling forever until the | ||
// Elasticsearch task succeeds or fails. | ||
return delayRetryState(state, left.message, Number.MAX_SAFE_INTEGER); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
NIT: not a big fan of the Number.MAX_SAFE_INTEGER
, but that's what we do in the v2 algorithm so I did the same.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah... the thinking is that if there's a active task picking up the mappings then it's kinda bad if we throw a FATAL and then start again creating another update_by_query... so we just wait... forever.
But there's probably some point at which we could say if it takes 2 days something is wrong and we want an SRE to take a look.
return { | ||
...state, | ||
controlState: 'DONE', | ||
}; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note: I'm currently moving to DONE
so that the algo completes, and so I can run integration test against the part of the logic already implemented.
When we'll be added the next part (doc migration), this will transition to the next steps of the migration.
// TODO: later we will want to generate the proper diff from `SavedObjectsModelExpansionChange.addedMappings` | ||
// for this first implementation this is acceptable given we only allow compatible mapping changes anyway. | ||
// we may want to implement the proper logic before this get used by real (non-test) type owners. | ||
|
||
const changedTypes = delta.diff.map((diff) => diff.name); | ||
|
||
const addedMappings: SavedObjectsMappingProperties = {}; | ||
changedTypes.forEach((type) => { | ||
addedMappings[type] = typeMap[type].mappings; | ||
}); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As commented: In theory, we'll want to generate the additive mapping changes based on the mappings defined on each model version of the delta. However, given we're planning on only allowing compatible mapping changes before the algo get released, it's probably fine for now to just use the global mappings of the types.
Or at least I assumed so, WDYT?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Or at least I assumed so, WDYT?
Makes sense to me. Just trying to make sure I understand this in practice:
With this approach we will be sending an update mappings request that contains the types' full set of mappings rather than only the subset of new fields?
const actions: AliasAction[] = []; | ||
|
||
const globalAlias = indexPrefix; | ||
const versionAlias = `${indexPrefix}_${kibanaVersion}`; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So, I decided to keep the version alias for now.
Main reason is that the SOR is accessing the indices via these aliases at the moment:
Lines 18 to 25 in 34c228b
export const getIndexForType = ({ | |
type, | |
typeRegistry, | |
defaultIndex, | |
kibanaVersion, | |
}: GetIndexForTypeOptions): string => { | |
return `${typeRegistry.getIndex(type) || defaultIndex}_${kibanaVersion}`; | |
}; |
And I didn't want to handle this problem right now.
We can revisit / think about it when we'll perform the SOR changes in #150312
let lastCount = -1; | ||
Object.keys(indices).forEach((indexName) => { | ||
const match = matcher.exec(indexName); | ||
if (match && match.groups?.counter) { | ||
const suffix = parseInt(match.groups.counter, 10); | ||
lastCount = Math.max(lastCount, suffix); | ||
} | ||
}); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note: there should always be a single [prefix]_1
index present, but just in case of (and because I wanted to add a named group regexp), we're scanning all the indices for the highest suffix.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
NIT if somebody entered '.kibana_003'
the function would return '.kibana_3'
.
describe('ZDT upgrades - running on a fresh cluster', () => { | ||
let esServer: TestElasticsearchUtils['es']; | ||
|
||
const startElasticsearch = async () => { | ||
const { startES } = createTestServers({ | ||
adjustTimeout: (t: number) => jest.setTimeout(t), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some basic integration tests on the part of the algo implemented in this PR. Will add more for edge cases once we agree on the overall workflow.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great work @pgayvallet ! I left a few non-blocker comments and overall this is looking like a solid start.
For my own understanding: is the "update alias" step something that will mostly be a noop for serverless (since we will deploy faster than new stack versions)?
If yes: I am a little concerned that we will not have a way to prevent old nodes from continuing to write to our SO index. I know this is something we will revise along with changes to SOR, but I was curious about whether we have a plan for how to manage more frequent non-stack version releases - are we going to use sub-patch versions like build numbers?
[UPDATE] Or are we guaranteed to have no "old nodes" running at that point?
Approving to unblock progress, but I think getting another review would be great.
...saved-objects/core-saved-objects-migration-server-internal/src/zdt/model/stages/init.test.ts
Outdated
Show resolved
Hide resolved
...d-objects/core-saved-objects-migration-server-internal/src/zdt/utils/build_index_mappings.ts
Outdated
Show resolved
Hide resolved
// TODO: later we will want to generate the proper diff from `SavedObjectsModelExpansionChange.addedMappings` | ||
// for this first implementation this is acceptable given we only allow compatible mapping changes anyway. | ||
// we may want to implement the proper logic before this get used by real (non-test) type owners. | ||
|
||
const changedTypes = delta.diff.map((diff) => diff.name); | ||
|
||
const addedMappings: SavedObjectsMappingProperties = {}; | ||
changedTypes.forEach((type) => { | ||
addedMappings[type] = typeMap[type].mappings; | ||
}); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Or at least I assumed so, WDYT?
Makes sense to me. Just trying to make sure I understand this in practice:
With this approach we will be sending an update mappings request that contains the types' full set of mappings rather than only the subset of new fields?
const logFileContent = await fs.readFile(logFilePath, 'utf-8'); | ||
const records = logFileContent | ||
.split('\n') | ||
.filter(Boolean) | ||
.map((str) => JSON5.parse(str)) as LogRecord[]; | ||
|
||
const expectLogsContains = (messagePrefix: string) => { | ||
expect(records.find((entry) => entry.message.includes(messagePrefix))).toBeDefined(); | ||
}; | ||
|
||
expectLogsContains('INIT -> CREATE_TARGET_INDEX'); | ||
expectLogsContains('CREATE_TARGET_INDEX -> UPDATE_ALIASES'); | ||
expectLogsContains('UPDATE_ALIASES -> DONE'); | ||
expectLogsContains('Migration completed'); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Minor comment: I've seen this approached a couple different ways (partly my fault). I think it is worth consolidating:
JSON5.parse
way (used here)- Reading full string content from log file then
expect(logs).toMatch('A -> B')
Way (2) is provided by the kibana_migrator_test_kit
and used in src/core/server/integration_tests/saved_objects/migrations/group3/skip_reindex.test.ts
, for example.
Not major, but for these Jest integration tests would be nice to use one approach. Personally I prefer (2) bc it is "good enough" and avoids using a lib. Disclaimer: I have not benchmarked either of these.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cc @gsoldevila
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the recent enhancements to the migration logic I ended up using (2), parsing the whole text into a string, and then using expect(logs).toMatch('certain string');
.
I tried to simplify this and extract it into a method that allowed to check multiple strings were present in the logs. Whilst it was clearer in terms of readability, it added more calls in the stack, and in case of failure it prevented me from seeing the real call that was failing (Jest is currently printing 3 lines of the stack). Thus, I tend to agree with JL for these 2 reasons (avoid using lib + clearer stack traces).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Personally I prefer (2) bc it is "good enough"
Personal opinion, but I don't think it's good enough for all scenarios.
The limitations of the expect(myWholeFileAsAString).toContains(someText)
is that you won't be able to perform advanced tests on the order of the logs (and ihmo we should assert against the order of the logs too).
Like what I did here a while ago (I know, I'm no doing it, yet, in the test we're talking about)
Lines 122 to 127 in fcf1098
expectMatchOrder(errorLines, [ | |
{ | |
mode: 'equal', | |
value: '- foo:3: Error: Migration function for version 7.14.0 threw an error', | |
}, | |
{ |
Yes. the error message are terrible in case of failure, won't deny it. But maybe adding out own matched/expects (not even sure what the term is for jest) would be the way to go?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think it's good enough for all scenarios...the order of the logs
Good point! I hadn't thought of order. Just brainstorming here:
expectLogOrder(logs, stringToMatch: string[])
...
const logs = (await readLog()).split('\n')
expectLogOrder(logs, ['INIT -> FOO', 'FOO -> BAR', ...]) // or other messages
But yeah, happy to go with whoever has the strongest opinion. Would just be nice to keep the test itself super lean.
bar: 5, | ||
}, | ||
}); | ||
const versionMap = getModelVersionsFromMappings({ mappings, source: 'mappingVersions' }); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We are not testing the scenario where types don't have a valid model version.
deletedTypes, | ||
}: CheckVersionCompatibilityOpts): CompareModelVersionResult => { | ||
const appVersions = getModelVersionMapForTypes(types); | ||
const indexVersions = getModelVersionsFromMappings({ mappings, source }); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking at code, both getModelVersionMapForTypes
and getModelVersionsFroMapping
will throw if any of the mappings versions is invalid. We don't seem to be catching these errors here, nor in the different places where we call these two methods. I suppose such configuration errors are critical and should make Kibana fail, but I just wanted to make sure that this is the expected behavior.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
getModelVersionMapForTypes
will never throw, because we validated each type during their registration.
getModelVersionsFroMapping
could throw if someone manually tempered with the index / cluster state. But yeah in that case, it would be fatal. We could be more explicit instead of implicitly bubbling the thrown error though, eventually.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's fine to have uncaught exceptions bubbling up for scenarios where it really should never happen. If we expect it to happen when e.g. Elasticsearch is unhealthy it feels worth catching but in this case I think it's fine. On 8.x we've only seen Elastic internal staff meddle with system indices while testing.
deletedTypes: [], | ||
}); | ||
|
||
expect(result.status).toEqual('conflict'); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm probably missing something here, but why is this test behaving differently from the last one? Oh I see, the deletedTypes
is used to explicitly specify which types have been deleted. Nvm!
*/ | ||
export const getModelVersionMapForTypes = (types: SavedObjectsType[]): ModelVersionMap => { | ||
return types.reduce<ModelVersionMap>((versionMap, type) => { | ||
versionMap[type.name] = getLatestModelVersion(type); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm wondering if it would be worth checking for "no gaps" in the defined versions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's done during type registration validation, so we take the validation as granted later on.
Lines 95 to 103 in e64d545
if (missingVersions.length) { | |
throw new Error( | |
`Type ${ | |
type.name | |
}: gaps between model versions aren't allowed (missing versions: ${missingVersions.join( | |
',' | |
)})` | |
); | |
} |
Pinging @elastic/kibana-core (Team:Core) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No new comments since my last review. Great work @pgayvallet , exciting to see ZDT moving forward 🍻 . Also, thanks for adding those tests.
...ore/saved-objects/core-saved-objects-migration-server-internal/src/common/utils/logs.test.ts
Outdated
Show resolved
Hide resolved
.../core/saved-objects/core-saved-objects-migration-server-internal/src/zdt/model/model.test.ts
Outdated
Show resolved
Hide resolved
💚 Build Succeeded
Metrics [docs]Public APIs missing comments
Public APIs missing exports
Unknown metric groupsAPI count
ESLint disabled line counts
Total ESLint disabled count
History
To update your PR or re-run it, just comment with: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! The 2-layer separation in stages is a great strategy to avoid the huge model.ts
.
Looking forward to seeing the transform part:
- In v2 we're doing the transform before updating the mappings. I believe we're doing this to force all instances to go through that phase, so this is not mandatory for serverless.
- Also, we're doing it systematically at startup, to account for the case when we re-enable a disabled plugin and must suddenly migrate its SOs. I'm not sure this applies to serverless. Can plugins be enabled / disabled there? Will we disable them for all instances? (including the one with the migrator role?)
- As an improvement, we said that we could avoid "picking up" the whole index after updating the mappings:
- We could pick up only the types that have a new version.
- We could take into account that some documents are transformed (and already updated). We can probably default to an identity transform function for the types that have a new version and don't define a custom transform function (is that even possible?). This way, we can probably get rid of the "pickup operation" after updating the mappings.
deletedTypes, | ||
}: CheckVersionCompatibilityOpts): CompareModelVersionResult => { | ||
const appVersions = getModelVersionMapForTypes(types); | ||
const indexVersions = getModelVersionsFromMappings({ mappings, source }); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's fine to have uncaught exceptions bubbling up for scenarios where it really should never happen. If we expect it to happen when e.g. Elasticsearch is unhealthy it feels worth catching but in this case I think it's fine. On 8.x we've only seen Elastic internal staff meddle with system indices while testing.
// completed. Retry this step to see if the task has completed after an | ||
// exponential delay. We will basically keep polling forever until the | ||
// Elasticsearch task succeeds or fails. | ||
return delayRetryState(state, left.message, Number.MAX_SAFE_INTEGER); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah... the thinking is that if there's a active task picking up the mappings then it's kinda bad if we throw a FATAL and then start again creating another update_by_query... so we just wait... forever.
But there's probably some point at which we could say if it takes 2 days something is wrong and we want an SRE to take a look.
|
||
return { | ||
mappingVersions: modelVersions, | ||
docVersions: modelVersions, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
so when docVersions has dashboard: '2'
then the repository will switch to v2 writes? And then once we have transformations we'll have another meta key to indicate that transformation is complete and plugins can switch to v2 queries
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's correct
## Summary Part of elastic#150309 This PR implements the first stage (mapping check / update) of the ZDT algorithm, following the schema from the design document: <img width="1114" alt="Screenshot 2023-02-28 at 09 23 07" src="https://user-images.githubusercontent.com/1532934/221795647-4e3d8ad0-18a1-4e2a-8c0d-dd70e66a3c25.png"> Which translates to this: <img width="700" alt="Screenshot 2023-03-01 at 14 30 50" src="https://user-images.githubusercontent.com/1532934/222153028-8e2cc6e8-4da2-4ca6-b299-61db6fbb624e.png">
## Summary Part of #150309 Follow-up of #152219 Implement the second part of the zero-downtime migration algorithm: the document conversion. ### Schema because a schema is worth a thousand words: <img width="650" alt="Screenshot 2023-03-22 at 08 33 44" src="https://user-images.githubusercontent.com/1532934/226832339-d74d8349-9969-4c51-a5fe-f77558f17b67.png"> ### TODO / notepad - ~check that all types have model versions in INIT~ will do later when we'll start have real types using MVs - [x] Optimize to skip document migration when creating new index - [x] documentsUpdateInit: extract remaining logic to utilities - [x] outdatedDocumentsSearchRead: cleanup corrupted doc logic - [x] outdatedDocumentsSearchTransform: cleanup corrupted doc logic - [x] tests for /zdt/actions/wait_for_delay.ts ? - ~support for coreMigrationVersion~ added as a follow-up in the parent issue - [x] init -> equal -> check if aliasActions is empty --------- Co-authored-by: kibanamachine <[email protected]>
Summary
Part of #150309
This PR implements the first stage (mapping check / update) of the ZDT algorithm, following the schema from the design document:
Which translates to this: