Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement first stages of the ZDT migration algorithm #152219

Merged
merged 32 commits into from
Mar 8, 2023

Conversation

pgayvallet
Copy link
Contributor

@pgayvallet pgayvallet commented Feb 27, 2023

Summary

Part of #150309

This PR implements the first stage (mapping check / update) of the ZDT algorithm, following the schema from the design document:

Screenshot 2023-02-28 at 09 23 07

Which translates to this:

Screenshot 2023-03-01 at 14 30 50

@pgayvallet pgayvallet added Team:Core Core services & architecture: plugins, logging, config, saved objects, http, ES client, i18n, etc release_note:skip Skip the PR/issue when compiling release notes Feature:Migrations v8.8.0 labels Feb 27, 2023
Copy link
Contributor Author

@pgayvallet pgayvallet left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Self-review

Comment on lines +68 to +73
/**
* The current model versions of the mapping of the index.
*
* @remark: Only defined for indices using the zdt migration algorithm.
*/
mappingVersions?: { [k: string]: number };
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IndexMapping / IndexMappingMeta are used in a lot of places, and it was going to be hard to have distinct types for each algo without entering the Generic loop of hell. Using the same type for both algo's meta felt like the most pragmatic approach by a very large margin.

Comment on lines +35 to +40
case 'CREATE_TARGET_INDEX':
return Stages.createTargetIndex(
current,
response as StateActionResponse<'CREATE_TARGET_INDEX'>,
context
);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NIT: I wanted to automate the dispatch by just defining a map of AllActionStates=>ModelStage, but it was a pain given the non unknow-able generic parameters of ModelStage. So for now it's a 4 line copy pasta for each stage. I will eventually revisit later.

Comment on lines 22 to 26
describe('Action: init', () => {
let context: MockedMigratorContext;

const currentIndex = '.kibana_1';

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The only stage I added unit test for until we validate the workflow

Comment on lines +24 to +27
export const init: ModelStage<
'INIT',
'CREATE_TARGET_INDEX' | 'UPDATE_INDEX_MAPPINGS' | 'UPDATE_ALIASES' | 'FATAL'
> = (state, res, context) => {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, this is a big stage. I wondered if it made sense to split it into multiple, just to increase readability and isolation of concern, but given all the data necessary for all the decisions made in this stage are fetched in a single request, splitting the stage was causing artificial noop-type actions and transitions just to improve an artificial isolation, so I didn't like it. Also, we have the same kind of gigantic init stage logic in the v2 algo, so it's probably our best option.

// completed. Retry this step to see if the task has completed after an
// exponential delay. We will basically keep polling forever until the
// Elasticsearch task succeeds or fails.
return delayRetryState(state, left.message, Number.MAX_SAFE_INTEGER);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NIT: not a big fan of the Number.MAX_SAFE_INTEGER, but that's what we do in the v2 algorithm so I did the same.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah... the thinking is that if there's a active task picking up the mappings then it's kinda bad if we throw a FATAL and then start again creating another update_by_query... so we just wait... forever.

But there's probably some point at which we could say if it takes 2 days something is wrong and we want an SRE to take a look.

Comment on lines +21 to +24
return {
...state,
controlState: 'DONE',
};
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note: I'm currently moving to DONE so that the algo completes, and so I can run integration test against the part of the logic already implemented.

When we'll be added the next part (doc migration), this will transition to the next steps of the migration.

Comment on lines +57 to +66
// TODO: later we will want to generate the proper diff from `SavedObjectsModelExpansionChange.addedMappings`
// for this first implementation this is acceptable given we only allow compatible mapping changes anyway.
// we may want to implement the proper logic before this get used by real (non-test) type owners.

const changedTypes = delta.diff.map((diff) => diff.name);

const addedMappings: SavedObjectsMappingProperties = {};
changedTypes.forEach((type) => {
addedMappings[type] = typeMap[type].mappings;
});
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As commented: In theory, we'll want to generate the additive mapping changes based on the mappings defined on each model version of the delta. However, given we're planning on only allowing compatible mapping changes before the algo get released, it's probably fine for now to just use the global mappings of the types.

Or at least I assumed so, WDYT?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or at least I assumed so, WDYT?

Makes sense to me. Just trying to make sure I understand this in practice:

With this approach we will be sending an update mappings request that contains the types' full set of mappings rather than only the subset of new fields?

const actions: AliasAction[] = [];

const globalAlias = indexPrefix;
const versionAlias = `${indexPrefix}_${kibanaVersion}`;
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, I decided to keep the version alias for now.

Main reason is that the SOR is accessing the indices via these aliases at the moment:

export const getIndexForType = ({
type,
typeRegistry,
defaultIndex,
kibanaVersion,
}: GetIndexForTypeOptions): string => {
return `${typeRegistry.getIndex(type) || defaultIndex}_${kibanaVersion}`;
};

And I didn't want to handle this problem right now.

We can revisit / think about it when we'll perform the SOR changes in #150312

Comment on lines +18 to +25
let lastCount = -1;
Object.keys(indices).forEach((indexName) => {
const match = matcher.exec(indexName);
if (match && match.groups?.counter) {
const suffix = parseInt(match.groups.counter, 10);
lastCount = Math.max(lastCount, suffix);
}
});
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note: there should always be a single [prefix]_1 index present, but just in case of (and because I wanted to add a named group regexp), we're scanning all the indices for the highest suffix.

Copy link
Contributor

@gsoldevila gsoldevila Mar 8, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NIT if somebody entered '.kibana_003' the function would return '.kibana_3'.

Comment on lines +20 to +25
describe('ZDT upgrades - running on a fresh cluster', () => {
let esServer: TestElasticsearchUtils['es'];

const startElasticsearch = async () => {
const { startES } = createTestServers({
adjustTimeout: (t: number) => jest.setTimeout(t),
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some basic integration tests on the part of the algo implemented in this PR. Will add more for edge cases once we agree on the overall workflow.

Copy link
Contributor

@jloleysens jloleysens left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great work @pgayvallet ! I left a few non-blocker comments and overall this is looking like a solid start.

For my own understanding: is the "update alias" step something that will mostly be a noop for serverless (since we will deploy faster than new stack versions)?

If yes: I am a little concerned that we will not have a way to prevent old nodes from continuing to write to our SO index. I know this is something we will revise along with changes to SOR, but I was curious about whether we have a plan for how to manage more frequent non-stack version releases - are we going to use sub-patch versions like build numbers?

[UPDATE] Or are we guaranteed to have no "old nodes" running at that point?

Approving to unblock progress, but I think getting another review would be great.

Comment on lines +57 to +66
// TODO: later we will want to generate the proper diff from `SavedObjectsModelExpansionChange.addedMappings`
// for this first implementation this is acceptable given we only allow compatible mapping changes anyway.
// we may want to implement the proper logic before this get used by real (non-test) type owners.

const changedTypes = delta.diff.map((diff) => diff.name);

const addedMappings: SavedObjectsMappingProperties = {};
changedTypes.forEach((type) => {
addedMappings[type] = typeMap[type].mappings;
});
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or at least I assumed so, WDYT?

Makes sense to me. Just trying to make sure I understand this in practice:

With this approach we will be sending an update mappings request that contains the types' full set of mappings rather than only the subset of new fields?

Comment on lines +100 to +113
const logFileContent = await fs.readFile(logFilePath, 'utf-8');
const records = logFileContent
.split('\n')
.filter(Boolean)
.map((str) => JSON5.parse(str)) as LogRecord[];

const expectLogsContains = (messagePrefix: string) => {
expect(records.find((entry) => entry.message.includes(messagePrefix))).toBeDefined();
};

expectLogsContains('INIT -> CREATE_TARGET_INDEX');
expectLogsContains('CREATE_TARGET_INDEX -> UPDATE_ALIASES');
expectLogsContains('UPDATE_ALIASES -> DONE');
expectLogsContains('Migration completed');
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor comment: I've seen this approached a couple different ways (partly my fault). I think it is worth consolidating:

  1. JSON5.parse way (used here)
  2. Reading full string content from log file then expect(logs).toMatch('A -> B')

Way (2) is provided by the kibana_migrator_test_kit and used in src/core/server/integration_tests/saved_objects/migrations/group3/skip_reindex.test.ts, for example.

Not major, but for these Jest integration tests would be nice to use one approach. Personally I prefer (2) bc it is "good enough" and avoids using a lib. Disclaimer: I have not benchmarked either of these.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the recent enhancements to the migration logic I ended up using (2), parsing the whole text into a string, and then using expect(logs).toMatch('certain string');.

I tried to simplify this and extract it into a method that allowed to check multiple strings were present in the logs. Whilst it was clearer in terms of readability, it added more calls in the stack, and in case of failure it prevented me from seeing the real call that was failing (Jest is currently printing 3 lines of the stack). Thus, I tend to agree with JL for these 2 reasons (avoid using lib + clearer stack traces).

Copy link
Contributor Author

@pgayvallet pgayvallet Mar 2, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Personally I prefer (2) bc it is "good enough"

Personal opinion, but I don't think it's good enough for all scenarios.

The limitations of the expect(myWholeFileAsAString).toContains(someText) is that you won't be able to perform advanced tests on the order of the logs (and ihmo we should assert against the order of the logs too).

Like what I did here a while ago (I know, I'm no doing it, yet, in the test we're talking about)

expectMatchOrder(errorLines, [
{
mode: 'equal',
value: '- foo:3: Error: Migration function for version 7.14.0 threw an error',
},
{

Yes. the error message are terrible in case of failure, won't deny it. But maybe adding out own matched/expects (not even sure what the term is for jest) would be the way to go?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think it's good enough for all scenarios...the order of the logs

Good point! I hadn't thought of order. Just brainstorming here:

  expectLogOrder(logs, stringToMatch: string[])
  ...
  const logs = (await readLog()).split('\n')
  expectLogOrder(logs, ['INIT -> FOO', 'FOO -> BAR', ...]) // or other messages

But yeah, happy to go with whoever has the strongest opinion. Would just be nice to keep the test itself super lean.

bar: 5,
},
});
const versionMap = getModelVersionsFromMappings({ mappings, source: 'mappingVersions' });
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We are not testing the scenario where types don't have a valid model version.

deletedTypes,
}: CheckVersionCompatibilityOpts): CompareModelVersionResult => {
const appVersions = getModelVersionMapForTypes(types);
const indexVersions = getModelVersionsFromMappings({ mappings, source });
Copy link
Contributor

@gsoldevila gsoldevila Mar 2, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking at code, both getModelVersionMapForTypes and getModelVersionsFroMapping will throw if any of the mappings versions is invalid. We don't seem to be catching these errors here, nor in the different places where we call these two methods. I suppose such configuration errors are critical and should make Kibana fail, but I just wanted to make sure that this is the expected behavior.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

getModelVersionMapForTypes will never throw, because we validated each type during their registration.

getModelVersionsFroMapping could throw if someone manually tempered with the index / cluster state. But yeah in that case, it would be fatal. We could be more explicit instead of implicitly bubbling the thrown error though, eventually.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's fine to have uncaught exceptions bubbling up for scenarios where it really should never happen. If we expect it to happen when e.g. Elasticsearch is unhealthy it feels worth catching but in this case I think it's fine. On 8.x we've only seen Elastic internal staff meddle with system indices while testing.

deletedTypes: [],
});

expect(result.status).toEqual('conflict');
Copy link
Contributor

@gsoldevila gsoldevila Mar 2, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm probably missing something here, but why is this test behaving differently from the last one? Oh I see, the deletedTypes is used to explicitly specify which types have been deleted. Nvm!

*/
export const getModelVersionMapForTypes = (types: SavedObjectsType[]): ModelVersionMap => {
return types.reduce<ModelVersionMap>((versionMap, type) => {
versionMap[type.name] = getLatestModelVersion(type);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm wondering if it would be worth checking for "no gaps" in the defined versions.

Copy link
Contributor Author

@pgayvallet pgayvallet Mar 2, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's done during type registration validation, so we take the validation as granted later on.

if (missingVersions.length) {
throw new Error(
`Type ${
type.name
}: gaps between model versions aren't allowed (missing versions: ${missingVersions.join(
','
)})`
);
}

@pgayvallet pgayvallet marked this pull request as ready for review March 7, 2023 08:13
@pgayvallet pgayvallet requested a review from a team as a code owner March 7, 2023 08:13
@elasticmachine
Copy link
Contributor

Pinging @elastic/kibana-core (Team:Core)

Copy link
Contributor

@jloleysens jloleysens left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No new comments since my last review. Great work @pgayvallet , exciting to see ZDT moving forward 🍻 . Also, thanks for adding those tests.

@kibana-ci
Copy link
Collaborator

💚 Build Succeeded

Metrics [docs]

Public APIs missing comments

Total count of every public API that lacks a comment. Target amount is 0. Run node scripts/build_api_docs --plugin [yourplugin] --stats comments for more detailed information.

id before after diff
@kbn/core-saved-objects-base-server-internal 36 56 +20

Public APIs missing exports

Total count of every type that is part of your API that should be exported but is not. This will cause broken links in the API documentation system. Target amount is 0. Run node scripts/build_api_docs --plugin [yourplugin] --stats exports for more detailed information.

id before after diff
@kbn/core-saved-objects-base-server-internal 6 7 +1
Unknown metric groups

API count

id before after diff
@kbn/core-saved-objects-base-server-internal 47 76 +29

ESLint disabled line counts

id before after diff
securitySolution 428 430 +2

Total ESLint disabled count

id before after diff
securitySolution 505 507 +2

History

To update your PR or re-run it, just comment with:
@elasticmachine merge upstream

Copy link
Contributor

@gsoldevila gsoldevila left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! The 2-layer separation in stages is a great strategy to avoid the huge model.ts.
Looking forward to seeing the transform part:

  • In v2 we're doing the transform before updating the mappings. I believe we're doing this to force all instances to go through that phase, so this is not mandatory for serverless.
  • Also, we're doing it systematically at startup, to account for the case when we re-enable a disabled plugin and must suddenly migrate its SOs. I'm not sure this applies to serverless. Can plugins be enabled / disabled there? Will we disable them for all instances? (including the one with the migrator role?)
  • As an improvement, we said that we could avoid "picking up" the whole index after updating the mappings:
    • We could pick up only the types that have a new version.
    • We could take into account that some documents are transformed (and already updated). We can probably default to an identity transform function for the types that have a new version and don't define a custom transform function (is that even possible?). This way, we can probably get rid of the "pickup operation" after updating the mappings.

deletedTypes,
}: CheckVersionCompatibilityOpts): CompareModelVersionResult => {
const appVersions = getModelVersionMapForTypes(types);
const indexVersions = getModelVersionsFromMappings({ mappings, source });
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's fine to have uncaught exceptions bubbling up for scenarios where it really should never happen. If we expect it to happen when e.g. Elasticsearch is unhealthy it feels worth catching but in this case I think it's fine. On 8.x we've only seen Elastic internal staff meddle with system indices while testing.

// completed. Retry this step to see if the task has completed after an
// exponential delay. We will basically keep polling forever until the
// Elasticsearch task succeeds or fails.
return delayRetryState(state, left.message, Number.MAX_SAFE_INTEGER);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah... the thinking is that if there's a active task picking up the mappings then it's kinda bad if we throw a FATAL and then start again creating another update_by_query... so we just wait... forever.

But there's probably some point at which we could say if it takes 2 days something is wrong and we want an SRE to take a look.


return {
mappingVersions: modelVersions,
docVersions: modelVersions,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so when docVersions has dashboard: '2' then the repository will switch to v2 writes? And then once we have transformations we'll have another meta key to indicate that transformation is complete and plugins can switch to v2 queries

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's correct

@pgayvallet pgayvallet merged commit 95b393b into elastic:main Mar 8, 2023
@kibanamachine kibanamachine added the backport:skip This commit does not require backporting label Mar 8, 2023
bmorelli25 pushed a commit to bmorelli25/kibana that referenced this pull request Mar 10, 2023
## Summary

Part of elastic#150309

This PR implements the first stage (mapping check / update) of the ZDT
algorithm, following the schema from the design document:

<img width="1114" alt="Screenshot 2023-02-28 at 09 23 07"
src="https://user-images.githubusercontent.com/1532934/221795647-4e3d8ad0-18a1-4e2a-8c0d-dd70e66a3c25.png">

Which translates to this:

<img width="700" alt="Screenshot 2023-03-01 at 14 30 50"
src="https://user-images.githubusercontent.com/1532934/222153028-8e2cc6e8-4da2-4ca6-b299-61db6fbb624e.png">
pgayvallet added a commit that referenced this pull request Mar 28, 2023
## Summary

Part of #150309
Follow-up of #152219

Implement the second part of the zero-downtime migration algorithm: the
document conversion.

### Schema

because a schema is worth a thousand words:

<img width="650" alt="Screenshot 2023-03-22 at 08 33 44"
src="https://user-images.githubusercontent.com/1532934/226832339-d74d8349-9969-4c51-a5fe-f77558f17b67.png">


### TODO / notepad

- ~check that all types have model versions in INIT~ will do later when
we'll start have real types using MVs
- [x] Optimize to skip document migration when creating new index
- [x] documentsUpdateInit: extract remaining logic to utilities
- [x] outdatedDocumentsSearchRead: cleanup corrupted doc logic
- [x] outdatedDocumentsSearchTransform: cleanup corrupted doc logic
- [x] tests for /zdt/actions/wait_for_delay.ts ?
- ~support for coreMigrationVersion~ added as a follow-up in the parent
issue
- [x] init -> equal -> check if aliasActions is empty

---------

Co-authored-by: kibanamachine <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport:skip This commit does not require backporting Feature:Migrations release_note:skip Skip the PR/issue when compiling release notes Team:Core Core services & architecture: plugins, logging, config, saved objects, http, ES client, i18n, etc v8.8.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants