Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bootstrap ZDT migration algorithm #151282

Merged
merged 12 commits into from
Feb 27, 2023

Conversation

pgayvallet
Copy link
Contributor

@pgayvallet pgayvallet commented Feb 15, 2023

Summary

Part of #150309

Purpose of the PR is to create the skeleton of the ZDT algorithm, in order to make sure we're all aligned on the way we'll be managing our codebase between the 2 implementation (and to ease with the review of the follow-up PRs by not having the bootstrap of the algo to review at the same time)

@pgayvallet pgayvallet added v8.8.0 Team:Core Core services & architecture: plugins, logging, config, saved objects, http, ES client, i18n, etc release_note:skip Skip the PR/issue when compiling release notes Feature:Migrations labels Feb 15, 2023
Copy link
Contributor Author

@pgayvallet pgayvallet left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Self-review, trying to explain my reasoning:

Comment on lines +14 to +16
algorithm: schema.oneOf([schema.literal('v2'), schema.literal('zdt')], {
defaultValue: 'v2',
}),
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wasn't sure how to name it, so I went with zdt. If anyone prefers managed, or anything else, please feel free to tell me. We could even change the algorithm name to something else.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm fine with a more descriptive algorithm name, managed is a little too obscure.

Copy link
Contributor

@jloleysens jloleysens Feb 20, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From someone using this API's perspective the values v2 and zdt seem to not have any relation. This is OK, but we need a good doc comment about what each means and where it came from.

IMO zero-downtime or operator (bc I'm guessing this algo is intended to work with the "operator pattern") are also good candidates.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

my 2cents: I'd use a name that makes it obvious that there is an external agent required (operator seems like a good candidate).

At the same time, I tend to prefer options that resemble the functionality they enable: Zero-downtime...

Mixing both thoughts... I wonder if we should be more explicit about when we plan to use it (or anyone willing to change the defaults is expected to use this): with the k8s operator? Something like k8s-zdt?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, naming is hard.... we could also go with serverless and default/standard. OTOH at some point we may want to allow this algo to be officially supported on prem (with 'manual' orchestration of the workflow)... So I really don't know.

Comment on lines +26 to +29
export const logStateTransition = (
logger: Logger,
logMessagePrefix: string,
prevState: LogAwareState,
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Was extracted from being inlined in a state machine file of the v2 algorithm to be re-used for zdt.

Please note the src/common folder of the package. This is where I'm planning to move stuff shared/common between the two algo.

Basically the ideal end structure for me would be

- src
  - common
    - actions 
    - other_stuff 
  - v2
    - model
    - other_v2_specific_stuff
  - zdt
    - folders_this_pr_introduced

Does that seems fine to you, or should we go with another folder structure?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does that seems fine to you, or should we go with another folder structure?

Seems good to me. It's not worth getting too tied up in where the code lives, as long as it's somewhat categorized.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm all good with src/common... just wondering about the differences between src/common and src/core... Should we pick one or are we good with having both?

If we are at risk of having circular references, I'm OK with having 2 common dirs 😇

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

src/core is the remain of old history tbh 😅 . Ideally we would be moving src/core to subfolders of src/common soon. I just wanted to avoid doing it in the initial PR.

/**
* Merge mappings from all registered saved object types.
*/
export const buildTypesMappings = (
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

extracted from v2. Moved to src/core because the other mapping-related helpers are there already. src/core may eventually move to src/common/core or something later (if we feel like it's worth the effort)

Comment on lines +139 to +143
if (migrationAlgorithm === 'zdt') {
return this.runMigrationZdt();
} else {
return this.runMigrationV2();
}
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The branching


export async function cleanup(client: ElasticsearchClient, state?: State) {
if (!state) return;
type CleanableState = { sourceIndexPitId: string } | {};
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Adapted the type to make it reusable with different State types (as the 2 algos have different shapes of states)

import type { State } from '../../state';
import type { ModelStage } from '../types';

export const init: ModelStage<'INIT'> = (state, res, context): State => {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Allowing to define them just like that (parameter types are all properly inferred)

Copy link
Contributor

@jloleysens jloleysens Feb 20, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Separating these into individual functions is really great! What do you think of something like this as an iteration on this idea:

// just an example
export const init: ModelTransition<'INIT', 'FATAL' | 'DONE'> = (state, res, context) => {}

In this way we will:

  1. leverage TS sanity check on our expected return values
  2. replace State return value declarations from these functions

ModelTransition would be defined as:

/**
 * Defines a transition function for the model
 */
export type ModelTransition<T extends AllActionStates, R extends AllControlStates> = (
  state: StateFromActionState<T>,
  res: StateActionResponse<T>,
  context: MigratorContext
) => StateFromControlState<T | R>; // T | R because we assume a state can go back to itself as QOL for user of this type...

Plus utility:

export type StateFromControlState<T extends AllControlStates> = ControlStateMap[T];

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think your improvements are great. Done in f7d6c7f

Comment on lines +27 to +31
// nothing implemented yet, just going to 'DONE'
return {
...state,
controlState: 'DONE',
};
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, not the purpose of this PR, implementation will be done in follow-up(s)

Comment on lines +15 to +17
describe('Action: init', () => {
let context: MockedMigratorContext;

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Per model stage test file

Comment on lines +32 to +38
export const next = (context: MigratorContext) => {
const map = nextActionMap(context);

return (state: State) => {
const delay = <F extends (...args: any) => any>(fn: F): (() => ReturnType<F>) => {
return () => {
return state.retryDelay > 0
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also encountered an issue trying to factorize this. nextActionMap cannot be factorized, and next depends on it (and most importantly, on the shape of the State).

We could try spending some time to find a proper TS way to make this generic, but for 50 lines, it didn't felt like a priority.

Comment on lines +12 to +16
export interface BaseState extends ControlState {
readonly retryCount: number;
readonly retryDelay: number;
readonly logs: MigrationLog[];
}
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

State types are different between the two algorithm (both the BaseState and the composition of the State) so I don't think we can really re-use anything in this specific part of the code.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not totally onboard with having logs as part of the state.
Stages won't require checking logs for anything, so it feels weird to have them as part of the state.
That being said, alternatives would probably require having stages as classes or objects that implement some sort of interface, so perhaps that's not ideal either.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah. TBH I tried just having the logger in the context initially, and it brings two issues:

  1. incompatibilities between helper functions of the two algos
  2. making it harder to control when logs are outputted
    So I went back to putting the logs in the state...

So if we were to change that, I think we would need to do it in both implementations at the same time, which kinda closed the question for me (given it's not the same amount of work at all)

@pgayvallet pgayvallet marked this pull request as ready for review February 15, 2023 15:41
@pgayvallet pgayvallet requested a review from a team as a code owner February 15, 2023 15:41
@elasticmachine
Copy link
Contributor

Pinging @elastic/kibana-core (Team:Core)

Copy link
Contributor

@TinaHeiligers TinaHeiligers left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've added a few comments and nits, none of which block merging.
Exciting times!
LGTM

}

logger.info(
logMessagePrefix + `${prevState.controlState} -> ${currState.controlState}. took: ${tookMs}ms.`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can't see it here but do we add a migration log identifier? I mean, does the logMessagePrefix include some indication that the logs come from the zdt migration algorithm?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, atm we're using the same prefix than the v2 algo:

We should probably allow to distinguish between the two, but maybe we should use a dedicated logger child/context instead of this prefix, to make sure all our logs are properly identifiable.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

dedicated logger child/context instead of this prefix, to make sure all our logs are properly identifiable

++

});
}

private runMigrationV2(): Promise<MigrationResult[]> {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

asside: The prep needed for runResilientMigrator looks clunky compared to runZeroDowntimeMigration.
Would it make sense to move that to a 'prepareResilientMigration' setup function? The cleanup's not needed now but possibly some time.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I agree. Ideally most of the 'architectural'/'code' improvements we do in this second algo should be ported to the existing one. Not sure our current priorities will allow us to take such chore tasks in the following months.

*/

export const StageMocks = {
init: jest.fn().mockImplementation((state: unknown) => state),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It should be fine and it even gives us testing options to mock stages individually.

context = createContextMock();
});

test('INIT -> DONE because its not done yet', () => {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit:

Suggested change
test('INIT -> DONE because its not done yet', () => {
test("INIT -> DONE because it's not implemented yet", () => {

"not done yet" could mean that the migration hasn't finished yet, making for a confused dev reading the code 😉

test('INIT -> INIT when cluster routing allocation is incompatible', () => {
const state = createState();
const res: StateActionResponse<'INIT'> = Either.left({
type: 'incompatible_cluster_routing_allocation',
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are we going to enforce routing allocation given that these will be managed instances?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great question, and I'm not sure tbh. I mostly re-used the existing init action as a POC that the algorithm got executed. The zdt workflow may start by something different, and we may not need to check for such things on managed environment, yeah.

Comment on lines 17 to 28
export type StateActionResponse<T extends AllActionStates> = ExcludeRetryableEsError<
ResponseType<T>
>;

/**
* Defines a stage delegation function for the model
*/
export type ModelStage<T extends AllActionStates> = (
state: StateFromActionState<T>,
res: StateActionResponse<T>,
context: MigratorContext
) => State;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🪄

Copy link
Contributor

@jloleysens jloleysens left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left some initial comments, will finish rest of review soon! Let me know what you think!

Comment on lines +14 to +16
algorithm: schema.oneOf([schema.literal('v2'), schema.literal('zdt')], {
defaultValue: 'v2',
}),
Copy link
Contributor

@jloleysens jloleysens Feb 20, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From someone using this API's perspective the values v2 and zdt seem to not have any relation. This is OK, but we need a good doc comment about what each means and where it came from.

IMO zero-downtime or operator (bc I'm guessing this algo is intended to work with the "operator pattern") are also good candidates.

}

logger.info(
logMessagePrefix + `${prevState.controlState} -> ${currState.controlState}. took: ${tookMs}ms.`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

dedicated logger child/context instead of this prefix, to make sure all our logs are properly identifiable

++

import type { State } from '../../state';
import type { ModelStage } from '../types';

export const init: ModelStage<'INIT'> = (state, res, context): State => {
Copy link
Contributor

@jloleysens jloleysens Feb 20, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Separating these into individual functions is really great! What do you think of something like this as an iteration on this idea:

// just an example
export const init: ModelTransition<'INIT', 'FATAL' | 'DONE'> = (state, res, context) => {}

In this way we will:

  1. leverage TS sanity check on our expected return values
  2. replace State return value declarations from these functions

ModelTransition would be defined as:

/**
 * Defines a transition function for the model
 */
export type ModelTransition<T extends AllActionStates, R extends AllControlStates> = (
  state: StateFromActionState<T>,
  res: StateActionResponse<T>,
  context: MigratorContext
) => StateFromControlState<T | R>; // T | R because we assume a state can go back to itself as QOL for user of this type...

Plus utility:

export type StateFromControlState<T extends AllControlStates> = ControlStateMap[T];

Copy link
Member

@afharo afharo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So exciting to have this 🎉

Comment on lines +14 to +16
algorithm: schema.oneOf([schema.literal('v2'), schema.literal('zdt')], {
defaultValue: 'v2',
}),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

my 2cents: I'd use a name that makes it obvious that there is an external agent required (operator seems like a good candidate).

At the same time, I tend to prefer options that resemble the functionality they enable: Zero-downtime...

Mixing both thoughts... I wonder if we should be more explicit about when we plan to use it (or anyone willing to change the defaults is expected to use this): with the k8s operator? Something like k8s-zdt?

Comment on lines +26 to +29
export const logStateTransition = (
logger: Logger,
logMessagePrefix: string,
prevState: LogAwareState,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm all good with src/common... just wondering about the differences between src/common and src/core... Should we pick one or are we good with having both?

If we are at risk of having circular references, I'm OK with having 2 common dirs 😇

Comment on lines 29 to 30
export function throwBadResponse(state: { controlState: string }, p: never): never;
export function throwBadResponse(state: { controlState: string }, res: any): never {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: does res: unknown work here?

*/

export const StageMocks = {
init: jest.fn().mockImplementation((state: unknown) => state),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably something like jest.requireActual('./stages') and loop over its keys would automate the creation of the mocks... although I don't think it's a big deal

Copy link
Contributor

@jloleysens jloleysens left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for integrating my feedback @pgayvallet , overall LGTM!

@kibana-ci
Copy link
Collaborator

💚 Build Succeeded

Metrics [docs]

Unknown metric groups

ESLint disabled line counts

id before after diff
securitySolution 428 430 +2

Total ESLint disabled count

id before after diff
securitySolution 506 508 +2

History

To update your PR or re-run it, just comment with:
@elasticmachine merge upstream

@pgayvallet pgayvallet merged commit bbbf8d1 into elastic:main Feb 27, 2023
@kibanamachine kibanamachine added the backport:skip This commit does not require backporting label Feb 27, 2023
bmorelli25 pushed a commit to bmorelli25/kibana that referenced this pull request Mar 10, 2023
## Summary

Part of elastic#150309

Purpose of the PR is to create the skeleton of the ZDT algorithm, in
order to make sure we're all aligned on the way we'll be managing our
codebase between the 2 implementation (and to ease with the review of
the follow-up PRs by not having the bootstrap of the algo to review at
the same time)

---------

Co-authored-by: kibanamachine <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport:skip This commit does not require backporting Feature:Migrations release_note:skip Skip the PR/issue when compiling release notes Team:Core Core services & architecture: plugins, logging, config, saved objects, http, ES client, i18n, etc v8.8.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants