Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Saved object migrations] Collect all documents that fail to transform before stopping the migration #96986

Conversation

TinaHeiligers
Copy link
Contributor

@TinaHeiligers TinaHeiligers commented Apr 13, 2021

Resolves #90279
Approach:

  • Collect ids from saved objects that can't be transformed because of a CorruptSavedObjectError that we used to throw
  • Return either the transformed saved objects or an array of the failed doc ids
  • Rework the state machine logic:
    • If all the docs could be transformed, index them and continue with searching for outdated docs (assuming now that we have PIT searches by then to not search again for outdated docs)
    • If some docs failed, we don't index the ones that did transform, instead, we continue transforming the remaining outdated docs and build up a list of docs that failed.
    • Eventually, we'll either end the migration successfully, or throw with the full list of docs that have issues.

Implementation:

  • We use a new method, migrateRawDocsNonThrowing similar to migrateRawDocs that returns an instance of Either.left containing an array of transformErrors and corruptDocumentIds if there are issues transforming a batch of outdated saved object documents. If there aren't any issues with the current batch, we return an instance of Either.right with the processedDocs (transformed saved object documents).
    (migrateRawDocs is used in v1 migrations in kibana_migrator, hence not refactoring the original)
  • If at any point during the migration we come across documents that either couldn't be transformed or have corrupt ids, we stop bulk indexing but carry on trying to transform remaining outdated documents.
  • When we have no more outdated documents to transform, the migration fails with a list of all failures.

Dependency:

@TinaHeiligers TinaHeiligers added enhancement New value added to drive a business result Feature:Saved Objects v8.0.0 v7.14.0 project:ResilientSavedObjectMigrations Reduce Kibana upgrade failures by making saved object migrations more resilient labels Apr 13, 2021
@TinaHeiligers TinaHeiligers requested a review from rudolf April 13, 2021 15:29
@elastic elastic deleted a comment from kibanamachine Apr 13, 2021
@TinaHeiligers TinaHeiligers force-pushed the so-migrations/collect-failing-docs-WIP branch from fbfa86b to d307ea4 Compare April 15, 2021 15:45
@elastic elastic deleted a comment from kibanamachine Apr 18, 2021
@TinaHeiligers TinaHeiligers force-pushed the so-migrations/collect-failing-docs-WIP branch from d307ea4 to 993a5fa Compare April 19, 2021 21:51
@TinaHeiligers

This comment has been minimized.

@TinaHeiligers TinaHeiligers force-pushed the so-migrations/collect-failing-docs-WIP branch from 2362862 to 0b14388 Compare April 20, 2021 15:59
Copy link
Contributor

@pgayvallet pgayvallet left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good job entering in the SO migration v2 code. You're now a migration expert!

Implementation looks fine to me. What's currently missing in the change in the OUTDATED_DOCUMENTS_SEARCH step to used a PIT instead of just looping on search, to avoid trying to migrate the failing documents indefinitely.

It seems that it's not currently handled in #97222, so not sure which PR should take care of that.

this.id = id;
this.namespace = namespace;
this.type = type;
// Removed because not including still seems to work, it may have been an old Typescript 2.1 issue:
Copy link
Contributor Author

@TinaHeiligers TinaHeiligers Apr 21, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@pgayvallet ideally, we also want to capture the original stack trace. I think we are, since we're on node v14.16.1 but I'm not as familiar as you might be. Do you happen to know if I need to add something along the lines of:

Suggested change
// Removed because not including still seems to work, it may have been an old Typescript 2.1 issue:
// Maintains proper stack trace for where our error was thrown (only available on V8)
if (Error.captureStackTrace) {
Error.captureStackTrace(this, TransformSavedObjectDocumentError)
}
// Removed because not including still seems to work, it may have been an old Typescript 2.1 issue:

as suggested in the "ES6 Custom Error Class" https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/Error#instance_properties section?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Error.captureStackTrace captures the stacktrace at the time it was invoked, so when the TransformSavedObjectDocumentError will be created. Is that what you want?

If you want to use the stack from the original error instead (originalError), You can just copy it in the constructor (this.stack = originalError.stack), or even access it from consumer code, as the property is public transformError.originalError.stack

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I want the stack trace from the original error because that's ultimately what's adding to the reason the migration will fail.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried to add the stack traces to the FATAL controlState reason but realized it's going to clutter up the logs badly! For now, we'll go with adding the raw id and the transform that failed. We can tack on the stack trace later if we see that it's not enough info for debugging failed migrations.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried to add the stack traces to the FATAL controlState reason but realized it's going to clutter up the logs badly!

How much worse are they getting? I'd say keeping the error stack is a must.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe we can improve it in a follow-up

@TinaHeiligers

This comment has been minimized.

@pgayvallet

This comment has been minimized.

Copy link
Contributor

@pgayvallet pgayvallet left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall looking good. A few nits and questions

err,
});
} else {
transformErrors.push({ rawId: 'unknown', err }); // cases we haven't accounted for yet
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The error might be from a specific transformation script that threw, as an example

Having the document._id of the doc that encountered the failure (or at least the doc we were processing while encountering this unhandled error) seems like valuable information though, and is available in raw, so I think we ideally want to surface it in the logs when ! err instanceof TransformSavedObjectDocumentError

src/core/server/saved_objects/migrationsv2/model.ts Outdated Show resolved Hide resolved
};
} else {
const left = res.left;
const left = Either.left;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm sorry, even with your comment, I don't understand this change, given that Either is the library import, not one of our constant?

import * as Either from 'fp-ts/lib/Either';

How does const left = Either.left make sense here?

@TinaHeiligers
Copy link
Contributor Author

@elasticmachine merge upstream

Copy link
Contributor

@pgayvallet pgayvallet left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Last few minor NITs, but overall LGTM.

Comment on lines 104 to 105
// const savedObject = convertToRawAddMigrationVersion(raw, options, serializer);
try {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NIT: commented line can be removed

const resultsWithProcessDocs = ((await transformTask()) as Either.Right<DocumentsTransformSuccess>)
.right.processedDocs;
expect(resultsWithProcessDocs.length).toEqual(2);
// const foo2 = hits.find((h) => h._id === 'foo:2');
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NIT: can be removed

Comment on lines 114 to 116
[...err.message.matchAll(corruptFooSOs)].concat(
[...err.message.matchAll(corruptBarSOs)],
[...err.message.matchAll(corruptBazSOs)]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NIT: no use to concat

[
  ...err.message.matchAll(corruptFooSOs),  
  ...err.message.matchAll(corruptBarSOs), 
  ...err.message.matchAll(corruptBazSOs)
]

Comment on lines +110 to +112
const corruptFooSOs = /foo:/g;
const corruptBarSOs = /bar:/g;
const corruptBazSOs = /baz:/g;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • can the regexps be a little more 'precise'?
  • can we also add a test on the 'prefix' of the error message?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll address those changes in the follow up PR. The test will likely change anyway.

Copy link
Contributor

@mshustov mshustov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, left a few nits

const processedDocs: SavedObjectsRawDoc[] = [];
const transformErrors: TransformErrorObjects[] = [];
const corruptSavedObjectIds: string[] = [];
const options = { namespaceTreatment: 'lax' as const };
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

optional nit: let's move it outside to reduce GC pressure

const soParseOptions = { namespaceTreatment: 'lax' } as const;
export function migrateRawDocsSafely(...

const options = { namespaceTreatment: 'lax' as const };
for (const raw of rawDocs) {
if (serializer.isRawSavedObject(raw, options)) {
// const savedObject = convertToRawAddMigrationVersion(raw, options, serializer);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's remove it

// the doc id we get from the error is only the uuid part
// we transform the id to a raw saved object id.
transformErrors.push({
rawId: serializer.generateRawId(err.namespace, err.type, err.id),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why don't we use the original raw._id?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a very good question!

import { TransformSavedObjectDocumentError } from '.';

export interface DocumentsTransformFailed {
type: string;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: readonly for all the properties. And below.

expect(result.left.transformErrors.length).toEqual(1);
expect(result.left.transformErrors[0].err.message).toMatchInlineSnapshot(`
"Failed to transform document b. Transform: a1.2.3
Doc: {\\"type\\":\\"a\\",\\"id\\":\\"b\\",\\"attributes\\":{\\"name\\":\\"AAA\\"},\\"references\\":[],\\"migrationVersion\\":{}}"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TransformSavedObjectDocumentError shows the problem SO, but not the source of the problem. Doesn't it? Let's refactor it to include the original error message as well.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll do that in a follow up PR.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, the original error is included in the transformation error as originalError.

}

/**
* Sanitizes the raw saved object document and sets the migration version
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It doesn't set the migration version. Does it?

const resultsWithProcessDocs = ((await transformTask()) as Either.Right<DocumentsTransformSuccess>)
.right.processedDocs;
expect(resultsWithProcessDocs.length).toEqual(2);
// const foo2 = hits.find((h) => h._id === 'foo:2');
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's remove it.

.map((errObj) => `${errObj.rawId}: ${errObj.err.message}\n ${errObj.err.stack ?? ''}`)
.join('/n')
: '';
return {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it could return the final error message Migrations failed. Reason:... to encapsulate all the error message logic

const corruptBarSOs = /bar:/g;
const corruptBazSOs = /baz:/g;
expect(
[...err.message.matchAll(corruptFooSOs)].concat(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would you mind extending the test case to include transformErrors?

Copy link
Contributor Author

@TinaHeiligers TinaHeiligers May 10, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can do that in a follow up PR if you don't mind. I wasn't sure how to inject a failing transform and need to research that a bit more.

// foo: '7.13.0',
// },
// },
// contains migrated index with 8.0 aliases to skip migration, but run outdated doc search
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Optional: it's better to use 7.x index unless you want to test OUTDATED_DOCUMENTS_* steps. otherwise, we have to skip the test on 7.x branch.

Copy link
Contributor Author

@TinaHeiligers TinaHeiligers May 10, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For now, testing OUTDATED_DOCUMENTS_* should be a good start and we can explicitly test the changes introduced to REINDEX_SOURCE_TO_TEMP_* in a follow up PR.

@TinaHeiligers TinaHeiligers added the auto-backport Deprecated - use backport:version if exact versions are needed label May 10, 2021
@TinaHeiligers TinaHeiligers enabled auto-merge (squash) May 10, 2021 19:12
@kibanamachine
Copy link
Contributor

💚 Build Succeeded

Metrics [docs]

Unknown metric groups

References to deprecated APIs

id before after diff
canvas 29 25 -4
crossClusterReplication 8 6 -2
fleet 4 2 -2
globalSearch 4 2 -2
indexManagement 12 7 -5
infra 5 3 -2
licensing 18 15 -3
monitoring 109 56 -53
total -73

History

To update your PR or re-run it, just comment with:
@elasticmachine merge upstream

@TinaHeiligers TinaHeiligers merged commit 59f42ec into elastic:master May 10, 2021
kibanamachine added a commit to kibanamachine/kibana that referenced this pull request May 10, 2021
…m before stopping the migration (elastic#96986)

Co-authored-by: Kibana Machine <[email protected]>
@kibanamachine
Copy link
Contributor

💚 Backport successful

Status Branch Result
7.x

This backport PR will be merged automatically after passing CI.

kibanamachine added a commit that referenced this pull request May 10, 2021
…ansform before stopping the migration (#96986) (#99713)

* [Saved object migrations] Collect all documents that fail to transform before stopping the migration (#96986)

Co-authored-by: Kibana Machine <[email protected]>

* Update src/core/server/saved_objects/migrationsv2/integration_tests/corrupt_outdated_docs.test.ts

Test relies on an archive with saved objects from version 8.0.0

Co-authored-by: Christiane (Tina) Heiligers <[email protected]>
@TinaHeiligers TinaHeiligers deleted the so-migrations/collect-failing-docs-WIP branch July 21, 2021 21:53
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
auto-backport Deprecated - use backport:version if exact versions are needed enhancement New value added to drive a business result Feature:Saved Objects project:ResilientSavedObjectMigrations Reduce Kibana upgrade failures by making saved object migrations more resilient release_note:enhancement v7.14.0 v8.0.0
Projects
None yet
5 participants