Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Establish batch semantics #655

Closed
gingerwizard opened this issue Jul 8, 2022 · 8 comments
Closed

Establish batch semantics #655

gingerwizard opened this issue Jul 8, 2022 · 8 comments
Assignees
Labels
Milestone

Comments

@gingerwizard
Copy link
Collaborator

gingerwizard commented Jul 8, 2022

Some challenges with batches:

  1. For columnar insertion we provide Append methods on columns e.g. https://github.com/ClickHouse/clickhouse-go/blob/v2/examples/native/write-columnar/main.go.

    For some types, Append can fail e.g. when parsing Strings as IP addresses. Currently, in the event of failure we have inconsistent semantics. Consider:

    The former is faster - it saves iterations, at the cost of trickier error recovery for the user.

  2. Append to a batch can fail e.g. the enum doesn't map Batch.Send panics if AppendStruct couldn't map Enum #703. This causes a release of the connection and means subsequent use of the batch fails with a panic.

We should decide on the semantics, document them and be consistent.

@ernado @genzgd

@gingerwizard gingerwizard self-assigned this Jul 8, 2022
@mshustov
Copy link
Member

Do I understand correctly the problem might happen when transforming data for the insert action?

IMO we should align with the ClickHouse default logic for such cases. ClickHouse implements fail-fast pattern, even though it's configurable. From input_format_allow_errors_num docs:

Sets the maximum number of acceptable errors when reading from text formats (CSV, TSV, etc.).
The default value is 0.

I'd suggest avoiding the complexity of making this configurable (at least for now) and throwing an exception if a value cannot be parsed. In this case, a user can be informed about the problem and fix the data source.

@gingerwizard
Copy link
Collaborator Author

fail-fast invariably will slow down inserts - the client will need to parse the data twice. We could make this a switch to disable?

@ernado
Copy link
Collaborator

ernado commented Jul 19, 2022

IMO just failing whole batch is ok

@gingerwizard
Copy link
Collaborator Author

@ernado @mshustov note i've added issue (2).Shoud failure to append to a batch invalidate the batch?

@gingerwizard gingerwizard added this to the 2.4.0 milestone Sep 13, 2022
@gingerwizard gingerwizard modified the milestones: 2.4.0, 2.5.0 Nov 23, 2022
@gingerwizard
Copy link
Collaborator Author

This occurs outside of columnar api appends and standard appends. Currently if we make an invalid append to a batch it means it can't be sent - even if previous rows were successful. We had a bug re this - see #798

Noting we considered:

One proposal is that the Append of invalid data should fail but the batch should remain valid i.e. subsequent Send() should succeed. This is trickier as columns are added one by one. If the error occurs due to a column other than the first, we have to undo the previous changes. I suspect this isn't trivial as it will mean reversing the changes to the buffer. We could check to see all values are valid beforehand - this will incur an overhead though.

@gingerwizard
Copy link
Collaborator Author

This needs to be resolved and made consistent @mshustov Let discuss

@mdonkers
Copy link
Contributor

Giving my thoughts also on (2) (appending to batch might fail);

My preference would be to just invalidate the entire batch and trying to Send() will return an error. Reason is that, appending data and then having it fail, could basically only happen when types are incorrect, right? Which makes it a 'programmer' error. And if it fails for one record, then very likely for all records. So sending an empty batch is then useless anyhow.
As developer I'd much rather see it fail completely and fix it, than some partial behaviour. Especially if adding some records would succeed for whatever reason. Then the result of what is stored in ClickHouse is going to be highly nondeterministic.

@jkaflik jkaflik assigned jkaflik and unassigned gingerwizard Dec 16, 2022
@jkaflik
Copy link
Contributor

jkaflik commented Dec 16, 2022

Agree with @mshustov:

IMO we should align with the ClickHouse default logic for such cases. ClickHouse implements fail-fast pattern, even though it's configurable. From input_format_allow_errors_num docs:

fail-fast is the right approach here, but I don't want to go way too far and introduce a configurable threshold. Will submit a PR with a test that validates against behavior.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

5 participants