Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix weird case when s3/s3Cluster return incomplete result or exception #71947

Merged
merged 9 commits into from
Nov 20, 2024

Conversation

nickitat
Copy link
Member

@nickitat nickitat commented Nov 14, 2024

Changelog category (leave one):

  • Bug Fix (user-visible misbehavior in an official stable release)

Changelog entry (a user-readable short description of the changes that goes to CHANGELOG.md):

Fixed case when s3/s3Cluster functions could return incomplete result or throw an exception. It involved using glob pattern in s3 uri (like pattern/*) and an empty object should exist with the key pattern/ (such objects automatically created by S3 Console). Also default value for setting s3_skip_empty_files changed from false to true by default.

CI Settings (Only check the boxes if you know what you are doing):

  • Allow: All Required Checks
  • Allow: Stateless tests
  • Allow: Stateful tests
  • Allow: Integration Tests
  • Allow: Performance tests
  • Allow: All Builds
  • Allow: batch 1, 2 for multi-batch jobs
  • Allow: batch 3, 4, 5, 6 for multi-batch jobs

  • Exclude: Style check
  • Exclude: Fast test
  • Exclude: All with ASAN
  • Exclude: All with TSAN, MSAN, UBSAN, Coverage
  • Exclude: All with aarch64, release, debug

  • Run only fuzzers related jobs (libFuzzer fuzzers, AST fuzzers, etc.)
  • Exclude: AST fuzzers

  • Do not test
  • Woolen Wolfdog
  • Upload binaries for special builds
  • Disable merge-commit
  • Disable CI cache

@robot-clickhouse robot-clickhouse added the pr-not-for-changelog This PR should not be mentioned in the changelog label Nov 14, 2024
@robot-ch-test-poll
Copy link
Contributor

robot-ch-test-poll commented Nov 14, 2024

This is an automated comment for commit 5aeeec0 with description of existing statuses. It's updated for the latest CI running

✅ Click here to open a full report in a separate page

Successful checks
Check nameDescriptionStatus
AST fuzzerRuns randomly generated queries to catch program errors. The build type is optionally given in parenthesis. If it fails, ask a maintainer for help✅ success
Bugfix validationChecks that either a new test (functional or integration) or there some changed tests that fail with the binary built on master branch✅ success
BuildsThere's no description for the check yet, please add it to tests/ci/ci_config.py:CHECK_DESCRIPTIONS✅ success
ClickBenchRuns [ClickBench](https://github.com/ClickHouse/ClickBench/) with instant-attach table✅ success
Compatibility checkChecks that clickhouse binary runs on distributions with old libc versions. If it fails, ask a maintainer for help✅ success
Docker keeper imageThe check to build and optionally push the mentioned image to docker hub✅ success
Docker server imageThe check to build and optionally push the mentioned image to docker hub✅ success
Docs checkBuilds and tests the documentation✅ success
Fast testNormally this is the first check that is ran for a PR. It builds ClickHouse and runs most of stateless functional tests, omitting some. If it fails, further checks are not started until it is fixed. Look at the report to see which tests fail, then reproduce the failure locally as described here✅ success
Flaky testsChecks if new added or modified tests are flaky by running them repeatedly, in parallel, with more randomization. Functional tests are run 100 times with address sanitizer, and additional randomization of thread scheduling. Integration tests are run up to 10 times. If at least once a new test has failed, or was too long, this check will be red. We don't allow flaky tests, read the doc✅ success
Install packagesChecks that the built packages are installable in a clear environment✅ success
Integration testsThe integration tests report. In parenthesis the package type is given, and in square brackets are the optional part/total tests✅ success
Performance ComparisonMeasure changes in query performance. The performance test report is described in detail here. In square brackets are the optional part/total tests✅ success
Stateful testsRuns stateful functional tests for ClickHouse binaries built in various configurations -- release, debug, with sanitizers, etc✅ success
Stateless testsRuns stateless functional tests for ClickHouse binaries built in various configurations -- release, debug, with sanitizers, etc✅ success
Stress testRuns stateless functional tests concurrently from several clients to detect concurrency-related errors✅ success
Style checkRuns a set of checks to keep the code style clean. If some of tests failed, see the related log from the report✅ success
Unit testsRuns the unit tests for different release types✅ success
Upgrade checkRuns stress tests on server version from last release and then tries to upgrade it to the version from the PR. It checks if the new server can successfully startup without any errors, crashes or sanitizer asserts✅ success

@nikitamikhaylov
Copy link
Member

nikitamikhaylov commented Nov 14, 2024

Backport?
I would also say this is a bug-fix, because s3 is widely used.

@nickitat
Copy link
Member Author

Backport? I would also say this is a bug-fix, because s3 is widely used.

ok, will do. first i need to read the code and understand what will break

@kssenii kssenii self-assigned this Nov 15, 2024
@robot-clickhouse-ci-2 robot-clickhouse-ci-2 added pr-bugfix Pull request with bugfix, not backported by default and removed pr-not-for-changelog This PR should not be mentioned in the changelog labels Nov 16, 2024
SET max_threads = 1;
SET s3_truncate_on_insert = 1;

INSERT INTO FUNCTION s3(s3_conn, filename='dir1/03271_s3_table_function_asterisk_glob/', format=Parquet) SELECT 0 as num;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

#72007 would break this test by no longer including paths with empty file names when using a trailing wildcard (/*). Is that ok or is this behavior needed for a specific use case?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i don't have any use case in mind, it just seems to be the correct behaviour. because s3 is not a fs, object key is a single string that might end on a '/'. initially i was asked by Pete why query with /dir/* pattern didn't work and it turned out that he had an object with name dir/, we got it in the List output, found that its name is empty and also throw away all other objects. so i decided to try to fix both problems - don't give up when we see an "empty file name" and read from such objects too.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was solving the same original problem with #72007. :-)

don't give up when we see an "empty file name" and read from such objects too

Makes sense to read from it when there is data in it, but shouldn't we rather skip it if it's empty?

Because Amazon creates such objects when you create a directory in the S3 console and it doesn't really make sense to try to read from such objects (from here):

When you create a folder in Amazon S3, S3 creates a 0-byte object with a key that's set to the folder name that you provided. For example, if you create a folder named photos in your bucket, the Amazon S3 console creates a 0-byte object with the key photos/. The console creates this object to support the idea of folders.

Copy link
Member Author

@nickitat nickitat Nov 18, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thx for this reference - i was wondering who creates these empty objects, seemed hard to believe that people really do it manually.

@nickitat nickitat changed the title Fix weird case when s3 function returns incomplete result Fix weird case when s3 function returns incomplete result or exception Nov 19, 2024
@nickitat nickitat changed the title Fix weird case when s3 function returns incomplete result or exception Fix weird case when s3/s3Cluster return incomplete result or exception Nov 19, 2024
@kssenii kssenii added this pull request to the merge queue Nov 20, 2024
Merged via the queue into master with commit c6a1015 Nov 20, 2024
220 checks passed
@kssenii kssenii deleted the fix_weird_problem branch November 20, 2024 17:10
@aalexfvk
Copy link
Contributor

aalexfvk commented Dec 16, 2024

Hello! Will the backport be made in the LTS/stable versions (from 24.3) ?

@kssenii kssenii added the pr-must-backport Pull request should be backported intentionally. Use this label with great care! label Dec 16, 2024
nickitat added a commit that referenced this pull request Dec 16, 2024
Cherry pick #71947 to 24.10: Fix weird case when `s3`/`s3Cluster` return incomplete result or exception
robot-clickhouse added a commit that referenced this pull request Dec 16, 2024
@robot-clickhouse-ci-2 robot-clickhouse-ci-2 added the pr-backports-created Backport PRs are successfully created, it won't be processed by CI script anymore label Dec 16, 2024
robot-clickhouse added a commit that referenced this pull request Dec 16, 2024
Backport #71947 to 24.10: Fix weird case when `s3`/`s3Cluster` return incomplete result or exception
@aalexfvk
Copy link
Contributor

aalexfvk commented Jan 8, 2025

Hello! Is it possible to backport in 24.3 - 24.9 also ? They were closed

@nickitat
Copy link
Member Author

nickitat commented Jan 8, 2025

Hello! Is it possible to backport in 24.3 - 24.9 also ? They were closed

Hi! Unfortunately not - there were merge conflicts.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
pr-backports-created Backport PRs are successfully created, it won't be processed by CI script anymore pr-backports-created-cloud pr-bugfix Pull request with bugfix, not backported by default pr-must-backport Pull request should be backported intentionally. Use this label with great care! pr-must-backport-cloud pr-synced-to-cloud The PR is synced to the cloud repo
Projects
None yet
Development

Successfully merging this pull request may close these issues.

9 participants