Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Compression: optimize the CompressionCodecFactory #9299

Merged
merged 6 commits into from
Aug 12, 2024

Conversation

Lloyd-Pottiger
Copy link
Contributor

What problem does this PR solve?

Issue Number: close #8982

Problem Summary:

What is changed and how it works?

Compression: optimize the CompressionCodecFactory

Check List

Tests

  • Unit test
  • Integration test
  • Manual test (add detailed scripts or steps below)
  • No code

Side effects

  • Performance regression: Consumes more CPU
  • Performance regression: Consumes more Memory
  • Breaking backward compatibility

Documentation

  • Affects user behaviors
  • Contains syntax changes
  • Contains variable changes
  • Contains experimental features
  • Changes MySQL compatibility

Release note

None

@ti-chi-bot ti-chi-bot bot added release-note-none Denotes a PR that doesn't merit a release note. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. labels Aug 7, 2024
@ti-chi-bot ti-chi-bot bot added size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. and removed size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. labels Aug 8, 2024
Copy link
Member

@CalvinNeo CalvinNeo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

@ti-chi-bot ti-chi-bot bot added approved needs-1-more-lgtm Indicates a PR needs 1 more LGTM. labels Aug 9, 2024
@@ -69,12 +69,13 @@ UInt32 CompressionCodecLightweight::doCompressData(const char * source, UInt32 s
case CompressionDataType::Float32:
case CompressionDataType::Float64:
case CompressionDataType::String:
case CompressionDataType::Unknown:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it ok to do compression on "unknown" data type?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, I add "unknown" to distinguish between string (char/varchar) and other types (binary).

if (auto codec = create(setting); codec)
codecs.push_back(std::move(codec));
}
return codecs;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we add a check that we will return at least one valid codec?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

RUNTIME_CHECK(codec);
#ifndef DBMS_PUBLIC_GTEST
RUNTIME_CHECK(codec->isCompression());
#endif
return codec;

check here.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Theoretically, CompressionCodecFactory::createCodecs can return a vector<Codec> with no element and CompressionCodecMultiple accept it.

CompressionCodecMultiple::CompressionCodecMultiple(Codecs && codecs_)
: codecs(std::move(codecs_))
{}

And it can pass the check you added here because it is a non-nullptr CompressionCodecMultiple instance.
CompressionCodecPtr CompressionCodecFactory::create(const CompressionSettings & settings)
{
RUNTIME_CHECK(!settings.settings.empty());
CompressionCodecPtr codec = (settings.settings.size() > 1)
? std::make_unique<CompressionCodecMultiple>(createCodecs(settings))
: create(settings.settings.front());
RUNTIME_CHECK(codec);
#ifndef DBMS_PUBLIC_GTEST
RUNTIME_CHECK(codec->isCompression());
#endif
return codec;
}

I think we need to add an assert(!codecs.empty()) in CompressionCodecFactory::createCodecs or the ctor of CompressionCodecMultiple

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

Copy link
Contributor

@JaySon-Huang JaySon-Huang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO, a negative if statement is not as readable as a positive one

dbms/src/IO/Compression/CompressionCodecFactory.h Outdated Show resolved Hide resolved
dbms/src/IO/Compression/CompressionCodecFactory.cpp Outdated Show resolved Hide resolved
dbms/src/IO/Compression/CompressionCodecFactory.cpp Outdated Show resolved Hide resolved
dbms/src/IO/Compression/CompressionCodecFactory.cpp Outdated Show resolved Hide resolved
Copy link
Contributor

@JaySon-Huang JaySon-Huang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@ti-chi-bot ti-chi-bot bot added lgtm and removed needs-1-more-lgtm Indicates a PR needs 1 more LGTM. labels Aug 12, 2024
Copy link
Contributor

ti-chi-bot bot commented Aug 12, 2024

[LGTM Timeline notifier]

Timeline:

  • 2024-08-09 06:34:57.006756672 +0000 UTC m=+593026.873855757: ☑️ agreed by CalvinNeo.
  • 2024-08-12 03:36:56.494176847 +0000 UTC m=+152701.197646491: ☑️ agreed by JaySon-Huang.

return it->second;
if (lz4_map.size() >= MAX_LZ4_MAP_SIZE)
lz4_map.clear();
lz4_map.emplace(setting.level, std::make_shared<CompressionCodecLZ4>(setting.level));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What if lz4_map is updated concurrently?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

@Lloyd-Pottiger
Copy link
Contributor Author

/hold

@ti-chi-bot ti-chi-bot bot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Aug 12, 2024
@Lloyd-Pottiger Lloyd-Pottiger force-pushed the optimize-codec-factory branch from e61baa2 to 1dbcb16 Compare August 12, 2024 04:01
Signed-off-by: Lloyd-Pottiger <[email protected]>
Signed-off-by: Lloyd-Pottiger <[email protected]>
Signed-off-by: Lloyd-Pottiger <[email protected]>
Signed-off-by: Lloyd-Pottiger <[email protected]>
@Lloyd-Pottiger Lloyd-Pottiger force-pushed the optimize-codec-factory branch from 1dbcb16 to e61fc41 Compare August 12, 2024 04:55
Copy link
Contributor

ti-chi-bot bot commented Aug 12, 2024

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: CalvinNeo, JaySon-Huang, JinheLin

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:
  • OWNERS [CalvinNeo,JaySon-Huang,JinheLin]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@Lloyd-Pottiger
Copy link
Contributor Author

/unhold

@ti-chi-bot ti-chi-bot bot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Aug 12, 2024
@ti-chi-bot ti-chi-bot bot merged commit 5e9e6c2 into pingcap:master Aug 12, 2024
5 checks passed
@Lloyd-Pottiger Lloyd-Pottiger deleted the optimize-codec-factory branch August 12, 2024 06:49
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved lgtm release-note-none Denotes a PR that doesn't merit a release note. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Support Adaptively Decide Compression Algorithm for integers data type
4 participants