Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(core-clp)!: Migrate archive metadata file format to MessagePack. #700

Merged
merged 14 commits into from
Feb 5, 2025

Conversation

davemarco
Copy link
Contributor

@davemarco davemarco commented Jan 29, 2025

Description

This PR is breaking

MIgrates the archive metadata format from a byte stream to MessagePack Map (same as sfa spec). Using MessagePack allows us to more easily store variable length metadata like strings. In previous format, a size specifier was required for any variable length field.

Note the metadata file will take up slightly more space, but likely small relative to archive.

Lastly the metadata size is now excluded from the compressed size. Now that the metadata is variable length, it is not trivial to incorporate the metadata size in the metadata itself (there is a circularity problem). The simplest solution is just to remove it from compressed size value.

This PR is sfa related like:
#698

Validation performed

Tested compression/decompression of archive.

Summary by CodeRabbit

Summary by CodeRabbit

  • Refactor

    • Updated the ArchiveMetadata class to use MessagePack serialization for more efficient file reading and writing.
    • Replaced the constructor with a static create_from_file method.
    • Simplified the object creation process for ArchiveMetadata in the Archive::open method.
  • New Features

    • Added a default constructor for ArchiveMetadata.
    • Enhanced error handling for file reading and metadata creation.
    • Introduced a new method for writing ArchiveMetadata to a file.
  • Improvements

    • Streamlined metadata serialization process.
    • Added a new member variable for improved metadata handling.

Copy link
Contributor

coderabbitai bot commented Jan 29, 2025

Walkthrough

The pull request introduces significant changes to the ArchiveMetadata class in the streaming archive component. The primary modifications involve refactoring the file reading and writing methods to use MessagePack serialization. A new static method create_from_file replaces the previous constructor, streamlining the file handling process. The Archive::open method has also been updated to utilize the new static method for instantiating ArchiveMetadata.

Changes

File Change Summary
components/core/src/clp/streaming_archive/ArchiveMetadata.cpp - Removed constructor accepting FileReader
- Introduced static method create_from_file for file reading
- Updated write_to_file to use MessagePack serialization
components/core/src/clp/streaming_archive/ArchiveMetadata.hpp - Added default constructor
- Introduced new method for file reading and serialization
- Updated serialization macro to include new members
- Added member variable m_metadata_size
components/core/src/clp/streaming_archive/reader/Archive.cpp - Updated metadata creation to use new static method create_from_file

Possibly related PRs

Suggested reviewers

  • kirkrodrigues
  • LinZhihao-723

Tip

🌐 Web search-backed reviews and chat
  • We have enabled web search-based reviews and chat for all users. This feature allows CodeRabbit to access the latest documentation and information on the web.
  • You can disable this feature by setting web_search: false in the knowledge_base settings.
  • Please share any feedback in the Discord discussion.

📜 Recent review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between a885dc8 and fc00b4a.

📒 Files selected for processing (2)
  • components/core/src/clp/streaming_archive/ArchiveMetadata.cpp (3 hunks)
  • components/core/src/clp/streaming_archive/ArchiveMetadata.hpp (4 hunks)
🚧 Files skipped from review as they are similar to previous changes (2)
  • components/core/src/clp/streaming_archive/ArchiveMetadata.cpp
  • components/core/src/clp/streaming_archive/ArchiveMetadata.hpp
⏰ Context from checks skipped due to timeout of 90000ms (12)
  • GitHub Check: centos-stream-9-static-linked-bins
  • GitHub Check: centos-stream-9-dynamic-linked-bins
  • GitHub Check: ubuntu-focal-static-linked-bins
  • GitHub Check: ubuntu-jammy-static-linked-bins
  • GitHub Check: ubuntu-focal-dynamic-linked-bins
  • GitHub Check: ubuntu-jammy-dynamic-linked-bins
  • GitHub Check: build-macos (macos-14, false)
  • GitHub Check: build-macos (macos-14, true)
  • GitHub Check: lint-check (ubuntu-latest)
  • GitHub Check: build-macos (macos-13, false)
  • GitHub Check: build-macos (macos-13, true)
  • GitHub Check: build (macos-latest)

Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media?

❤️ Share
🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>, please review it.
    • Generate unit testing code for this file.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
    • @coderabbitai generate unit testing code for this file.
    • @coderabbitai modularize this function.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbitai read src/utils.ts and generate unit testing code.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
    • @coderabbitai help me debug CodeRabbit configuration file.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

  • @coderabbitai pause to pause the reviews on a PR.
  • @coderabbitai resume to resume the paused reviews.
  • @coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
  • @coderabbitai full review to do a full review from scratch and review all the files again.
  • @coderabbitai summary to regenerate the summary of the PR.
  • @coderabbitai generate docstrings to generate docstrings for this PR. (Beta)
  • @coderabbitai resolve resolve all the CodeRabbit review comments.
  • @coderabbitai configuration to show the current CodeRabbit configuration for the repository.
  • @coderabbitai help to get help.

Other keywords and placeholders

  • Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (.coderabbit.yaml)

  • You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
  • Please see the configuration documentation for more information.
  • If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (5)
components/core/src/clp/streaming_archive/ArchiveMetadata.cpp (2)

5-5: Remove unused header if not required.
<fmt/core.h> does not appear to be used in this file. Consider removing it to keep dependencies minimal.

-#include <fmt/core.h>

29-46: Address potential large file memory usage.
Reading the entire metadata file into memory (std::vector<char> buf(file_stat.st_size)) could pose memory concerns if the file is large. Consider chunk-based reading or additional checks (e.g., file size limit) to reduce risk.

components/core/src/clp/streaming_archive/ArchiveMetadata.hpp (2)

15-15: Fixed compression type constant.
Using a static constexpr string_view for the "ZSTD" compression type is clear. If you anticipate multiple compression methods in the future, ensure expandability.


122-132: MSGPACK_DEFINE_MAP ordering.
Keeping the newly added fields in a stable order can help maintain backward compatibility. Consider documenting any assumptions about field ordering in public references.

components/core/src/clp/streaming_archive/reader/Archive.cpp (1)

41-41: Explicit naming for metadata object.
Using auto const metadata = ArchiveMetadata::create_from_file_reader(file_reader); is readable. If multiple metadata objects are used in the future, a more specific variable name (e.g., archive_metadata) may be beneficial.

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between ff90d7f and a6974c7.

📒 Files selected for processing (3)
  • components/core/src/clp/streaming_archive/ArchiveMetadata.cpp (3 hunks)
  • components/core/src/clp/streaming_archive/ArchiveMetadata.hpp (5 hunks)
  • components/core/src/clp/streaming_archive/reader/Archive.cpp (1 hunks)
🧰 Additional context used
📓 Path-based instructions (3)
components/core/src/clp/streaming_archive/reader/Archive.cpp (1)

Pattern **/*.{cpp,hpp,java,js,jsx,ts,tsx}: - Prefer false == <expression> rather than !<expression>.

components/core/src/clp/streaming_archive/ArchiveMetadata.cpp (1)

Pattern **/*.{cpp,hpp,java,js,jsx,ts,tsx}: - Prefer false == <expression> rather than !<expression>.

components/core/src/clp/streaming_archive/ArchiveMetadata.hpp (1)

Pattern **/*.{cpp,hpp,java,js,jsx,ts,tsx}: - Prefer false == <expression> rather than !<expression>.

⏰ Context from checks skipped due to timeout of 90000ms (11)
  • GitHub Check: ubuntu-focal-static-linked-bins
  • GitHub Check: centos-stream-9-static-linked-bins
  • GitHub Check: ubuntu-focal-dynamic-linked-bins
  • GitHub Check: ubuntu-jammy-static-linked-bins
  • GitHub Check: centos-stream-9-dynamic-linked-bins
  • GitHub Check: ubuntu-jammy-dynamic-linked-bins
  • GitHub Check: build-macos (macos-14, false)
  • GitHub Check: build-macos (macos-14, true)
  • GitHub Check: build-macos (macos-13, false)
  • GitHub Check: lint-check (ubuntu-latest)
  • GitHub Check: build-macos (macos-13, true)
🔇 Additional comments (12)
components/core/src/clp/streaming_archive/ArchiveMetadata.cpp (2)

3-3: Validate necessity of the <sys/stat.h> header.
It seems this header is required for the struct stat usage at lines 30–31. Please confirm it remains necessary if file statistics are still being used.


59-62: Confirm write operation’s success path.
You rely on the FileWriter to throw on failure. While this is acceptable, providing explicit handling or logging around partial writes can offer more clarity for troubleshooting.

components/core/src/clp/streaming_archive/ArchiveMetadata.hpp (10)

7-7: Include for encoding methods constants.
Ensuring ffi::encoding_methods.hpp is included allows proper default values for new member fields. Nicely done.


11-11: Essential include for MessagePack.
The msgpack.hpp header is essential for the serialisation and deserialisation logic used in this class.


36-36: Default constructor utility.
Providing a default constructor can simplify certain use cases or testing scenarios when creating ArchiveMetadata objects.


49-49: Concise method documentation.
The comment for the factory method clarifies its purpose and expected usage.


51-59: Thorough docstring for MessagePack creation.
These lines effectively describe how create_from_file_reader works and the possible exceptions. Excellent detail in the error-handling notes.


95-97: Getter for variable encoding methods version.
Returning a const reference avoids unnecessary copies and ensures data consistency.


99-101: Getter for variables schema version.
Likewise, returning a const reference aligns with your efficiency pattern and design style.


103-105: Getter for compression type.
Maintains consistency with other getters. Straightforward and efficient.


114-119: Comprehensive documentation for write operation.
Explicitly mentioning exceptions aids users of this method in planning robust error handling.


148-150: Initial values for new fields.
Pre-setting defaults for m_variable_encoding_methods_version, m_variables_schema_version, and m_compression_type ensures consistent behaviour across all code paths.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🧹 Nitpick comments (3)
components/core/src/clp/streaming_archive/ArchiveMetadata.cpp (2)

5-5: Consider removing unused include.

The fmt/core.h header appears to be unused in the current implementation.


59-62: Improve error handling and memory efficiency in serialization.

The current implementation:

  1. Doesn't handle potential msgpack::pack errors
  2. Uses intermediate string buffer, causing double memory allocation

Consider using msgpack's direct streaming to file writer or a more memory-efficient approach:

-    std::ostringstream buf;
-    msgpack::pack(buf, *this);
-    auto const& string_buf = buf.str();
-    file_writer.write(string_buf.data(), string_buf.size());
+    // Option 1: Use msgpack's streaming interface
+    msgpack::packer<FileWriter> packer(file_writer);
+    packer.pack(*this);
+
+    // Option 2: Pre-calculate buffer size
+    size_t buffer_size = msgpack::pack_calculate_size(*this);
+    std::vector<char> buffer(buffer_size);
+    msgpack::pack(buffer.data(), *this);
+    file_writer.write(buffer.data(), buffer_size);
components/core/src/clp/streaming_archive/ArchiveMetadata.hpp (1)

95-105: Add documentation for new getter methods.

The new getter methods lack documentation comments explaining their purpose and return values.

Add documentation similar to other methods in the class:

+    /**
+     * @return The version of variable encoding methods used in this archive
+     */
     [[nodiscard]] auto get_variable_encoding_methods_version() const -> std::string const& {
         return m_variable_encoding_methods_version;
     }

+    /**
+     * @return The version of variables schema used in this archive
+     */
     [[nodiscard]] auto get_variables_schema_version() const -> std::string const& {
         return m_variables_schema_version;
     }

+    /**
+     * @return The compression algorithm type used in this archive
+     */
     [[nodiscard]] auto get_compression_type() const -> std::string const& {
         return m_compression_type;
     }
📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between a6974c7 and d3a52d4.

📒 Files selected for processing (2)
  • components/core/src/clp/streaming_archive/ArchiveMetadata.cpp (3 hunks)
  • components/core/src/clp/streaming_archive/ArchiveMetadata.hpp (5 hunks)
🧰 Additional context used
📓 Path-based instructions (2)
components/core/src/clp/streaming_archive/ArchiveMetadata.cpp (1)

Pattern **/*.{cpp,hpp,java,js,jsx,ts,tsx}: - Prefer false == <expression> rather than !<expression>.

components/core/src/clp/streaming_archive/ArchiveMetadata.hpp (1)

Pattern **/*.{cpp,hpp,java,js,jsx,ts,tsx}: - Prefer false == <expression> rather than !<expression>.

⏰ Context from checks skipped due to timeout of 90000ms (11)
  • GitHub Check: ubuntu-focal-static-linked-bins
  • GitHub Check: ubuntu-jammy-static-linked-bins
  • GitHub Check: centos-stream-9-static-linked-bins
  • GitHub Check: ubuntu-focal-dynamic-linked-bins
  • GitHub Check: ubuntu-jammy-dynamic-linked-bins
  • GitHub Check: centos-stream-9-dynamic-linked-bins
  • GitHub Check: build-macos (macos-14, false)
  • GitHub Check: build-macos (macos-14, true)
  • GitHub Check: build-macos (macos-13, false)
  • GitHub Check: build-macos (macos-13, true)
  • GitHub Check: lint-check (macos-latest)
🔇 Additional comments (1)
components/core/src/clp/streaming_archive/ArchiveMetadata.hpp (1)

7-15: LGTM! Well-organized includes and constants.

The new include and compression type constant are appropriately placed and follow the project's conventions.

Comment on lines 122 to 132
MSGPACK_DEFINE_MAP(
MSGPACK_NVP("archive_format_version", m_archive_format_version),
MSGPACK_NVP("variable_encoding_methods_version", m_variable_encoding_methods_version),
MSGPACK_NVP("variables_schema_version", m_variables_schema_version),
MSGPACK_NVP("compression_type", m_compression_type),
MSGPACK_NVP("creator_id", m_creator_id),
MSGPACK_NVP("begin_timestamp", m_begin_timestamp),
MSGPACK_NVP("end_timestamp", m_end_timestamp),
MSGPACK_NVP("uncompressed_size", m_uncompressed_size),
MSGPACK_NVP("compressed_size", m_compressed_size)
);
Copy link
Contributor

@coderabbitai coderabbitai bot Jan 29, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Add missing creation_idx to MessagePack serialization.

The creation_idx field is initialized in the constructor but not included in the MessagePack serialization map.

Add the missing field:

     MSGPACK_DEFINE_MAP(
             MSGPACK_NVP("archive_format_version", m_archive_format_version),
             MSGPACK_NVP("variable_encoding_methods_version", m_variable_encoding_methods_version),
             MSGPACK_NVP("variables_schema_version", m_variables_schema_version),
             MSGPACK_NVP("compression_type", m_compression_type),
             MSGPACK_NVP("creator_id", m_creator_id),
+            MSGPACK_NVP("creation_idx", m_creation_idx),
             MSGPACK_NVP("begin_timestamp", m_begin_timestamp),
             MSGPACK_NVP("end_timestamp", m_end_timestamp),
             MSGPACK_NVP("uncompressed_size", m_uncompressed_size),
             MSGPACK_NVP("compressed_size", m_compressed_size)
     );
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
MSGPACK_DEFINE_MAP(
MSGPACK_NVP("archive_format_version", m_archive_format_version),
MSGPACK_NVP("variable_encoding_methods_version", m_variable_encoding_methods_version),
MSGPACK_NVP("variables_schema_version", m_variables_schema_version),
MSGPACK_NVP("compression_type", m_compression_type),
MSGPACK_NVP("creator_id", m_creator_id),
MSGPACK_NVP("begin_timestamp", m_begin_timestamp),
MSGPACK_NVP("end_timestamp", m_end_timestamp),
MSGPACK_NVP("uncompressed_size", m_uncompressed_size),
MSGPACK_NVP("compressed_size", m_compressed_size)
);
MSGPACK_DEFINE_MAP(
MSGPACK_NVP("archive_format_version", m_archive_format_version),
MSGPACK_NVP("variable_encoding_methods_version", m_variable_encoding_methods_version),
MSGPACK_NVP("variables_schema_version", m_variables_schema_version),
MSGPACK_NVP("compression_type", m_compression_type),
MSGPACK_NVP("creator_id", m_creator_id),
MSGPACK_NVP("creation_idx", m_creation_idx),
MSGPACK_NVP("begin_timestamp", m_begin_timestamp),
MSGPACK_NVP("end_timestamp", m_end_timestamp),
MSGPACK_NVP("uncompressed_size", m_uncompressed_size),
MSGPACK_NVP("compressed_size", m_compressed_size)
);

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@davemarco is this an issue?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems like the humans are having a chat. I'll hop back into my burrow for now. If you need me again, just tag @coderabbitai in a new comment, and I'll come hopping out!

Copy link
Contributor Author

@davemarco davemarco Jan 29, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps. Thanks @coderabbitai. I just realized that this field was omitted from sfa format, so left out here unintentionally. I believe this field is an archive order for split archives, so maybe it was omitted because the archives are not split. We can either have private branch add this back, since it is harmless. Or maybe I can just default initialize it as zero.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@davemarco Thanks for the explanation! Since the archives aren't split, defaulting creation_idx to zero in the MessagePack serialization seems like a reasonable approach. Would you like me to provide an updated suggestion with this change?


✏️ Learnings added
Learnt from: davemarco
PR: y-scope/clp#700
File: components/core/src/clp/streaming_archive/ArchiveMetadata.hpp:122-132
Timestamp: 2025-01-29T21:09:06.779Z
Learning: The `creation_idx` field in `ArchiveMetadata` is used for tracking archive order in split archives. When archives aren't split, it can be safely initialized to zero.

Note: Learnings are effective only in the context of similar code segments. To apply general coding guidelines, please configure review instructions. You can manage existing learnings in the app.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm, I think either works. I guess we can omit creation_idx for now, but add a comment to explain it's for format consistency. Then we can open another issue tracking the effort to handle this field properly in private branch

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added it back, since I assume this is used somewhere. I will add issue to doc. Note it is default initialized to zero, so even if missing in private branch, it will show up as zero when deserialized (if missing), so not much of a problem

@@ -38,7 +38,7 @@ void Archive::open(string const& path) {
archive_format_version_t format_version{};
try {
FileReader file_reader{metadata_file_path};
ArchiveMetadata const metadata{file_reader};
auto const metadata = ArchiveMetadata::create_from_file_reader(file_reader);
Copy link
Contributor

@haiqi96 haiqi96 Jan 29, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

since the FileReader goes out of scope right after the metadata is created, can the method takes metadata_file_path as its parameter and create a FileReader internally?

Edit: Now i see that you are just trying to reuse the original signature, which makes sense.. but maybe we can also improve the method interface in this PR since it's a few lines of change.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How do you feel about taking a clp::array instead as input. It will move some of the code (the read and stat) into archive reader, but then this constructor can also be used for sfa. Like for sfa, we can't pass in a filereader (since no way to get the size of metadata in larger file by itself, and the filepath is also for the entire sfa)?

We could also have 2 constructors. One for file path, and one for buffer. And they just share a method to deserialize from buffer? Let me know your thoughts

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It will move some of the code (the read and stat) into archive reader, but then this constructor can also be used for sfa.

Is this planned in a future PR? I think it makes sense, but the change you propose might be easier to reason about when we are actually making the changes that include sfa.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay for now, I can take in a file path. I can add the buffer feature later when sfa reader is implemented

@@ -102,6 +145,9 @@ class ArchiveMetadata {
// The size of the archive
uint64_t m_compressed_size{0};
uint64_t m_dynamic_compressed_size{0};
std::string m_variable_encoding_methods_version{ffi::cVariableEncodingMethodsVersion};
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The new coding standard encourages using std::string_view where possible. is it possible to use string_view instead?

throw OperationFailed(ErrorCode_Failure, __FILENAME__, __LINE__);
}

std::vector<char> buf(file_stat.st_size);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@LinZhihao-723 should we use clp::array instead?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I moved to clp::array

Comment on lines 31 to 34
auto clp_rc = file_reader.try_fstat(file_stat);
if (clp::ErrorCode::ErrorCode_Success != clp_rc) {
throw OperationFailed(ErrorCode_Failure, __FILENAME__, __LINE__);
}
Copy link
Contributor

@haiqi96 haiqi96 Jan 29, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

another way to write this code is:
Do we prefer one way to another? @LinZhihao-723

Suggested change
auto clp_rc = file_reader.try_fstat(file_stat);
if (clp::ErrorCode::ErrorCode_Success != clp_rc) {
throw OperationFailed(ErrorCode_Failure, __FILENAME__, __LINE__);
}
if (auto const clp_rc = file_reader.try_fstat(file_stat);
clp::ErrorCode::ErrorCode_Success != clp_rc) {
throw OperationFailed(ErrorCode_Failure, __FILENAME__, __LINE__);
}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Supposedly second is better since the variable is not needed in remaining scope

Comment on lines 37 to 40
clp_rc = file_reader.try_read_exact_length(buf.data(), buf.size());
if (clp::ErrorCode::ErrorCode_Success != clp_rc) {
throw OperationFailed(ErrorCode_Failure, __FILENAME__, __LINE__);
}
Copy link
Contributor

@haiqi96 haiqi96 Jan 29, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto, maybe can be rewritten as

Suggested change
clp_rc = file_reader.try_read_exact_length(buf.data(), buf.size());
if (clp::ErrorCode::ErrorCode_Success != clp_rc) {
throw OperationFailed(ErrorCode_Failure, __FILENAME__, __LINE__);
}
if (auto const clp_rc = file_reader.try_read_exact_length(buf.data(), buf.size());
clp::ErrorCode::ErrorCode_Success != clp_rc) {
throw OperationFailed(ErrorCode_Failure, __FILENAME__, __LINE__);
}

* @throw `ArchiveMetadata::OperationFailed` if stat or read operation on metadata file fails.
* @throw `msgpack::unpack_error` if data cannot be unpacked into MessagePack object.
* @throw `msgpack::type_error` if MessagePack object can't be converted to
* `ArchiveMetadata`.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can it be at end of the preivious line?

Copy link
Contributor Author

@davemarco davemarco left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed comment, and provided feedback for couple questions

Comment on lines 31 to 34
auto clp_rc = file_reader.try_fstat(file_stat);
if (clp::ErrorCode::ErrorCode_Success != clp_rc) {
throw OperationFailed(ErrorCode_Failure, __FILENAME__, __LINE__);
}
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Supposedly second is better since the variable is not needed in remaining scope

Comment on lines 122 to 132
MSGPACK_DEFINE_MAP(
MSGPACK_NVP("archive_format_version", m_archive_format_version),
MSGPACK_NVP("variable_encoding_methods_version", m_variable_encoding_methods_version),
MSGPACK_NVP("variables_schema_version", m_variables_schema_version),
MSGPACK_NVP("compression_type", m_compression_type),
MSGPACK_NVP("creator_id", m_creator_id),
MSGPACK_NVP("begin_timestamp", m_begin_timestamp),
MSGPACK_NVP("end_timestamp", m_end_timestamp),
MSGPACK_NVP("uncompressed_size", m_uncompressed_size),
MSGPACK_NVP("compressed_size", m_compressed_size)
);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added it back, since I assume this is used somewhere. I will add issue to doc. Note it is default initialized to zero, so even if missing in private branch, it will show up as zero when deserialized (if missing), so not much of a problem

@@ -38,7 +38,7 @@ void Archive::open(string const& path) {
archive_format_version_t format_version{};
try {
FileReader file_reader{metadata_file_path};
ArchiveMetadata const metadata{file_reader};
auto const metadata = ArchiveMetadata::create_from_file_reader(file_reader);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How do you feel about taking a clp::array instead as input. It will move some of the code (the read and stat) into archive reader, but then this constructor can also be used for sfa. Like for sfa, we can't pass in a filereader (since no way to get the size of metadata in larger file by itself, and the filepath is also for the entire sfa)?

We could also have 2 constructors. One for file path, and one for buffer. And they just share a method to deserialize from buffer? Let me know your thoughts

throw OperationFailed(ErrorCode_Failure, __FILENAME__, __LINE__);
}

std::vector<char> buf(file_stat.st_size);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I moved to clp::array

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (4)
components/core/src/clp/streaming_archive/ArchiveMetadata.hpp (2)

37-39: Enhance the default constructor documentation.

While the comment explains the MessagePack requirement, consider making it more explicit:

-    // The class must be constructible to convert from msgpack::object.
-    // https://github.com/msgpack/msgpack-c/wiki/v2_0_cpp_adaptor
+    /**
+     * Default constructor required for MessagePack deserialization.
+     * @see https://github.com/msgpack/msgpack-c/wiki/v2_0_cpp_adaptor
+     * @note This constructor should not be used directly.
+     */

117-122: Enhance write_to_file documentation.

Consider adding more details about the MessagePack serialization format and potential size implications.

     /**
-     * Packs `ArchiveMetadata` to MessagePack and writes to the open file.
+     * Serializes the metadata to MessagePack format and writes to the open file.
+     * Note: MessagePack format may result in slightly larger file sizes compared
+     * to the previous binary format.
      *
      * @param file_writer Writer for archive metadata file.
      * @throw FileWriter::OperationFailed if failed to write.
      */
components/core/src/clp/streaming_archive/ArchiveMetadata.cpp (2)

5-6: Remove unused include.

The fmt/core.h header appears to be unused in this file. Consider removing it to keep the includes clean and minimize compilation time.

-#include <fmt/core.h>

63-66: Improve write_to_file performance and error handling.

The current implementation has two areas for improvement:

  1. Double buffering could be avoided by writing directly to the file
  2. Missing error handling for the write operation
 void ArchiveMetadata::write_to_file(FileWriter& file_writer) const {
-    std::ostringstream buf;
-    msgpack::pack(buf, *this);
-    auto const& string_buf = buf.str();
-    file_writer.write(string_buf.data(), string_buf.size());
+    // Pack directly to a pre-sized buffer to avoid double buffering
+    auto const packed_size = msgpack::pack_calculate_size(*this);
+    clp::Array<char> buf(packed_size);
+    msgpack::pack_raw(buf.data(), packed_size, *this);
+    
+    if (auto const clp_rc = file_writer.try_write(buf.data(), buf.size());
+        clp::ErrorCode::ErrorCode_Success != clp_rc) {
+        throw OperationFailed(clp_rc, __FILENAME__, __LINE__);
+    }
 }
📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between d3a52d4 and 72452c1.

📒 Files selected for processing (2)
  • components/core/src/clp/streaming_archive/ArchiveMetadata.cpp (3 hunks)
  • components/core/src/clp/streaming_archive/ArchiveMetadata.hpp (5 hunks)
🧰 Additional context used
📓 Path-based instructions (2)
components/core/src/clp/streaming_archive/ArchiveMetadata.cpp (1)

Pattern **/*.{cpp,hpp,java,js,jsx,ts,tsx}: - Prefer false == <expression> rather than !<expression>.

components/core/src/clp/streaming_archive/ArchiveMetadata.hpp (1)

Pattern **/*.{cpp,hpp,java,js,jsx,ts,tsx}: - Prefer false == <expression> rather than !<expression>.

📓 Learnings (2)
components/core/src/clp/streaming_archive/ArchiveMetadata.cpp (1)
Learnt from: haiqi96
PR: y-scope/clp#700
File: components/core/src/clp/streaming_archive/ArchiveMetadata.cpp:29-46
Timestamp: 2025-01-29T22:16:52.665Z
Learning: In C++ code, when throwing exceptions, prefer throwing the original return code (RC) from lower-level operations instead of a generic error code to preserve error context and aid in debugging.
components/core/src/clp/streaming_archive/ArchiveMetadata.hpp (1)
Learnt from: davemarco
PR: y-scope/clp#700
File: components/core/src/clp/streaming_archive/ArchiveMetadata.hpp:122-132
Timestamp: 2025-01-29T21:09:06.878Z
Learning: The `creation_idx` field in `ArchiveMetadata` is used for tracking archive order in split archives. When archives aren't split, it can be safely initialized to zero.
⏰ Context from checks skipped due to timeout of 90000ms (12)
  • GitHub Check: ubuntu-focal-static-linked-bins
  • GitHub Check: centos-stream-9-static-linked-bins
  • GitHub Check: ubuntu-jammy-static-linked-bins
  • GitHub Check: ubuntu-focal-dynamic-linked-bins
  • GitHub Check: centos-stream-9-dynamic-linked-bins
  • GitHub Check: ubuntu-jammy-dynamic-linked-bins
  • GitHub Check: build-macos (macos-14, false)
  • GitHub Check: build-macos (macos-13, false)
  • GitHub Check: lint-check (ubuntu-latest)
  • GitHub Check: build-macos (macos-13, true)
  • GitHub Check: lint-check (macos-latest)
  • GitHub Check: build (macos-latest)
🔇 Additional comments (7)
components/core/src/clp/streaming_archive/ArchiveMetadata.hpp (5)

5-16: LGTM! The new includes and constants are well-organized.

The use of std::string_view aligns with the coding standards, and the compression type constant follows proper naming conventions.


53-64: LGTM! Well-documented factory method.

The static factory method is well-designed with comprehensive error documentation.


98-109: LGTM! Consistent getter implementations.

The new getters follow a consistent pattern and properly use std::string_view const& as return types.


152-154: LGTM! Well-structured member variables.

The new member variables properly use std::string_view for constant strings and follow consistent initialization patterns.


125-136: Verify MessagePack field order impact.

The order of fields in MSGPACK_DEFINE_MAP might affect backward compatibility when reading older archives.

components/core/src/clp/streaming_archive/ArchiveMetadata.cpp (2)

33-44: Well-implemented error handling and buffer management.

Good practices observed:

  1. Using the recommended if-with-initializer syntax
  2. Throwing original return codes for better error context
  3. Using clp::Array for safe buffer management

31-50: 🛠️ Refactor suggestion

Add file size validation to prevent excessive memory allocation.

While the implementation follows good practices for error handling and buffer management, it lacks size validation before allocating memory for the file contents.

 auto ArchiveMetadata::create_from_file_reader(FileReader& file_reader) -> ArchiveMetadata {
     struct stat file_stat{};
     if (auto const clp_rc = file_reader.try_fstat(file_stat);
         clp::ErrorCode::ErrorCode_Success != clp_rc)
     {
         throw OperationFailed(clp_rc, __FILENAME__, __LINE__);
     }
 
+    // Validate file size to prevent excessive memory allocation
+    constexpr size_t max_metadata_size = 1024 * 1024; // 1MB should be more than enough for metadata
+    if (file_stat.st_size <= 0 || file_stat.st_size > max_metadata_size) {
+        throw OperationFailed(ErrorCode_BadParam, __FILENAME__, __LINE__);
+    }
+
     clp::Array<char> buf(file_stat.st_size);

Likely invalid or redundant comment.

@davemarco davemarco requested a review from haiqi96 January 30, 2025 15:55
Copy link
Contributor Author

@davemarco davemarco left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed to pass in file path instead of fileReader @haiqi96

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🧹 Nitpick comments (3)
components/core/src/clp/streaming_archive/ArchiveMetadata.hpp (1)

16-16: Add documentation for the compression type constant.

Add a comment to explain the purpose and usage of this constant.

Apply this diff to add documentation:

+/**
+ * Constant representing ZSTD compression type for archive metadata.
+ */
 static constexpr std::string_view cCompressionTypeZstd = "ZSTD";
components/core/src/clp/streaming_archive/ArchiveMetadata.cpp (2)

5-5: Remove unused include.

The fmt/core.h header appears to be unused in the current implementation.

-#include <fmt/core.h>

65-68: Optimize string buffer allocation.

Consider reserving space in the string buffer to avoid reallocations during serialization.

     std::ostringstream buf;
+    // Reserve approximate size based on typical metadata size
+    buf.str().reserve(1024);
     msgpack::pack(buf, *this);
     auto const& string_buf = buf.str();
📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 72452c1 and aafee6e.

📒 Files selected for processing (3)
  • components/core/src/clp/streaming_archive/ArchiveMetadata.cpp (3 hunks)
  • components/core/src/clp/streaming_archive/ArchiveMetadata.hpp (5 hunks)
  • components/core/src/clp/streaming_archive/reader/Archive.cpp (1 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
  • components/core/src/clp/streaming_archive/reader/Archive.cpp
🧰 Additional context used
📓 Path-based instructions (2)
components/core/src/clp/streaming_archive/ArchiveMetadata.cpp (1)

Pattern **/*.{cpp,hpp,java,js,jsx,ts,tsx}: - Prefer false == <expression> rather than !<expression>.

components/core/src/clp/streaming_archive/ArchiveMetadata.hpp (1)

Pattern **/*.{cpp,hpp,java,js,jsx,ts,tsx}: - Prefer false == <expression> rather than !<expression>.

📓 Learnings (2)
components/core/src/clp/streaming_archive/ArchiveMetadata.cpp (1)
Learnt from: haiqi96
PR: y-scope/clp#700
File: components/core/src/clp/streaming_archive/ArchiveMetadata.cpp:29-46
Timestamp: 2025-01-29T22:16:52.665Z
Learning: In C++ code, when throwing exceptions, prefer throwing the original return code (RC) from lower-level operations instead of a generic error code to preserve error context and aid in debugging.
components/core/src/clp/streaming_archive/ArchiveMetadata.hpp (1)
Learnt from: davemarco
PR: y-scope/clp#700
File: components/core/src/clp/streaming_archive/ArchiveMetadata.hpp:122-132
Timestamp: 2025-01-29T21:09:06.878Z
Learning: The `creation_idx` field in `ArchiveMetadata` is used for tracking archive order in split archives. When archives aren't split, it can be safely initialized to zero.
⏰ Context from checks skipped due to timeout of 90000ms (12)
  • GitHub Check: ubuntu-jammy-static-linked-bins
  • GitHub Check: ubuntu-focal-static-linked-bins
  • GitHub Check: centos-stream-9-static-linked-bins
  • GitHub Check: ubuntu-jammy-dynamic-linked-bins
  • GitHub Check: ubuntu-focal-dynamic-linked-bins
  • GitHub Check: centos-stream-9-dynamic-linked-bins
  • GitHub Check: build-macos (macos-14, false)
  • GitHub Check: build-macos (macos-14, true)
  • GitHub Check: build-macos (macos-13, false)
  • GitHub Check: lint-check (ubuntu-latest)
  • GitHub Check: build-macos (macos-13, true)
  • GitHub Check: build (macos-latest)
🔇 Additional comments (12)
components/core/src/clp/streaming_archive/ArchiveMetadata.cpp (2)

32-52: 🛠️ Refactor suggestion

Add file size validation to prevent excessive memory allocation.

The implementation should validate the file size before allocating memory for the buffer.

Apply this diff to add file size validation:

 auto ArchiveMetadata::create_from_file(std::string_view path) -> ArchiveMetadata {
     auto file_reader = FileReader(std::string(path));
     struct stat file_stat{};
     if (auto const clp_rc = file_reader.try_fstat(file_stat);
         clp::ErrorCode::ErrorCode_Success != clp_rc)
     {
         throw OperationFailed(clp_rc, __FILENAME__, __LINE__);
     }
 
+    // Validate file size to prevent excessive memory allocation
+    constexpr size_t max_metadata_size = 1024 * 1024; // 1MB should be more than enough for metadata
+    if (file_stat.st_size <= 0 || file_stat.st_size > max_metadata_size) {
+        throw OperationFailed(ErrorCode_BadParam, __FILENAME__, __LINE__);
+    }
+
     clp::Array<char> buf(file_stat.st_size);
     if (auto const clp_rc = file_reader.try_read_exact_length(buf.data(), buf.size());
         clp::ErrorCode::ErrorCode_Success != clp_rc)
     {
         throw OperationFailed(clp_rc, __FILENAME__, __LINE__);
     }

Likely invalid or redundant comment.


32-46: 🛠️ Refactor suggestion

Add file size validation to prevent excessive memory allocation.

While the implementation correctly handles file operations, it should validate the file size before allocating memory.

     if (auto const clp_rc = file_reader.try_fstat(file_stat);
         clp::ErrorCode::ErrorCode_Success != clp_rc)
     {
         throw OperationFailed(clp_rc, __FILENAME__, __LINE__);
     }
 
+    // Validate file size to prevent excessive memory allocation
+    constexpr size_t max_metadata_size = 1024 * 1024; // 1MB should be sufficient for metadata
+    if (file_stat.st_size <= 0 || file_stat.st_size > max_metadata_size) {
+        throw OperationFailed(ErrorCode_BadParam, __FILENAME__, __LINE__);
+    }
+
     clp::Array<char> buf(file_stat.st_size);

Likely invalid or redundant comment.

components/core/src/clp/streaming_archive/ArchiveMetadata.hpp (10)

37-40: LGTM!

The documentation clearly explains why the default constructor is necessary for MessagePack deserialization.


54-64: LGTM!

The documentation is comprehensive and accurately describes the method's behaviour, including all possible exceptions.


99-109: LGTM!

The getter methods are well-implemented and consistent with the existing codebase style.


118-123: LGTM!

The documentation clearly describes the method's purpose and potential exceptions.


126-137: LGTM!

The MessagePack map definition includes all necessary fields for serialization.


5-16: LGTM! Good use of modern C++ features.

The use of std::string_view and constexpr for the compression type constant is appropriate.


37-40: LGTM! Clear documentation.

The comment effectively explains why the default constructor is necessary for MessagePack deserialization.


126-137: LGTM! Complete serialization map.

All necessary fields are included in the MessagePack serialization map, including creation_idx for format consistency.


99-110: LGTM! Consistent getter implementation.

The new getter methods follow the same pattern as existing ones and correctly return const references.


153-155: Consider using std::string instead of std::string_view for member variables.

While std::string_view is efficient, storing it as a member variable can lead to dangling references if the referenced string's lifetime ends. Since these are configuration values that don't change frequently, the overhead of std::string would be negligible.

Comment on lines +65 to +68
std::ostringstream buf;
msgpack::pack(buf, *this);
auto const& string_buf = buf.str();
file_writer.write(string_buf.data(), string_buf.size());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Add error handling for the write operation.

The write operation should be checked for errors to ensure the data is written successfully.

Apply this diff to add error handling:

 void ArchiveMetadata::write_to_file(FileWriter& file_writer) const {
     std::ostringstream buf;
     msgpack::pack(buf, *this);
     auto const& string_buf = buf.str();
-    file_writer.write(string_buf.data(), string_buf.size());
+    if (auto const clp_rc = file_writer.try_write(string_buf.data(), string_buf.size());
+        clp::ErrorCode::ErrorCode_Success != clp_rc)
+    {
+        throw OperationFailed(clp_rc, __FILENAME__, __LINE__);
+    }
 }
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
std::ostringstream buf;
msgpack::pack(buf, *this);
auto const& string_buf = buf.str();
file_writer.write(string_buf.data(), string_buf.size());
std::ostringstream buf;
msgpack::pack(buf, *this);
auto const& string_buf = buf.str();
if (auto const clp_rc = file_writer.try_write(string_buf.data(), string_buf.size());
clp::ErrorCode::ErrorCode_Success != clp_rc)
{
throw OperationFailed(clp_rc, __FILENAME__, __LINE__);
}

Comment on lines 153 to 155
std::string_view m_variable_encoding_methods_version{ffi::cVariableEncodingMethodsVersion};
std::string_view m_variables_schema_version{ffi::cVariablesSchemaVersion};
std::string_view m_compression_type{cCompressionTypeZstd};
Copy link
Contributor

@coderabbitai coderabbitai bot Jan 30, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Use std::string instead of std::string_view for member variables.

Using std::string_view for member variables that store string literals is risky as it doesn't own the string data. If the string literal's lifetime isn't guaranteed, this could lead to dangling references.

Apply this diff to use std::string:

-    std::string_view m_variable_encoding_methods_version{ffi::cVariableEncodingMethodsVersion};
-    std::string_view m_variables_schema_version{ffi::cVariablesSchemaVersion};
-    std::string_view m_compression_type{cCompressionTypeZstd};
+    std::string m_variable_encoding_methods_version{std::string(ffi::cVariableEncodingMethodsVersion)};
+    std::string m_variables_schema_version{std::string(ffi::cVariablesSchemaVersion)};
+    std::string m_compression_type{std::string(cCompressionTypeZstd)};
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
std::string_view m_variable_encoding_methods_version{ffi::cVariableEncodingMethodsVersion};
std::string_view m_variables_schema_version{ffi::cVariablesSchemaVersion};
std::string_view m_compression_type{cCompressionTypeZstd};
std::string m_variable_encoding_methods_version{std::string(ffi::cVariableEncodingMethodsVersion)};
std::string m_variables_schema_version{std::string(ffi::cVariablesSchemaVersion)};
std::string m_compression_type{std::string(cCompressionTypeZstd)};

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The stings are constexpression

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@davemarco Ah, you're absolutely right. Since these are constexpr strings with static storage duration, using std::string_view is actually the correct and more efficient approach. I apologize for the confusion in my previous review.


✏️ Learnings added
Learnt from: davemarco
PR: y-scope/clp#700
File: components/core/src/clp/streaming_archive/ArchiveMetadata.hpp:153-155
Timestamp: 2025-01-30T19:26:33.726Z
Learning: When working with constexpr strings (string literals with static storage duration), std::string_view is the preferred choice for member variables as it's more efficient and safe, avoiding unnecessary memory allocations.

Note: Learnings are effective only in the context of similar code segments. To apply general coding guidelines, please configure review instructions. You can manage existing learnings in the app.

haiqi96
haiqi96 previously approved these changes Jan 30, 2025
Copy link
Contributor

@haiqi96 haiqi96 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me. as long as you ensure that the PR is tested.

for title, how about

feat(core-clp)!: Migrate archive metadata to MessagePack format and add new archive encoding fields.

@haiqi96 haiqi96 changed the title feat(clp)!: Migrate archive metadata format to MessagePack and add new archive encoding fields. feat(clp-text)!: Migrate archive metadata format to MessagePack and add new archive encoding fields. Jan 30, 2025
@haiqi96 haiqi96 changed the title feat(clp-text)!: Migrate archive metadata format to MessagePack and add new archive encoding fields. feat(clp)!: Migrate archive metadata format to MessagePack and add new archive encoding fields. Jan 30, 2025
@davemarco
Copy link
Contributor Author

davemarco commented Jan 30, 2025

I lightly tested at beginning, but will do more extensive tests before merging. Like test clg, and run unit tests.

Also I think @kirkrodrigues, may want to lightly review before merge? If not let us know

@davemarco davemarco changed the title feat(clp)!: Migrate archive metadata format to MessagePack and add new archive encoding fields. feat(core-clp)!: Migrate archive metadata format to MessagePack and add new archive encoding fields. Jan 31, 2025
Comment on lines 38 to 45
throw OperationFailed(clp_rc, __FILENAME__, __LINE__);
}

clp::Array<char> buf(file_stat.st_size);
if (auto const clp_rc = file_reader.try_read_exact_length(buf.data(), buf.size());
clp::ErrorCode::ErrorCode_Success != clp_rc)
{
throw OperationFailed(clp_rc, __FILENAME__, __LINE__);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Generally, exceptions should only be used for exceptional circumstances. I don't think corrupt data is exceptional since it can occur under normal circumstances (e.g., an interrupted download, a drive failure, etc.). Thus, I think this method should return these errors as error codes. To support a return value that can be an error code OR an object, you can use outcome. See clp::ffi::get_schema_subtree_bitmap for an example.

Copy link
Contributor Author

@davemarco davemarco Feb 3, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This function is called in a try-catch block, where any exception reading metadata is critical and re thrown. I think it would look weird to return std::result here , then throw an exception if there is an error, inside the existing try-catch block. It may be possible to get rid of the try-catch block entirely, and just use std::result, but then any undocumented exceptions in msgpack/file_reader may be caught higher up in code, and not in metadata handler like they are now. Lmk ur thoughts

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Gotcha. So from a guideline perspective, we should:

  1. Catch the MessagePack exceptions and return an error
    1. I think the FileReader exceptions should be truly exception, so we can leave them as exceptions for the top-layer of code to catch; but if not, then we should ideally use calls that return an error.
  2. Remove the external try-catch and change it into an error check
  3. Change Archive::open to return an error

I think (1) & (2) is doable in this PR (but feel free to disagree if it seems like too much code to change). (3) will probably be larger than the scope of this PR. So if we just do (1) & (2), to be able to propagate the error to the caller, I'd throw an exception.

Let's just make sure we're on the same page with the above before we go ahead with the change (don't want you to have to change too much).

Copy link
Contributor Author

@davemarco davemarco Feb 3, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In general I am still a little suspect about using std::result now that i have looked at a bit more. We need to use since this function returns ArchiveMetadata. We don't have custom error codes defined in this namespace yet (see Zhihao's custom codes), and dosen't look like the standard clp error codes have been migrated? It may be out-of-scope of this PR to add these new error codes, unless you want to do a new PR first, to add support for new codes into this class, or migrate clp codes or something else?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we can use regular error code, and pass in a pointer to default constructed ArchivedMetadata, but that is a bit gross

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Uh, we usually use the generic error codes (I gave a bad example, lol), but I guess since it isn't clean, let's leave it for now. I'll review the rest of the PR when I get a chance today.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

♻️ Duplicate comments (1)
components/core/src/clp/streaming_archive/ArchiveMetadata.cpp (1)

64-69: ⚠️ Potential issue

Add error handling for the write operation.

The write operation should check for errors to ensure data is written successfully.

     std::ostringstream buf;
     msgpack::pack(buf, *this);
     auto const& string_buf = buf.str();
-    file_writer.write(string_buf.data(), string_buf.size());
+    if (auto const clp_rc = file_writer.try_write(string_buf.data(), string_buf.size());
+        clp::ErrorCode::ErrorCode_Success != clp_rc)
+    {
+        throw OperationFailed(clp_rc, __FILENAME__, __LINE__);
+    }
📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between f19a7ed and 1b1b290.

📒 Files selected for processing (3)
  • components/core/src/clp/streaming_archive/ArchiveMetadata.cpp (3 hunks)
  • components/core/src/clp/streaming_archive/ArchiveMetadata.hpp (4 hunks)
  • components/core/src/clp/streaming_archive/reader/Archive.cpp (1 hunks)
🚧 Files skipped from review as they are similar to previous changes (2)
  • components/core/src/clp/streaming_archive/reader/Archive.cpp
  • components/core/src/clp/streaming_archive/ArchiveMetadata.hpp
🧰 Additional context used
📓 Path-based instructions (1)
components/core/src/clp/streaming_archive/ArchiveMetadata.cpp (1)

Pattern **/*.{cpp,hpp,java,js,jsx,ts,tsx}: - Prefer false == <expression> rather than !<expression>.

📓 Learnings (1)
components/core/src/clp/streaming_archive/ArchiveMetadata.cpp (1)
Learnt from: haiqi96
PR: y-scope/clp#700
File: components/core/src/clp/streaming_archive/ArchiveMetadata.cpp:29-46
Timestamp: 2025-01-29T22:16:52.665Z
Learning: In C++ code, when throwing exceptions, prefer throwing the original return code (RC) from lower-level operations instead of a generic error code to preserve error context and aid in debugging.
⏰ Context from checks skipped due to timeout of 90000ms (12)
  • GitHub Check: ubuntu-focal-static-linked-bins
  • GitHub Check: ubuntu-focal-dynamic-linked-bins
  • GitHub Check: centos-stream-9-static-linked-bins
  • GitHub Check: ubuntu-jammy-static-linked-bins
  • GitHub Check: centos-stream-9-dynamic-linked-bins
  • GitHub Check: ubuntu-jammy-dynamic-linked-bins
  • GitHub Check: build-macos (macos-14, false)
  • GitHub Check: build-macos (macos-14, true)
  • GitHub Check: build-macos (macos-13, false)
  • GitHub Check: lint-check (ubuntu-latest)
  • GitHub Check: build (macos-latest)
  • GitHub Check: build-macos (macos-13, true)
🔇 Additional comments (2)
components/core/src/clp/streaming_archive/ArchiveMetadata.cpp (2)

3-9: LGTM! All new includes are necessary.

The added includes support the new MessagePack serialization functionality and file operations.


41-46: LGTM! Proper error handling with original return codes.

The implementation correctly propagates the original error codes from lower-level operations, which aids in debugging.

Comment on lines 32 to 53
auto ArchiveMetadata::create_from_file(std::string_view file_path) -> ArchiveMetadata {
FileReader file_reader{std::string(file_path)};
struct stat file_stat{};
if (auto const clp_rc = file_reader.try_fstat(file_stat);
clp::ErrorCode::ErrorCode_Success != clp_rc)
{
throw OperationFailed(clp_rc, __FILENAME__, __LINE__);
}

clp::Array<char> buf{static_cast<size_t>(file_stat.st_size)};
if (auto const clp_rc = file_reader.try_read_exact_length(buf.data(), buf.size());
clp::ErrorCode::ErrorCode_Success != clp_rc)
{
throw OperationFailed(clp_rc, __FILENAME__, __LINE__);
}

ArchiveMetadata metadata;
msgpack::object_handle const obj_handle = msgpack::unpack(buf.data(), buf.size());
msgpack::object const obj = obj_handle.get();
obj.convert(metadata);
return metadata;
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Add file size validation to prevent excessive memory allocation.

Consider adding a size limit check before allocating memory for the buffer.

     if (auto const clp_rc = file_reader.try_fstat(file_stat);
         clp::ErrorCode::ErrorCode_Success != clp_rc)
     {
         throw OperationFailed(clp_rc, __FILENAME__, __LINE__);
     }
 
+    constexpr size_t max_metadata_size = 1024 * 1024; // 1MB limit for metadata
+    if (file_stat.st_size <= 0 || file_stat.st_size > max_metadata_size) {
+        throw OperationFailed(ErrorCode_BadParam, __FILENAME__, __LINE__);
+    }
+
     clp::Array<char> buf{static_cast<size_t>(file_stat.st_size)};
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
auto ArchiveMetadata::create_from_file(std::string_view file_path) -> ArchiveMetadata {
FileReader file_reader{std::string(file_path)};
struct stat file_stat{};
if (auto const clp_rc = file_reader.try_fstat(file_stat);
clp::ErrorCode::ErrorCode_Success != clp_rc)
{
throw OperationFailed(clp_rc, __FILENAME__, __LINE__);
}
clp::Array<char> buf{static_cast<size_t>(file_stat.st_size)};
if (auto const clp_rc = file_reader.try_read_exact_length(buf.data(), buf.size());
clp::ErrorCode::ErrorCode_Success != clp_rc)
{
throw OperationFailed(clp_rc, __FILENAME__, __LINE__);
}
ArchiveMetadata metadata;
msgpack::object_handle const obj_handle = msgpack::unpack(buf.data(), buf.size());
msgpack::object const obj = obj_handle.get();
obj.convert(metadata);
return metadata;
}
auto ArchiveMetadata::create_from_file(std::string_view file_path) -> ArchiveMetadata {
FileReader file_reader{std::string(file_path)};
struct stat file_stat{};
if (auto const clp_rc = file_reader.try_fstat(file_stat);
clp::ErrorCode::ErrorCode_Success != clp_rc)
{
throw OperationFailed(clp_rc, __FILENAME__, __LINE__);
}
constexpr size_t max_metadata_size = 1024 * 1024; // 1MB limit for metadata
if (file_stat.st_size <= 0 || file_stat.st_size > max_metadata_size) {
throw OperationFailed(ErrorCode_BadParam, __FILENAME__, __LINE__);
}
clp::Array<char> buf{static_cast<size_t>(file_stat.st_size)};
if (auto const clp_rc = file_reader.try_read_exact_length(buf.data(), buf.size());
clp::ErrorCode::ErrorCode_Success != clp_rc)
{
throw OperationFailed(clp_rc, __FILENAME__, __LINE__);
}
ArchiveMetadata metadata;
msgpack::object_handle const obj_handle = msgpack::unpack(buf.data(), buf.size());
msgpack::object const obj = obj_handle.get();
obj.convert(metadata);
return metadata;
}

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (2)
components/core/src/clp/streaming_archive/ArchiveMetadata.hpp (2)

51-61: Validate path parameters and file accessibility.

While the documentation states that create_from_file throws exceptions if the file cannot be read, consider verifying that file_path is not empty or invalid. This can provide more descriptive errors or early detection of misconfiguration.


102-106: Ensure documentation consistency for error cases.

If write_to_file can throw due to I/O failures, consider documenting such potential exceptions or clarifying that FileWriter handles them elsewhere.

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 1b1b290 and 202c4ee.

📒 Files selected for processing (2)
  • components/core/src/clp/streaming_archive/ArchiveMetadata.cpp (3 hunks)
  • components/core/src/clp/streaming_archive/ArchiveMetadata.hpp (4 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
  • components/core/src/clp/streaming_archive/ArchiveMetadata.cpp
🧰 Additional context used
📓 Path-based instructions (1)
components/core/src/clp/streaming_archive/ArchiveMetadata.hpp (1)

Pattern **/*.{cpp,hpp,java,js,jsx,ts,tsx}: - Prefer false == <expression> rather than !<expression>.

📓 Learnings (1)
components/core/src/clp/streaming_archive/ArchiveMetadata.hpp (2)
Learnt from: davemarco
PR: y-scope/clp#700
File: components/core/src/clp/streaming_archive/ArchiveMetadata.hpp:153-155
Timestamp: 2025-01-30T19:26:33.869Z
Learning: When working with constexpr strings (string literals with static storage duration), std::string_view is the preferred choice for member variables as it's more efficient and safe, avoiding unnecessary memory allocations.
Learnt from: davemarco
PR: y-scope/clp#700
File: components/core/src/clp/streaming_archive/ArchiveMetadata.hpp:122-132
Timestamp: 2025-01-29T21:09:06.878Z
Learning: The `creation_idx` field in `ArchiveMetadata` is used for tracking archive order in split archives. When archives aren't split, it can be safely initialized to zero.
⏰ Context from checks skipped due to timeout of 90000ms (12)
  • GitHub Check: ubuntu-focal-static-linked-bins
  • GitHub Check: centos-stream-9-static-linked-bins
  • GitHub Check: ubuntu-jammy-static-linked-bins
  • GitHub Check: ubuntu-focal-dynamic-linked-bins
  • GitHub Check: centos-stream-9-dynamic-linked-bins
  • GitHub Check: ubuntu-jammy-dynamic-linked-bins
  • GitHub Check: build-macos (macos-14, true)
  • GitHub Check: build-macos (macos-13, false)
  • GitHub Check: lint-check (ubuntu-latest)
  • GitHub Check: build-macos (macos-13, true)
  • GitHub Check: lint-check (macos-latest)
  • GitHub Check: build (macos-latest)
🔇 Additional comments (3)
components/core/src/clp/streaming_archive/ArchiveMetadata.hpp (3)

5-7: No issues found with the newly added includes.

Including <string_view> and <msgpack.hpp> is consistent with the new usage of string views and MessagePack functionalities.


33-35: Good addition of the default constructor.

Having a default constructor is necessary for msgpack conversion. This ensures safe deserialisation into an ArchiveMetadata object without imposing extra burdens on the caller.


49-49: No substantive changes.

This is merely a comment line update with no functional impact.

kirkrodrigues
kirkrodrigues previously approved these changes Feb 4, 2025
Copy link
Member

@kirkrodrigues kirkrodrigues left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the PR title, how about:

feat(core-clp)!: Migrate archive metadata file format to MessagePack.

@davemarco davemarco changed the title feat(core-clp)!: Migrate archive metadata format to MessagePack and add new archive encoding fields. feat(core-clp)!: Migrate archive metadata file format to MessagePack. Feb 4, 2025
@davemarco
Copy link
Contributor Author

davemarco commented Feb 4, 2025

I looked through this again prior to merging and noticed an issue. The compressed size for the metadata itself is calculated based on old metadata size. It is not the easiest thing to fix.

I propose 3 options:

Option A

  1. Modify m_compressed_size to m_compressed_size_without_metadata
  2. Add new function get_metadata_size() which serializes to MsgPack, and gets the current size
  3. Modify get_compressed_size_bytes() to return m_compressed_size_without_metadata + m_dynamic_compressed_size + get_metadata_size()

Option B
Just drop this PR. Is it not entirely necessary for SFA compatibility.

Option C
Exclude metadata size from compressed size

@kirkrodrigues
Copy link
Member

How about let's only add the metadata file's size to m_compressed_size inside write_to_file?

@davemarco
Copy link
Contributor Author

davemarco commented Feb 4, 2025

How about let's only add the metadata file's size to m_compressed_size inside write_to_file?

write_to_file is actually called whenever a segment is closed, so they would accumulate.

Even if called only once, or some behavior to handle that issue, there could also be issues with global_metadatadb archive update. Say we update global metadataDB with archive size before actually writting metadata to file.

@kirkrodrigues

@kirkrodrigues
Copy link
Member

How about let's only add the metadata file's size to m_compressed_size inside write_to_file?

write_to_file is actually called whenever a segment is closed, so they would accumulate.

Can we do the following?

  • Add a new member that just holds the size of the metadata file and which is set only in write_to_file.
  • Update get_compressed_size_bytes to include this new member.

Even if called only once, or some behavior to handle that issue, there could also be issues with global_metadatadb update. Say we update global metadataDB before actually writing to file.

  • I think it would be fine for the size in the global metadata database to only reflect the size on disk, i.e., post write?
  • We should generally ensure that the global metadata is updated after writes to file. If we're updating the metadata before something is on disk, that is a bug and should be fixed.

@davemarco
Copy link
Contributor Author

^ i think the above should work. ill let u know if any issues, will request ur review when updated

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (3)
components/core/src/clp/streaming_archive/ArchiveMetadata.hpp (3)

61-61: Consider using a more descriptive parameter name.

The parameter name file_path would be more descriptive than just path.

-[[nodiscard]] static auto create_from_file(std::string_view file_path) -> ArchiveMetadata;
+[[nodiscard]] static auto create_from_file(std::string_view archive_file_path) -> ArchiveMetadata;

102-107: Add exception documentation.

Document potential exceptions that could be thrown during MessagePack serialization or file writing.

     /**
      * Packs this instance into a MessagePack object and writes it to the open file.
      *
      * @param file_writer
+     * @throw msgpack::type_error if serialization fails
+     * @throw FileWriter::OperationFailed if writing to file fails
      */

131-131: Add documentation for m_metadata_size.

The purpose and usage of this new member variable should be documented.

     uint64_t m_compressed_size{0};
+    // The size of the serialized metadata in bytes
     uint64_t m_metadata_size{0};
📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 202c4ee and a885dc8.

📒 Files selected for processing (2)
  • components/core/src/clp/streaming_archive/ArchiveMetadata.cpp (3 hunks)
  • components/core/src/clp/streaming_archive/ArchiveMetadata.hpp (5 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
  • components/core/src/clp/streaming_archive/ArchiveMetadata.cpp
🧰 Additional context used
📓 Path-based instructions (1)
components/core/src/clp/streaming_archive/ArchiveMetadata.hpp (1)

Pattern **/*.{cpp,hpp,java,js,jsx,ts,tsx}: - Prefer false == <expression> rather than !<expression>.

📓 Learnings (1)
components/core/src/clp/streaming_archive/ArchiveMetadata.hpp (2)
Learnt from: davemarco
PR: y-scope/clp#700
File: components/core/src/clp/streaming_archive/ArchiveMetadata.hpp:153-155
Timestamp: 2025-01-30T19:26:33.869Z
Learning: When working with constexpr strings (string literals with static storage duration), std::string_view is the preferred choice for member variables as it's more efficient and safe, avoiding unnecessary memory allocations.
Learnt from: davemarco
PR: y-scope/clp#700
File: components/core/src/clp/streaming_archive/ArchiveMetadata.hpp:122-132
Timestamp: 2025-01-29T21:09:06.878Z
Learning: The `creation_idx` field in `ArchiveMetadata` is used for tracking archive order in split archives. When archives aren't split, it can be safely initialized to zero.
🔇 Additional comments (3)
components/core/src/clp/streaming_archive/ArchiveMetadata.hpp (3)

33-35: LGTM! Clear explanation for the default constructor.

The comment effectively explains why a default constructor is needed for MessagePack deserialization.


109-117: LGTM! MessagePack serialization map is complete.

All essential metadata fields are included in the serialization map.


7-7: 🛠️ Refactor suggestion

Use angle brackets for library includes.

Library headers should use angle brackets instead of quotes.

-#include "msgpack.hpp"
+#include <msgpack.hpp>

Likely invalid or redundant comment.

@davemarco
Copy link
Contributor Author

davemarco commented Feb 4, 2025

@kirkrodrigues I did the fix.

Note I made a slight change. I did not

Update get_compressed_size_bytes to include this new member.

Instead I set
m_compressed_size = m_compressed_size - previous_metadata_size + m_metadata_size
in write_to_file

Therefore m_compressed_size is always inclusive of m_metadata_size.

I did this because MsgPack will only take one variable m_compressed_size to serialize, it will not accept m_compressed_size + m_metadata_size

kirkrodrigues
kirkrodrigues previously approved these changes Feb 5, 2025
@davemarco
Copy link
Contributor Author

davemarco commented Feb 5, 2025

Hey I thought about this more yesterday night, and I think your idea is clever , but still a bit suspicious. Like fundamentally there is a chicken & egg problem, where you don't know the size of the metadata until after you write it, making it difficult to include the metadata_size in the msgpack metadata itself.

There are some weird effects:

  1. If you update the metadata, the size is only updated the next time you write it. So the size in the metadata file, is potentially always slightly off. Maybe we can live with this, but it is still a bit strange.

  2. Currently the value written to metadata file, and the size in the globalDB are 1 step out of sync. Like the globalDB gets the newer value before the metadata itself. I can fix this, so the globalDB also uses the old value, but thought I would mention.

I am still partial to Option C - just dropping the metadata from the compressed size.
We could also stick with the current approach Option "K"
Option B - old fixed-size metadata is also still available. Then try to come up with a better solution at a later date.
There could be a fancier approach (Option D) where we add some sort of fixed size header to store the msgpack size, after it's written. But that will take some design effort. Note we will have to be careful here to ensure compatability with private branch.

Let me know your thoughts.
@kirkrodrigues

@davemarco
Copy link
Contributor Author

Discussed offline new plan to remove metadata size from compressed size. Made changes

Copy link
Member

@kirkrodrigues kirkrodrigues left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approved pending Haiqi's final approval.

Copy link
Contributor

@haiqi96 haiqi96 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Quickly took a look. The final version looks reasonable.

@davemarco davemarco merged commit 9efb1bc into y-scope:main Feb 5, 2025
26 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants