Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(core-clp)!: Migrate archive metadata file format to MessagePack. #700

Merged
merged 14 commits into from
Feb 5, 2025
47 changes: 30 additions & 17 deletions components/core/src/clp/streaming_archive/ArchiveMetadata.cpp
Original file line number Diff line number Diff line change
@@ -1,5 +1,11 @@
#include "ArchiveMetadata.hpp"

#include <sys/stat.h>

#include <fmt/core.h>

#include "../Array.hpp"

namespace clp::streaming_archive {
ArchiveMetadata::ArchiveMetadata(
archive_format_version_t archive_format_version,
Expand All @@ -22,15 +28,26 @@ ArchiveMetadata::ArchiveMetadata(
+ sizeof(m_end_timestamp) + sizeof(m_compressed_size);
}

ArchiveMetadata::ArchiveMetadata(FileReader& file_reader) {
file_reader.read_numeric_value(m_archive_format_version, false);
file_reader.read_numeric_value(m_creator_id_len, false);
file_reader.read_string(m_creator_id_len, m_creator_id, false);
file_reader.read_numeric_value(m_creation_idx, false);
file_reader.read_numeric_value(m_uncompressed_size, false);
file_reader.read_numeric_value(m_compressed_size, false);
file_reader.read_numeric_value(m_begin_timestamp, false);
file_reader.read_numeric_value(m_end_timestamp, false);
auto ArchiveMetadata::create_from_file_reader(FileReader& file_reader) -> ArchiveMetadata {
struct stat file_stat{};
if (auto const clp_rc = file_reader.try_fstat(file_stat);
clp::ErrorCode::ErrorCode_Success != clp_rc)
{
throw OperationFailed(clp_rc, __FILENAME__, __LINE__);
}

clp::Array<char> buf(file_stat.st_size);
davemarco marked this conversation as resolved.
Show resolved Hide resolved
if (auto const clp_rc = file_reader.try_read_exact_length(buf.data(), buf.size());
clp::ErrorCode::ErrorCode_Success != clp_rc)
{
throw OperationFailed(clp_rc, __FILENAME__, __LINE__);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Generally, exceptions should only be used for exceptional circumstances. I don't think corrupt data is exceptional since it can occur under normal circumstances (e.g., an interrupted download, a drive failure, etc.). Thus, I think this method should return these errors as error codes. To support a return value that can be an error code OR an object, you can use outcome. See clp::ffi::get_schema_subtree_bitmap for an example.

Copy link
Contributor Author

@davemarco davemarco Feb 3, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This function is called in a try-catch block, where any exception reading metadata is critical and re thrown. I think it would look weird to return std::result here , then throw an exception if there is an error, inside the existing try-catch block. It may be possible to get rid of the try-catch block entirely, and just use std::result, but then any undocumented exceptions in msgpack/file_reader may be caught higher up in code, and not in metadata handler like they are now. Lmk ur thoughts

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Gotcha. So from a guideline perspective, we should:

  1. Catch the MessagePack exceptions and return an error
    1. I think the FileReader exceptions should be truly exception, so we can leave them as exceptions for the top-layer of code to catch; but if not, then we should ideally use calls that return an error.
  2. Remove the external try-catch and change it into an error check
  3. Change Archive::open to return an error

I think (1) & (2) is doable in this PR (but feel free to disagree if it seems like too much code to change). (3) will probably be larger than the scope of this PR. So if we just do (1) & (2), to be able to propagate the error to the caller, I'd throw an exception.

Let's just make sure we're on the same page with the above before we go ahead with the change (don't want you to have to change too much).

Copy link
Contributor Author

@davemarco davemarco Feb 3, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In general I am still a little suspect about using std::result now that i have looked at a bit more. We need to use since this function returns ArchiveMetadata. We don't have custom error codes defined in this namespace yet (see Zhihao's custom codes), and dosen't look like the standard clp error codes have been migrated? It may be out-of-scope of this PR to add these new error codes, unless you want to do a new PR first, to add support for new codes into this class, or migrate clp codes or something else?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we can use regular error code, and pass in a pointer to default constructed ArchivedMetadata, but that is a bit gross

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Uh, we usually use the generic error codes (I gave a bad example, lol), but I guess since it isn't clean, let's leave it for now. I'll review the rest of the PR when I get a chance today.

}

ArchiveMetadata metadata;
msgpack::object_handle oh = msgpack::unpack(buf.data(), buf.size());
davemarco marked this conversation as resolved.
Show resolved Hide resolved
msgpack::object obj = oh.get();
davemarco marked this conversation as resolved.
Show resolved Hide resolved
obj.convert(metadata);
return metadata;
}

void ArchiveMetadata::expand_time_range(epochtime_t begin_timestamp, epochtime_t end_timestamp) {
Expand All @@ -43,13 +60,9 @@ void ArchiveMetadata::expand_time_range(epochtime_t begin_timestamp, epochtime_t
}

void ArchiveMetadata::write_to_file(FileWriter& file_writer) const {
file_writer.write_numeric_value(m_archive_format_version);
file_writer.write_numeric_value(m_creator_id_len);
file_writer.write_string(m_creator_id);
file_writer.write_numeric_value(m_creation_idx);
file_writer.write_numeric_value(m_uncompressed_size + m_dynamic_uncompressed_size);
file_writer.write_numeric_value(m_compressed_size + m_dynamic_uncompressed_size);
file_writer.write_numeric_value(m_begin_timestamp);
file_writer.write_numeric_value(m_end_timestamp);
std::ostringstream buf;
msgpack::pack(buf, *this);
auto const& string_buf = buf.str();
file_writer.write(string_buf.data(), string_buf.size());
Comment on lines +53 to +56
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Add error handling for the write operation.

The write operation should be checked for errors to ensure the data is written successfully.

Apply this diff to add error handling:

 void ArchiveMetadata::write_to_file(FileWriter& file_writer) const {
     std::ostringstream buf;
     msgpack::pack(buf, *this);
     auto const& string_buf = buf.str();
-    file_writer.write(string_buf.data(), string_buf.size());
+    if (auto const clp_rc = file_writer.try_write(string_buf.data(), string_buf.size());
+        clp::ErrorCode::ErrorCode_Success != clp_rc)
+    {
+        throw OperationFailed(clp_rc, __FILENAME__, __LINE__);
+    }
 }
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
std::ostringstream buf;
msgpack::pack(buf, *this);
auto const& string_buf = buf.str();
file_writer.write(string_buf.data(), string_buf.size());
std::ostringstream buf;
msgpack::pack(buf, *this);
auto const& string_buf = buf.str();
if (auto const clp_rc = file_writer.try_write(string_buf.data(), string_buf.size());
clp::ErrorCode::ErrorCode_Success != clp_rc)
{
throw OperationFailed(clp_rc, __FILENAME__, __LINE__);
}

}
} // namespace clp::streaming_archive
58 changes: 54 additions & 4 deletions components/core/src/clp/streaming_archive/ArchiveMetadata.hpp
Original file line number Diff line number Diff line change
Expand Up @@ -2,13 +2,19 @@
#define CLP_STREAMING_ARCHIVE_ARCHIVEMETADATA_HPP

#include <cstdint>
#include <string_view>

#include "../Defs.h"
#include "../ffi/encoding_methods.hpp"
#include "../FileReader.hpp"
davemarco marked this conversation as resolved.
Show resolved Hide resolved
#include "../FileWriter.hpp"
#include "Constants.hpp"
#include "msgpack.hpp"
davemarco marked this conversation as resolved.
Show resolved Hide resolved

namespace clp::streaming_archive {

davemarco marked this conversation as resolved.
Show resolved Hide resolved
static constexpr std::string_view cCompressionTypeZstd = "ZSTD";
davemarco marked this conversation as resolved.
Show resolved Hide resolved

/**
* A class to encapsulate metadata directly relating to an archive.
*/
Expand All @@ -28,6 +34,10 @@ class ArchiveMetadata {
};

// Constructors
// The class must be constructible to convert from msgpack::object.
// https://github.com/msgpack/msgpack-c/wiki/v2_0_cpp_adaptor
ArchiveMetadata() = default;
haiqi96 marked this conversation as resolved.
Show resolved Hide resolved

/**
* Constructs a metadata object with the given parameters
* @param archive_format_version
Expand All @@ -40,13 +50,19 @@ class ArchiveMetadata {
uint64_t creation_idx
);

// Methods
/**
* Constructs a metadata object and initializes it from the given file reader
* @param file_reader
* Reads serialized MessagePack data from open file, unpacks it into an
* `ArchiveMetadata` instance.
davemarco marked this conversation as resolved.
Show resolved Hide resolved
*
* @param file_reader Reader for the file containing archive metadata.
* @return The created instance.
* @throw `ArchiveMetadata::OperationFailed` if stat or read operation on metadata file fails.
* @throw `msgpack::unpack_error` if data cannot be unpacked into MessagePack object.
* @throw `msgpack::type_error` if MessagePack object can't be converted to `ArchiveMetadata`.
davemarco marked this conversation as resolved.
Show resolved Hide resolved
*/
explicit ArchiveMetadata(FileReader& file_reader);
[[nodiscard]] static auto create_from_file_reader(FileReader& file_reader) -> ArchiveMetadata;

// Methods
[[nodiscard]] auto get_archive_format_version() const { return m_archive_format_version; }

[[nodiscard]] auto get_creator_id() const -> std::string const& { return m_creator_id; }
Expand Down Expand Up @@ -79,15 +95,46 @@ class ArchiveMetadata {

[[nodiscard]] auto get_end_timestamp() const { return m_end_timestamp; }

[[nodiscard]] auto get_variable_encoding_methods_version() const -> std::string_view const& {
return m_variable_encoding_methods_version;
}

[[nodiscard]] auto get_variables_schema_version() const -> std::string_view const& {
return m_variables_schema_version;
}

[[nodiscard]] auto get_compression_type() const -> std::string_view const& {
return m_compression_type;
}

/**
* Expands the archive's time range based to encompass the given time range
* @param begin_timestamp
* @param end_timestamp
*/
void expand_time_range(epochtime_t begin_timestamp, epochtime_t end_timestamp);

/**
* Packs `ArchiveMetadata` to MessagePack and writes to the open file.
*
* @param file_writer Writer for archive metadata file.
davemarco marked this conversation as resolved.
Show resolved Hide resolved
* @throw FileWriter::OperationFailed if failed to write.
davemarco marked this conversation as resolved.
Show resolved Hide resolved
*/
void write_to_file(FileWriter& file_writer) const;

MSGPACK_DEFINE_MAP(
haiqi96 marked this conversation as resolved.
Show resolved Hide resolved
MSGPACK_NVP("archive_format_version", m_archive_format_version),
MSGPACK_NVP("variable_encoding_methods_version", m_variable_encoding_methods_version),
MSGPACK_NVP("variables_schema_version", m_variables_schema_version),
MSGPACK_NVP("compression_type", m_compression_type),
MSGPACK_NVP("creator_id", m_creator_id),
MSGPACK_NVP("creation_idx", m_creation_idx),
MSGPACK_NVP("begin_timestamp", m_begin_timestamp),
MSGPACK_NVP("end_timestamp", m_end_timestamp),
MSGPACK_NVP("uncompressed_size", m_uncompressed_size),
MSGPACK_NVP("compressed_size", m_compressed_size)
);

private:
// Variables
archive_format_version_t m_archive_format_version{cArchiveFormatVersion};
Expand All @@ -102,6 +149,9 @@ class ArchiveMetadata {
// The size of the archive
uint64_t m_compressed_size{0};
uint64_t m_dynamic_compressed_size{0};
std::string_view m_variable_encoding_methods_version{ffi::cVariableEncodingMethodsVersion};
std::string_view m_variables_schema_version{ffi::cVariablesSchemaVersion};
std::string_view m_compression_type{cCompressionTypeZstd};
Copy link
Contributor

@coderabbitai coderabbitai bot Jan 30, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Use std::string instead of std::string_view for member variables.

Using std::string_view for member variables that store string literals is risky as it doesn't own the string data. If the string literal's lifetime isn't guaranteed, this could lead to dangling references.

Apply this diff to use std::string:

-    std::string_view m_variable_encoding_methods_version{ffi::cVariableEncodingMethodsVersion};
-    std::string_view m_variables_schema_version{ffi::cVariablesSchemaVersion};
-    std::string_view m_compression_type{cCompressionTypeZstd};
+    std::string m_variable_encoding_methods_version{std::string(ffi::cVariableEncodingMethodsVersion)};
+    std::string m_variables_schema_version{std::string(ffi::cVariablesSchemaVersion)};
+    std::string m_compression_type{std::string(cCompressionTypeZstd)};
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
std::string_view m_variable_encoding_methods_version{ffi::cVariableEncodingMethodsVersion};
std::string_view m_variables_schema_version{ffi::cVariablesSchemaVersion};
std::string_view m_compression_type{cCompressionTypeZstd};
std::string m_variable_encoding_methods_version{std::string(ffi::cVariableEncodingMethodsVersion)};
std::string m_variables_schema_version{std::string(ffi::cVariablesSchemaVersion)};
std::string m_compression_type{std::string(cCompressionTypeZstd)};

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The stings are constexpression

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@davemarco Ah, you're absolutely right. Since these are constexpr strings with static storage duration, using std::string_view is actually the correct and more efficient approach. I apologize for the confusion in my previous review.


✏️ Learnings added
Learnt from: davemarco
PR: y-scope/clp#700
File: components/core/src/clp/streaming_archive/ArchiveMetadata.hpp:153-155
Timestamp: 2025-01-30T19:26:33.726Z
Learning: When working with constexpr strings (string literals with static storage duration), std::string_view is the preferred choice for member variables as it's more efficient and safe, avoiding unnecessary memory allocations.

Note: Learnings are effective only in the context of similar code segments. To apply general coding guidelines, please configure review instructions. You can manage existing learnings in the app.

davemarco marked this conversation as resolved.
Show resolved Hide resolved
};
} // namespace clp::streaming_archive

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -38,7 +38,7 @@ void Archive::open(string const& path) {
archive_format_version_t format_version{};
try {
FileReader file_reader{metadata_file_path};
ArchiveMetadata const metadata{file_reader};
auto const metadata = ArchiveMetadata::create_from_file_reader(file_reader);
Copy link
Contributor

@haiqi96 haiqi96 Jan 29, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

since the FileReader goes out of scope right after the metadata is created, can the method takes metadata_file_path as its parameter and create a FileReader internally?

Edit: Now i see that you are just trying to reuse the original signature, which makes sense.. but maybe we can also improve the method interface in this PR since it's a few lines of change.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How do you feel about taking a clp::array instead as input. It will move some of the code (the read and stat) into archive reader, but then this constructor can also be used for sfa. Like for sfa, we can't pass in a filereader (since no way to get the size of metadata in larger file by itself, and the filepath is also for the entire sfa)?

We could also have 2 constructors. One for file path, and one for buffer. And they just share a method to deserialize from buffer? Let me know your thoughts

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It will move some of the code (the read and stat) into archive reader, but then this constructor can also be used for sfa.

Is this planned in a future PR? I think it makes sense, but the change you propose might be easier to reason about when we are actually making the changes that include sfa.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay for now, I can take in a file path. I can add the buffer feature later when sfa reader is implemented

format_version = metadata.get_archive_format_version();
} catch (TraceableException& traceable_exception) {
auto error_code = traceable_exception.get_error_code();
Expand Down
Loading