-
Notifications
You must be signed in to change notification settings - Fork 73
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
clp-s: Add support for serializing structured arrays. #413
clp-s: Add support for serializing structured arrays. #413
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you show the logs your are using for testing (It would be better if it covers all structured array cases in generate_structured_array_template
, generate_structured_object_template
and find_intersection_and_fix_brackets
)?
/** | ||
* @param schema | ||
* @return the first column ID found in the given schema, or -1 if the schema contains no | ||
* columns | ||
*/ | ||
static inline int32_t get_first_column_in_span(std::span<int32_t> schema); | ||
|
||
void find_intersection_and_fix_brackets( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you want to add a description for this method?
int32_t global_child_id = m_local_id_to_global_id[child_id]; | ||
auto structured_it = m_global_id_to_unordered_object.find(global_child_id); | ||
if (m_global_id_to_unordered_object.end() != structured_it) { | ||
size_t column_start = structured_it->second.first; | ||
std::span<int32_t> structured_schema = structured_it->second.second; | ||
generate_structured_array_template( | ||
global_child_id, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
int32_t global_child_id = m_local_id_to_global_id[child_id]; | |
auto structured_it = m_global_id_to_unordered_object.find(global_child_id); | |
if (m_global_id_to_unordered_object.end() != structured_it) { | |
size_t column_start = structured_it->second.first; | |
std::span<int32_t> structured_schema = structured_it->second.second; | |
generate_structured_array_template( | |
global_child_id, | |
auto structured_it = m_global_id_to_unordered_object.find(child_global_id); | |
if (m_global_id_to_unordered_object.end() != structured_it) { | |
size_t column_start = structured_it->second.first; | |
std::span<int32_t> structured_schema = structured_it->second.second; | |
generate_structured_array_template( | |
child_global_id, |
* @param id | ||
* @param column_start | ||
* @param schema | ||
*/ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What about the return value?
* @param id | ||
* @param column_start | ||
* @param schema | ||
*/ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same as the last comment
int32_t cur_root, | ||
int32_t next_root, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
int32_t cur_root, | |
int32_t next_root, | |
int32_t cur_node_id, | |
int32_t next_node_id, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I'd prefer to keep the cur_root/next_root names. They carry some meaning because they're supposed to indicate the root of the subtree for which we've pushed all of the required keys/brackets to m_json_serializer before the field we are about to push. I'm open to other names that carry meaning though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Got it. Personally I prefer using...id
for int32_t
node IDs as we extensively use ...column_id
, ...child_id
elsewhere. But it's fine to leave them as they are since it seems that we are following ...node
for SchemaNode
and ...root
for int32_t
.
Right now I'm using
which covers all of the cases for structured array and find_intersection_and_fix_brackets, and decompresses to the original input byte-for-byte. |
} else { | ||
cur_root = cur_node->get_parent_id(); | ||
cur_node = &m_global_schema_tree->get_node(cur_root); | ||
m_json_serializer.add_op(JsonSerializer::Op::EndObject); | ||
path_to_intersection.push_back(next_root); | ||
next_root = next_node->get_parent_id(); | ||
next_node = &m_global_schema_tree->get_node(next_root); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What if cur_node
and next_node
have the same parents? The code will never go into this branch. For example, after decompression, the log {"a": [1, {"b": {"c":3}, "d": [4]}]}
becomes {"a":[1,{"b":{"c":3,4]}]}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This case works on the most recent commit -- I pushed a fix for this issue when I re-requested review. I added a code block after the loop that checks if cur_node and next_node are different, and if so fixes the last bracket + adds the last node to the path .
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great work! Just one small thing.
int32_t cur_root, | ||
int32_t next_root, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Got it. Personally I prefer using...id
for int32_t
node IDs as we extensively use ...column_id
, ...child_id
elsewhere. But it's fine to leave them as they are since it seems that we are following ...node
for SchemaNode
and ...root
for int32_t
.
Co-authored-by: wraymo <[email protected]>
append_unordered_reader_columns( | ||
m_schema_reader, | ||
column_id, | ||
schema.get_view(i, 0), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you want to change it to std::span<int32_t>()
given it's an empty span?
Co-authored-by: wraymo <[email protected]>
Can you fix the lint error? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The PR title looks good to me.
How about: clp-s: Add support for serializing structured arrays. |
Description
This PR implements serialization for structurized arrays. All of the code changes are relegated to SchemaReader and JsonSerializer. Most of the heavy lifting to support serializing arrays was implemented in #355, so this code change mostly consists of code to walk over the schema for different unordered object types and push JsonSerializer ops as appropriate.
Validation performed