Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

clp-s: Add support for serializing structured arrays. #413

Merged
merged 11 commits into from
Jun 7, 2024

Conversation

gibber9809
Copy link
Contributor

@gibber9809 gibber9809 commented May 23, 2024

Description

This PR implements serialization for structurized arrays. All of the code changes are relegated to SchemaReader and JsonSerializer. Most of the heavy lifting to support serializing arrays was implemented in #355, so this code change mostly consists of code to walk over the schema for different unordered object types and push JsonSerializer ops as appropriate.

Validation performed

  • Validated that decompression works correctly for several array corner cases
  • Validated that search on arrays returns expected results

@gibber9809 gibber9809 marked this pull request as ready for review May 24, 2024 20:14
@gibber9809 gibber9809 requested a review from wraymo May 24, 2024 20:14
@gibber9809 gibber9809 changed the title [WIP] Implement serialization for structurized arrays. clp-s: Implement serialization for structurized arrays. May 27, 2024
Copy link
Contributor

@wraymo wraymo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you show the logs your are using for testing (It would be better if it covers all structured array cases in generate_structured_array_template, generate_structured_object_template and find_intersection_and_fix_brackets)?

/**
* @param schema
* @return the first column ID found in the given schema, or -1 if the schema contains no
* columns
*/
static inline int32_t get_first_column_in_span(std::span<int32_t> schema);

void find_intersection_and_fix_brackets(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you want to add a description for this method?

Comment on lines 582 to 588
int32_t global_child_id = m_local_id_to_global_id[child_id];
auto structured_it = m_global_id_to_unordered_object.find(global_child_id);
if (m_global_id_to_unordered_object.end() != structured_it) {
size_t column_start = structured_it->second.first;
std::span<int32_t> structured_schema = structured_it->second.second;
generate_structured_array_template(
global_child_id,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
int32_t global_child_id = m_local_id_to_global_id[child_id];
auto structured_it = m_global_id_to_unordered_object.find(global_child_id);
if (m_global_id_to_unordered_object.end() != structured_it) {
size_t column_start = structured_it->second.first;
std::span<int32_t> structured_schema = structured_it->second.second;
generate_structured_array_template(
global_child_id,
auto structured_it = m_global_id_to_unordered_object.find(child_global_id);
if (m_global_id_to_unordered_object.end() != structured_it) {
size_t column_start = structured_it->second.first;
std::span<int32_t> structured_schema = structured_it->second.second;
generate_structured_array_template(
child_global_id,

* @param id
* @param column_start
* @param schema
*/
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about the return value?

* @param id
* @param column_start
* @param schema
*/
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as the last comment

Comment on lines +306 to +307
int32_t cur_root,
int32_t next_root,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
int32_t cur_root,
int32_t next_root,
int32_t cur_node_id,
int32_t next_node_id,

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I'd prefer to keep the cur_root/next_root names. They carry some meaning because they're supposed to indicate the root of the subtree for which we've pushed all of the required keys/brackets to m_json_serializer before the field we are about to push. I'm open to other names that carry meaning though.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it. Personally I prefer using...id for int32_t node IDs as we extensively use ...column_id, ...child_id elsewhere. But it's fine to leave them as they are since it seems that we are following ...node for SchemaNode and ...root for int32_t.

@gibber9809
Copy link
Contributor Author

Can you show the logs your are using for testing (It would be better if it covers all structured array cases in generate_structured_array_template, generate_structured_object_template and find_intersection_and_fix_brackets)?

Right now I'm using

{"a":["a",{"b":"c"},{},[],{"e":{"f":{"g":{}}}},null,10,false,"h"],"b":[]}
{"a":["b",{"b":"d"},{},[],{"e":{"f":{"g":{}}}},null,11,true,"i"],"b":[]}
{"b":[0,1,2,3,[[]],[[4]],{"c":["d",{"e":{"f":["g",{}]}}]}]}
{"c":[{"a":{"b":"c","d":"e"},"f":{"g":{"h":{"i":"j"}},"k":"l"}}]}

which covers all of the cases for structured array and find_intersection_and_fix_brackets, and decompresses to the original input byte-for-byte.

@gibber9809 gibber9809 requested a review from wraymo May 31, 2024 13:23
Comment on lines +321 to +328
} else {
cur_root = cur_node->get_parent_id();
cur_node = &m_global_schema_tree->get_node(cur_root);
m_json_serializer.add_op(JsonSerializer::Op::EndObject);
path_to_intersection.push_back(next_root);
next_root = next_node->get_parent_id();
next_node = &m_global_schema_tree->get_node(next_root);
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What if cur_node and next_node have the same parents? The code will never go into this branch. For example, after decompression, the log {"a": [1, {"b": {"c":3}, "d": [4]}]} becomes {"a":[1,{"b":{"c":3,4]}]}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This case works on the most recent commit -- I pushed a fix for this issue when I re-requested review. I added a code block after the loop that checks if cur_node and next_node are different, and if so fixes the last bracket + adds the last node to the path .

@gibber9809 gibber9809 requested a review from wraymo June 3, 2024 14:12
Copy link
Contributor

@wraymo wraymo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great work! Just one small thing.

components/core/src/clp_s/SchemaReader.cpp Outdated Show resolved Hide resolved
Comment on lines +306 to +307
int32_t cur_root,
int32_t next_root,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it. Personally I prefer using...id for int32_t node IDs as we extensively use ...column_id, ...child_id elsewhere. But it's fine to leave them as they are since it seems that we are following ...node for SchemaNode and ...root for int32_t.

@gibber9809 gibber9809 requested a review from wraymo June 3, 2024 18:32
append_unordered_reader_columns(
m_schema_reader,
column_id,
schema.get_view(i, 0),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you want to change it to std::span<int32_t>() given it's an empty span?

@gibber9809 gibber9809 requested a review from wraymo June 6, 2024 16:31
@wraymo
Copy link
Contributor

wraymo commented Jun 6, 2024

Can you fix the lint error?

Copy link
Contributor

@wraymo wraymo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The PR title looks good to me.

@kirkrodrigues
Copy link
Member

How about:

clp-s: Add support for serializing structured arrays.

@gibber9809 gibber9809 changed the title clp-s: Implement serialization for structurized arrays. clp-s: Add support for serializing structured arrays. Jun 7, 2024
@gibber9809 gibber9809 merged commit 2725b9a into y-scope:main Jun 7, 2024
11 checks passed
@gibber9809 gibber9809 deleted the structurized-array-serialization branch June 7, 2024 01:24
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants