Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adds JSON/CBOR support and an Io-type option #243

Merged
merged 6 commits into from
Mar 30, 2023

Conversation

cheqianh
Copy link
Contributor

@cheqianh cheqianh commented Mar 10, 2023

Description:

This PR integrates JSON/CBOR into the benchmark CLI. It additionally adds an IO type option to decide which IO type of the Ion data will be benchmarked.

Details

In details, this PR adds below features/components

  1. Enable the CLI tool to benchmark Ion data as well as modern JSON/CBOR libraries concurrently.
  2. Adds an IO type option to allow the tool benchmark time for either read/write in-memory buffers or files.
  3. A bunch of refactor and unit testing.

Output Example

Example One

Benchmark load API for all JSON libraries as well as ION

Command and Output:

python amazon/ionbenchmark/ion_benchmark_cli.py read --format json --format orjson --format simplejson --format ion_text --format ion_binary --format ujson

Screenshot 2023-03-09 at 5 53 25 PM

Example Two

Benchmark dumps API (--io-types buffer) for CBOR2, and Ion_binary

Command and Output:

python amazon/ionbenchmark/ion_benchmark_cli.py write --io-type buffer --format cbor2 --format ion_binary test_unit_int

Screenshot 2023-03-09 at 5 56 19 PM

Example Three

Benchmark dump (--io-types file) API for CBOR2, and Ion_binary

Command and Output:

python amazon/ionbenchmark/ion_benchmark_cli.py write --io-type file --format cbor2 --format ion_binary test_unit_int

Screenshot 2023-03-09 at 5 56 58 PM

Follow-up issues

  1. Memory profiling enhancement
    Currently, the tool only profile a memory usage peak. Due to the potential optimization Python may do, the memory usage metrics might be inaccurate after running different formats multiple execution times. Need more investigation for memory related metrics - Benchmark CLI memory profiling feature enhancement. #245
  2. Needs a tool to convert different file formats.
    Benchmark-cli read command should support --format option (format conversion feature). #234
  3. Needs pretty printed format options
    Right now, the options output in the output table is unclear. For example, (simpleion, ion_binary, file) represents that --api is simpleion, --format is ion_binary and --io-type is file. This needs improvement - Benchmark CLI needs pretty printed option log. #244.

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

@cheqianh cheqianh requested a review from tgregg March 10, 2023 02:24
Comment on lines +23 to +24
def rewrite_file_to_format(file, format_option):
return file
Copy link
Contributor Author

@cheqianh cheqianh Mar 10, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The file format convert logic will go in here.

Copy link
Contributor Author

@cheqianh cheqianh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CI/CD failed due some json libraries does not support pypy. E.g., orjson. We can either

  1. Treat orjson as regular built-in json library when using pypy interpreter, or
  2. Remove orjson support on pypy.

Without that, it works as expected with known 10 failed tests - #249

amazon/ionbenchmark/ion_benchmark_cli.py Outdated Show resolved Hide resolved
amazon/ionbenchmark/ion_benchmark_cli.py Show resolved Hide resolved
amazon/ionbenchmark/ion_benchmark_cli.py Show resolved Hide resolved
tests/test_benchmark_cli.py Outdated Show resolved Hide resolved
@tgregg
Copy link
Contributor

tgregg commented Mar 15, 2023

CI/CD failed due some json libraries does not support pypy. E.g., orjson. We can either

1. Treat `orjson` as regular built-in json library when using `pypy` interpreter, or

2. Remove orjson support on pypy.

Without that, it works as expected with known 10 failed tests - #249

Let's go with option 2

ion-c Outdated
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since this introduces test failures that haven't yet been resolved, let's leave the ion-c submodule update for a separate PR, and revert it here.

Copy link
Contributor

@tgregg tgregg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since format conversion is left as a TODO, can you explain what is actually happening today in each of your examples in the description? Do those come from an actual execution?

Once we have data conversion, I'd expect the data size stat to be different for each of the formats, as it will reflect the size of the converted data, for both read and write.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the --api option, the allowed values should be changed to something like load_dump and streaming. The option simple_ion is no longer accurate for the other formats, and looks weird in the options list when those formats are selected.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed.



# Generates benchmark code for json/cbor load/loads APIs
def generate_read_test_code(file, memory_profiling, format_option, binary, io_type):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a reason we can't follow the same path for Ion?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Merged them into one.



# Generates benchmark code for json dump API
def generate_write_test_code(obj, memory_profiling, format_option, io_type, binary):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a reason Ion can't follow this path?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same as above comment.

# reset each option configuration
api, format_option, io_type = reset_for_each_execution(each_option)
binary = format_is_binary(format_option)
# TODO. currently, we must provide the tool a corresponding file format for read benchmarking. For example,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The tool to convert to a corresponding file format for read benchmarking. ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed.

Comment on lines 132 to 138
def test_func():
tracemalloc.start()
data = ion.loads(benchmark_data, single_value=single_value, emit_bare_values=emit_bare_values)
global read_memory_usage_peak
read_memory_usage_peak = tracemalloc.get_traced_memory()[1] / BYTES_TO_MB
tracemalloc.stop()
return data
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

test_func() is what is being timed, correct? That means the cost of using tracemalloc is included in the results. Is there a way for us to extract this logic outside of the timed block?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, we execute it twice; once with memory_profiling enabled, and once with it disabled.

We profile the memory before all performance benchmarking here. Then we benchmark the performance separately.

@cheqianh
Copy link
Contributor Author

cheqianh commented Mar 15, 2023

Since format conversion is left as a TODO, can you explain what is actually happening today in each of your examples in the description? Do those come from an actual execution?

Yep, these metrics are generated by the tool directly. Overall, the tool is assuming that we had the format conversion functionality already but it doesn't do any conversion work for files.

For each example,
(1)
Benchmarking both read and write for JSON/Ion_text with simple files does NOT require format conversion.

One thing I want to point out is that ion_binary is compared in the example but it's doing the exact same thing as format ion_text. We had an enhancement ticket describing this - #234. Once we support the format conversion, the tool will update it automatically without any changes.

(2) and (3)
Benchmarking write command for CBOR/Ion does NOT require format conversion; we provide python layer objects and doesn't care about how they are encoded and written into files. Both example are targeting to write above.

I found one thing that breaks the fairness of Ion/CBOR benchamrking is the slight difference in python objects generated by them (E.g., cbor2 reads "9233720363654371807" as -12852 in python). But the tool will automatically generate the correct object once we support the format conversion. (This is one benefit the tenet brings when I work on this PR - always assume that the tool supported formats conversion already. So after we have the formats conversion tool, cbor2 will generate correct cbor2 objects from the desired file format automatically)

Once we have data conversion, I'd expect the data size stat to be different for each of the formats, as it will reflect the size of the converted data, for both read and write.

Yeah, the converted file is being used for calculating the file size. Once we support the formats conversion, it will generate new file size automatically.

@cheqianh
Copy link
Contributor Author

The new commit addressed all the feedback above, passed all the CI/CD and pypy incompatible issue, and deprecated orjson for now.

Here is the GH issue to add orjson back.

@cheqianh cheqianh merged commit b08b9a7 into amazon-ion:master Mar 30, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants