-
Notifications
You must be signed in to change notification settings - Fork 8
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Feature blocky benchmark #541
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The benchmarking code looks pretty good to me. I'd like to see some indication of the number and size of of blocks used in each experiment. Ideally we could configure the experiments to run with different sizes but I realize that would be tricky.
Please update docs/benchmarking.rst
before merging, in particular it would be good to see more about how the blocks were created.
A feature request - we could show some form of progress in the benchmark containers output. If we printed the result token the user could attach a rest_client to watch progress if they really wanted.
The performance it reveals is another story 🤢
Running on my desktop I see uploading 100k clknblock
takes ~45s versus ~6s for binary encodings!
During upload I see I log the size of each block (opps) and that most(?) block has just 1 element.
For the 100k x 100k experiment it creates 112551 chunks. Creating the chunks appears to take almost 3 minutes. I scaled up to using 10 workers to give it a chance of finishing. On my machine one chunk is taking as much as 50ms - although I saw some ~10ms. My cpu cores are all <20% active during this process :-/
benchmarking/benchmark.py
Outdated
and `clk_{user}_{size_data}.json` where $user is a letter starting from `a` indexing the data owner, and `size_data` | ||
is a integer representing the number of data rows in the dataset (e.g. 10000). Note that the csv usually has a header. | ||
the 3 party linkage), and then a number a file following the format `PII_{user}_{size_data}.csv`, | ||
`clk_{user}_{size_data}_v2.bin`, `clk_{user}_{size_data}.json` and `clknblocks_{user}_{size_data}.json` where $user |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
`clk_{user}_{size_data}_v2.bin`, `clk_{user}_{size_data}.json` and `clknblocks_{user}_{size_data}.json` where $user | |
`clk_{user}_{size_data}_v2.bin`, `clk_{user}_{size_data}.json` and `clknblocks_{user}_{size_data}.json` where `user` |
benchmarking/benchmark.py
Outdated
is a integer representing the number of data rows in the dataset (e.g. 10000). Note that the csv usually has a header. | ||
the 3 party linkage), and then a number a file following the format `PII_{user}_{size_data}.csv`, | ||
`clk_{user}_{size_data}_v2.bin`, `clk_{user}_{size_data}.json` and `clknblocks_{user}_{size_data}.json` where $user | ||
is a letter starting from `a` indexing the data owner, and `size_data` is a integer representing the number of data |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is a letter starting from `a` indexing the data owner, and `size_data` is a integer representing the number of data | |
is a letter starting from `a` indexing the data providers, and `size_data` is an integer representing the number of |
I left it running last night with a 12 hour timeout, 6 workers and worker settings:
Benchmark Logs
The gist is one 100k x 100k run failed with a timeout, and the other took |
Re: binary encodings. we should definitely look into allowing to upload binary clks and block info in separate files for big jobs. |
This extends the benchmark script to be able to run experiments which use blocking.
An experiment definition for blocking looks like this:
The corresponding 'clknblocks' files are uploaded to S3.
For now I haven't changed the
default-experiements.json
file, as the blocked experiments take a very long time and will most likely trigger a timeout.Once we addressed that issue in the entity service, we can replace
default-experiements.json
withdefault-experiements-wawo-blocking.json
.