-
Notifications
You must be signed in to change notification settings - Fork 4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Consider switching to async inserts or batch tables #68
Labels
Comments
hellais
added a commit
that referenced
this issue
Jun 27, 2024
This is an important refactor of the table models. It moves the ProbeMeta and MeasurementMeta into nested composed classes, which is nicer because you don't get lost in the complicated class inheritance, but most importantly it significantly boosts performance because we don't have to make copies of each MeasurementMeta to pass it around. I also introduced to better patterns for handling the TableModels. Basically you decorate a table that should end up inside of the database via the `table_model` decorator and then when it's used type safety is enforced by the `TableModelProtocol`. Thanks to this refactoring it's also possible to improve the way in which we handle both the buffering and serialization of writes, but also the creation of the `CREATE` table queries by using python type hints. Some of these features require recentish versions of python (i.e. >=3.10), however we have already decided that backward compatibility is not a priority for the pipeline. We might however need some kind of compatibility layer if some of these functions need to be used by oonidata (though we might also drop older python support there too at some point if it gets too complex to manage). There are still several parts which need to be refactored, but I suggest doing that later and they are marked as TODO(art). This also adds support for making use of buffer tables, which has a significant performance boost in a parallalized workflow avoiding the issue outlined in here: #68 Moreover, we come up with better pattern to wait for table buffers being flushed before starting the dependant workflow. this can be implemented using primitives of temporal. We also enrich columns with the new processing time metadata for performance monitoring. --------- Co-authored-by: DecFox <[email protected]>
This was done in here: c797c26 |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
At the moment if you run too many workers on machine that is too fast, you can run into issues related to performing too many inserts per second even with the current approach of batching inserts inside of the custom ClickhouseConnection we use ooni/pipeline: https://github.com/ooni/data/blob/main/oonipipeline/src/oonipipeline/db/connections.py#L34.
We should consider switching to some of the native methods of either using the BufferTable engine or async inserts.
For the daily processing it's not so much of a concern, however it's a bit more of an issue for backfilling.
The text was updated successfully, but these errors were encountered: