Consider switching to async inserts or batch tables #68

hellais · 2024-04-30T18:13:45Z

At the moment if you run too many workers on machine that is too fast, you can run into issues related to performing too many inserts per second even with the current approach of batching inserts inside of the custom ClickhouseConnection we use ooni/pipeline: https://github.com/ooni/data/blob/main/oonipipeline/src/oonipipeline/db/connections.py#L34.

We should consider switching to some of the native methods of either using the BufferTable engine or async inserts.

For the daily processing it's not so much of a concern, however it's a bit more of an issue for backfilling.

This is an important refactor of the table models. It moves the ProbeMeta and MeasurementMeta into nested composed classes, which is nicer because you don't get lost in the complicated class inheritance, but most importantly it significantly boosts performance because we don't have to make copies of each MeasurementMeta to pass it around. I also introduced to better patterns for handling the TableModels. Basically you decorate a table that should end up inside of the database via the `table_model` decorator and then when it's used type safety is enforced by the `TableModelProtocol`. Thanks to this refactoring it's also possible to improve the way in which we handle both the buffering and serialization of writes, but also the creation of the `CREATE` table queries by using python type hints. Some of these features require recentish versions of python (i.e. >=3.10), however we have already decided that backward compatibility is not a priority for the pipeline. We might however need some kind of compatibility layer if some of these functions need to be used by oonidata (though we might also drop older python support there too at some point if it gets too complex to manage). There are still several parts which need to be refactored, but I suggest doing that later and they are marked as TODO(art). This also adds support for making use of buffer tables, which has a significant performance boost in a parallalized workflow avoiding the issue outlined in here: #68 Moreover, we come up with better pattern to wait for table buffers being flushed before starting the dependant workflow. this can be implemented using primitives of temporal. We also enrich columns with the new processing time metadata for performance monitoring. --------- Co-authored-by: DecFox <[email protected]>

hellais · 2024-08-22T20:37:12Z

This was done in here: c797c26

hellais added the priority/low label Apr 30, 2024

hellais mentioned this issue May 1, 2024

rebased willclose: Optimize performance of table writers and refactor table model #72

Closed

hellais mentioned this issue Jun 20, 2024

Optimize performance of table writers and refactor table model #74

Merged

hellais closed this as completed Aug 22, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Consider switching to async inserts or batch tables #68

Consider switching to async inserts or batch tables #68

hellais commented Apr 30, 2024

hellais commented Aug 22, 2024

Consider switching to async inserts or batch tables #68

Consider switching to async inserts or batch tables #68

Comments

hellais commented Apr 30, 2024

hellais commented Aug 22, 2024