Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Consider switching to async inserts or batch tables #68

Closed
hellais opened this issue Apr 30, 2024 · 1 comment
Closed

Consider switching to async inserts or batch tables #68

hellais opened this issue Apr 30, 2024 · 1 comment

Comments

@hellais
Copy link
Member

hellais commented Apr 30, 2024

At the moment if you run too many workers on machine that is too fast, you can run into issues related to performing too many inserts per second even with the current approach of batching inserts inside of the custom ClickhouseConnection we use ooni/pipeline: https://github.com/ooni/data/blob/main/oonipipeline/src/oonipipeline/db/connections.py#L34.

We should consider switching to some of the native methods of either using the BufferTable engine or async inserts.

For the daily processing it's not so much of a concern, however it's a bit more of an issue for backfilling.

hellais added a commit that referenced this issue Jun 27, 2024
This is an important refactor of the table models.

It moves the ProbeMeta and MeasurementMeta into nested composed classes,
which is nicer because you don't get lost in the complicated class
inheritance, but most importantly it significantly boosts performance
because we don't have to make copies of each MeasurementMeta to pass it
around.

I also introduced to better patterns for handling the TableModels.
Basically you decorate a table that should end up inside of the database
via the `table_model` decorator and then when it's used type safety is
enforced by the `TableModelProtocol`.

Thanks to this refactoring it's also possible to improve the way in
which we handle both the buffering and serialization of writes, but also
the creation of the `CREATE` table queries by using python type hints.

Some of these features require recentish versions of python (i.e.
>=3.10), however we have already decided that backward compatibility is
not a priority for the pipeline.

We might however need some kind of compatibility layer if some of these
functions need to be used by oonidata (though we might also drop older
python support there too at some point if it gets too complex to
manage).

There are still several parts which need to be refactored, but I suggest
doing that later and they are marked as TODO(art).

This also adds support for making use of buffer tables, which has a
significant performance boost in a parallalized
workflow avoiding the issue outlined in here:
#68

Moreover, we come up with better pattern to wait for table buffers being
flushed before starting the dependant workflow. this can be implemented
using primitives of temporal.

We also enrich columns with the new processing time metadata for
performance monitoring.

---------

Co-authored-by: DecFox <[email protected]>
@hellais
Copy link
Member Author

hellais commented Aug 22, 2024

This was done in here: c797c26

@hellais hellais closed this as completed Aug 22, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant