Ineffeciencies when listing dataset flows via GraphQL query #856

zaychenko-sergei · 2024-09-26T09:27:53Z

A casual dataset flows view that lists about 10 flows runs for ~1.36s, and performs highly ineffecient repository access operations.
(see Grafana trace)

There are over 7000 spans, including numerous access to get_active_polling_source for the very same dataset (the only one). Internally this is causing a lot of metadata chain iteration activity, reading multiple S3 files, then re-using the cached version.

Possible solutions:

general improvement of SetPollingSource access (via database materialization or summary extensions)
improving how flow GraphQL objects are organized, so that dataset query is issued only once for N flows

In addition, the same trace in Grafana uncovered need in #850

The text was updated successfully, but these errors were encountered:

zaychenko-sergei added rust Pull requests that update Rust code performance labels Sep 26, 2024

zaychenko-sergei mentioned this issue Sep 26, 2024

Metadata scanning performance kamu-data/kamu-node#61

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Ineffeciencies when listing dataset flows via GraphQL query #856

Ineffeciencies when listing dataset flows via GraphQL query #856

zaychenko-sergei commented Sep 26, 2024

Ineffeciencies when listing dataset flows via GraphQL query #856

Ineffeciencies when listing dataset flows via GraphQL query #856

Comments

zaychenko-sergei commented Sep 26, 2024