Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP] es/query: request_id-based derivation tasks statistics #187

Open
wants to merge 8 commits into
base: master
Choose a base branch
from

Conversation

mgolosova
Copy link
Collaborator

@mgolosova mgolosova commented Dec 5, 2018

Added query to get hashtag-based derivation tasks statistics.

The query gets information:

  • aggregated by output data formats;
  • within formats -- aggregated by task status;
  • for each format+status bucket:
    • total number of input events;
    • total size of input datasets;
    • total number of output events;
    • total size of output datasets;
    • average task walltime;
    • estimated total cpu time.

ToDo

The query gets information:
 * aggregated by output data formats;
 * within formats -- aggregated by task status;
 * for each format+status bucket:
   * total number of input events;
   * total size of input datasets;
   * total number of output events;
   * total size of output datasets;
   * average task walltime;
   * estimated total cpu time.
@mgolosova mgolosova self-assigned this Dec 5, 2018
@mgolosova
Copy link
Collaborator Author

It is said that for derivation tasks it is more common to look for
tasks with given request ID than with given hashtag(s).
ES aggregation "terms" returns by default only first 10 buckets; to get
others, "size" should be specified.
Field "data_format" of output dataset is artificially extended with
"general" format: "DAOD_EXOT12" is turned to ["DAOD", "DAOD_EXOT12"]
(see PR #102, commit 8c5ca49).

For given task it is not that good: we have extra format "DAOD", that
does not fit any specific datatset yet fits all the "DAOD_*" datasets.

To bypass this issue, list of data formats can be taken from tasks
metadata ("output_formats" field).
Somehow there are tasks with `start_time` > `end_time` in ProdSys2 DB,
so we have to check it explicitly to have the correct result.
@mgolosova mgolosova changed the title [WIP] Hashtag-based derivation tasks statistics [WIP] Request-based derivation tasks statistics Jan 11, 2019
@mgolosova mgolosova changed the title [WIP] Request-based derivation tasks statistics [WIP] es/query: request_id-based derivation tasks statistics Jan 23, 2019
@mgolosova
Copy link
Collaborator Author

[WIP] status is due to the fact that we still don`t know if the query does what it was made for.

Initially it was supposed that the query main parameter is hashtag (or
list of hashtags), but later it was changed to Request ID.
NOTE: output sample will be updated later, when data in ES are ready.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant