Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: allocation csv: gpu_hours -> slot_hours, add resource_pool [DET-10408] #9616

Merged
merged 4 commits into from
Jul 12, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
12 changes: 6 additions & 6 deletions docs/manage/historical-cluster-usage-data.rst
Original file line number Diff line number Diff line change
Expand Up @@ -6,7 +6,7 @@

Determined provides insights into the usage of your cluster, measured in compute hours allocated.
Note that this is based on allocation, not resource utilization. For example, if a user has 1 GPU
allocated but uses only 20% of it, we still report one GPU hour.
allocated but uses only 20% of it, we still report one slot-hour.

.. warning::

Expand All @@ -23,11 +23,11 @@ allocated but uses only 20% of it, we still report one GPU hour.

.. note::

When using the export to CSV functionality, ``gpu_hours`` reflects only the GPU hours used during
the export time window. This means that allocations overlapping the export window have their GPU
hours calculated only for the time within the window. As a result, allocations not starting and
ending within the export window may appear to have incorrect GPU hours when calculated manually
from their start and end times.
When using the export to CSV functionality, ``slot_hours`` reflects only the slot hours used
during the export time window. This means that allocations overlapping the export window have
their slot-hours calculated only for the time within the window. As a result, allocations not
starting and ending within the export window may appear to have incorrect slot-hours when
calculated manually from their start and end times.

*********************
WebUI Visualization
Expand Down
7 changes: 7 additions & 0 deletions docs/reference/experiment-config-reference.rst
Original file line number Diff line number Diff line change
Expand Up @@ -1174,6 +1174,13 @@ multiple GPUs is done using data parallelism. Configuring ``slots_per_trial`` to
certain models, as described in the `PyTorch documentation
<https://pytorch.org/docs/stable/generated/torch.nn.DataParallel.html#torch.nn.DataParallel>`__.

``slots``
=========

For historical reasons, this field usually passes config validation steps, but has no practical
effect when present in experiment config. Use :ref:`slots_per_trial
<exp-config-resources-slots-per-trial>` instead.

``max_slots``
=============

Expand Down
7 changes: 7 additions & 0 deletions docs/release-notes/relabel-allocation-csv.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
:orphan:

**Breaking Changes**

- Tasks: The :ref:`historical usage <historical-cluster-usage-data>` CSV header row for slot-hours
is now named ``slot_hours`` as it may also track allocation time for resource pools without GPUs.
Also, this CSV now has an additional column providing the ``resource_pool`` for each allocation.
15 changes: 10 additions & 5 deletions master/internal/core.go
Original file line number Diff line number Diff line change
Expand Up @@ -362,12 +362,13 @@ type AllocationMetadata struct {
TaskType model.TaskType
Username string
WorkspaceName string
ResourcePool string
ExperimentID int
Slots int
StartTime time.Time
EndTime time.Time
ImagepullingTime float64
GPUHours float64
SlotHours float64
}

// canGetUsageDetails checks if the user has permission to get cluster usage details.
Expand Down Expand Up @@ -438,7 +439,8 @@ func (m *Master) getResourceAllocations(c echo.Context) error {
ColumnExpr("a.start_time").
ColumnExpr("a.end_time").
ColumnExpr("a.slots").
ColumnExpr("CASE WHEN a.start_time is NULL THEN 0.0 ELSE extract(epoch FROM (LEAST(GREATEST(coalesce(a.end_time, now()), a.start_time), ? :: timestamptz) - GREATEST(a.start_time, ? :: timestamptz))) * a.slots END AS gpu_seconds", end, start).
ColumnExpr("a.resource_pool").
ColumnExpr("CASE WHEN a.start_time is NULL THEN 0.0 ELSE extract(epoch FROM (LEAST(GREATEST(coalesce(a.end_time, now()), a.start_time), ? :: timestamptz) - GREATEST(a.start_time, ? :: timestamptz))) * a.slots END AS slot_seconds", end, start).
TableExpr("allocations a").
Where("tstzrange(start_time - interval '1 microsecond', greatest(start_time, coalesce(end_time, now()))) && tstzrange(? :: timestamptz, ? :: timestamptz)", start, end)

Expand Down Expand Up @@ -470,7 +472,8 @@ func (m *Master) getResourceAllocations(c echo.Context) error {
ColumnExpr("a.start_time").
ColumnExpr("a.end_time").
ColumnExpr("ip.imagepulling_time").
ColumnExpr("a.gpu_seconds / 3600.0 AS gpu_hours").
ColumnExpr("a.slot_seconds / 3600.0 AS slot_hours").
ColumnExpr("a.resource_pool").
With("tasks_in_range", tasksInRange).
With("allocations_in_range", allocationsInRange).
With("task_owners", taskOwners).
Expand Down Expand Up @@ -500,7 +503,8 @@ func (m *Master) getResourceAllocations(c echo.Context) error {
"start_time",
"end_time",
"imagepulling_time",
"gpu_hours",
"slot_hours",
"resource_pool",
}

formatTimestamp := func(t time.Time) string {
Expand Down Expand Up @@ -545,7 +549,8 @@ func (m *Master) getResourceAllocations(c echo.Context) error {
formatTimestamp(allocationMetadata.StartTime),
formatTimestamp(allocationMetadata.EndTime),
formatDuration(allocationMetadata.ImagepullingTime),
formatDuration(allocationMetadata.GPUHours),
formatDuration(allocationMetadata.SlotHours),
allocationMetadata.ResourcePool,
}
if err := csvWriter.Write(fields); err != nil {
return err
Expand Down
Loading