Skip to content

Commit

Permalink
fix: allocation csv: gpu_hours -> slot_hours, add resource_pool [DET-…
Browse files Browse the repository at this point in the history
  • Loading branch information
jesse-amano-hpe authored Jul 12, 2024
1 parent 6299dcd commit e9e4458
Show file tree
Hide file tree
Showing 4 changed files with 30 additions and 11 deletions.
12 changes: 6 additions & 6 deletions docs/manage/historical-cluster-usage-data.rst
Original file line number Diff line number Diff line change
Expand Up @@ -6,7 +6,7 @@

Determined provides insights into the usage of your cluster, measured in compute hours allocated.
Note that this is based on allocation, not resource utilization. For example, if a user has 1 GPU
allocated but uses only 20% of it, we still report one GPU hour.
allocated but uses only 20% of it, we still report one slot-hour.

.. warning::

Expand All @@ -23,11 +23,11 @@ allocated but uses only 20% of it, we still report one GPU hour.

.. note::

When using the export to CSV functionality, ``gpu_hours`` reflects only the GPU hours used during
the export time window. This means that allocations overlapping the export window have their GPU
hours calculated only for the time within the window. As a result, allocations not starting and
ending within the export window may appear to have incorrect GPU hours when calculated manually
from their start and end times.
When using the export to CSV functionality, ``slot_hours`` reflects only the slot hours used
during the export time window. This means that allocations overlapping the export window have
their slot-hours calculated only for the time within the window. As a result, allocations not
starting and ending within the export window may appear to have incorrect slot-hours when
calculated manually from their start and end times.

*********************
WebUI Visualization
Expand Down
7 changes: 7 additions & 0 deletions docs/reference/experiment-config-reference.rst
Original file line number Diff line number Diff line change
Expand Up @@ -1174,6 +1174,13 @@ multiple GPUs is done using data parallelism. Configuring ``slots_per_trial`` to
certain models, as described in the `PyTorch documentation
<https://pytorch.org/docs/stable/generated/torch.nn.DataParallel.html#torch.nn.DataParallel>`__.

``slots``
=========

For historical reasons, this field usually passes config validation steps, but has no practical
effect when present in experiment config. Use :ref:`slots_per_trial
<exp-config-resources-slots-per-trial>` instead.

``max_slots``
=============

Expand Down
7 changes: 7 additions & 0 deletions docs/release-notes/relabel-allocation-csv.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
:orphan:

**Breaking Changes**

- Tasks: The :ref:`historical usage <historical-cluster-usage-data>` CSV header row for slot-hours
is now named ``slot_hours`` as it may also track allocation time for resource pools without GPUs.
Also, this CSV now has an additional column providing the ``resource_pool`` for each allocation.
15 changes: 10 additions & 5 deletions master/internal/core.go
Original file line number Diff line number Diff line change
Expand Up @@ -362,12 +362,13 @@ type AllocationMetadata struct {
TaskType model.TaskType
Username string
WorkspaceName string
ResourcePool string
ExperimentID int
Slots int
StartTime time.Time
EndTime time.Time
ImagepullingTime float64
GPUHours float64
SlotHours float64
}

// canGetUsageDetails checks if the user has permission to get cluster usage details.
Expand Down Expand Up @@ -438,7 +439,8 @@ func (m *Master) getResourceAllocations(c echo.Context) error {
ColumnExpr("a.start_time").
ColumnExpr("a.end_time").
ColumnExpr("a.slots").
ColumnExpr("CASE WHEN a.start_time is NULL THEN 0.0 ELSE extract(epoch FROM (LEAST(GREATEST(coalesce(a.end_time, now()), a.start_time), ? :: timestamptz) - GREATEST(a.start_time, ? :: timestamptz))) * a.slots END AS gpu_seconds", end, start).
ColumnExpr("a.resource_pool").
ColumnExpr("CASE WHEN a.start_time is NULL THEN 0.0 ELSE extract(epoch FROM (LEAST(GREATEST(coalesce(a.end_time, now()), a.start_time), ? :: timestamptz) - GREATEST(a.start_time, ? :: timestamptz))) * a.slots END AS slot_seconds", end, start).
TableExpr("allocations a").
Where("tstzrange(start_time - interval '1 microsecond', greatest(start_time, coalesce(end_time, now()))) && tstzrange(? :: timestamptz, ? :: timestamptz)", start, end)

Expand Down Expand Up @@ -470,7 +472,8 @@ func (m *Master) getResourceAllocations(c echo.Context) error {
ColumnExpr("a.start_time").
ColumnExpr("a.end_time").
ColumnExpr("ip.imagepulling_time").
ColumnExpr("a.gpu_seconds / 3600.0 AS gpu_hours").
ColumnExpr("a.slot_seconds / 3600.0 AS slot_hours").
ColumnExpr("a.resource_pool").
With("tasks_in_range", tasksInRange).
With("allocations_in_range", allocationsInRange).
With("task_owners", taskOwners).
Expand Down Expand Up @@ -500,7 +503,8 @@ func (m *Master) getResourceAllocations(c echo.Context) error {
"start_time",
"end_time",
"imagepulling_time",
"gpu_hours",
"slot_hours",
"resource_pool",
}

formatTimestamp := func(t time.Time) string {
Expand Down Expand Up @@ -545,7 +549,8 @@ func (m *Master) getResourceAllocations(c echo.Context) error {
formatTimestamp(allocationMetadata.StartTime),
formatTimestamp(allocationMetadata.EndTime),
formatDuration(allocationMetadata.ImagepullingTime),
formatDuration(allocationMetadata.GPUHours),
formatDuration(allocationMetadata.SlotHours),
allocationMetadata.ResourcePool,
}
if err := csvWriter.Write(fields); err != nil {
return err
Expand Down

0 comments on commit e9e4458

Please sign in to comment.