MGPU Support #32

moudgils · 2022-03-28T17:44:29Z

moudgils
Mar 28, 2022
Maintainer

MGPU Support

Summary

Multi GPU support can allow for better performance as you are now able to use multiple GPUs to render frame’s’ within window’s’. With OpenXR support on the roadmap there is also an opportunity to use multi-gpu support if the app is run on PC and viewed via a VR device or if the VR device has multiple adapters. Historically adding multiple gpu support as been tricky to get right and sometimes is not worth the effort depending on how it is implemented and what hardware is used for the purpose. O3de currently does not provide multi gpu support and this document goes over possible strategies and the final proposed solution at the end. This can also provide great benefits if used within cloud services where multiple gpu nodes are in play. Before we go any further I have added a high level view of how Atom is setup if there is any confusion around how RPI and RHI are structured.

               RPI
                |
               RHI
                |
       DX12    VK  Metal

Goals

Provide easy access to multiple GPUs, without sacrificing API cleanliness or maintainability
Iterative development
Scalability across multiple devices
Profiling support per device
Debugging support with multi-gpu in mind

Known Strategies

Multi gpu support can be done in many ways and the drivers actually provide underlying support for some of these strategies. Below are three possible strategies for linked devices and there is literature out there on how this can be supported on Nvidia and AMD hardware.

Alternate Frame Rendering - This is more of a strategy for when you have two GPUs we switch between two gpus every other frame. To hit 60fps, each device would need to complete execution at 30hz, and the two frames are interleaved. Certain features in a pipeline require reading data from previous frame (for example velocity buffer) and we may have some issues there. A possible strategy would be to introduce a lag of one frame and the assumption there is that when reading previous frames data it will always be 2 frames behind. If this is not acceptable you could just add support to transfer data between the gpu at the end of the frame to help with Temporal effects.
Split Frame rendering - In this strategy it will split the scene workload into multiple regions and assign these regions to different GPUs. For example, on a system with two SLI-enabled GPUs, a render target may be divided vertically, with GPU 1 rendering the left region and GPU 2 rendering the right region. Rendering is also dynamically load balanced, so the division will change whenever the driver determines that one GPU is working more than another.
Pipelined Frame Rendering - For this strategy we would cut the frame up and pipeline parts of it to other gpu nodes. Ideally the data always flow from one gpu to another and never back. It could be the pipeline author that decide which passes can be pipelined over or it could be a decision made by RHI. RHI will then be responsible for data transfer and synchronization between the nodes. The load balancing may vary based on the hardware configuration and it may be tricky to figure out exactly how to cut the frame across multiple nodes. Depending on the hw configuration the data may need to go over pcie bus.

Vulkan

https://www.youtube.com/watch?v=RkXa4RiERu8 - Vulkan's Device Group Extension actually has a nice abstraction over all the linked gpu nodes hidden behind a single device instance. A device group can allocate memory across the sub-devices, bind memory from one sub-device to a resource on another sub-device, record command buffers where some work executes on an arbitrary subset of the sub-devices, and potentially present a swapchain image from one or more sub-devices.

Dx12

https://github.com/Microsoft/DirectX-Graphics-Samples/tree/master/Libraries/D3DX12AffinityLayer - D3D12 MultiGPU Starter Library has support for Alternate Frame Rendering supported via duplicating all non-system memory resources on both GPUs (i.e. textures, buffers, render targets, etc.) and executing the same work on both GPU

Metal

Metal does not seem to have support similar abstraction like Vulkan device group. It seems like for Metal we will need to provide support for explicit synchronization/copy.
Data can't be directly transferred between GPUs; it must be transferred via the Mac’s system memory - https://developer.apple.com/documentation/metal/resource_fundamentals/transferring_data_between_connected_gpus?language=objc
Explicit synchronization support related information - https://developer.apple.com/documentation/metal/resource_synchronization/synchronizing_events_across_multiple_devices?language=objc

In theory using the ‘Linked adapter’ approach with one of the above mentioned strategy would be good approach but there are a few problems with it. The linked adapter approach is not cleanly abstracted across all APIs in the same way and hence creating a single API would be much harder to accomplish. Experiments show that the existing APIs do not have mature support. After trying various Nvidia GPUs with SLI or NVLink on Windows and Linux no configuration was found where vulkaninfo reported device groups with more than one GPU. This is in line with Nvidia dropping support for SLI.

Proposed strategy

Based on the goals mentioned above I propose following development path

Explicit heterogenous support only (No Linked adapters)
Scalability support via development in an iterative manner. Each iteration builds on the previous one and by the end of Iteration pass 3 we should be able to scale to N GPUs across any part of the Render pipeline.
- Iteration 1 - Framegraph per device. No communication between gpu nodes.
- Iteration 2 - Pass per device (Synchronization support). Support to have some passes pushes to a secondary gpu (similar to async compute)
- Iteration 3 - Pass split across multiple devices. This is essentially Split frame rendering.
Virtual GPU support to help with debuggability - This will allow us to debug multi-gpu support with just one adapter
GPU query support across all devices to help with profiling

I think that adding support for all of the above will help us achieve all of our goals mentioned earlier in the document. It tries to address the issue of fractured API support across DX12/Metal/Vk, immature driver support, iterative development support, better debugging support and ability to better profile all the passes within the pipeline. Lets go over more specific implementation details below

Device selection and Management
Currently RHI::Factory provides a way to enumerate physical device instances on the system. A physical device is basically a handle to gpu adapter, with some platform independent information about it. The descriptor is as follows

class PhysicalDeviceDescriptor
{
public:
    AZ_RTTI(PhysicalDeviceDescriptor, "{22052601-3C81-4FD2-AD46-1AE00F01E95E}");
    static void Reflect(AZ::ReflectContext* context);
    virtual ~PhysicalDeviceDescriptor() = default;

    AZStd::string m_description;
    PhysicalDeviceType m_type = PhysicalDeviceType::Unknown;
    VendorId m_vendorId = VendorId::Unknown;
    uint32_t m_deviceId = 0;
    uint32_t m_driverVersion = 0;
    AZStd::array<size_t, HeapMemoryLevelCount> m_heapSizePerLevel = {{}};
};

The PhysicalDeviceDescriptor can be enlarged to hold more information about the adapter which can help us build a priority list from the strongest gpu to the weakest. As a start we could just use m_heapSizePerLevel which contains information about dedicated video/system memory to assign priority to a adapter.

At the moment we pick one adapter (with preference given to NVidia/AMD) and create a device instance for it. Introduce a new variable within settings registry that will dictate the number of virtual instances per adapter. The default value can be 1 but can be switched to 1+ in order to create multiple instances from the same adapter. Multiple instances may mean duplicated device memory (depending on the implementation) so caution needs to be taken when doing this and should mainly be used for debugging purposes in case of one adapter.

A DeviceManager class will be needed which will contain all the device instances and their mapping to the related adapter. It will be a singleton class and can live within RHISystem. New methods will need to be added within RHISystem so that the RPI is able to access the list of device instances and all the information related to it. As part of init RPI is able to direct RHI which GPU should be enabled as part of multi-gpu support. By default RHI will use the first instance of the chosen adapter. At the end of initialization RHI will have successfully created a device instance mask that will contain the bits for all the ‘activated’ device instances set to 1. This mask can then be used for validation purposes and can also be used by backends like DX12 as a ‘nodemask’.

We want to try to design around the fact that RPI is able to activate a previously un-activated device instance. This will allow flexibility in terms of adding or removing instances based on the load of the application and can be very beneficial when we have access to multiple nodes and are trying to load balance across all of them.

Api changes will be required to

RHISystem::InitDevice <---- We need to initialize multiple devices
New api will need to be added to activate or de-activate an existing device instances as well.

Resource management

Before I go over each Iteration pass separately we need to make a big decision related to resource management. We have two modes of thinking here and they both will have sever impact on how we develop mgpu support.

One device per resource approach(1->1) - RHI and all the backends are currently built on the idea around one device per resource. We make this assumption everywhere. If we were to continue with this approach RPI will need ensure valid residency of all the resource pools/resources needed for a framegraph used on a specific device. So if a framegraph X is run on device Y, RPI will need to ensure we have init/allocated all the resources on the device Y before calling compile/execute on the framegraph X.
Multiple devices per resource (1->N)- We modify the whole RHI and backend gems to now add support for one resource/resource pool→N devices. This will probably create complications throughout the entire RHI’s lower level API and overall logistics of this approach maybe considerable. But at a higher level the API will remain clean. Very little changes would be needed at RPI level. Adding support for Iteration 3 would become considerably simpler at least at a higher level.

Having said that we should be able to use either of options described above and still follow the iterative plan I have prescribed below.

Iteration 1 - Framegraph per device
For this approach no data transfer needs to happen from one gpu to another. This iteration will essentially try to apply work related to each window on to a different GPU and hence will be limited in terms of performance boost in different scenarios. For example this would be very beneficial if we are using an editor and have access to multiple gpu whereby we can assign a different gpu to each editor window like the Main editor, Material editor, UI editor, Animation editor, etc. If you have a game with one window/scene this iteration will not be as beneficial but this approach will setup the groundwork for more complicated iterations in the future.

RPI will need to have a way to assign a priority to an editor which can be used to map a gpu to an editor if all the windows are active at the same time. We can match the highest priority adapter device instance to the highest priority editor window. Based on this device→window mapping we will need to modify ViewportContextManager to pass in a different device instance per window handle. Swapchain is already part of the window context so we are good there.

RPI will be able to assign a frame graph (i.e m_frameScheduler) to a window and this can be arbitrary. The frame scheduler provides user-facing API for preparing (constructing), compiling, and executing a frame graph so now we will need to create N Frame schedulers for N GPUs. We would create a window handle→device instance→frame graph id mapping. We would have to modify some of RHISystem’s api around managing m_frameScheduler api to provide mapping between Framegraph’s’→device instance’s’

RHISystem will need to change to now hold N FrameSchedulers

constexpr uint32_t NumFrameGraphs = 5; <---This number is arbitrary
RHI::FrameScheduler m_frameScheduler[NumFrameGraphs];

We will need to update

RHISystem::Init

to pass the framegraph id and the device instance associated with the given frame graph. The init function initializes the frame scheduler and hence at that point we are establishing a mapping between the frame scheduler and the device it will be using. Changes will be need to be made to call to FrameUpdate to something like this

for(all activated framegraphs i)
{
    m_rhiSystem.FrameUpdate(i, ..)
}

1 resource→ 1 device approach
If we follow the one device per resource approach here RPI will be responsible for maintaining residency of the resource pools in their separate devices based on the frame scheduler assigned to the device. This may not be too tricky as the idea here is that we will have very different pipelines per window (think UI editor vs main window) and hence managing different pools per device may not be as bad. Since no communication is needed nothing else should be needed.

1 resource→ N device approach
If we follow this resource approach nothing more will be needed at RPI level. Ofcourse we will have to expand RHI to support this approach first. But once accomplished if we want to execute Framegraph X on device Y during Framegraph X’s executing time it will query the resource present on device Y and use it as needed. No other changes will be required.

Iteration 2 - Pass per device (Essentially Iteration 1 but with device communication support)
For this iteration we are building upon the previous iteration and now on top of running N frame graphs across N gpus we want to allow for communication between the frame graphs. This will allow us to address the case where we have one window with one or more camera/scene which is the most common use case when playing a game. The idea behind this approach is that we now want to run multiple frame graphs on multiple GPUS but are bound to the same window and hence are allowed to communicate.

With this approach RPI would have to allow for multiple disparate frame graphs per device. This means that the pipeline authors have the option to break up the main pipeline into multiple sub-pipelines where each sub-pipeline will be able to run on a different virtual device instance. Each sub-pipeline will be assigned a separate FrameScheduler which will create a separate framegraph to be run on a separate device. Imagine an example like this where you have two separate frame graphs running on separate GPUs. GPU A renders the Shadow, Depth and Gbufer pass which is then copied over to GPU B to do lighting, Post-processing and Presentation. The synchronization can be built around CrossDeviceCopyPass using Fences.

Dx12 multi-gpu fencing Api

// Create fence for cross adapter resources
mPrimaryDevice->CreateFence(mCurrentFenceValue,
    D3D12_FENCE_FLAG_SHARED | D3D12_FENCE_FLAG_SHARED_CROSS_ADAPTER,
    IID_PPV_ARGS(&primaryFence));

// Create a shared handle to the cross adapter fence
HANDLE fenceHandle = nullptr;
mPrimaryDevice->CreateSharedHandle(
    primaryFence.Get(),
    nullptr,
    GENERIC_ALL,
    nullptr,
    &fenceHandle));

// Open shared handle to fence on secondaryDevice GPU
mSecondaryDevice->OpenSharedHandle(fenceHandle, IID_PPV_ARGS(&secondaryFence));

RHI abstraction

// Create fence B using device B and handle.
fenceB->Init(deviceB, handle);
// Need two fences, one for each device.
RHI::Ptr<RHI::Fence> fenceA = RHI::Factory::Get().CreateFence();
RHI::Ptr<RHI::Fence> fenceB = RHI::Factory::Get().CreateFence();

// Mark fence as shared.
RHI::FenceDescriptor desc;
desc.m_initialState = RHI::FenceState::Reset;
desc.m_flags = RHI::FenceFlags::Shared;

// Create first fence on device A.
fenceA->Init(deviceA, desc);

// Create a shared fence handle to pass state to fence B.
RHI::Ptr<RHI::SharedFenceHandle> handle = RHI::Factory::Get().CreateSharedFenceHandle();
handle->Init(fenceA);

RPI could have a special pass made to copy data from one device to another called CrossDeviceCopyPass where it would create fences like the ones described above. It will then be able to attach fenceA as a signaling fence to the CrossDeviceCopyPass scope (FrameGraphA) which will copy the data from GPU-A to shared staging buffer. Then the CrossDeviceCopyPass(FrameGraphB) would wait on fenceB and then copy over the data from shared staging buffer to GPU-B. We may need to add support to attach a fence to wait on for a scope by RPI.

1 resource→ 1 device approach
Lets discuss memory transfer with this approach. DirectX requires that you create a heap on the first device, then open a shared handle to that heap on the second device. You then have to create a separate placed resource for each device. A possible API could look like this

DX12 API

// Create a shader resource and shared handle
for (int i = 0; i < NumRenderTargets; i++)
{
    mPrimaryDevice->CreateCommittedResource(
        &CD3DX12_HEAP_PROPERTIES(D3D12_HEAP_TYPE_DEFAULT),
        D3D12_HEAP_FLAG_SHARED | D3D12_HEAP_FLAG_SHARED_CROSS_ADAPTER,
        &crossAdapterDesc,
        D3D12_RESOURCE_STATE_PIXEL_SHADER_RESOURCE,
        nullptr,
        IID_PPV_ARGS(&shaderResources[i]));

    HANDLE heapHandle = nullptr;
    mPrimaryDevice->CreateSharedHandle(
        mShaderResources[i].Get(),
        nullptr,
        GENERIC_ALL,
        nullptr,
        &heapHandle);

    // Open shared handle on secondaryDevice device
    mSecondaryDevice->OpenSharedHandle(heapHandle, IID_PPV_ARGS(&shaderResourceViews[i]));

    CloseHandle(heapHandle);
}

Possible direction for RHI API

// Need two copies, one for device A and one for device B.
RHI::Ptr<RHI::SharedBufferPool> sharedPoolA = RHI::Factory::Get().CreateSharedBufferPool();
RHI::Ptr<RHI::SharedBufferPool> sharedPoolB = RHI::Factory::Get().CreateSharedBufferPool();

// Need two copies, one for device A and one for device B.
RHI::Ptr<RHI::Buffer> sharedBufferA = RHI::Factory::Get().CreateBuffer();
RHI::Ptr<RHI::Buffer> sharedBufferB = RHI::Factory::Get().CreateBuffer();

// Create initial pool associated with device A.
RHI::SharedBufferPoolDescriptor desc;
desc.m_byteCount = 16 * 1024 * 1024;
sharedPool->Init(deviceA, desc);
 
// Create shared handle for building the second pool.
RHI::Ptr<RHI::SharedBufferPoolHandle> sharedPoolHandle = RHI::Factory::CreateSharedBufferPoolHandle();
sharedPoolHandle->Init(sharedPool);
 
// Create second pool associated with device B.
sharedPoolB->Init(deviceB, sharedPoolHandle);

// Either pool can allocate. Internally, the two pools are considered aliases of each other, so
// an allocation on pool 'A' is also applied to pool 'B'. You have to explicitly share and init
// the buffer alias on the other pool. 
sharedPoolA->InitBuffer(*sharedBufferA, 16 * 1024 * 1024);

// Create a shared handle from the buffer 'A' instance.
RHI::Ptr<RHI::SharedBufferHandle> sharedBufferHandle = RHI::Factory::CreateSharedBufferHandle();
sharedBufferHandle->Init(*sharedBufferA);

// Create the aliased buffer on pool 'B' using the shared handle.
sharedPoolB->InitBuffer(*sharedBufferB, *sharedBufferHandle);
 
// Now sharedBufferA and sharedBufferB are aliases.

1 resource→ N device approach
With this approach RPI will not have to worry about creating separate shared pools. We could introduce a new buffer bind flag which would tag a buffer that needs to alias across multiple devices. When that happens it could directly create cross device resources/resource views in the RHI backend itself and the api mentioned above may not need to be exposed outside of RHI. The fence api will still exist and in the CrossDeviceCopyPass RPI would setup correct copies (GPU-A buffer to staging memory and staging memory to GPU-B) by just querying the RHI to get the correct buffer pointer for a specific device.

Iteration 3 approach - Pass split across multiple devices

This should be the last iterative step and would allow for multi-gpu scaling across specific supported passes. For this approach we can split the screen dimension in the X-axis across N GPUs. For example the forward pass can be split across N GPUs by setting the RT viewport/scissor dimensions to be of dimensions (Resolution_Width / N, Resolution_Height).

Use Case 1 - Deferred pipeline - We run Depth->Shadow->GBuffer pass on N-1 GPUs and have them copy over the result of their section of the screen to the primary GPU which will then apply Lighting→PostProcessing→Present. The vertex work is duplicated across all devices. We have to be careful with what we partition as any passes that require filtering can not be partitioned.

UseCase2 - Ray Tracing pipeline- Generate rays on one gpu, trace them on N GPUs, copy the data back to the primary gpu to be filtered.

1 resource→ 1 device approach
RPI will need to do a lot of the heavy lifting here. It will need to understand which resources need to live on which device for it’s appropriate frame graph and manage that accordingly. Based on the pipeline changes it should be able to set appropriate viewport/scissor dimensions for the pass that will need to be partitioned across multiple GPUs

1 resource→ N device approach
RPI will have much fewer changes. It will be able to query RHI for it’s driver appropriate handle and set up the scopes appropriately. The synchronization code will be the same regardless. Below are a few thoughts on what will need to be changed for 1 resource→ N device support in general.
Resource creation

AliasedHeap - Aliased Heap is part of the Framegraph so we should be protected as we will create one per device via framegraph.
BufferPool/Buffer - RPI will need to provide a device mask of all the activated device instances and RHI will have to create backing memory of the pool and the buffer pointer across all the device instances
Image - RPI will need to provide a device mask of all the activated device instances and RHI will have to create multiple images per device instances
ShaderResourceGroupPool/ShaderResource - RPI will need to provide a device mask of all the activated device instances and RHI will have to create multiple SRGPools and SRGs per device
QueryPool/Query - RPI will need to provide a device mask of all the activated device instances and RHI will have to create multiple QueryPools and Queries per device

Resource Binding

RPI will need to populate all instances of SRGs (per devices).
When populating SRGData RPI will be able to query the RHI to get the correct view before binding it to the correct SRG
SRG compilation will also need a device mask to tell RHI which SRGs to compile

Execution

Multiple Shader PSO will need to be created per device
RPI will need to create multiple draw packets per framegraph. All the CPU data can be copied/referenced across all the draw packets.
All the frame graphs will need to executed at the same time on the cpu timeline to ensure no drastic stalls.

Open Questions

1:1 approach or 1:N approach when it comes to resource management?
Do we want to consider Linked adapter approach if our use case is just doing Tiled rendering across multiple monitors on a wall of monitors (think Mandalorian)? Linked adapter approach on one platforms and one api maybe easier to add (i.s PC-dx12) but harder to scale. This may require a fairly different design than the one recommended here.

We will need further research and discussion around how to proceed based on our use cases. I recommend setting up a working group to help facilitate further development in a much more transparent manner.

jhmueller-huawei · 2022-03-29T11:21:05Z

jhmueller-huawei
Mar 29, 2022
Maintainer

This sounds like a great plan already.

With respect to the questions I think that starting out with a 1:1 approach for resource management should be easier and maybe it would not be too much work to change that later, e.g. between iteration 2 and 3. However, I think that even for iteration 3 a 1:1 approach might be better, since you only have to implement the resource management once in the RPI for 1:1, while you would have to do it multiple times for 1:N.
For the second question I would say that I wouldn't consider this approach since our experiments didn't show much support on the hardware/driver side as you wrote and I don't expect that to change or wouldn't want to rely on it.

In terms of resource management, would it make sense to consider various types of resources? E.g., there are some resources that are only uploaded to the device (or multiple devices), like vertex and index buffers or textures. Then there are transient resources that don't need to be shared at all, like a depth buffer that is only used in a part of a pipeline running on one device and then finally there are the resources that need to be actually shared/transferred. The RHI cannot really distinguish between those which would be another argument for the 1:1 approach.

0 replies

jhmueller-huawei · 2022-03-31T13:37:29Z

jhmueller-huawei
Mar 31, 2022
Maintainer

We had a closer look at where any references to the device are made in the engine and there are a few categories you can put those into.

Mostly, it is of course the management of resources, i.e., using the various RPI systems (e.g. BufferSystem, GpuQuerySystem, ImageSystem), or more directly (such as time stamps, query pools, resources for the diffuse probe grid or ray tracing, etc.).
On a higher level the device is used of course during viewport/window creation which is what we want anyway for the first iteration already, just that we somehow have to select a device there first.
One question is how to handle the RenderTick() in the RPISystem with multiple frame graphs, since we might want to support running frame graphs at different frame rates, e.g. when rendering to an HMD and a monitor with different refresh rates.
Various functions that access device specific functions are used for profiling or displaying performance numbers. Those should not be problematic since they can iterate over all devices, either using each separately or averaging over the devices.
Some accesses try to get feature support or possible formats for images or query other device information. While those don't sound difficult to handle, they can be, especially in the case of pipeline/shader caches (check GetPipelineLibraryPath which calls GetPhysicalDeviceDescriptor) since those calls are quite high up in the hierarchy, i.e. within shader assets which should probably even be device independent. So the question with these is, where to actually introduce the device specificity. In some cases it might be possible to query feature support at later stages when it's clear which device is used for some resource.

2 replies

VickyAtAZ Apr 6, 2022

As to the RPI systems in RPI, the iteration process may look like:
Stage 1: 1 device 1 resource.
One solution I am thinking of is to create the RPISystem with its sub systems for each device. So each RPISystem would have one FameGraph and control its render tick freely.
For example,

XxxSystemInterface::Get() to XxxSystemInterface::Get(Device*).
For each RPI resource object create function, it will take device as input as well. For example, StreamingImage::FindOrCreate(imageAsset) will be changed to StreamingImage::FindOrCreate(device, imageAsset). These may include these RPI objects: AttachmentImage, Buffer, ImagePools, ShaderResourceGroup, Material, Shader

I image this could even be used to create duo backend systems (one for vulkan, one for dx12).
Although, with this approach, when we move to next iteration which use one resource for multiple devices, these changes probably wouldn't be needed anymore.

rgba16f Apr 6, 2022
Maintainer

If we go with a multiple RPI instance approach we'd need some way to enumerate the gpu's above the RPI and then pass down which device to attach that RPI & RHI instance to.

moudgils · 2022-04-05T02:49:00Z

moudgils
Apr 5, 2022
Maintainer Author

Q. In terms of resource management, would it make sense to consider various types of resources? E.g., there are some resources that are only uploaded to the device (or multiple devices), like vertex and index buffers or textures. Then there are transient resources that don't need to be shared at all, like a depth buffer that is only used in a part of a pipeline running on one device and then finally there are the resources that need to be actually shared/transferred. The RHI cannot really distinguish between those which would be another argument for the 1:1 approach.

If we go with the 1:1 approach you would only need to care about the resources managed by RPI. All those resources will now need to have an instance per device. RHI mantains all the logic needed for transient resoruces (Buffers and Images) and you wouldnt need to touch any of that as by creating multiple Framegraphs (i.e FrameScheduler) you automatically get support for multiple instances of all the objects responsible for managing transient resources.

Q. On a higher level the device is used of course during viewport/window creation which is what we want anyway for the first iteration already, just that we somehow have to select a device there first.

Yes. Currently the RHI gem makes the call on which device to pick. This code lives in RHISystem::InitInternalDevice. We can probbaly change the api here to query the RHI for the device list and have RPI make the call on which device/devices to use for the app. RPI can use the PhysicalDeviceDescriptor to make a more informed decision here (i.e pick the best two gpus if we decide we only want to use 2 gpus).

Q. One question is how to handle the RenderTick() in the RPISystem with multiple frame graphs, since we might want to support running frame graphs at different frame rates, e.g. when rendering to an HMD and a monitor with different refresh rates.

Currently FrameGraph (i.e FrameScheduler which is also a FrameGraphBuilder) lives within RHI and is exposed to RPI via this function (void PassSystem::FrameUpdate(RHI::FrameGraphBuilder& frameGraphBuilder)). This api will also need to change. It seems like we may want RPI to tell RHI at init how many framegraphs it needs and then at FrameUpdate it will get a list of FrameGraphBuilder references from RHI which it will populate as it sees fit. So in this case it could assign a separate pipeline (one for a different editor) to each framegraph and populate it accordingly inside FrameUpdate. RPI will also need to assign a device to each FrameGraphBuilder.

Q. Various functions that access device specific functions are used for profiling or displaying performance numbers. Those should not be problematic since they can iterate over all devices, either using each separately or averaging over the devices.

Well if we go with 1:1 approach RHI code will not change. RHI only has one device so it will provide profiling data for the one device. What will need to change is the code within RPI and how it displays this data using Imgui. It will now need to accomodate for multiple FrameGraphs and each one running on a different device. So we will need a way to display this cleanly in the GPU profiler.

Q. Some accesses try to get feature support or possible formats for images or query other device information. While those don't sound difficult to handle, they can be, especially in the case of pipeline/shader caches (check GetPipelineLibraryPath which calls GetPhysicalDeviceDescriptor) since those calls are quite high up in the hierarchy, i.e. within shader assets which should probably even be device independent. So the question with these is, where to actually introduce the device specificity. In some cases it might be possible to query feature support at later stages when it's clear which device is used for some resource.

PipelineLibrary support for dx12 is currently disabled due to a driver issue related to saving out PipelineLibraries but it is enabled on Vulkan and Metal. We are actually currently already saving out different libraries per device, driver version, etc. We currently write out Pipeline Libraries when a shader is deleted within RPI. Easiest thing to do would be write out the Pipeline library for all activated devices. A more ideal approach could be to only write out the PipelineLibrary for the device the shader was used on. At shader load time as long as we pass in the correct device instance everything should just work.
In terms of how do you know which device to use when loading a shader RPI should know which device to use based on which pipeline got assigned to a device. If PipelineA is using DeviceA then all the shaders used by PipelineA would require DeviceA and so on.

0 replies

jeremyong-az · 2022-04-10T22:25:43Z

jeremyong-az
Apr 10, 2022
Collaborator

Personally, I would advocate a much simpler option to start, which is to simply multiplex all resource creation, and even entire pipelines to all devices. Managing node-masks, cross-device synchronization, etc. is likely to be a complex undertaking, and this complexity is something I would recommend tacking on later.

Here's what I would consider starting with:

Create a MultiDevice which implements the RHI::Device interface and multiplexes all resource and descriptor creation to each device in the family
Expose a few bits in the object/draw rate SRG which correspond to the deviceID. In addition, bind the device count. This would allow the user to specialize shaders based on device they operate on. Effectively, this would operate similarly to a quantity like SV_ViewID (view instancing aka multiview).

As an example configuration, the MultiDevice would multiplex operations to, say, 10 GPUs. The shaders themselves set culling planes/viewports based on the deviceID, and we achieve horizontal scaling by splitting the workload in screenspace.

Afterwards, we can expose more low level control as part of the MultiDevice interface, but this would certainly get things started. Tiling the raster is a more straightforward means of achieving scale than specializing jobs on different GPUs, and I suspect this type of scalability would account for the majority use case.

2 replies

moudgils Apr 11, 2022
Maintainer Author

I think your approach above would use a 1->N approach (for resource management) where every resource (like a Buffer, Image, Views, etc) within RPI contains a separate copy of native objects for each device within RHI.

For example for DX12 Imageview with N gpus we could go from
Current setup-----> RPI::ImageView -> RHI::ImageView -> DX12::ImageView (Native objects)
to
New setup -------> RPI::ImageView -> RHI::ImageView -> RHI::DeviceImageView[N] and each RHI::DeviceImageView will contain DX12::ImageView

The RPI level is clean but RHI will now need to manage native objects per device. As a start we could only add the proposed redesign for a few objects like Images, Buffers, SRGs and Shaders and deal with other later.

jeremyong-az Apr 11, 2022
Collaborator

Yup, basically, I would advocate a "top down approach" (supporting the high level strategy of striping GPUs) instead of a "bottom up approach" (supporting every cross-GPU interaction possible in a heterogeneous GPU environment). This interface would also mimic how other APIs expose heterogeneous groups of accelerators (Intel's DPC++, MPI, etc.). While I wouldn't want to mimic those APIs too closely, there's some merit in starting simple I think, especially if it accounts for >80% of the use case. A lower level interface for retrieving the Device* of a particular GPU could be added later, along with a fence/timeline-semaphore mechanism to synchronize between them. Similarly, we can always add a node mask later, if we do in fact what to support non-multiplexed resources in the future.

Probably worth soliciting @jhmueller-huawei's opinion here because I see some discussion about the demultiplexed use case above.

jhmueller-huawei · 2022-04-13T10:28:25Z

jhmueller-huawei
Apr 13, 2022
Maintainer

Thank you for all your responses.

@jeremyong-az: We are not sure if split rendering is the goal to strive for. According to Nvidia, AFR has been the preferred method (https://docs.nvidia.com/gameworks/content/technologies/desktop/sli.htm) over SFR. Considering that neither AFR nor SFR is widespread today - quite the contrary, Nvidia is dropping support - I wouldn't count on it to be the ideal solution. Thinking about a complex rendering pipeline like in O3DE, I don't even know how you would do SFR throughout the pipeline. E.g., when rendering shadow maps, do you render them on both devices (diminishing the gains that you would get from the parallelization over GPUs) or do you split every rendering operation, which would result in a lot more syncs and transfer overhead, which is especially detrimental without proper possibilities for using fast GPU interconnect, as we seem to have right now, where we need to go through the CPU. For these reasons we are actually not interested in SFR and would like to experiment with other approaches.
As a consequence we also very much prefer a device selection much higher up in the hierarchy, as e.g. @VickyAtAZ proposes. The MultiDevice which is essentially there to duplicate resources over all devices is quite limited and focused primarily on supporting AFR/SFR. Even when using device masks it complicates approaches that want to allocate resources on GPUs with more flexibility. Even with the MultiDevice resource duplication, the RPI needs to do a considerable amount of work to handle multiple devices, since all the processing on multiple devices cannot be completely hidden at the higher layers, even when doing just AFR/SFR. That means there is still the need for synchronization and copying between devices, merging the split frames and presenting them on the display GPU, as well as being aware within passes that a lower resolution and a different projection matrix is used. You could probably find a solution where you do everything in the RHI, hiding the fact that any SFR is going on from the RPI, but that would then mean that only SFR is supported with no flexibility at all, which is again not what we would like to explore.

@VickyAtAZ: As I wrote, we're in favor of pushing the device choice further up in the hierarchy, i.e. to RenderPipeline or Pass creation depending on iteration 1 or 2. I'm not entirely sure if just duplicating the RPI systems works, because of the globals/singletons used for the various systems within the RPISystem. So to do this, we either need to get rid of the globals, or hand the device to each separate system.

@moudgils: We think that it might be a good idea to target having the possibility of a frame graph using multiple devices where each pass is linked to one device, rather than splitting a pass over multiple devices (i.e. iteration 3). That would require to be able to put passes within the graph that transfer data and synchronize between devices. This would give the passes/pipeline/developer full control over which pass runs on which device. This can then be used, for example, to build AFR on top. What do you think about that?

3 replies

jeremyong-az Apr 13, 2022
Collaborator

Shadow maps are a useful counterpoint. My thinking here specifically is that we have the singular device interfaces to fall back on for cases where we want resources that are shareable. However, note that even this isn't completely obvious given that shadow maps relevant to one portion of the screen aren't necessarily relevant to other portions, once light culling and shadow influences are accounted for. I should mention, one bias I might have is the virtual production use case, where you have an array of high powered devices that render to a large LED wall. It's helpful to know that it sounds like you have a fairly different use case in mind however.

moudgils Apr 13, 2022
Maintainer Author

@jhmueller-huawei What you are describing is Iteration 2. Since these iterations are setup to be additive it should be fine. We dont have to add support for Iteration 3 after as it may not satisfy our use cases any more and is more advanced/complicated. With Iteration 1 and 2 implemented you can basically have different framegraphs running on different devices that are able to communicate with each other. We will add a special pass called CrossDeviceCopyPass that will be responsible to transfer data and synchronize between devices. I have provided examples of how Iteration 2 could work with synchronization support. Does that work for your use case (in case I am misunderstanding the requirement here)?

VickyAtAZ Apr 20, 2022

@jhmueller-huawei I proposed we can add device as input parameter to get the interface. For example, ImageSystemInterface::Get(Device*). The code definitely need to be changed to implement that interface.

galibzon · 2022-04-20T17:29:12Z

galibzon
Apr 20, 2022
Maintainer

Action item: Create a working group.

0 replies

jhmueller-huawei · 2022-04-25T10:06:55Z

jhmueller-huawei
Apr 25, 2022
Maintainer

Sorry for missing the meeting last week! Our current issue is that we still lack approval for open source work and it looks like it is going to take a few more weeks before we get that approval and know what exactly we can communicate. It is probably best to wait until then for further discussion and creating a work group.

0 replies

jhmueller-huawei · 2022-06-14T13:45:02Z

jhmueller-huawei
Jun 14, 2022
Maintainer

Hello everyone!

We finally got approval for our open source work. While we were waiting, we spent some time to further investigate and experiment with a multi-GPU implementation in Atom and I would like to detail what we tried and what issues we found along the way.

First steps towards multi-device support

First of all, we found that the way GLAD (which is used for handling dynamic loading of Vulkan) is used does not support multiple devices, a patch to fix this will be our first patch that we will submit a pull request for soon. Edit: here it is.

We then initialized multiple devices in the RHISystem. There are a couple of objects in the RHISystem which are device dependent (DD from now on in this post): the physical device descriptor (which is actually unnecessary, since you can get it directly from the device), the pipeline state cache, the frame scheduler and compile request. The latter two should probably not be DD in the long term. The remaining object in the RHISystem is the draw list tag registry, which seems to be device independent (DI) already.

Next we experimented with introducing an RHI::Device parameter in the call stack hierarchy for any DD function from the bottom up. As you can imagine, that is quite a lot of places. We move up the call hierarchy to the RPI Layer objects that are managed by the various RPI systems (RPI::Buffer, RPI::Shader, RPI::StreamingImage, etc). These objects build some kind of convergence point in the call stack, i.e. the number of places where you need to know a device (which you usually get from RHI::RHISystemInterface::Get()->GetDevice() anywhere in the code) is reduced up to this point.

Passing the RHI::Device parameter further up in the call hierarchy from there, the number of places where you need to know a device expands again. In many cases we could help ourselves by having an RHI::Device within the Scene class, but that is actually not what we want. And we also added the device to the Pass class, where we would like to have it and where it makes sense. At many other points however you neither have a Scene nor a Pass. Those points are usually within some top level FeatureProcessor or SystemComponent that is part of some feature or gem (AtomTressFX, NvCloth, WhiteBox, Terrain, SkinnedMesh, MorphTargets and many more). Clearly, you cannot go further up in the call hierarchy, because you simply cannot add the RHI::Device as parameter to the Activate function of some SystemComponent since this interface is outside of Atom already.

RPI level multi-device objects

Next, we investigated having RPI level multi-device objects for the various resources that the RPI System handles. Specifically, we tried Buffers, Shaders, StreamingImages and ShaderResourceGroups. Other objects can then automatically become multi-device objects, like the Material for example, by using the multi-device versions for the internal buffers, images and SRGs. In this experiment we basically use the multi-device (can be considered DI) objects in parallel with the DD objects. A Pass that runs only on one specific device can directly use the DD objects for its internal resources without having the overhead of the DI objects. Shared resources, like the scene or view SRG need to be DI to be used in passes of different devices.

The major issue with the RPI layer DI object approach is that some objects that would need a DI version are not RPI but RHI layer objects. Important examples for these are: BufferView, ImageView, DrawPacket, DrawItem and DispatchItem. For example, buffers are often bound to an SRG using a BufferView, however you cannot refer to a DI RPI Buffer from an RHI BufferView. Similarly, DrawPackets refer to DD pipeline state of a shader, so they cannot refer to DI RPI Shaders. Do circumvent this latter issue for example, we had to store maps from RHI::Device to DD DrawPackets getting the corresponding DD Shader objects and their pipeline state from the DI shader. Certainly not a nice solution. Also worthy of consideration is ray tracing which does not seem to be properly integrated in Atom at all yet, since there is no RPI abstraction for it yet. E.g. the buffer for the TLAS is directly allocated as RHI::Buffer, since RayTracingTlas is also part of the RHI. Since there is no corresponding RPI TLAS, there is no way to have an RPI side DI TLAS.

Additionally, sometimes DI and DD objects are not properly separated in the code base. For example, some FeatureProcessor holds a reference to a Pass that is used to render this feature. A specific example would be the DirectionalLightFeatureProcessor that holds a reference to a FullscreenShadowPass and a RPI::ParentPass* m_fullscreenShadowBlurPass. This example is easy, but there are more complex cases, like the whole dynamic draw system which I don't want to detail here.

Another issue: Theoretically AttachmentImages should all be Pass specific and thus DD and it shouldn't be necessary to bind them to some DI SRG. However, the TerrainDetailMaterialManager has a m_detailTextureImage allocated from the SystemAttachmentPool. I assume this should be a StreamingImage instead to make it DI. I.e., this should be easily fixable this way, or we would need DI AttachmentImages as well. We would also need DI AttachmentImages since many of the interfaces expect Image instances rather than StreamingImage or AttachmentImage specifically though in many cases going explicitly to a DI StreamingImage is possible. Such hierarchy and interface dependency issues also appear in conjunction with BufferPools and ImagePools.

Open questions

With the current state of our investigation it makes sense to reconsider the approach. The reason to have DI and DD resource objects at the RPI level was to have the device decision as high up as possible and support less overhead when a resource is only needed on one device. However, this approach comes at the disadvantage of some overhead, like having to consider which kind of object to use at every place. Other options would be to either use RPI DI objects everywhere with the option of having DI objects with a policy that they only exist on one device. This doesn't solve the issue of RHI objects that would need DI (like DrawPacket, etc). Maybe, having DI objects in the RHI that are used consistently would simplify this issue.

Of course we don't want to add additional overhead, so in the case of just one actual device or resources that just need to exist on one device, we need to optimize this inbetween layer to have as little performance impact as possible. This should be possible by using a combination of inheritance and templates for the corresponding policies.

This brings us to the policies that actually govern the device selection for various objects. The two most basic ones are a single-device policy and an all-device policy. However, we are not sure yet, which other policies would make sense. For example, a policy for which devices mesh data (and thus draw packets/items) should be on could ensure that this data is only present on the devices that need to rasterize geometry. Such a policy system could be implemented similarly to the draw tag or other tag systems already present in Atom.

The same question regarding device assignment is still open for passes, i.e. how to assign passes to a specific device. It is simpler in the sense that only one device has to be chosen for a Pass, but we are not sure yet how to do this. A first step probably is to assign devices to RenderPasses and let them have a FrameScheduler each. We are then not sure yet, how to handle multiple frame schedulers that should not work in sync, e.g. if you want to use one GPU for rendering, one for other asynchronous tasks such as a longer running physics simulation.

We hope to be able to discuss this with you during tomorrow's meeting.

2 replies

jeremyong-az Jun 14, 2022
Collaborator

Awesome writeup. I don't have a ton of time to issue a long response right now, but one idea that came to mind was to allocate separate virtual address ranges and pin DI objects to those ranges, thereby letting us do a device association with just some pointer comparison.

Thinking about this more though, I think simply embedding a node-mask per resource is the cleanest and most sensible initial starting point

moudgils Jun 14, 2022
Maintainer Author

I see. So you tried experimenting with the "One device per resource approach(1->1)" approach. It puts all the burden of the work at the RPI level as far as resource management per device is concerned. It is the easier of the two approaches where the other approach is to manage all the work at the RHI level i.e 1->N (as mentioned in the proposal above). I have added a few points below but we can discuss this further in a meeting

E.g. the buffer for the TLAS is directly allocated as RHI::Buffer, since RayTracingTlas is also part of the RHI. Since there is no corresponding RPI TLAS, there is no way to have an RPI side DI TLAS.

We actually have a ticket to address this. Our plan is to expand BufferSystem::CreateCommonBufferPool in order to allow RT code to use that instead of directly interfacing with RHI::BufferPool. This will address the concern brough up above. Ideally we want all features to use RPI::BufferPool.

It is simpler in the sense that only one device has to be chosen for a Pass, but we are not sure yet how to do this

I would imagine this would be done via some device specific metadata in the *.pass files that dictate the rendering pipelines.

The major issue with the RPI layer DI object approach is that some objects that would need a DI version are not RPI but RHI layer objects. Important examples for these are: BufferView, ImageView, DrawPacket, DrawItem and DispatchItem

I would imagine that RPI will be providing the correct device to the RHI objects based on which pipeline the RPI objects belong to.

We are then not sure yet, how to handle multiple frame schedulers that should not work in sync, e.g. if you want to use one GPU for rendering, one for other asynchronous tasks such as a longer running physics simulation.

In my proposal above if you have setup 2 FrameSchedulers to be run on separate devices with no sync point (i.e no CrossDeviceCopy Scope) by default they will work in an async manner no?

jhmueller-huawei · 2023-01-17T13:34:37Z

jhmueller-huawei
Jan 17, 2023
Maintainer

Have a look at this merge request please: o3de/o3de#14079

Since we've been working on the multi device support for Atom as a side project for quite a while now and have put a considerable amount of time into this project already, we decided it is time to show you what we have done, get feedback and discuss next steps to hopefully get our work merged. The changes we are showing here a substantial and thus pretty much all areas of Atom are touched in some way, so this is a huge change set and merging it will need some coordination I guess. However, this is not a done merge request yet, since the code is currently based on a state of master from the end of September last year and rebasing it on the current state of master is again a considerable amount of work, so we wanted to get feedback first, before we put more hours into this project. Also, not everything is working as before yet, so there are still some things to iron out. For some issues where we could not yet find the reason for we would appreciate help as well, but that's for a later version of this merge request. Most of the test cases work however and we tried them on Windows and Linux with DX12 and Vulkan, not with Metal on MacOS though, since we don't have a machine for that.

So, what did we do?

I should probably start with an overview of what the plan is: before we can consider actual implementations of techniques where multiple devices (=GPUs) are utilized at the same time, we need to handle resources on multiple devices, i.e., a way to tell which resource should reside on which device. Our first attempt was to introduce multi-device resources was on the RPI level, where each RPI resource could reference one or multiple RHI resources. However, during our reworking of the whole RPI this route turned out to be impossible, since some objects where we would have needed a multi-device consideration were not in the RPI, but in the RHI, like DrawItems and thus could not reference RPI objects.

Now this merge request is our second attempt at getting multi-device resources into Atom. The main point is to introduce multi-device resources in the RHI as a layer between RPI objects and the existing single-device RHI resource objects. Our goal here was to make this additional layer as independent from the APIs as possible so that only few modifications within the API implementations are necessary, since those would need to be done three times, i.e., for each API (DX12, Vulkan, Metal). Spoiler alert: This was not completely possible, but I'll write about that later, below.

We rewrote the history of this merge request into four commits, which I would like to detail now:

The first commit is "simply" a rename of all relevant Resource derived classes to Device*. However, we did not rename the implementations in the APIs of those mostly abstract classes. The reason for this renaming is that we wanted to keep the names of the RHI classes that Atom users usually interact directly with besides the RPI objects the same. That means, that the newly introduced multi-device resource layer will use the original names.
In a smaller second commit we actually initialize multiple devices in the RHISystem. We introduced a new command line switch (--rhi-multiple-devices) to initialize all available devices, otherwise only one device is initialized as before.
The next commit introduces the multi-device resource classes with the original naming of the classes that were renamed in the first commit. Each class handles multiple instances of the corresponding Device* classes and API calls are usually forwarded to all handled objects. Which devices are considered is determined based on a device mask with one bit for every device. Currently, we are using an AllDevices mask throughout the code, which puts the resource on all initialized devices. We are considering to change the default behavior though to a DefaultDevice mask which has only the least-significant bit set and thus uses device 0 by default. As long as the command line switch for multiple devices is not used or there is only a single GPU available anyway, there is no difference between these two.
Finally, in the last commit, we tried to rewrite everything to use the new multi device RHI classes rather than the Device* classes. This concerns mainly the RPI classes of those resources, but also direct uses of the RHI classes, which constitute a considerable amount of the change set.

We also made the appropriate changes in the atom-sampleviewer, i.e., the first and last commit. You can find the corresponding merge request here: o3de/o3de-atom-sampleviewer#573

Discussion

Let me start by repeating what I have stated above already: of course adding an additional layer introduces a (hopefully negligible) overhead in the system. Future developments will of course also have to keep multi-device capabilities in mind. That said, we don't really see a different solution to introduce multi-device support into Atom, at least not without even bigger changes.

One discussion point would be to figure out the best location where the code switches from handling multi-device objects to single device objects. In this merge request we are doing this somewhere in the frame graph, more specifically we are trying to do this between frame graph compilation and execution. This is probably one of the more crucial points to get right when adding multi-device support in Atom. That said, it is not really a crucial point for this merge request yet, since the focus here is to get multi-device resources working and then work on this building on that. It is a huge change set currently anyway.

Please let us know what the further steps should be here. With your approval, we would like to rebase this code onto the current state of master and the merge it as quickly as possible, since every rebase currently is a huge amount of work. However, if there are still significant changes to be made before this can be merged in the way we tackle things, we need to address these first of course.

0 replies

jhmueller-huawei · 2023-01-30T11:07:36Z

jhmueller-huawei
Jan 30, 2023
Maintainer

As requested, here is a list of all the classes which were renamed to Device* classes and multi device versions with the original name were introduced in this patch. I would like to reiterate the pricinples we follow in these changes: we want to introduce as little change as possible in the backends, i.e., they should mostly not have to care about multi-device resources.

Of course any DeviceObject derived classes are renamed and turned into multi-device classes, with few exceptions: AliasedHeap, AliasedAttachmentAllocator and CommandQueue. We did not have to turn these two into multi-device versions because they are only used inside the backends where we are anyway using single device classes.

not renamed:

AliasedAttachmentAllocator
AliasedHeap

classes:

DeviceObject
    DeviceFence
    DeviceIndirectBufferSignature
    DevicePipelineLibrary
    DevicePipelineState
    DeviceRayTracingBlas
    DeviceRayTracingBufferPools
    DeviceRayTracingPipelineState
    DeviceRayTracingShaderTable
    DeviceRayTracingTlas
    DeviceResource
        DeviceBuffer
        DeviceImage
        DeviceQuery
        DeviceShaderResourceGroup
    DeviceResourcePool
        DeviceBufferPoolBase
            DeviceBufferPool
        DeviceImagePoolBase
            DeviceImagePool
            DeviceStreamingImagePool
            DeviceSwapChain
        DeviceQueryPool
        DeviceShaderResourceGroupPool
    DeviceResourceView
        DeviceBufferView
        DeviceImageView
    DeviceTransientAttachmentPool

The following classes and structs which are mostly descriptors, requests and items (for copy, dispatch and draw) were all renamed and turned into multi-device classes in order to facilitate the use of the (multi-)device classes. This is required because they store either a reference or pointer to a (multi)-device object. We always have both classes available because in most cases the multi-device versions are used as parameter for some method of the MultiDeviceObject where they are translated to the device version and passed to the corresponding DeviceObject. I.e., following our design principle above that backends don't have to consider multi-device resources, we have to make sure that any classes/structs passed to the DeviceObject deriving classes in the backend are supplied with single device data.

classes:

DeviceIndexBufferView
DeviceIndirectBufferView
DeviceIndirectBufferWriter
DeviceRayTracingBlasDescriptor
DeviceRayTracingTlasDescriptor
DeviceRayTracingPipelineStateDescriptor
DeviceRayTracingShaderTableDescriptor
DeviceResourceEventInterface
DeviceResourceInvalidateBus
DeviceShaderResourceGroupData
DeviceStreamBufferView

structs:

DeviceBufferInitRequest
DeviceBufferMapRequest
DeviceBufferMapResponse
DeviceBufferStreamRequest
DeviceCopyBufferDescriptor
DeviceCopyImageDescriptor
DeviceCopyBufferToImageDescriptor
DeviceCopyImageToBufferDescriptor
DeviceCopyQueryToBufferDescriptor
DeviceCopyItem
DeviceDispatchArguments
DeviceDispatchItem
DeviceDispatchRaysArguments
DeviceDispatchRaysItem
DeviceDrawArguments
DeviceDrawItem
DeviceDrawItemProperties
DeviceImageInitRequest
DeviceImageUpdateRequest
DeviceIndirectArguments
DeviceIndirectBufferSignatureDescriptor
DevicePipelineLibraryDescriptor
DeviceRayTracingGeometry
DeviceRayTracingTlasInstance
DeviceRayTracingShaderTableRecord
DeviceStreamingImageInitRequest
DeviceStreamingImageExpandRequest
DeviceTransientAttachmentPoolDescriptor

For example: multi-device DrawItems are created in various feature processors and passed to the rendering passes that actually draw them. We need multi-device DrawItems because the resources stored in them are multi-device resources. But it can also be viewed differently: this gives us the option to submit these DrawItems to multiple devices. When the frame graph is executed we have to submit it to a device specific CommandList which expects a DeviceDrawItem, since at this point we are not multi-device capable anymore, but working on one single device. This is the case because currently we switch from multi-device to single device when the frame graph is executed; frame graph compilation still handles multi-device resources.

0 replies

moudgils · 2023-02-02T00:18:40Z

moudgils
Feb 2, 2023
Maintainer Author

The one thing I will highly recommend is that we do this first step in a manner where most of the changes are within RHI itself and does not affect RPI or at least minimizes this work as much as possible for this PR and we can create a future PR as a next step. This will help us considerably when reviewing the PR, testing the PR as well as be confident about not introducing a regression. Here is what I recommend in terms of design changes that can help with this goal. I propose that we do not create a deviceXXX version of an object XXX that wasnt already inheriting from RHI::Resource or RHI::ResourcePool. This would mean that we will not be creating classes like DeviceDrawItem or Device DispatchDrawItem, etc and it would also mean that we will not have to update RHI::CommandList Submit api and indirectly not impacting RPI at all.

You could modify RHI::DrawItem to add a variable called m_deviceIndex = 0 to and this variable "can" be set by RPI if the pass wants to be supported by multiple gpus but doesnt need to. So the RPI pass can do m_deviceIndex = context.GetDeviceIndex() but it will not be needed for this PR. So now a DrawItem knows which device it will need to target so within RHI it can go from RHI::XXX resource to RHI::DeviceXXX without exposing this functionality to RPI.

-We could then remove objects like DeviceIndexBufferView, DeviceStreamBufferView. Instead just modify the following function within IndexBufferView and StreamBufferView and be able to support buffers for multiple devices.

const DeviceBuffer* IndexBufferView::GetBuffer(int deviceindex = 0) const
{
    return m_buffer.GetDeviceBuffer(deviceindex);
}

and this can only be used by RHI.

A DrawItem can still be pointing to an array of RHI::ShaderResourceGroup instead of RHI::DeviceShaderResourceGroup. Given that RHI knows what device inndex a drawItem is targeting RHI can go from RHI::ShaderResourceGroup to RHI::DeviceShaderResourceGroup as well.
A DrawItem can still be pointing to RHI::PipelineState as it knows which device index to target and that can be done within RHI
You should be able to revert all the changes related to DrawArguments vs DeviceDrawArguments as they all stem from the fact that we have DeviceIndirectArguments which is taking in DeviceIndirectBufferView which again can all go away if you pass in m_deviceIndex.
RHI::ShaderResourceGroupData is currently updating all the RHI::DeviceShaderResourceGroupData in the same way which is fine if you want the exact same SRGdata across all the devices which may not be the case. Ideally RPI will decide what device gets what SRGdata based on the pipeline it is trying to run. One option is to add a mask or we could modify the public api of RHI::SRGData to include a device index but again we can add this later.
Similar logic can be applied to other objects like DispatchItem.
RHI::Resource can be renamed to RHI::MultiDeviceResource
RHI::ResourcePool can be renamed to RHI::MultiDeviceResourcePool
RHI::ResourceView can be renamed to RHI::MultiDeviceResourceView

3 replies

jhmueller-huawei Feb 2, 2023
Maintainer

I assume you mean all classes that derive from DeviceObject, not just Resource and ResourcePool?

If we do this, it would mean that a lot more code needs to be changed in the backends since for all of these structs that do not derive from DeviceObject, we need to then add a device index to the calls or to the structs themselves, as you suggest. This would then mean that we need device indices at a lot more places. For example, the IndexBufferView::GetBuffer that you mentioned is called in ModelDataInstance::SetRayTracingData - how do you know which device index you want to call that for? Or do you simply loop over all device indices of the buffer? That would change the API more rather than less with our approach of having single- and multi-device versions of the structs.

If we follow the strategy in our patch of translating the multi-device to single-device structs in the multi-device layer of the RHI, there should not be any significant changes in the RPI. The RHI::CommandList is the only exception in the current patch and that is because we did not properly introduce a multi-device version of it, i.e., we didn't follow our strategy and that can of course be fixed.

For the DrawItem I don't understand how the device index is supposed to be set? The DrawItems for meshes are inside DrawPackets which themselves are inside MeshDrawPackets inside MeshDrawPacketList inside MeshDrawPacketLods inside ModelDataInstance inside the MeshFeatureProcessor - quite the hierarchy, yes. How is the MeshFeatureProcessor supposed to set the device index, since it doesn't know which passes are using the DrawItem? Or do the passes set it, just before they submit the DrawItem? In that case it wouldn't be possible to have passes submit in parallel, since that would require mutual exclusion if they don't all submit to the same device. Instead of having the device index in the DrawItem struct, the other option is to have it as a parameter for submission next to the DrawItem. Actually this is more or less, what we are doing. The only difference here is that we create a new device specific struct, where we already handled the lookup in the multi-device resources with the device index and this struct is passed to the backend. This is exactly what helps us to avoid having the same lookup code through all the backends.

In case we're not misunderstanding anything here, it seems that there is no major difference in terms of the RPI usage of the classes and structs here (except that in the version with the structs containing the device index, it's difficult to say when the device index is supposed to be set, since it's unclear at initialization time). Thus, we don't agree with the premise that the RPI is affected too much in this PR. When we fix the missing CommandList implementation and revert clang-format changes there will be a lot less RPI or higher level files affected.

The main difference is on the other end, i.e., whether the backends should separately have code to do the lookup of single-device resources on the multi-device resources or if we have a struct translation layer in between, as we proposed. That is the issue that should be discussed in our opinion. We don't think that the slight overhead of translating the structs (and thus having two versions of them) adds significant overhead and even though this adds many more structs that then exist in a multi-device and a single-device version, we think it's still easier for code maintenance, since changes here do not require changes in all the backends.

moudgils Feb 2, 2023
Maintainer Author

Yes any RHI XXX class inheriting from RHI::DeviceObject in some manner is a good candidate for the addition of RHI::DeviceXXX although I did find some more issues related to RHI::ResourceView which I have discussed further down.

You actually bring a good point about not including device index in Draw/Dispatch item which led me to take a deeper look. Lets take a step back and first decide how RPI should be sending draw packets to RHI which could in theory be sent to any device. What makes sense to me at this point is that at some point in future we will have multiple render pipelines (built via .pass files similar to what we already have) except we attach a device index to these pipelines. This essentially means that passes have a device index associated with them. When the passes are added to RHI they will get flattened into one framegraph full of scopes but now these scopes have a device index associates with them. RHI will split up the framegraph into multiple groups, attach a framegraph context to each group and these groups will now have a device index associated with them (passed down all the way from the *.pass files). As part of building framegraphgroups we will ensure that scopes within a group can not go to different devices. At this point RHI will ask RPI to send it DrawItems/DispatchItems which essentially abstract away some gpu work that will run on some device known to RHI as it know which group a drawItem/dispatchItem belongs to.

RPI at this point needs to populate the DrawItem/DispatchItem. In the prototype RPI essentially extracts the device index from the framegraphcontext and proceeds to build Device specific draw/dispatch item by exrtracting device specific resource. What I am proposing is that we push this work to RHI for the prototype. And it may very well be that we re-evaluate this decision in future but at this point I dont see any advantages to have this translation happen at RPI level instead of RHI level. So an IndexBufferView can still contain RHI::Buffer which can hold multiple valid RHI::DeviceBuffer and is sent to RHI via DrawItem with no translation. RHI knows which device index the scope belongs to and should be able to go from RHI::Buffer -> RHI::DeviceBuffer as needed. Same can be applied to RHI::StreamBufferView, RHI::ShaderResoruceGroup, RHI::PipelineState. So there is no need to pass device index to RHI as part of Draw/Dispatch item or even as part of CommandList Submit api as RHI has already decided which device index a scope belongs to as part of flattening the frame graph earlier in the frame. I am willing to jump in a call to go over this as well.

Now lets talk about RHI::ResourceView. With the existing setup we have a RHI::Resource that can have many RHI::ResourceViews which is tracked via m_resourceViewCache (via naked pointers). It allows RPI to create multiple ResourceViews easily with simple api (GetResourceView). Also, RHI::ResourceView keeps a strong pointer to it's RHI::Resource (ConstPtr m_resource) so that a RHI::Resource is not deleted until all the RHI::ResourceViews are gone. Both RHI::REsource and RHI::ResourceViews are smart pointers so when all the RHI::ResourceViews associated with a RHI::Resource goes out of scope and deleted RHI::Resource will also be deleted once it goes out of scope.

Usings Buffers as an example in the prototype we now have RHI::Resource containing multiple RHI::DeviceResouce but also containing multiple RHI::ResourceView and each RHI::ResourceView containing multiple RHI::DeviceResourceView who also are connected to RHI::DeviceResource making everything quite a bit complicated. Lets forget the fact that it makes tracking resources and views a nightmare it also means that a RHI::DeviceResource can not be cleaned until all the Resoruce views across all the devices are cleaned which is sub-optimal. What about we go something like this

What if we revert the change made to RHI::BufferView and RHI::DeviceBufferView and essentially think of RHI::BufferView as RHI::DeviceBufferView which are only held by RHI::DeviceBuffer. So we still have DX12::BufferView->RHI::ResourceBufferView->RHI::ResourceView->RHI::Object. So to get to RHI::BufferView you will need to go through RHI::DeviceBuffer. The idea is that RPI communicates with RHI in terms for RHI::Buffer and it will be upto RHI to go from RHI::Buffer to RHI::BufferView through device index as RHI know which index to use. What does this mean for RPI since wee have code like RHIBuffer->GetBuffer()->GetBufferView() everywhere. Well as a start for this prototype we can modify this function to RHI::Buffer so no changes will be required within RPI.

      Ptr<BufferView> Buffer::GetBufferView(const BufferViewDescriptor& bufferViewDescriptor, int deviceIndex = 0)
       {
           return  return m_deviceBuffers.at(deviceIndex)->GetResourceView(bufferViewDescriptor);
       }

To be honest the only time RPI needs access to ResourceView is when it needs to bind it for ShaderResoruceGroupData. Since we are going to have multiple ShaderResoruceGroupData (one for each device) and if RPI wants to lets say bind a ResourceView for RHI::DeviceBuffer at index 1 then we can extend the ShaderResoruceGroupData api as part of next PR and not the current PR. We have a RPI::SRG which contains RHI::SRGData which contains multiple RHI::DeviceShaderResourceGroupData. RPI will be responsible for binding the correct views to various RHI::DeviceShaderResourceGroupData. As a start we can just bind the same views to all the RHI::DeviceShaderResourceGroupData but more nuanced api can be added later on.

In the end at a high level the idea is that the RPI will populate RHI::Buffer with all the needed RHI::DeviceBuffers + RHI::BufferViews and connect RHI::DeviceSrgData with the correct RHI::Buffer/Image View at the start of the frame and let RHI handle the rest during framegraph execution phase.

jhmueller-huawei Feb 3, 2023
Maintainer

Thank you for your quick response! We agree with your analysis of the DrawItems and how the passes, scopes and context should handle the translation to single device resources. We don't think a call is necessary.

In terms of the memory management of the ResourceViews, we don't really see an issue with having the single device views still allocated as long as a multi-device object still references it. If you want to free the view of a single device we could add API to do so, e.g., through a call that changes the mask and thus frees the single device resource views in the map that are not part of the mask anymore. However, we don't have an issue with not having multi gpu resource views, since a combination of (multi-device) resource and ResourceViewDescriptor is basically the same as a ResourceView. I.e., both solutions are fine for us. The removal of multi device resource views however changes the RPI code more which is something you wanted to avoid.

Finally, what are your thoughts about the struct translation in the RHI layer versus modifying all backends?

moudgils · 2023-02-03T19:27:58Z

moudgils
Feb 3, 2023
Maintainer Author

Something I realized later last night is that Commandlist Submit api has no idea about FGContext and hence does not know which device index to use to translate. So one option is to add a default device index (int deviceIndex = 0) to the Submit api and then have DX12/Vk/Mt::Commandlist::Submit methods modify GetBuffer to GetBuffer(deviceIndex). Same applies to PSO and SRG objects. I think that is fine for now. The alternative is to keep Device version of the DrawItem/DispatchItem around and have RHI::CommandList::Submit do the conversion from RHI::DrawItem to RHI::DeviceDrawItem and pass it to the backend verison of Submit API. I am not a fan of the latter approach as it means that for thousands of calls to submit we are creating a new Device version of the item each time which seems in-effecient. RPI has code where it caches draw items for meshes and that optimization will be nullified if we re-create the Device version of the draw items under the hood. Another option is we expand Draw/Dispatch item to hold device specific resources but they are set by the RHI::Commandlist::Submit api if not already set (so set once).This will mean that we will not need to do an extra lookup for all the objects for each submit call. This way if RPI is re-using draw items RHI will only pay the cost of doing the translation once.

Regarding ResourceView conversation I would prefer the latter solution as it is a simpler design (RHi::Buffer->RHI::DeviceBuffer->RHI::BufferView). In general, we have struggled with Resource vs ResourceView in terms of handling their relationship when considering multi-threaded programming. Basically we have had to fix a lot of edge cases around deletion of ResourceViews/Resources when multiple threads are trying to create a bunch of them and destroying a bunch of them together. You will see that we have added a lot of unit testing just around creation and deletion of ResourceViews to help solidify the code. Throwing in this new design where you can now have RHI::ResourceView and RHI::DeviceResourceView will make everything much harder.

As for your question related to other structs. It is hard for me to pass a general decision but as an example lets visit a struct like BufferInitRequest vs DeviceBufferInitRequest. I think there is probably value in having both. RHI::BufferInitRequest is probably only going to be used by RPI to create an RHI::Buffer where as RHI::DeviceBufferInitRequest is setup to only build out one RHI::DeviceBuffer and should only be used by RHI. It would make sense to add and remove variables within the struct to that effect. For example BufferInitRequest could contain a device mask telling RHI how many DeviceBuffers to create. BufferDescriptor probably only makes sense to live within BufferInitRequest and not within DeviceBufferInitRequest. At Buffer::Init RHI will then use BufferInitRequest to build out multiple DeviceBufferInitRequest and populate RHI::Buffer as needed.

BufferMapRequest and DeviceBufferMapRequest can follow the same logic where one is for RPI use and the other for RHI use. Maybe the structs that are for RHI use only can live within RHI.Private to ensure RPI is not using them. We can also discuss a specific struct that you may have in mind as well.

2 replies

jhmueller-huawei Feb 7, 2023
Maintainer

If we understand it correctly, you mean that we should store the single device version of the resources that we get during submit based on the device index inside the multi device DrawItem? What if we would like to use DrawItems on multiple devices then? E.g., when rendering stereo for VR splitting the two views to two devices? In order to not run into performance problems and allow caching, we propose to store a map of DeviceDrawItems within the multi device DrawItem like we do for the resources, such as buffers. That way we don't have to translate them every time and it should have less of a performance impact?

As I wrote previously, we are fine with not having multi device resource views, but use the resource + descriptor instead in the RPI. However, we are wondering what we should do with StreamBufferView, IndexBufferView and IndirectBufferView, since these classes consist of a resource (or two if you count IndirectBufferSignature) and some other values (that could potentially differ between devices?). Since these classes are used in the RPI, I am not sure if we can get around them without a multi device version and conversion. But these classes are anyway different from the ResourceView classes. If you want they could potentially be split into the resource(s) and a descriptor struct as well. May not be so nice for the array of StreamBufferViews in the DrawItem. What do you think? I guess we basically have to do the same as we do with the DrawItems, i.e., if we have two versions there, we need two versions here as well. The same is true for some other structs.

moudgils Feb 7, 2023
Maintainer Author

Having a map of DeviceDrawItems within RHI::DrawItem could work. So the draw index is passed to RHI as part of RHI::CommandList::Submit api (defaulted to 0). RHI::CommandList::Submit can do the translation from DrawItem->DeviceDrawItem once, cache it and then forward it to the DX12/Vk/Metal::CommandList::Submit which will use the DeviceDrawItem as needed. That way the RHI can do the translation and also cache the Device specific resources while keeping the api simple at the RPI level.

Question about IndexBufferView - Do you think that m_byteOffset/m_byteCount/m_format will change for different devices? I would imagine they will be the same and only thing that will be different is RHI::Buffer which will already incapsulate RHI::DeviceBuffer underneath it.

But you do bring a good point. If a RHI object does have non-resource properties that may differ from one device to another it will need a Device version and a map that can cache the device specific data. It would be nice to avoid this assumption, basically only resources can differ from one device to another but all the other properties are the same on all the devices. It will make our life simpler.

jhmueller-huawei · 2023-02-09T13:22:01Z

jhmueller-huawei
Feb 9, 2023
Maintainer

We posted an RFC just now, please have a look: #120 :)

0 replies

MGPU Support #32

moudgils Mar 28, 2022 Maintainer

MGPU Support

Replies: 13 comments · 14 replies

jhmueller-huawei Mar 29, 2022 Maintainer

jhmueller-huawei Mar 31, 2022 Maintainer

VickyAtAZ Apr 6, 2022

rgba16f Apr 6, 2022 Maintainer

moudgils Apr 5, 2022 Maintainer Author

jeremyong-az Apr 10, 2022 Collaborator

moudgils Apr 11, 2022 Maintainer Author

jeremyong-az Apr 11, 2022 Collaborator

jhmueller-huawei Apr 13, 2022 Maintainer

jeremyong-az Apr 13, 2022 Collaborator

moudgils Apr 13, 2022 Maintainer Author

VickyAtAZ Apr 20, 2022

galibzon Apr 20, 2022 Maintainer

jhmueller-huawei Apr 25, 2022 Maintainer

jhmueller-huawei Jun 14, 2022 Maintainer

First steps towards multi-device support

RPI level multi-device objects

Open questions

jeremyong-az Jun 14, 2022 Collaborator

moudgils Jun 14, 2022 Maintainer Author

jhmueller-huawei Jan 17, 2023 Maintainer

So, what did we do?

Discussion

jhmueller-huawei Jan 30, 2023 Maintainer

moudgils Feb 2, 2023 Maintainer Author

jhmueller-huawei Feb 2, 2023 Maintainer

moudgils Feb 2, 2023 Maintainer Author

jhmueller-huawei Feb 3, 2023 Maintainer

moudgils Feb 3, 2023 Maintainer Author

jhmueller-huawei Feb 7, 2023 Maintainer

moudgils Feb 7, 2023 Maintainer Author

jhmueller-huawei Feb 9, 2023 Maintainer

moudgils
Mar 28, 2022
Maintainer

Replies: 13 comments 14 replies

jhmueller-huawei
Mar 29, 2022
Maintainer

jhmueller-huawei
Mar 31, 2022
Maintainer

rgba16f Apr 6, 2022
Maintainer

moudgils
Apr 5, 2022
Maintainer Author

jeremyong-az
Apr 10, 2022
Collaborator

moudgils Apr 11, 2022
Maintainer Author

jeremyong-az Apr 11, 2022
Collaborator

jhmueller-huawei
Apr 13, 2022
Maintainer

jeremyong-az Apr 13, 2022
Collaborator

moudgils Apr 13, 2022
Maintainer Author

galibzon
Apr 20, 2022
Maintainer

jhmueller-huawei
Apr 25, 2022
Maintainer

jhmueller-huawei
Jun 14, 2022
Maintainer

jeremyong-az Jun 14, 2022
Collaborator

moudgils Jun 14, 2022
Maintainer Author

jhmueller-huawei
Jan 17, 2023
Maintainer

jhmueller-huawei
Jan 30, 2023
Maintainer

moudgils
Feb 2, 2023
Maintainer Author

jhmueller-huawei Feb 2, 2023
Maintainer

moudgils Feb 2, 2023
Maintainer Author

jhmueller-huawei Feb 3, 2023
Maintainer

moudgils
Feb 3, 2023
Maintainer Author

jhmueller-huawei Feb 7, 2023
Maintainer

moudgils Feb 7, 2023
Maintainer Author

jhmueller-huawei
Feb 9, 2023
Maintainer