Replies: 13 comments 14 replies
-
This sounds like a great plan already. With respect to the questions I think that starting out with a 1:1 approach for resource management should be easier and maybe it would not be too much work to change that later, e.g. between iteration 2 and 3. However, I think that even for iteration 3 a 1:1 approach might be better, since you only have to implement the resource management once in the RPI for 1:1, while you would have to do it multiple times for 1:N. In terms of resource management, would it make sense to consider various types of resources? E.g., there are some resources that are only uploaded to the device (or multiple devices), like vertex and index buffers or textures. Then there are transient resources that don't need to be shared at all, like a depth buffer that is only used in a part of a pipeline running on one device and then finally there are the resources that need to be actually shared/transferred. The RHI cannot really distinguish between those which would be another argument for the 1:1 approach. |
Beta Was this translation helpful? Give feedback.
-
We had a closer look at where any references to the device are made in the engine and there are a few categories you can put those into.
|
Beta Was this translation helpful? Give feedback.
-
Q. In terms of resource management, would it make sense to consider various types of resources? E.g., there are some resources that are only uploaded to the device (or multiple devices), like vertex and index buffers or textures. Then there are transient resources that don't need to be shared at all, like a depth buffer that is only used in a part of a pipeline running on one device and then finally there are the resources that need to be actually shared/transferred. The RHI cannot really distinguish between those which would be another argument for the 1:1 approach.
Q. On a higher level the device is used of course during viewport/window creation which is what we want anyway for the first iteration already, just that we somehow have to select a device there first.
Q. One question is how to handle the RenderTick() in the RPISystem with multiple frame graphs, since we might want to support running frame graphs at different frame rates, e.g. when rendering to an HMD and a monitor with different refresh rates.
Q. Various functions that access device specific functions are used for profiling or displaying performance numbers. Those should not be problematic since they can iterate over all devices, either using each separately or averaging over the devices.
Q. Some accesses try to get feature support or possible formats for images or query other device information. While those don't sound difficult to handle, they can be, especially in the case of pipeline/shader caches (check GetPipelineLibraryPath which calls GetPhysicalDeviceDescriptor) since those calls are quite high up in the hierarchy, i.e. within shader assets which should probably even be device independent. So the question with these is, where to actually introduce the device specificity. In some cases it might be possible to query feature support at later stages when it's clear which device is used for some resource.
|
Beta Was this translation helpful? Give feedback.
-
Personally, I would advocate a much simpler option to start, which is to simply multiplex all resource creation, and even entire pipelines to all devices. Managing node-masks, cross-device synchronization, etc. is likely to be a complex undertaking, and this complexity is something I would recommend tacking on later. Here's what I would consider starting with:
As an example configuration, the Afterwards, we can expose more low level control as part of the |
Beta Was this translation helpful? Give feedback.
-
Thank you for all your responses. @jeremyong-az: We are not sure if split rendering is the goal to strive for. According to Nvidia, AFR has been the preferred method (https://docs.nvidia.com/gameworks/content/technologies/desktop/sli.htm) over SFR. Considering that neither AFR nor SFR is widespread today - quite the contrary, Nvidia is dropping support - I wouldn't count on it to be the ideal solution. Thinking about a complex rendering pipeline like in O3DE, I don't even know how you would do SFR throughout the pipeline. E.g., when rendering shadow maps, do you render them on both devices (diminishing the gains that you would get from the parallelization over GPUs) or do you split every rendering operation, which would result in a lot more syncs and transfer overhead, which is especially detrimental without proper possibilities for using fast GPU interconnect, as we seem to have right now, where we need to go through the CPU. For these reasons we are actually not interested in SFR and would like to experiment with other approaches. @VickyAtAZ: As I wrote, we're in favor of pushing the device choice further up in the hierarchy, i.e. to @moudgils: We think that it might be a good idea to target having the possibility of a frame graph using multiple devices where each pass is linked to one device, rather than splitting a pass over multiple devices (i.e. iteration 3). That would require to be able to put passes within the graph that transfer data and synchronize between devices. This would give the passes/pipeline/developer full control over which pass runs on which device. This can then be used, for example, to build AFR on top. What do you think about that? |
Beta Was this translation helpful? Give feedback.
-
Action item: Create a working group. |
Beta Was this translation helpful? Give feedback.
-
Sorry for missing the meeting last week! Our current issue is that we still lack approval for open source work and it looks like it is going to take a few more weeks before we get that approval and know what exactly we can communicate. It is probably best to wait until then for further discussion and creating a work group. |
Beta Was this translation helpful? Give feedback.
-
Hello everyone! We finally got approval for our open source work. While we were waiting, we spent some time to further investigate and experiment with a multi-GPU implementation in Atom and I would like to detail what we tried and what issues we found along the way. First steps towards multi-device supportFirst of all, we found that the way GLAD (which is used for handling dynamic loading of Vulkan) is used does not support multiple devices, a patch to fix this will be our first patch that we will submit a pull request for soon. Edit: here it is. We then initialized multiple devices in the Next we experimented with introducing an Passing the RPI level multi-device objectsNext, we investigated having RPI level multi-device objects for the various resources that the RPI System handles. Specifically, we tried Buffers, Shaders, StreamingImages and ShaderResourceGroups. Other objects can then automatically become multi-device objects, like the Material for example, by using the multi-device versions for the internal buffers, images and SRGs. In this experiment we basically use the multi-device (can be considered DI) objects in parallel with the DD objects. A The major issue with the RPI layer DI object approach is that some objects that would need a DI version are not RPI but RHI layer objects. Important examples for these are: Additionally, sometimes DI and DD objects are not properly separated in the code base. For example, some Another issue: Theoretically Open questionsWith the current state of our investigation it makes sense to reconsider the approach. The reason to have DI and DD resource objects at the RPI level was to have the device decision as high up as possible and support less overhead when a resource is only needed on one device. However, this approach comes at the disadvantage of some overhead, like having to consider which kind of object to use at every place. Other options would be to either use RPI DI objects everywhere with the option of having DI objects with a policy that they only exist on one device. This doesn't solve the issue of RHI objects that would need DI (like Of course we don't want to add additional overhead, so in the case of just one actual device or resources that just need to exist on one device, we need to optimize this inbetween layer to have as little performance impact as possible. This should be possible by using a combination of inheritance and templates for the corresponding policies. This brings us to the policies that actually govern the device selection for various objects. The two most basic ones are a single-device policy and an all-device policy. However, we are not sure yet, which other policies would make sense. For example, a policy for which devices mesh data (and thus draw packets/items) should be on could ensure that this data is only present on the devices that need to rasterize geometry. Such a policy system could be implemented similarly to the draw tag or other tag systems already present in Atom. The same question regarding device assignment is still open for passes, i.e. how to assign passes to a specific device. It is simpler in the sense that only one device has to be chosen for a We hope to be able to discuss this with you during tomorrow's meeting. |
Beta Was this translation helpful? Give feedback.
-
Have a look at this merge request please: o3de/o3de#14079 Since we've been working on the multi device support for Atom as a side project for quite a while now and have put a considerable amount of time into this project already, we decided it is time to show you what we have done, get feedback and discuss next steps to hopefully get our work merged. The changes we are showing here a substantial and thus pretty much all areas of Atom are touched in some way, so this is a huge change set and merging it will need some coordination I guess. However, this is not a done merge request yet, since the code is currently based on a state of master from the end of September last year and rebasing it on the current state of master is again a considerable amount of work, so we wanted to get feedback first, before we put more hours into this project. Also, not everything is working as before yet, so there are still some things to iron out. For some issues where we could not yet find the reason for we would appreciate help as well, but that's for a later version of this merge request. Most of the test cases work however and we tried them on Windows and Linux with DX12 and Vulkan, not with Metal on MacOS though, since we don't have a machine for that. So, what did we do?I should probably start with an overview of what the plan is: before we can consider actual implementations of techniques where multiple devices (=GPUs) are utilized at the same time, we need to handle resources on multiple devices, i.e., a way to tell which resource should reside on which device. Our first attempt was to introduce multi-device resources was on the RPI level, where each RPI resource could reference one or multiple RHI resources. However, during our reworking of the whole RPI this route turned out to be impossible, since some objects where we would have needed a multi-device consideration were not in the RPI, but in the RHI, like DrawItems and thus could not reference RPI objects. Now this merge request is our second attempt at getting multi-device resources into Atom. The main point is to introduce multi-device resources in the RHI as a layer between RPI objects and the existing single-device RHI resource objects. Our goal here was to make this additional layer as independent from the APIs as possible so that only few modifications within the API implementations are necessary, since those would need to be done three times, i.e., for each API (DX12, Vulkan, Metal). Spoiler alert: This was not completely possible, but I'll write about that later, below. We rewrote the history of this merge request into four commits, which I would like to detail now:
We also made the appropriate changes in the atom-sampleviewer, i.e., the first and last commit. You can find the corresponding merge request here: o3de/o3de-atom-sampleviewer#573 DiscussionLet me start by repeating what I have stated above already: of course adding an additional layer introduces a (hopefully negligible) overhead in the system. Future developments will of course also have to keep multi-device capabilities in mind. That said, we don't really see a different solution to introduce multi-device support into Atom, at least not without even bigger changes. One discussion point would be to figure out the best location where the code switches from handling multi-device objects to single device objects. In this merge request we are doing this somewhere in the frame graph, more specifically we are trying to do this between frame graph compilation and execution. This is probably one of the more crucial points to get right when adding multi-device support in Atom. That said, it is not really a crucial point for this merge request yet, since the focus here is to get multi-device resources working and then work on this building on that. It is a huge change set currently anyway. Please let us know what the further steps should be here. With your approval, we would like to rebase this code onto the current state of master and the merge it as quickly as possible, since every rebase currently is a huge amount of work. However, if there are still significant changes to be made before this can be merged in the way we tackle things, we need to address these first of course. |
Beta Was this translation helpful? Give feedback.
-
As requested, here is a list of all the classes which were renamed to Of course any not renamed:
classes:
The following classes and structs which are mostly descriptors, requests and items (for copy, dispatch and draw) were all renamed and turned into multi-device classes in order to facilitate the use of the (multi-)device classes. This is required because they store either a reference or pointer to a (multi)-device object. We always have both classes available because in most cases the multi-device versions are used as parameter for some method of the classes:
structs:
For example: multi-device |
Beta Was this translation helpful? Give feedback.
-
The one thing I will highly recommend is that we do this first step in a manner where most of the changes are within RHI itself and does not affect RPI or at least minimizes this work as much as possible for this PR and we can create a future PR as a next step. This will help us considerably when reviewing the PR, testing the PR as well as be confident about not introducing a regression. Here is what I recommend in terms of design changes that can help with this goal. I propose that we do not create a deviceXXX version of an object XXX that wasnt already inheriting from RHI::Resource or RHI::ResourcePool. This would mean that we will not be creating classes like DeviceDrawItem or Device DispatchDrawItem, etc and it would also mean that we will not have to update RHI::CommandList Submit api and indirectly not impacting RPI at all.
-We could then remove objects like DeviceIndexBufferView, DeviceStreamBufferView. Instead just modify the following function within IndexBufferView and StreamBufferView and be able to support buffers for multiple devices.
and this can only be used by RHI.
|
Beta Was this translation helpful? Give feedback.
-
Something I realized later last night is that Commandlist Submit api has no idea about FGContext and hence does not know which device index to use to translate. So one option is to add a default device index (int deviceIndex = 0) to the Submit api and then have DX12/Vk/Mt::Commandlist::Submit methods modify GetBuffer to GetBuffer(deviceIndex). Same applies to PSO and SRG objects. I think that is fine for now. The alternative is to keep Device version of the DrawItem/DispatchItem around and have RHI::CommandList::Submit do the conversion from RHI::DrawItem to RHI::DeviceDrawItem and pass it to the backend verison of Submit API. I am not a fan of the latter approach as it means that for thousands of calls to submit we are creating a new Device version of the item each time which seems in-effecient. RPI has code where it caches draw items for meshes and that optimization will be nullified if we re-create the Device version of the draw items under the hood. Another option is we expand Draw/Dispatch item to hold device specific resources but they are set by the RHI::Commandlist::Submit api if not already set (so set once).This will mean that we will not need to do an extra lookup for all the objects for each submit call. This way if RPI is re-using draw items RHI will only pay the cost of doing the translation once. Regarding ResourceView conversation I would prefer the latter solution as it is a simpler design (RHi::Buffer->RHI::DeviceBuffer->RHI::BufferView). In general, we have struggled with Resource vs ResourceView in terms of handling their relationship when considering multi-threaded programming. Basically we have had to fix a lot of edge cases around deletion of ResourceViews/Resources when multiple threads are trying to create a bunch of them and destroying a bunch of them together. You will see that we have added a lot of unit testing just around creation and deletion of ResourceViews to help solidify the code. Throwing in this new design where you can now have RHI::ResourceView and RHI::DeviceResourceView will make everything much harder. As for your question related to other structs. It is hard for me to pass a general decision but as an example lets visit a struct like BufferInitRequest vs DeviceBufferInitRequest. I think there is probably value in having both. RHI::BufferInitRequest is probably only going to be used by RPI to create an RHI::Buffer where as RHI::DeviceBufferInitRequest is setup to only build out one RHI::DeviceBuffer and should only be used by RHI. It would make sense to add and remove variables within the struct to that effect. For example BufferInitRequest could contain a device mask telling RHI how many DeviceBuffers to create. BufferDescriptor probably only makes sense to live within BufferInitRequest and not within DeviceBufferInitRequest. At Buffer::Init RHI will then use BufferInitRequest to build out multiple DeviceBufferInitRequest and populate RHI::Buffer as needed. BufferMapRequest and DeviceBufferMapRequest can follow the same logic where one is for RPI use and the other for RHI use. Maybe the structs that are for RHI use only can live within RHI.Private to ensure RPI is not using them. We can also discuss a specific struct that you may have in mind as well. |
Beta Was this translation helpful? Give feedback.
-
We posted an RFC just now, please have a look: #120 :) |
Beta Was this translation helpful? Give feedback.
-
MGPU Support
Summary
Multi GPU support can allow for better performance as you are now able to use multiple GPUs to render frame’s’ within window’s’. With OpenXR support on the roadmap there is also an opportunity to use multi-gpu support if the app is run on PC and viewed via a VR device or if the VR device has multiple adapters. Historically adding multiple gpu support as been tricky to get right and sometimes is not worth the effort depending on how it is implemented and what hardware is used for the purpose. O3de currently does not provide multi gpu support and this document goes over possible strategies and the final proposed solution at the end. This can also provide great benefits if used within cloud services where multiple gpu nodes are in play. Before we go any further I have added a high level view of how Atom is setup if there is any confusion around how RPI and RHI are structured.
Goals
Known Strategies
Multi gpu support can be done in many ways and the drivers actually provide underlying support for some of these strategies. Below are three possible strategies for linked devices and there is literature out there on how this can be supported on Nvidia and AMD hardware.
Vulkan
Dx12
Metal
In theory using the ‘Linked adapter’ approach with one of the above mentioned strategy would be good approach but there are a few problems with it. The linked adapter approach is not cleanly abstracted across all APIs in the same way and hence creating a single API would be much harder to accomplish. Experiments show that the existing APIs do not have mature support. After trying various Nvidia GPUs with SLI or NVLink on Windows and Linux no configuration was found where vulkaninfo reported device groups with more than one GPU. This is in line with Nvidia dropping support for SLI.
Proposed strategy
Based on the goals mentioned above I propose following development path
Explicit heterogenous support only (No Linked adapters)
Scalability support via development in an iterative manner. Each iteration builds on the previous one and by the end of Iteration pass 3 we should be able to scale to N GPUs across any part of the Render pipeline.
Virtual GPU support to help with debuggability - This will allow us to debug multi-gpu support with just one adapter
GPU query support across all devices to help with profiling
I think that adding support for all of the above will help us achieve all of our goals mentioned earlier in the document. It tries to address the issue of fractured API support across DX12/Metal/Vk, immature driver support, iterative development support, better debugging support and ability to better profile all the passes within the pipeline. Lets go over more specific implementation details below
Device selection and Management
Currently RHI::Factory provides a way to enumerate physical device instances on the system. A physical device is basically a handle to gpu adapter, with some platform independent information about it. The descriptor is as follows
The PhysicalDeviceDescriptor can be enlarged to hold more information about the adapter which can help us build a priority list from the strongest gpu to the weakest. As a start we could just use m_heapSizePerLevel which contains information about dedicated video/system memory to assign priority to a adapter.
At the moment we pick one adapter (with preference given to NVidia/AMD) and create a device instance for it. Introduce a new variable within settings registry that will dictate the number of virtual instances per adapter. The default value can be 1 but can be switched to 1+ in order to create multiple instances from the same adapter. Multiple instances may mean duplicated device memory (depending on the implementation) so caution needs to be taken when doing this and should mainly be used for debugging purposes in case of one adapter.
A DeviceManager class will be needed which will contain all the device instances and their mapping to the related adapter. It will be a singleton class and can live within RHISystem. New methods will need to be added within RHISystem so that the RPI is able to access the list of device instances and all the information related to it. As part of init RPI is able to direct RHI which GPU should be enabled as part of multi-gpu support. By default RHI will use the first instance of the chosen adapter. At the end of initialization RHI will have successfully created a device instance mask that will contain the bits for all the ‘activated’ device instances set to 1. This mask can then be used for validation purposes and can also be used by backends like DX12 as a ‘nodemask’.
We want to try to design around the fact that RPI is able to activate a previously un-activated device instance. This will allow flexibility in terms of adding or removing instances based on the load of the application and can be very beneficial when we have access to multiple nodes and are trying to load balance across all of them.
Api changes will be required to
RHISystem::InitDevice <---- We need to initialize multiple devices
New api will need to be added to activate or de-activate an existing device instances as well.
Resource management
Before I go over each Iteration pass separately we need to make a big decision related to resource management. We have two modes of thinking here and they both will have sever impact on how we develop mgpu support.
Having said that we should be able to use either of options described above and still follow the iterative plan I have prescribed below.
Iteration 1 - Framegraph per device
For this approach no data transfer needs to happen from one gpu to another. This iteration will essentially try to apply work related to each window on to a different GPU and hence will be limited in terms of performance boost in different scenarios. For example this would be very beneficial if we are using an editor and have access to multiple gpu whereby we can assign a different gpu to each editor window like the Main editor, Material editor, UI editor, Animation editor, etc. If you have a game with one window/scene this iteration will not be as beneficial but this approach will setup the groundwork for more complicated iterations in the future.
RPI will need to have a way to assign a priority to an editor which can be used to map a gpu to an editor if all the windows are active at the same time. We can match the highest priority adapter device instance to the highest priority editor window. Based on this device→window mapping we will need to modify ViewportContextManager to pass in a different device instance per window handle. Swapchain is already part of the window context so we are good there.
RPI will be able to assign a frame graph (i.e m_frameScheduler) to a window and this can be arbitrary. The frame scheduler provides user-facing API for preparing (constructing), compiling, and executing a frame graph so now we will need to create N Frame schedulers for N GPUs. We would create a window handle→device instance→frame graph id mapping. We would have to modify some of RHISystem’s api around managing m_frameScheduler api to provide mapping between Framegraph’s’→device instance’s’
RHISystem will need to change to now hold N FrameSchedulers
We will need to update
RHISystem::Init
to pass the framegraph id and the device instance associated with the given frame graph. The init function initializes the frame scheduler and hence at that point we are establishing a mapping between the frame scheduler and the device it will be using. Changes will be need to be made to call to FrameUpdate to something like this
1 resource→ 1 device approach
If we follow the one device per resource approach here RPI will be responsible for maintaining residency of the resource pools in their separate devices based on the frame scheduler assigned to the device. This may not be too tricky as the idea here is that we will have very different pipelines per window (think UI editor vs main window) and hence managing different pools per device may not be as bad. Since no communication is needed nothing else should be needed.
1 resource→ N device approach
If we follow this resource approach nothing more will be needed at RPI level. Ofcourse we will have to expand RHI to support this approach first. But once accomplished if we want to execute Framegraph X on device Y during Framegraph X’s executing time it will query the resource present on device Y and use it as needed. No other changes will be required.
Iteration 2 - Pass per device (Essentially Iteration 1 but with device communication support)
For this iteration we are building upon the previous iteration and now on top of running N frame graphs across N gpus we want to allow for communication between the frame graphs. This will allow us to address the case where we have one window with one or more camera/scene which is the most common use case when playing a game. The idea behind this approach is that we now want to run multiple frame graphs on multiple GPUS but are bound to the same window and hence are allowed to communicate.
With this approach RPI would have to allow for multiple disparate frame graphs per device. This means that the pipeline authors have the option to break up the main pipeline into multiple sub-pipelines where each sub-pipeline will be able to run on a different virtual device instance. Each sub-pipeline will be assigned a separate FrameScheduler which will create a separate framegraph to be run on a separate device. Imagine an example like this where you have two separate frame graphs running on separate GPUs. GPU A renders the Shadow, Depth and Gbufer pass which is then copied over to GPU B to do lighting, Post-processing and Presentation. The synchronization can be built around CrossDeviceCopyPass using Fences.
Dx12 multi-gpu fencing Api
RHI abstraction
RPI could have a special pass made to copy data from one device to another called CrossDeviceCopyPass where it would create fences like the ones described above. It will then be able to attach fenceA as a signaling fence to the CrossDeviceCopyPass scope (FrameGraphA) which will copy the data from GPU-A to shared staging buffer. Then the CrossDeviceCopyPass(FrameGraphB) would wait on fenceB and then copy over the data from shared staging buffer to GPU-B. We may need to add support to attach a fence to wait on for a scope by RPI.
1 resource→ 1 device approach
Lets discuss memory transfer with this approach. DirectX requires that you create a heap on the first device, then open a shared handle to that heap on the second device. You then have to create a separate placed resource for each device. A possible API could look like this
DX12 API
Possible direction for RHI API
1 resource→ N device approach
With this approach RPI will not have to worry about creating separate shared pools. We could introduce a new buffer bind flag which would tag a buffer that needs to alias across multiple devices. When that happens it could directly create cross device resources/resource views in the RHI backend itself and the api mentioned above may not need to be exposed outside of RHI. The fence api will still exist and in the CrossDeviceCopyPass RPI would setup correct copies (GPU-A buffer to staging memory and staging memory to GPU-B) by just querying the RHI to get the correct buffer pointer for a specific device.
Iteration 3 approach - Pass split across multiple devices
This should be the last iterative step and would allow for multi-gpu scaling across specific supported passes. For this approach we can split the screen dimension in the X-axis across N GPUs. For example the forward pass can be split across N GPUs by setting the RT viewport/scissor dimensions to be of dimensions (Resolution_Width / N, Resolution_Height).
Use Case 1 - Deferred pipeline - We run Depth->Shadow->GBuffer pass on N-1 GPUs and have them copy over the result of their section of the screen to the primary GPU which will then apply Lighting→PostProcessing→Present. The vertex work is duplicated across all devices. We have to be careful with what we partition as any passes that require filtering can not be partitioned.
UseCase2 - Ray Tracing pipeline- Generate rays on one gpu, trace them on N GPUs, copy the data back to the primary gpu to be filtered.
1 resource→ 1 device approach
RPI will need to do a lot of the heavy lifting here. It will need to understand which resources need to live on which device for it’s appropriate frame graph and manage that accordingly. Based on the pipeline changes it should be able to set appropriate viewport/scissor dimensions for the pass that will need to be partitioned across multiple GPUs
1 resource→ N device approach
RPI will have much fewer changes. It will be able to query RHI for it’s driver appropriate handle and set up the scopes appropriately. The synchronization code will be the same regardless. Below are a few thoughts on what will need to be changed for 1 resource→ N device support in general.
Resource creation
Resource Binding
Execution
Open Questions
We will need further research and discussion around how to proceed based on our use cases. I recommend setting up a working group to help facilitate further development in a much more transparent manner.
Beta Was this translation helpful? Give feedback.
All reactions