August 11, 2015, Siggraph, Los Angeles—Tim Foley from Nvidia provided an overview of the latest generation of APIs for graphics work on various platforms. These new APIs are emerging as systems become performance and power limited.
The new generation of APIs include Vulkan, D3D12 (or DX12) and Metal. The rationale for the new languages is the increasing CPU bottlenecks in performance and predictability. Earlier APIs use a single thread to connect the CPU and GPU and depend on the driver to set up all of the handshakes. Changing to a multi-threaded flow helps, but the overhead is unpredictable since the application doesn't control the transfers to the GPU and back.
The overall flow is for the CPU to submit data to a command buffer, which accesses a driver. This driver then sets up pipeline status objects, establishes resources and bindings, and then releases commands to the execution queue, which forwards them to the GPU. The biggest problem is that the CPU cannot keep the various queues and buffers full for the GPU execution.
These non-deterministic interfaces are being replaced by console-like APIs that have explicit buffers, submits, and multiple queues. The app calls threads and is responsible for synchronization. Objects are owned by the threads and there is no state intention across buffers. Metal has one-shot buffers while Vulkan and D12 allow buffer reuse.
The app now generates and controls pipeline state objects (PSO) that allow states to change granularity. Objects encapsulate most state vectors which allows for early compile and validation. The new interfaces require the app to identify all shaders and states for those shaders to separate code and data streams. The app also has to set the state for non PSO data.
The result is that the app now has a sequence of draw calls which can share targets. For D12 and Vulkan, the memory and registers can be allocated for textures. Allocation is to a range of virtual cells, resources enumerate memory and layout, and the view binds resources and formats per use.
The binding process takes the PSO and binding tables for samples, textures, and buffers. Descriptors are GPU-specific encoding of programming and allocation is a destination table. Shader constraints define the descriptors through tables for root layouts and pipeline flows for the GPU PSO.
As a part of the greater control of the pipeline, the developer now must be aware of data hazards and image object lifetimes. In previous APIs, the driver mapped and scheduled all of the functions, but the new APIs require the developer to manage the sync, data flow management, residency, and resource transitions. While these help performance and predictability, the loss of automation for reset, map discard, and other functions comes from the new explicit management of the objects.
The new APIs will take lots of work for additional functions like object lifetimes and all transitions. The use of explicit signal resources and transitions make the API more console-like. All resources are in one state at a time and switches are explicit. The reduction in automation means that if something is wrong, you have created an error. The upside is that the new APIs also come with more complete tool sets that can detect these errors. The overall tradeoff is for greater control and predictability for more responsibility in managing the details.
As an example of the new API, Graham Sellers from AMD described the Vulkan architecture. The intent is to reduce overhead for high performance and scale to many threads across a range of platforms. The app generates instances and all states to run. The API allows multiple instances and the instances own the drivers, which can be aggregated.
The app creates instances that define all states, etc., enumerate all devices and retrieve information on GPU capabilities including performance, memory, etc. The app talks to logical devices after getting feature sets, queues, and extensions. Queues are asynchronous and the capabilities set identifies graphic compute, DMA, and other functions. the queue handle from devices permits more than one queue per family.
Command buffers hold information on various creation parameters; which queue family, drive optimizations, etc. This feature is a major part of the work for the drivers. Pipelines are compiled up front then submitted to the buffers and include the shaders, blend, etc. All of the pipeline flows are serialized and put into multiple caches. Vulkan uses the SPIR-v language for the shaders.
The API allows for mutable states, but most are immutable. States are dynamic, allowing for small chunks of state. State binding is established in the command buffer and is inherited from draw to draw. Creating a derivative state calls for a template-like object and modifications for submission to the PSO.
Resources are the data to be accessed and memory allocated by the app. The app defines the need for memory and allocates some memory for the function. The app binds than memory to the resource. The app also does memory management like pool management and data sharing through flags. The control over cache, coherence, etc. enables zero copy and unified memory associations.
The descriptors for the GPU resources are sets from pools which may include layouts that are known at pipeline creation. Render pass objects define the number of passes, regions, and frame buffers. Sub-passes are created for merged passes for multiple attachments of transient data allow for memory reuse.
Drawing inside the render pass can use multiple asynchronous compute objects that are launched through dispatch. Synchronization is through event primitives in the event objects including set, reset, polled, etc. Resources are now static and are an application function and not part of the driver execution barrier.
Work is submitted to the queue, so the execution fence requires some handshake with the CPU. Resource sharing is through some semaphore across queue primitives. The API doesn't lock threads, which allows concurrent reads to the same object and concurrent rewrites to different objects. The presentation is an optional function and like an extension as displays are abstracted. Teardown is now explicit, so you have to delete objects in order. There is no reference counter or implicit object lifetime, so it is important not to delete objects in use.