Fundamental Vulkan Samples rework

Fundamental Vulkan Samples rework (Sync, command buffers and more)

Some history

It’s been almost ten years since I started working on my open source C++ Vulkan samples. Never would I have imagined what the coming years would look like. Back then I only knew a bit of OpenGL and very little DirectX. And I wasn’t that proficient in C++ either, my main programming language back then was Delphi (which has become a small niche as of today). And all of that on a non-professional level.

Among the things that I would’ve never imagined back:

  • Having one of the most popular realtime graphics related repositories on github (~11k stars and counting)
  • Working on that almost every day for almost a decade
  • IHV and ISVs using my sample code internally for testing, learning and onboarding (even some I would’ve never expected)
  • Becoming part of the Khronos group as an individual contributor
  • Getting donations from one of the most influential 3D graphics programmer of all time
  • Helping to shape open standards
  • Working together with some of the industry’s best

And last but not least the impact I seemingly had on a lot of people:

  • Thousands of people have forked my samples
  • Almost 150 people contributed to them
  • Countless mentions of my samples on all kinds of social media
  • Getting feedback from people actually landing jobs after learning from my samples
  • Getting recognized by my name alone (which still feels very odd at times)

That’s why I enjoy Open Source so much. Though I gotta admit, maintaining such a widely used and popular repository can become a burden sometimes. But I still enjoy all of this and have no plans to change any of this.

Biggest issues

But there are a few things I do regret on a technical level. While I’ve been doing 3D graphics at a lower-level back in the DOS days, I didn’t knew much about low-level graphics apis. And switching over from OpenGL wasn’t easy, esp. with the first Vulkan implementations being prone to crashes if you only did the slightest mistake. And back then validation layers were in their infancy. Getting BSODs was very common, and you could literally hear them as first your audio output would start to stutter before your whole system crashed. A lot has improved since then, the ecosystem is on a whole different level as of today.

One area that I was totally uncomfortable with (and still am sometimes, even today) was synchronization. There was nothing comparable in OpenGL or OpenGL ES. Getting CPU and GPU do work in parallel felt like Black Magic. And “unlearning” OpenGL concepts wasn’t easy either.

As a result of that, the samples, while doing fine at showing specific features, extensions, etc. did have some serious flaws that I’ve been wanting to fix for years.

Draining the queue

The biggest flaw was doing a vkQueueWaitIdle at the end of each frame:

// In the sample
void draw()
{
	VulkanExampleBase::prepareFrame();
	submitInfo.commandBufferCount = 1;
	submitInfo.pCommandBuffers = &drawCmdBuffers[currentBuffer];
	// Notice the lack of a fence here (last argument)
	VK_CHECK_RESULT(vkQueueSubmit(queue, 1, &submitInfo, VK_NULL_HANDLE));
	VulkanExampleBase::submitFrame();
}

// In the base class
void VulkanExampleBase::submitFrame()
{
	VkResult result = swapChain.queuePresent(queue, currentBuffer, semaphores.renderComplete);
	if ((result == VK_ERROR_OUT_OF_DATE_KHR) || (result == VK_SUBOPTIMAL_KHR)) {
		windowResize();
		if (result == VK_ERROR_OUT_OF_DATE_KHR) {
			return;
		}
	}
	else {
		VK_CHECK_RESULT(result);
	}
    // This should never be done in a Vulkan application
	VK_CHECK_RESULT(vkQueueWaitIdle(queue));
}

That’s kinda like cheating in regards to synchronization, because it “saves” you from properly syncing things. So instead of making sure that resources shared by the CPU and the GPU are implemented such that CPU writes won’t occur while GPU reads are still running, one can simply wait for a queue to become idle and get away with not having to deal with that. Same for command buffers, no need to use fences to ensure that their execution has finished if ou drain the queue they’ve been submitted to. The samples only use a single queue, so this is pretty much equally to waiting for the whole device becoming idle.

As a result I got away with never having to e.g. duplicate resources written by the CPU and read by the GPU. So samples e.g. only ever had to create a single uniform buffer for stuff like scene matrices:

struct UniformData {
	glm::mat4 projection;
	glm::mat4 modelView;
	glm::vec4 lightPos{ 0.0f, 2.0f, 1.0f, 0.0f };
} uniformData;
vks::Buffer uniformBuffer;

void updateUniformBuffers()
{
	...
	memcpy(uniformBuffer.mapped, &uniformData, sizeof(UniformData));
}

void render()
{
	VulkanExampleBase::prepareFrame();
	updateUniformBuffers();
	VulkanExampleBase::submitFrame();
}

Without having some sort of blocking sync like vkQueueWaitIdle, this would result in the uniform buffer getting updated by the CPU mid-frame while the GPU wasn’t yet finished reading from it. That would be undefined behavior, and ever programmer knows that undefined behavior can cause all sorts of issues. On some devices, esp. with lower frame rates, this would cause visual artifacts:

But the issue with such a blocking call, aside from possible performance losses, is that commands like vkQueueWaitIdle kill any chance of CPU/GPU parallelism, something that Vulkan did actually allow in a far better and more controllable way than OpenGL. It’s pretty much the very definition of a bad (Vulkan) practice.

Sadly a lot of people learning from my samples adopted this bad practice. Or worse, they did not and started to wonder why they were getting all sorts of synchronization issues. I can’t remember how often I did explain why my samples were working with single uniform buffers.

Pre-recording Command Buffers

Command buffers (and having to record them) was a concept new to Vulkan that wasn’t present in OpenGL (no, display lists don’t count πŸ˜‰). In Vulkan you don’t directly submit e.g. draw or bind commands like in OpenGL, you instead record a command buffer with those commands and at some later point submit it to a queue for execution.

And so it seemed only natural to pre-record all command buffers for all frame buffer attachments at startup like this:

// This is an override in the base class, so e.g. the old UI overlay did call this when the UI hard to rebuilt
void buildCommandBuffers()
{
	VkCommandBufferBeginInfo cmdBufInfo = vks::initializers::commandBufferBeginInfo();

	// Pre-record all command buffers in a loop
	for (int32_t i = 0; i < drawCmdBuffers.size(); ++i)
	{
		VK_CHECK_RESULT(vkBeginCommandBuffer(drawCmdBuffers[i], &cmdBufInfo));
		...
		drawUI(drawCmdBuffers[i]);
		VK_CHECK_RESULT(vkEndCommandBuffer(drawCmdBuffers[i]));			
	}

	...
}

void prepare()
{
	VulkanExampleBase::prepare();
	...
	buildCommandBuffers();
	prepared = true;
}

While that did work in the early stage of development, it made things unnecessarily complicated when I started added more complex samples and a user interface. As now I had to actually write logic to detect when I had to tell a sample to re-record it’s pre-recorded command buffers, make sure the command buffers were not in use (hello again, vkQueueWaitIdle) and then use a virtual function to trigger the command buffer build function of the actual sample. That complexity did hurt sample readability, and made the control flow hard to follow.

And in reality almost nobody would pre-record command buffers. Vulkan was created with better ways for CPU and GPU parallelism in mind after all, so offloading command buffer recording to a separate thread was the more common pattern. And for something like a sample, where command buffers are usually pretty small, the overhead of re-recording them on the fly is so minor that it didn’t make much sense pre-recording them.

Fixing things

I’ve been wanting to fix these issues for a long time. But having almost 100 samples that’s not an easy undertaking, esp. with me having less time and energy (you do get older) to work on my samples. So 4 years ago I took a first stab at fixing all of this but failed. I did progress pretty good early on, but doing a huge change on a repository that’s so widely used and getting regular contributions wasn’t easy to manage. I just wanted too much, not only fixing those issues, but also trying to kinda rework architectural stuff up to a point where some samples would’ve looked completely different, all while maintaining the main branch of the samples. So after working on almost a year on that I gave up, lesson learned.

But with me now also porting and writing samples for the official Khronos Vulkan samples repository those issues lingered in the back of my head. Every time I wrote a new sample I thought “This is not how it should be! You’re misleading all those looking at your code. That’s not good.”.

So fast forward to 2025, I made another attempt and did try to keep the scope much smaller:

  • Get rid of the vkQueueWaitIdle
  • Get rid of the command buffer pre-recording
  • Fix at most minor issues like wrong image barriers
  • Do no big architectural changes
  • Only small code improvements like better naming things

And that did indeed workπŸ‘πŸ». Even with my limited spare time it took only ~1 1/2 months incl. some planning beforehand to fix those fundamental issues.

But even with this limited scope this is one of the biggest PRs I ever did. In total it touches ~15.000 lines of code.

A positive side-effect: This PR also removes almost 2,000 lines of code, making things easier to read, easier to follow and easier to debug.

tl;dr: As of today my samples are much closer to what Vulkan in a real-world application should look like πŸ₯³πŸŽ‰.

Changes

If you want to see all changes in detail, you can find them in this pull request.

Doing proper sync

Main goal no.1 : Replace the queue wait idle with proper frame synchronization with the possibility of getting the CPU and GPU to work in parallel. Doing that the right way actually adds some complexity to the code. You now have to be more explicit about synchronization and make sure CPU and GPU have don’t get in each other’s way when accessing data that is written by one and read by the other. This e.g. applies to all uniform buffers that can change (from the CPU) between frames but also to command buffers.

First I did redo the sync object setup. Before this there was one semaphore to signal render completion, one to signal present completion and no fence for compute buffer execution. That’s not sufficient for getting CPU and GPU work in parallel. The new sync object contains everything required to get that parallelism working:

// Vulkan sync objects
std::array<VkSemaphore, maxConcurrentFrames> presentCompleteSemaphores{};
std::vector<VkSemaphore> renderCompleteSemaphores{};
std::array<VkFence, maxConcurrentFrames> waitFences;

// Indices used for sync
uint32_t currentImageIndex{ 0 };
uint32_t currentBuffer{ 0 };

maxConcurrentFrames refers to frames-in-flight, aka how many frames can be worked on in parallel. Think of it as “max. frames to render ahead”. While reworking the samples I also did use a few more modern C++ features, so this is a constexpr, which lets me size arrays at compile time.

Note that it’s not used for sizing the renderCompleteSemaphores. That’s because those semaphores are tied to the images owned by the swapchain. And the no. of swapchain images can only be determined at runtime. A more detailed explanation for swapchain an semaphore relation can be found in this excellent Vulkan Guide chapter.

This also affects how we can index into things. The buffers are owned by the application, so it explicitly controls the index like this:

void VulkanExampleBase::submitFrame(bool skipQueueSubmit)
{
	...
	// Select the next frame to render to, based on the max. no. of concurrent frames
	currentBuffer = (currentBuffer + 1) % maxConcurrentFrames;
}

Whereas the swapchain image index is controlled by the Vulkan implementation (aka the driver), so we have no explicit control over that but rather have to get this index from a call to vkAcquireNextImageKHR. This can (and will be) out of order, so it’s important to decouple this from the command buffer index.

Next I reworked how sync is done in the base example class, which is used for all examples except for the two hello triangle samples (that are deliberately using as little base class code as possible) and two headless samples (those just render a single frame and then exit).

This applies to the actual command submission and image presentation:

void VulkanExampleBase::submitFrame()
{
	const VkPipelineStageFlags waitPipelineStage{ VK_PIPELINE_STAGE_COLOR_ATTACHMENT_OUTPUT_BIT };
	VkSubmitInfo submitInfo{ .sType = VK_STRUCTURE_TYPE_SUBMIT_INFO };
	submitInfo.pWaitDstStageMask = &waitPipelineStage;
	submitInfo.commandBufferCount = 1;
	submitInfo.pCommandBuffers = &drawCmdBuffers[currentBuffer];
	submitInfo.pWaitSemaphores = &presentCompleteSemaphores[currentBuffer];
	submitInfo.waitSemaphoreCount = 1;
	submitInfo.pSignalSemaphores = &renderCompleteSemaphores[currentImageIndex];
	submitInfo.signalSemaphoreCount = 1;
	VK_CHECK_RESULT(vkQueueSubmit(queue, 1, &submitInfo, waitFences[currentBuffer]));

	VkPresentInfoKHR presentInfo{ .sType = VK_STRUCTURE_TYPE_PRESENT_INFO_KHR };
	presentInfo.waitSemaphoreCount = 1;
	presentInfo.pWaitSemaphores = &renderCompleteSemaphores[currentImageIndex];
	presentInfo.swapchainCount = 1;
	presentInfo.pSwapchains = &swapChain.swapChain;
	presentInfo.pImageIndices = &currentImageIndex;
	VkResult result = vkQueuePresentKHR(queue, &presentInfo);
	...
}

The queue submission now uses the semaphores based on their relevant index (buffer or image) and signals a fence so we actually know when that command buffer has completed execution. That’s a requirement for getting rid of the vkQueueWaitIdle:

And that also paves to way for having the CPU working on a command buffer while the GPU is still executing another one. But an important requirement for that is to separate resources that are written by the CPU and read by the GPU. Something that applies mostly to uniform buffers, and fixing that meant changing pretty much every sample. With the possibility of the CPU and GPU working in parallel, no longer is it possible to just have a single uniform buffer. As the CPU writes happen on a different timeline (e.g. via a simple memcpy) then the GPU reads. That would cause all sorts of read after write hazard.

There are different ways of dealing with this in Vulkan, but since these are samples that should be easy to follow, I went with the straight forward way of simply duplicating those buffers based on maxConcurrentFrames. So one copy of a uniform buffer per command buffer:

struct UniformData {
	glm::mat4 projection;
	glm::mat4 model;
	glm::vec4 lightPos = glm::vec4(5.0f, 5.0f, -5.0f, 1.0f);
	glm::vec4 viewPos;
} uniformData;
std::array<vks::Buffer, maxConcurrentFrames> uniformBuffers;

Duplicating uniform buffers also means duplicating descriptor sets, which I also adjusted for all samples

std::array<VkDescriptorSet, maxConcurrentFrames> descriptorSets{};
...
void setupDescriptors()
{
	std::vector<VkDescriptorPoolSize> poolSizes = {
		vks::initializers::descriptorPoolSize(VK_DESCRIPTOR_TYPE_UNIFORM_BUFFER, maxConcurrentFrames),
	};
	// One set for matrices and one per model image/texture
	const uint32_t maxSetCount = maxConcurrentFrames;
	VkDescriptorPoolCreateInfo descriptorPoolInfo = vks::initializers::descriptorPoolCreateInfo(poolSizes, maxSetCount);
	VK_CHECK_RESULT(vkCreateDescriptorPool(device, &descriptorPoolInfo, nullptr, &descriptorPool));

	// Descriptor set for scene matrices per frame, just like the buffers themselves
	for (auto i = 0; i < uniformBuffers.size(); i++) {
		VkDescriptorSetAllocateInfo allocInfo = vks::initializers::descriptorSetAllocateInfo(descriptorPool, &descriptorSetLayouts.matrices, 1);
		VK_CHECK_RESULT(vkAllocateDescriptorSets(device, &allocInfo, &descriptorSets[i]));
		VkWriteDescriptorSet writeDescriptorSet = vks::initializers::writeDescriptorSet(descriptorSets[i], VK_DESCRIPTOR_TYPE_UNIFORM_BUFFER, 0, &uniformBuffers[i].descriptor);
		vkUpdateDescriptorSets(device, 1, &writeDescriptorSet, 0, nullptr);
	}
	...
}

And that also meant having to update the appropriate uniform buffer using an index instead:

void updateUniformBuffers()
{
	uniformData.projection = camera.matrices.perspective;
	uniformData.model = camera.matrices.view;
	uniformData.viewPos = camera.viewPos;
	memcpy(uniformBuffers[currentBuffer].mapped, &uniformData, sizeof(UniformData));
}

Adjusting those parts was actually the bulk of the changes, as this had to be done on a per-sample basis. One specific area that was a bit tricky to get working with the new sync setup was the UI overlay. Unlike most samples that only load vertex- and index data once, the UI overlay is dynamic and requires destroying and (re)creating index and vertex buffers. Since I had to rework that logic anyway I not only adjusted it for the new sync but also tried to optimize and simplify how those UI related buffers are (re)created and went for a chunk based approach. Those buffer are now increasing size based on a fixed chunk size, which reduces buffer recreation:

void UIOverlay::update(uint32_t currentBuffer)
{	
	ImDrawData* imDrawData = ImGui::GetDrawData();
	...
	VkDeviceSize vertexBufferSize = imDrawData->TotalVtxCount * sizeof(ImDrawVert);
	VkDeviceSize indexBufferSize = imDrawData->TotalIdxCount * sizeof(ImDrawIdx);
	...

	// Create buffers with multiple of a chunk size to minimize the need to recreate them
	const VkDeviceSize chunkSize = 16384;
	vertexBufferSize = ((vertexBufferSize + chunkSize - 1) / chunkSize) * chunkSize;
	indexBufferSize = ((indexBufferSize + chunkSize - 1) / chunkSize) * chunkSize;

	// Recreate vertex buffer only if necessary
	if ((buffers[currentBuffer].vertexBuffer.buffer == VK_NULL_HANDLE) || (buffers[currentBuffer].vertexBuffer.size < vertexBufferSize)) {
		...
		buffers[currentBuffer].vertexBuffer.destroy();
		VK_CHECK_RESULT(device->createBuffer(VK_BUFFER_USAGE_VERTEX_BUFFER_BIT, VK_MEMORY_PROPERTY_HOST_VISIBLE_BIT, &buffers[currentBuffer].vertexBuffer, vertexBufferSize));
		...
	}

	// Recreate index buffer only if necessary
	if ((buffers[currentBuffer].indexBuffer.buffer == VK_NULL_HANDLE) || (buffers[currentBuffer].indexBuffer.size < indexBufferSize)) {
		buffers[currentBuffer].indexBuffer.destroy();
		VK_CHECK_RESULT(device->createBuffer(VK_BUFFER_USAGE_INDEX_BUFFER_BIT, VK_MEMORY_PROPERTY_HOST_VISIBLE_BIT, &buffers[currentBuffer].indexBuffer, indexBufferSize));
	}
	...
}

Again I tried to be conservative with all these changes, which meant that some samples might look like they’re unnecessarily duplicating descriptors. But that’s not wrong per se and I wanted to avoid the mistake of trying to change too much, which made me fail the first attempt at reworked sync.

Record command buffers on the fly

Main goal no.2 : Instead of pre-recording all command buffers at application startup, they’re now re-recorded on the fly right before they’re getting submitted. With the new sync in place, this fits nicely into the CPU/GPU parallelism topic. This way the CPU can start re-recording command buffer n+1 while the GPU is still processing command buffer n. This is often referred to as “frames in flight”, a good explanation on that topic can be found here. This builds upon the reworked sync setup:

// No longer an override
void buildCommandBuffer()
{
	VkCommandBuffer cmdBuffer = drawCmdBuffers[currentBuffer];
	VkCommandBufferBeginInfo cmdBufInfo = vks::initializers::commandBufferBeginInfo();
	VK_CHECK_RESULT(vkBeginCommandBuffer(cmdBuffer, &cmdBufInfo));		
	...
	// Descriptors accessing buffers shared by CPU/GPU now need to be properly indexed
	vkCmdBindDescriptorSets(cmdBuffer, VK_PIPELINE_BIND_POINT_GRAPHICS, pipelineLayout, 0, 1, &descriptorSets[currentBuffer], 0, nullptr);
	...
	drawUI(cmdBuffer);
	VK_CHECK_RESULT(vkEndCommandBuffer(cmdBuffer));
}

virtual void render()
{
	// Sample abstracts this in VulkanExampleBase::prepareFrame();		
	VK_CHECK_RESULT(vkWaitForFences(device, 1, &waitFences[currentBuffer], VK_TRUE, UINT64_MAX));
	VK_CHECK_RESULT(vkResetFences(device, 1, &waitFences[currentBuffer]));
	...
	swapChain.acquireNextImage(presentCompleteSemaphores[currentBuffer], currentImageIndex);		
	...
	buildCommandBuffer();
	// Sample abstracts this VulkanExampleBase::submitFrame();
	...
	submitInfo.pCommandBuffers = &drawCmdBuffers[currentBuffer];
	VK_CHECK_RESULT(vkQueueSubmit(queue, 1, &submitInfo, waitFences[currentBuffer]));
	...
	VK_CHECK_RESULT(vkQueuePresentKHR(queue, &presentInfo));
}

Fixing up validation

The Vulkan Validation Layer is one of the ecosystem components that evolved the most since Vulkan was released. It’s an essential tool for Vulkan developers and new validation checks are added with each version. Since I had to touch every sample I also did a full validation run, and fixed several outstanding validation issues. Most of them were related to missing or wrong image layout transitions, others to not properly enabled features. I would never claim my samples to be perfect showcases for Vulkan, but I tried my best and all samples validate clean when using standard validation. There are a few issues with race conditions in regards to images on a few samples that I wasn’t able to fix yet though.

Cleaning up code

As a side goal I also wanted to do at least some conservative cleanup. I did use a little bit of modern C++ (designed initializers, constexpr). I try to stay away from using many of the more modern C++ concepts and language constructs though, as they can have a negative impact on readability and make it hard for non C++ people to understand code.

And while I had to add more code (e.g. descriptor setup now more involved), esp. the removal or pre-recording command buffers removed several complicated and unnecessary couplings. That was an opportunity to clean up stuff. The samples base class and the UI overlay class profited the most. Maybe not in line count, but in removing coupling and making them easier to understand and adopt for your own needs. I also fixed some odd quirks, some of which came to be due to my (back then) lacking C++ skills.

Closing words

While it wasn’t easy to get this done, it was totally worth the effort. Having code out there that’s used by so many people on so many levels that’s teaching some bad practices has been bothering me for years. This no longer being the case feels like taking off a heavy burden.

One thing that’s making big changes challenging is the fact that my samples nowadays support 10 different platforms. Many of which have been added by contributors, wo testing on all of them isn’t possible for me. I did test on Windows, Linux and Android, so hopefully nothing is fundamentally broken. But if you do you encounter any problems, feel free to open an issue at the samples repository.