Meet Susan, the SuperMicro SYS-6028U-TR4T+ SuperServer

I've recently acquired an old dual socket server machine. Her name is Susan. We met at a place near my old apartment that sells off-lease refurbished server hardware for very reasonable rates and she followed me home. This is a system set up with 12 drive bays, 4x 10-gigabit ethernet jacks, a dedicated IPMI jack, 2x 18c/36t Xeon parts, 100mb of cache, 128 gigs of RAM, redundant 1000w PSUs, a grand total of two (2) USB ports, and it consumes about 300w under load. The fans are very loud, and using it for development work feels a bit like running a woodchipper. You absolutely need hearing protection for any kind of chronic exposure. It is a machine designed for doing headless work, no GPU, only a VGA port on the motherboard. There is a SAS controller for the drive backplane in one of the PCIe slots and two more open for "normal size" two-slot GPUs.

The adaptations that have been made for serviceability and reliability under load are very interesting - one of the most unique ones is that the redundant power supplies enable hot-plugging between circuits. I can unplug one PSU - it alarms to signal the power failure, but has no functional impact due to the redundant unit - plug it into another circuit, unplug the second PSU, replug it somewhere, all while never losing power to the machine. I don't have any practical purpose for this right now but I have demonstrated it and I think it's an extremely interesting capability, you can think about things like battery backups, transporting it while running, even taking it in a vehicle.

A piece of hardware like this, new, is on the order of 10k USD. The numbers involved in these systems are fairly eye-watering from a consumer PC standpoint, on most fronts. This machine was under 600 USD, including the chassis, RAM and two big server CPUs, it's really unlike anything I've ever had access to before. You'll spend that much on a DDR5 RAM kit, easy. With this many drive bays, you could do very well setting it up as network attached storage, but the CPUs are totally overkill for an application like that. If you're not hard up for floating point compute (this is "only" 4 TFLOPS), this is a cheaper toy than most meaningful consumer GPUs, now. In particular for me, it's a very economic way to force myself to adapt to a new system architecture and do some remedial work on CPU multithreading and sync primitives. This is something I hadn't gotten a chance to study before. I learn best with applications, like this, and I found this was a great exercise in the management of threads and also use of atomics and mutexes for correct operation between threads, where you would otherwise create unstable race conditions. I've had a lot of fun "playing supercomputer", learning how to make use of its capabilities the past few weeks.

Adapting to a New System

Ubuntu is not the default choice that it was, anymore. My favorite Linux distro during college was not maintained since 2014, but spawned a community project to recreate it. It is a bare metal install (to a Windows sensibility) built on top of Debian Linux with a beautifully simple desktop config using OpenBox, Tint2, and Conky. This is more or less all that runs, plus the compositor and a few other little odds and ends, during a user session - this should be an expected default, but that's becoming less and less the case with whatever Microsoft seems to think they're allowed to do on your hardware. Windows 11 is getting scary bad, in increasingly user-facing ways. I write system software in C++, and take as given that doing so in javascript is unacceptable - they have no shame, they literally put ads in the the OS. It feels like a sign of pending collapse. When I do use it, I'm encountering bugs in Windows Explorer, a couple times a week. Right click menu options populate for several frames after the menu appears. This shouldn't happen, it shouldn't be possible for it to happen. It shatters the UI paradigm, you can easily click "The Wrong Thing" as it populates under your mouse - this async bullshit is happening in mobile operating systems, too, options populating with your finger a millimeter off the glass, mid-tap. This introduces a monster class of user facing vulnerabilities in a simple UI element, which will be regarded by the OS as correct behavior. This is a total failure of the software development process - it is so profoundly unethical to ship software like this, a growingly common condition, for which there is no accountability. A testing framework does not replace user testing. This is not a game. Rapidly approaching zero trust. This doesn't work. But I digress.

Because I have no GPU in the system, I avoided setting up a graphics API for the small codebase I spun up for Crystal. I decided to look at options for textmode UIs, and found a nice one that provides a familiar UI paradigm right in a terminal called FTXUI. One of the sample applications sets up a familiar window analogy that is draggable just like OS windows, inside the terminal. The Debian repos have libftxui-dev for the headers and ftxui-examples for a set of little demo applications that show the UI functionality, with corresponding docs and source code here (see examples). The library puts the terminal in an interactive mode where it recieves mouse events and can logically treat the terminal as a framebuffer to render into. It also supports keyboard interaction. I think quite possibly, I may be able to do the plumbing to run this inside the text renderer in my own engine, but that would be further down the line, I can render all the terminal UI characters but I need to figure out passing input events in. The library provides a nice functional-style interface that kind of resembles a builder pattern. With some coaxing, you can setup dynamic UIs pretty easily, showing and hiding elements with the Maybe() component, which takes a pointer to a bool enable flag. This is nice because it enables central management of those flags for several elements, if you need.

I have had my frustrations with ImGUI, which became the de-facto drop-in cheap-and-easy UI solution everywhere. We used it at id, and had endless frustrations with event passing, DPI issues, issues when running it inside a QT viewport in idStudio (not to mention that we were doing that with any regularity in the first place), etc, mostly arising from a custom backend that got hacked in. Nobody had a good time with it if they were doing anything beyond "click the button". I am appreciating quite a few aspects of using a textmode UI like this. First of all, who needs 60 Hz on a UI? What are you doing, low latency 3D tasks? No. You're clicking buttons, sliders, 10 Hz is already more than you need. Do you even need to open a separate window from the terminal where you launch the program? When I'm really loading down a system I'm dealing with latencies north of 5 seconds for single inputs. Why make this user interaction additionally contingent on render work?

And so I think using a terminal UI like this is actually quite compelling. You can run a thread for this terminal UI, run it at 10 Hz. No need for more than that, and really no point tying it to the render thread, as it is fundamentally unrelated. It encourages good practice in synchronizing resources between several threads. If the UI needs to do something, it can spawn a thread to do so. In doing so, we now put onto the system scheduler, our latency constraint. I've got however many threads doing work, and here's another one that's in a loop doing relatively light work to manage inputs, update the display, terminal output, any required messaging to worker threads, then sleeping for 100ms.

Important C++ Utilities for Multithreading

My coverage of this material has been spotty and I learned a lot here studying up the past few weeks. A lot of university staff are behind the times on new standards and things of this nature, so in my undergrad C++ classes C++11 was considered shiny and new, C++14 and C++17 weren't really even discussed. I had some spotty exposure to bits and pieces of more modern C++ over the years and it's nothing fundamentally surprising. It's good to get into the practice of correctly managing lambda capture lists and synchronization primitives.

Three pieces of functionality I've had real application for in undertaking this project:

std::jthread
There is an important distinction between std::thread and std::jthread. They are fundamentally the same, in that they represent a thread of execution, and are usually specified with a lambda expression to express the work they perform. The difference is more ergonomic, in that std::thread requires an awkward call to .join(), and if you fail to do so, your program simply crashes with a call to abort(). std::jthread automatically rejoins on destruction. What's interesting is that you can spawn one of these from a UI button and basically forget about it. It manages its own execution and termination.

This is only tangentially related but I want to mention because it screwed me up for more time than it should have: the thread_local memory specifier creates a separate copy for every thread of execution - but, it also means the same thing as static. The value will not be reinitialized when you enter scope, the way other stack variables would be. Something to be careful of. I found it most useful for creating random number generators, for each thread.
std::shared_mutex
Mutexes are mutual exclusion primitives. The point is that something can be uniquely held by one thread of execution, in a way that precludes all other threads of execution from doing the same. What's unique about std::shared_mutex is that it actually has two modes. Where a std::mutex establishes a std::lock_guard, to express that this thread of execution is the only one in this block of code, we actually have more options. This is actually fairly performance-critical in Crystal, because I need synchronized access to the hashmap containing the anchored particles... but mutually exclusive is too constraining. You cannot check to see if a mutex is held, without trying to acquire the mutex. Doing so is bad practice.

The more-correct way to approach this, from the standards point of view, is to use a std::shared_mutex. This has two different types of locks. One is exclusive, and behaves exactly like your expected std::mutex functionality. It is called std::unique_lock. When you put this in a function's scope, only one thread of execution can be there at a time. This is enforced by the blocking behavior of the constructor of std::unique_lock, the same way as std::lock_guard. In Crystal, I use this type of lock when I'm adding information to a grid cell in the hashmap (more details on the data structure in that post), because of how it is changing the data structure in a way that will affect all reads. But we need a lighter guarantee on the read side. This is provided by std::shared_lock, which will only fail to acquire when a std::unique_lock is currently held (likewise, it blocks on any std::unique_lock on the std::shared_mutex being held).

The key distinction is that there can be many separate threads of execution simultaneously holding this std::shared_lock. In Crystal, this was very important to avoid race conditions on the particle contents of the grid cells.
std::atomic<T>
Coming from a graphics and GPU compute background, I've got a strong intuition for atomic operations. But in C++, they serve a slightly different purpose. These are a wrapper around particular types, which inserts the correct hardware barriers to ensure that the operations on them are atomic. This is nontrivial, even for a "pause" flag in Crystal, it's actually bad form to have that be a raw bool - even if you perceive that as the simplest possible increment, from false to true. It is important to make the associated guarantees that things are happening in the correct order. This will manage the appropriate things like flushing values from caches, etc, as needed at the hardware level. Specific hardware primitives can make more specific guarantees, but I have not researched this.

Two primary applications for atomics on Crystal. First, like I mentioned, are signal flags that can serve as communication to worker thread pools, pause is a good example. The second is unique work dispatch. By doing atomic increments to a 64-bit uint "job counter", the job system gives each thread of execution a new sort of "unique work serial number" off this job counter, each time it finishes a job and asks for a new one. 64 bits should be roughly enough to run until the sun explodes. I was able to use a very similar system to orchestrate cutting over from simulation work to render work in Crystal, where a second counter would be maintained for a linear pixel buffer index, representing its dispatch for a particular pixel the framebuffer (see below, re: Worker Threads). Each time .fetch_add( 1 ) is called, we get back the number that it was, before we added 1.

We know how many pixels are in the image, it's just the height times the width - if we're over this threshold, we know there is no more screenshot work to be done (or maybe its moving on to the next frame) - otherwise, solve for an X and Y pixel index and call RenderPixel( x, y ). The "screenshot trigger" is setting the value of the pixel dispatch counter to 0. That's it. The job threads start seeing valid indices, and cut over instantly. This is slick.
Bonus: std::chrono_literals
C++ has some super nice utilities for dealing with time. The type names suck, but you can deal with that with some typedefs, no big deal. I had never encountered this before, but you can write values like "1ms" in your code, and have that expand into the associated time interval representation for std::chrono, have yourself some very easy-to-read wait statements like sleep_for( 100ms ). Very nice.

Building a CPU Job System

The problem statement is maybe relatively simple: I have 72 independent threads of execution that I want to keep busy. How do you do that? What does it look like? How do you monitor it?

The program starts by creating several threads, one service thread, one UI thread, and a pool of 72 worker threads. Initially, a Crystal was a first-class entity, the primary class for the application. I've since moved to encapsulating that in its own class and managing several of them at a time to be able to pipeline it and make use of the huge memory buffer on this machine (I can now easily saturate 128gb and spill into swap - this chassis can be kitted out with 3tb, I think moving to 512 gigs would enable some pretty cool stuff). I'll describe the program before that point, because it is more pertinent to this discussion.

Proc Monitor Thread
The /proc/ filesystem is a utility exposed by the Linux kernel. It also exposes /proc/self/ so you can monitor the memory allocations made by your own individual process, I haven't looked into this yet. I translated a parse script for a top clone I wrote in college. This thread does nothing but maintains a double-buffered copy of CPU data. You can see the square of flickering colored block characters - this is actually monitoring of all the hardware threads, at about 10 Hz. The proc monitor thread is an infinite loop until the program is killed, reading the values, applying a low pass filter for some smoothing, and then sleeping for 5ms.

Most system monitors don't have a high refresh mode like this. I had a lot of fun putting together some blinkenlights to show the CPU usage while the program is running. This same information could be passed to a master machine for central monitoring, in a larger system. I've also given some thought to running a hardware device with a grid of LEDs, in the spirit of the status register displays of the old Connection Machine Supercomputers, I've always been fond of this type of display and the idea of total expression of internal state of the hardware. Obviously this becomes more and more difficult as the numbers become billions and trillions, but millions is still manageable with single pixels on a display. I've been thinking a lot about how to operate at this kind of scale.
UI Thread
The UI thread handles FTXUI's input handling, state updating, and display. Inside of the FTXUI config structure, the top level "Component", we keep a hierarchical structure that among other things includes lambdas for specific functionality we want to use to control the simulation at runtime. It also prepares the grid display of the black-to-orange or black-to-green gradient block characters used to show the activity status of the 72 logical CPUs that are present in the proc filesystem.
Worker Thread Pool
Spawning and destroying threads has nontrivial system overhead, it is not a free operation. If you are creating and destroying threads frequently, it is unlikely that you are operating in a very efficient manner. A "thread pool" here is just an array of std::thread objects, which persist while an operation or series of operations take place. These represent separate threads of execution, on which we do whatever structured calculations the situation at hand calls for. In Crystal, each one runs this function, which has some minor subtleties:

I haven't included the check to the pause flag, but it's just inside of the while loop. You need to put it there, because you need to block the increments happening, or they will occur without mapping to real work. Threadkill is a global atomic bool used to shut down all the worker threads and the proc polling thread. This flag is set by the UI thread when indicated by keystroke or button press, and the program shuts down. These two bool-returning functions are internally doing the atomic work to determine what work should be performed.

This is the structure which allows the atomic dispatch of simulation work, and the cutover to render work when indicated by the monitor thread setting that value ssDispatch to a value of 0, so that ScreenshotIndicated() immediately starts returning true. The next worker thread to enter the function will see that flag and enter DrawPixel instead of UpdateParticle.

Future Directions

This machine has more network bandwidth than you can shake a stick at. I don't have the infrastructure to take advantage of even 10% of it, I've got a couple big gigabit switches, but Susan has 4x 10-gig ports. I don't have anything that I can connect to it in a way that would operate at full capacity - interesting, because this is a machine from 2014. However, this would be quite interesting to explore, connecting it to other devices like itself. I'm not sure if you have to go through a switch or ideally if you can go directly port-to-port. I've recently been talking to a couple friends about some networking libraries, something a bit higher level than dealing with sockets, one called Enet, which was apparently originally made for a game I played as a kid, Cube 2: Sauerbraten, and one called rpclib, which provides more of a function interface. I'll need to pass a significant amount of data between a couple machines to sync scene data and rendered frames. I'd like to move towards building a "graphics supercomputer" and experimenting with different architectures to take advantage of larger scale hardware. With a GPU, I can offload expensive float operations to the device, and do control-flow heavy jobs, maybe work like sorting and organizing of results that come back from the GPU, on the CPU. One current idea I'm toying with is to setup a ring network between several machines like this, and pass progressively refined branching tree structures representing rays for a given pixel. Ring networks are very interesting because you do have an interesting inherent limit in jump distance between machines, that is the number of elements in the ring. Very high bandwidth and zero contention along links. It has redundancy issues, though, if any machine goes down the network stops functioning correctly. Running another line to a switch acts like a central hub, and you can at least be aware of this kind of failure condition from whatever head unit acts as "master".

I'd also like to get away from using the VGA output on this machine. It wasn't really designed to be used this way, and I'd like to figure out how to SSH/network into it and do the X server forwarding to do a virtual desktop. This is probably one of the next skillsets I will be focusing on developing a bit, figure out at least some amount of basic networking, I'll be looking to figure this out soon. If I have several machines like this, I can have each one report the system monitoring data back to a master machine, and do central monitoring, which I think is a very cool bit of opportunity for blinkenlights. I also need to figure out the correct cables for the GPU_PWR headers to 8-pin PCIe connectors to run a GPU in it. This would massively expand its capabilities, with a significant GPU. I have a 7900xtx that I am planning on using for this.

Jon Baker, Graphics Programming