GPU Sphere Packing with Dart Throwing

This is another one of the pet algorithms that I've been playing with on and off the past couple years. I think it's an interesting exercise, filling up a space with shapes - interestingly enough, I've found some relatively current research related to doing it with more complex geometries, and specifically using it for procedural generation for games. Incidentally, he also mentions potential applications in Tasseography. In the past I've mostly implemented sphere packing as a very naiive O(N²) algorithm where each newly placed point explicitly checked distance to each element of a list of already-placed spheres.

On shadertoy a couple of years ago, I found a couple examples of a GPU-based algorithm for it that I can't find a further source for beyond that paper, it's kind of being referred to as "dart throwing" on shadertoy... This is a kind of a-bit-better-than-brute-force implementation of that paper. It uses a grid which holds samples of an SDF, which is progressively updated as more and more shapes are added to the model. This functions as a cache which replaces the O(N²) check of a new radius against the whole list with a constant-time texture sample of global distance.

I went through one by leon, and it's a grid-based, iterative algorithm that slowly builds up a voronoi-type structure. Most of the samples on shadertoy are 2D, the 3D examples are using a different approach called Minkowski Offset that I'm not really familiar with. Seems to be based on spheres repelling one another. That said, the "dart throwing" algorithm works just as well in 3D as it does in 2D, and I was able to write a compute shader to use it to pack spheres instead of circles.

Algorithm

My implementation keeps a double-buffered resource of three pieces of data for each voxel. The first two, "distance" and "radius" are pretty straightforward, "seed" is a little more involved and has some further implications I will talk about later. This algorithm requires many iterations, which operate on these voxels. The field is seeded with some initial conditions, and can be reset at any time by restoring those conditions.

Distance - initially max radius - the current value of the voronoi distance
Radius - initially max radius - the current value of the radius (assuming a sphere has been placed)
Seed - initially 0 - the current value of the seed (assuming a sphere has been placed)

Once the field is seeded with the initial values, you are ready to run the iterative process. The structure here is interesting, essentially we make a centralized decision as to the proposed point, and each voxel's compute invocation does some checks to see if it will update with the new information, or retain its prior contents.

In more detail:

Generate

This part of the process is centralized, at least deterministic. We need a way for every voxel to be presented with the same information, which is a 32-bit unsigned integer seed value. This seed value is used in conjuction with the wang hash, in order to get the proposed location. More on particulars later, for now the pertinent point is that this is what places the center point of the proposed sphere for this iteration.

Check/Adjust

This is where we have some creative potential. We are now defining how we pick the radius of the new sphere. We have the location indicated by the current shader invocation, and the location of the center of this iteration's proposed sphere. We can now take a sample of the "distance" from the buffer... just like the usage of "distance" during raymarching, this defines a maximally sized sphere that can be placed without touching the scene geometry. With a negative value of distance, we already know we have failed to place a sphere this iteration, as it is globally impossible, and we can terminate the iteration (this is an opportunity for big speedup, if you can sample your distance/rejection criteria from the CPU).

If we get a positive value of distance, we now have a maximum possible radius of a sphere that can be placed in the field. We can use it directly, clamp it with an arbitrary maximum value, sample a noise field or an analytic SDF at the proposed sphere center point for a scalar, the options are basically endless. I have a sense that you need to enforce a minimum sphere size because of Nyquist (I chose 1.5 voxels as a minimum radius). With a combination of these influences, and potentially using them to reject proposed spheres as well, you can drive both the overall shape of the structure as well as the local distribution of radii. I love this.

Overlap?

By this stage we know enough information to make a final determination of whether or not this is a viable sphere to add to the global model, on the grid. We know the position, we know the radius. We now need to compute the distance from the center point of that sphere to a point representing the voxel for the current compute shader invocation. This operation represents a threshold check to locally determine if we accept this new sphere as "closest". For example - a sphere may be valid, but it may be located too distantly from the invocation to be added to the model (this is another area where there is a lot of potential speedup, if you know ahead of time and can constrain your compute dispatch to only the pertinent voxels that might be updated by that proposed sphere... sparse models/early iterations will be doing a lot of work updating empty space for nothing). We need to handle two cases:

A: PROPOSED SPHERE OVERLAPS: if the distance on the grid is less than or equal to the proposed sphere's distance, we want to keep the values that are currently on the grid. Just copy last frame's data over.

B: PROPOSED SPHERE IS A VALID CHOICE: if the distance on the grid is greater than the proposed sphere's distance, we have new data to store:

Distance: to the surface of the proposed sphere - may be negative, if the invocation's voxel is inside
Radius: this is important, because the information is not included in the hash
Hash Seed: this can be used to generate the exact center position on-the-fly, due to hash determinism

Significance of the Hash Seed

How do you represent a high-precision 3D point with a single 32-bit seed? That's an interesting question. I saw on leon's example, there's a 2D value that is stored, with the center point, and that works, it would also work here. But that is a large amount of memory, and I figured an interesting way to do it in 1/3 the amount of space.

This is an interesting application of determinism in random number generation. Because of the discrete operation of the wang hash on the bits of a unsigned int seed value, we can actually derive an exact sequence of arbitrary length from a given seed value. This is a very interesting property, and if we combine this with the knowledge of the grid dimensions that we already know from before, we can use a float in the range of 0 to 1 times each grid dimension to generate a point along that axis. And I can do this three times, in a known sequence... and there's your 3D point. You just restore the RNG state and generate three values, by doing: seed=seedValue, x=hash(), y=hash(), z=hash(), and you end up with the same sequence of values each and every time.

You can continue hashing this seed further, for material properties, etc, that are "uniform per sphere". I don't really know how to measure this to make that statement for any kind of certainty, I haven't really put a lot of thought into it. Point is that you have a uniquely seeded random process for each sphere, and you can use it as many times as you would like, you get the same sequence of values for your seed. Say I hash the seed further for more material properties, roughness=hash(), albedo=vec3(hash(), hash(), hash())... This is the essence of what I'm getting at.

Data Storage

I had to come up with a good format for transferring this between programs. I decided to keep a separate generator, because of how slow this approach is, in practice. It is a minutes-to-tens-of-minutes kind of process to fill in the block in this random "dart throwing" process, without any higher level guidance. Turning it over to an unguided, fully stochastic process, it is very, very slow - over time you will have more and more failing tests, trending towards zero efficiency. I think you will always eventually reach a state where no more spheres can be placed, for a given minimum sphere size.

This motivated getting a straightforward format I could dump from one program and read straight into another. This cache format is a PNG image, which is is used to encode 4 bytes per voxel for radius, and 4 bytes for the seed, interleaved in neighboring pixels of a 2D image. These pixels each losslessly contain 4 bytes, so we have something that can basically be trivially loaded on the read side, with knowledge of exactly 1 dimension of the block. This losslessness is important because we are keeping 4-byte uints as part of our state, 4-byte uint encoded float for radius, and 4-byte uint for hash seed, and we need to keep them bit-intact for the data to be meaningful.

Application as a Raytracing Acceleration Structure

I have dedicated more time to this point in another article to itself, because I think it's very interesting and bears further discussion. Basically we can reuse information from the underlying grid while doing a DDA traversal. This lifts the limitation that primitives need to be contained inside of grid cells, from the prior implementation. No longer does a given shape need to fit inside of the voxel... now, we load the radius and seed value from the grid, and hash the seed to get a position value. Test the ray against a sphere with the specified properties... it is by definition, from this process, the nearest sphere, and you can get nice raytraced results out.

I implemented this for the visualizer for this generator and was surprised to see how well it worked, so I went ahead and implemented it in Daedalus. I've been extremely pleased with how many spheres it can handle. In fact, maybe somewhat paradoxically, sparse volumes are arguably more expensive to traverse than densely populated ones. The traversal time is "relatively deterministic" and will mostly have to do with how many traversal steps are required to get to a valid hit with a sphere. The scaling is not particularly straightforward and does also have a dependence on the properties of the contents of the block. I'm curious to see if this is also viable for other shapes...

Future Directions

There's something extremely interesting you could do for the hash seeds. This may be an overnight computation, I'm not sure exactly how it would scale... but you could derive a table of hash seeds which would get you any point in the volume. Even a very small subset of the complete coverage would be a valuable tool, for explicit placement. A table entry would represent a particular voxel, and it would hold a seed value which would get you back to that voxel. Start with 0's in all voxels as a reserve value for "not hit yet", and just start exhaustively running all possible seed values. For a given 32-bit seed value, you have 4 billion possible RNG states. This is a tractible scale for brute force.

What's interesting about this problem is that it's actually a dependent thing, that you're tracking. And you can make it potentially more efficient by pipelining the process... not sure. Since you hashed for y and z this check, you can reuse it for x and y next check, hash a new value for z, keeping the hash state after one subsequent update? In any event, the product here would be something that you could use to direct the process, basically this is a very basic property you expect from a lot of drawing primitives but it's not easy to get, here. The property, specifically, is basically being able to specify particular points during that global seeding process in section 1. from above. I'm quite sure that you can do it, with this process, and it would again be a major opportunity for speedup.

Overall I think there's some major reasons to duplicate the logic for scaling etc on the CPU. Perhaps doing a full CPU implementation, basically the part that I would be limited on there is the actual per-voxel evaluation and update. I would have enough information to inform the sizing of those jobs' dispatch, and get much more substantial benefit from the early-out of evaluating noise thresholds a single time, on a newly proposed point. This would eliminate massive redundancy of computing this hash value and these noise values so many times, repeated across every voxel. I think this could reasonably be 100x-1000x faster if you really dedicated some time to optimizing these couple things.

Jon Baker, Graphics Programming