bitfontCore2

The author of the hoard-of-bitfonts repo added a number of new bitmapped fonts to the repo the last couple months. I hadn't noticed this until here the last couple days - I didn't realize that this project was still actively being maintained, I had forked it a while back and used my fork for the processing of my own resource that I created from the data. So I was missing 70-some commits since the last time I had looked.

Part of going through and getting all the data out this time, I wanted to do something that I would be able to do again without any trouble, it would just be a simple pipeline of parsing all the files and producing my model as output. I took a different approach this time, getting some practice with regex - it worked much better, and I was able to much more efficiently parse the whole dataset. Interestingly, I ended up with almost twice as many distinct glyphs as last time.

I went through a couple iterations on how to process the data before arriving at my current iteration. At first I was trying to pattern match strings in the data, taking out certain labels with the find-and-replace in files feature on my text editor. This was slow and several times I damaged the dataset irreparably and had to clone it again. I got into trying to figure out how to do some regex pattern matching of the strings, and actually found that this is a good application for ChatGPT - it will write you a regular expression to match a pattern that you describe in sentences, thought it was an interesting usage.

After several iterations of trying to pattern match lines which started with hashes, which indicated comments, and some other things to find labels which contained a colon somewhere in the line, I realized that I was making this much more complicated than it had to be - the whole thing can be done in two moves, so I wrote it with C++ std::regex, operating on a string of each of the font files' contents:

// apply the regex replaces
  // [a-zA-Z0-9]-[a-zA-Z0-9] will find dashes between alphanumerics ( remove these )
content = std::regex_replace( content, std::regex( "[a-zA-Z0-9]-[a-zA-Z0-9]" ), "\n" );
  // [^-#.@\n] will find all characters that are not part of the data - also ignores newlines
content = std::regex_replace( content, std::regex( "[^-#.@\n]" ), "" );

Labels were removed, and broken up with newlines to give more spacing, to weed out a couple of weird cases I found. Then all the alphanumerics and characters besides the characters used to represent the data itself were just removed - tabs, spaces, everything but these four characters and newlines. These four characters are used by the two formats to represent the 1-bit data of the glyphs ( -/# for .draw, ./@ for .yaff files ). And then as a final processing step, you take the contents of the string, append it to a big running string with a couple of newlines as padding, after do the following to standardize the whole operation and make the next parsing steps completely trivial:

// replace characters
std::replace( content.begin(), content.end(), '-', '0' );
std::replace( content.begin(), content.end(), '.', '0' );
std::replace( content.begin(), content.end(), '#', '1' );
std::replace( content.begin(), content.end(), '@', '1' );

This did have a couple of very minor issues, which I was able to detect and fix. Some glyphs had some leading zero values on the first line that were caught and removed, not exactly sure why, but I was able to confirm by doing a search through the files for matching lines that this was what was happening. Easy to fix, and then I had a big 50 megabyte file of glyphs in about the simplest format you could ask for, ready to parse:

Duplicate removal was done pretty much the same as last time, a simple N to N checking to get all the distinct bit patterns in the dataset. You could write a hashmap kind of thing or encode the glyphs into integer values to speed it up, but I didn't get into that, basically just reused a lot of the code from last time. It took about 40 minutes to do this on a single thread, going from some 450k glyphs to a final count of 140863 distinct bit patterns.

I again used my modified version of stb_rect_pack.h which adds an id value to the rectangles being packed. I also added a little bit of padding around each of the rectangles, by making them two pixels larger in each direction. This allowed me to add a small bit of padding around the rectangles and construct the output in a different way this time, which ended up containing twice as many glyphs in about the same memory footprint, with a unique index for each glyph.

This was originally a bug, but I ended up really liking the way it looked - like a little drop shadow below each glyph. What we have in the data is a couple of things - there are two distinct levels which make up the bit pattern for each glyph - two distinct gray levels, on a white background, and then an identifying color in the drop shadow. It is easy to pick out all the continuous rectangles consisting of the two gray levels, and then read all the bit patterns in.

If desired, you can get the identifying color value from the drop shadow, by adding the red value to the green value times 255, and the blue value times 255 * 255. This uses the color data as a base-256 number system, and would allow you to get back the ordering from my original dataset, if for some reason you wanted that. There are no collisions with the gray levels because the "dark" value is ( 69, 69, 69 ), "light" value is ( 205, 205, 205 ), and none of the blue values get high enough to interfere.

This is the final output, an 5.5 megabyte, 7664 by 5597 png. I think this is a much better format for the data than my uint-encoded model in the JSON from last time, because it is not only more space efficient, but it's less complex to parse than that model. Going through all the image data does take significantly longer to parse than the JSON format, but I think it's a real feature that it retains human readability in this image based format. This will be a good replacement for keeping the hoard-of-bitfonts repo as a submodule in NQADE, and will be nicer to use for the next iteration of the Voraldo spaceship generator.

Jon Baker, Graphics Programming

bitfontCore2