Writing C in 2024

To learn how it works and because I don’t use Java, I decided to port professor Michael Pound’s implementation of the Enigma machine and the fitness and analysis code.

The initial port was done to Python. Unfortunately, this is one instance where Python was rather slow because of the large amount of settings that need to be tried.

So I decided to port it to C instead.

Initial port to Python

The initial port was done to Python, using the same class-based approach as in Java. Unfortunately, this is one instance where Python was rather slow. Mostly when looking for the optimal rotor combinations out of sixty possible combinations. And especially when going through the 17576 possible starting positions for each rotor combination. Of course I used multiprocessing to spread out the work over all available cores. But apart from that I did not see obvious ways to speed it up by e.g. using numpy.

Why C and not C++

Some might ask the question why not C++. When C++ first came out, I started using it but over time the increasing complexity made that I grew to dislike it. So I decided that C++ was not for me in my private projects. Other people will see that differently and that is fine.

Getting re-acquainted with C

Since Python has been my go-to language for years, I have not written much C in the last decade or so. But of course I still have my templates for C code and Makefiles.

To get up to speed again I did some reading;

Those were a good heads-up. Especially the modern way of initializing structures and arrays I found very useful. And I like stdint.h and stdbool.h.

I’ve also enjoyed Chris Wellons’ blog articles about C coding.

The nature of this program means that it only required statically allocated variables and arrays. Through this, a whole lot of possible error conditions around memory allocation is avoided.

Where Java and Python use classes, one can use structures and functions in C. If you don’t require inheritance that is relatively simple. In the case of the Enigma machine, composition is used instead of inheritance. For example, this is the definition of the Enigma machine in C:

typedef struct {
  Rotor rotors[3];
  Reflector reflector;
  int32_t plugboard[26];
} Enigma;

extern void enigma_rotate(Enigma *m);
extern void enigma_reset(Enigma *m);
extern void enigma_encrypt(Enigma *m, const char in[], char out[],
                           int32_t outlen);

It would be weird to say that an Enigma is a rotor, reflector and plugboard. It has multiple rotors and a reflector and a plugboard.

Data structures

While Python has several data structures like list and dictionary (hash maps) built-in, C only has arrays.

The analysis code uses bigrams and quadgrams to check whether a decryption is a good fit for English text. These would be naturally expressed as dictionaries, but C doesn’t have those.

Bigrams

The data for the English bigrams, trigrams and quadgrams comes from https://github.com/mikepound/enigma.git.

The beginning of the bigrams file look like this:

TH,-1.449013409
HE,-1.512191427
IN,-1.61390336
ER,-1.688613896
AN,-1.702206436
RE,-1.73181457
ON,-1.754969661
AT,-1.827767223
EN,-1.837361388
ND,-1.868932911
TI,-1.872060209
ES,-1.873091732
OR,-1.893965891
TE,-1.919059941
OF,-1.92997127
ED,-1.932511289
IS,-1.947525422
IT,-1.949514395

So a bigram is a string of two letters, e.g. “AC”, and the value depends on the frequency, with higher (less negative) values signifying more occurrence.

For fast lookup in C, an array is the best candidate. If the bigram values are stored in sorted order, calculating the index into the array is simple. Sorted order means that the values are stored in the sequence AA..AZ, BA..BZ, etc. The letter A has index 0 while Z has index 25. The index value for the bigram is simply 26¹×(index first letter) + 26⁰×(index second letter), where 26⁰ is of course 1.

This index value calculation can be considered a collision-free hash function.

Calculating any offset will take approximately the same amount of time, and accessing the actual value is just a single memory reference. So access time is O(1).

In important lesson here is that in C, a sorted array can be used as a simple hash map!

There are 26² = 676 possible bigrams. The bigrams file only contains 668 bigrams. That means there are 676-668 = 8 invalid bigrams. These will be given the value -9.522878745280337 as in the original code.

The Python script mkbigrams.py reads the bigrams file, and emits C code for a sorted array. This is of course a prime example where Python shines. Generating the table only has to be done once, so speed is not all that relevant. In any case, generating the bigram array only took around 0.02 s on my PC.

LETTERS = "ABCDEFGHIJKLMNOPQRSTUVWXYZ"
L2 = len(LETTERS)**2

print("// This code has been generated by mkbigrams.py.")
print("// Bigram values are stored in a sorted array.")
print("// The index is 26×first character offset + second character offset.")
print("// Invalid bigrams are represented by -9.522878745280337")
print(f"const float BIGRAMS[{L2}] = {{")
with open("bigrams") as f:
    bigramdict = dict(ln.strip().split(",") for ln in f)
index = 0
for j in LETTERS:
    for k in LETTERS:
        key = j + k
        value = bigramdict.get(key, "-9.522878745280337")
        if key == "ZZ":
            print(f"{value} // [{index}] = {key}")
        else:
            print(f"{value}, // [{index}] = {key}")
        index += 1
print("};")

The C code for creating the bigram score for a text is very simple because of the sorted array of data.

extern const float BIGRAMS[676];

float bigram_score(const char txt[]) {
int32_t n = strnlen(txt, 1024);
float total = 0.0;
for (int32_t j = 0; j < (n - 1); j++) {
    total += BIGRAMS[26 * (txt[j] - 65) + (txt[j + 1] - 65)];
}
return total;
}

Quadgrams

A quadgram has four letters, e.g. “ATIO”. The quadgrams file has 42170 quadgrams. The first valid quadgram is AAAA, and the last is ZZYI. The amount of possible quadgrams is 26⁴ = 456976. So only 42170/456976 ≈ 9% of possible quadgrams are valid. Invalid quadgrams are given the value -9.522878745280337, just like with bigrams.

Still, the total array of float values would only use 26⁴×4/1024 ≈ 1785 kiB of memory. This is a small amount for a PC in 2024! It is a small price to pay for O(1) lookup.

Here is the code for generating the QUADGRAMS array C code.

LETTERS = "ABCDEFGHIJKLMNOPQRSTUVWXYZ"
L4 = len(LETTERS)**4

print("// This code has been generated by mkquadgrams.py")
print("// Quadgram values are stored in a sorted array.")
print("// The index is 26⁴×first character offset + 26³×second character offset +")
print("//  26²×third character offset + fourth character offset.")
print("// Invalid quadgrams are represented by -9.522878745280337")
print(f"const float QUADGRAMS[{L4}] = {{")
with open("quadgrams") as f:
    quadgramdict = dict(ln.strip().split(",") for ln in f)
for j in LETTERS:
    for k in LETTERS:
        for m in LETTERS:
            for p in LETTERS:
                key = j + k + m + p
                value = quadgramdict.get(key, "-9.522878745280337")
                if key == "ZZZZ":
                    print(f"{value} // {key}")
                else:
                    print(f"{value}, // {key}")
print("};")

This code runs in approximately 0.4 s on my PC.

And the C code for the lookup is as follows.

extern const float QUADGRAMS[456976];

float quadgram_score(const char txt[]) {
  int32_t n = strnlen(txt, 1024);
  float total = 0.0;
  for (int32_t j = 0; j < (n - 3); j++) {
    int32_t idx = 17576 * (txt[j] - 65) + 676 * (txt[j + 1] - 65) +
                  26 * (txt[j + 2] - 65) + (txt[j + 3] - 65);
    total += QUADGRAMS[idx];
  }
return total;
}

Time/memory trade off for quadgrams

If the data is too large, we could use a binary search in the sorted array. This is the fastest way to search a sorted array, at O(log₂n). For an array of 42170 quadgrams, O(log₂n) ≈ O(15.4).

Python code for this operation is shown below.

with open("quadgrams.sorted") as f:
    lines = f.readlines()
items = list(ln.strip().split(",") for ln in lines
quadkeys = tuple(j[0] for j in items)
quadvalues = tuple(j[1] for j in items)

def find_key(quad):
    left = 0
    right = len(quadkeys)-1
    mid = (left+right)//2
    count = 1
    while True:
        print(f"{count} left={left}, right={right}")
        v = quadkeys[mid]
        if quad < v:
            if quad > quadkeys[mid-1]:
               return -1
            right = mid
        elif quad > v:
            if quad < quadkeys[mid+1]:
               return -1
            left = mid
        elif quad == v:
            return mid
        mid = (left+right)//2
        count += 1

Let’s try it:

In [23]: find_key('OUGH')
1 left=0, right=42170
2 left=21085, right=42170
3 left=21085, right=31627
4 left=26356, right=31627
5 left=26356, right=28991
6 left=27673, right=28991
7 left=27673, right=28332
8 left=27673, right=28002
9 left=27837, right=28002
10 left=27837, right=27919
11 left=27878, right=27919
12 left=27898, right=27919
13 left=27898, right=27908
14 left=27903, right=27908
Out[23]: 27905

In [24]: quadkeys[27905]
Out[24]: 'OUGH'

In [25]: find_key('QQQQ')
1 left=0, right=42170
2 left=21085, right=42170
3 left=21085, right=31627
4 left=26356, right=31627
5 left=28991, right=31627
6 left=28991, right=30309
7 left=29650, right=30309
8 left=29650, right=29979
9 left=29650, right=29814
10 left=29732, right=29814
11 left=29732, right=29773
12 left=29752, right=29773
13 left=29752, right=29762
14 left=29752, right=29757
15 left=29754, right=29757
Out[25]: -1

In [26]: quadkeys[29754], quadkeys[29755], quadkeys[29756], quadkeys[29757]
Out[26]: ('QING', 'QQAR', 'QSAR', 'QTIP')

In [27]: find_key('AAAA')
1 left=0, right=42170
2 left=0, right=21085
3 left=0, right=10542
4 left=0, right=5271
5 left=0, right=2635
6 left=0, right=1317
7 left=0, right=658
8 left=0, right=329
9 left=0, right=164
10 left=0, right=82
11 left=0, right=41
12 left=0, right=20
13 left=0, right=10
14 left=0, right=5
15 left=0, right=2
16 left=0, right=1
Out[27]: 0

It generally takes 14—16 iterations to find the answer. This agrees with the big-O notation and testing.

If we want to do this in C, we’re not going to compare strings, but integer indexes. These can be calculated as follows:

In [31]: sum((ord(c)-65)*26**n for n, c in enumerate("AAAA"[::-1], start=0))
Out[31]: 0

In [32]: sum((ord(c)-65)*26**n for n, c in enumerate("TION"[::-1], start=0))
Out[32]: 339729

In [33]: sum((ord(c)-65)*26**n for n, c in enumerate("ATIO"[::-1], start=0))
Out[33]: 13066

In [34]: sum((ord(c)-65)*26**n for n, c in enumerate("ZZZZ"[::-1], start=0))
Out[34]: 456975

In [35]: sum((ord(c)-65)*26**n for n, c in enumerate("OUGH"[::-1], start=0))
Out[35]: 259747

In [36]: sum((ord(c)-65)*26**n for n, c in enumerate("QING"[::-1], start=0))
Out[36]: 286968

This approach requires 42170*4/1024 ≈ 165 kiB for the sorted indexes, and another 165 kiB for the sorted float values.

Here is the Python code that generates the tables.

LETTERS = "ABCDEFGHIJKLMNOPQRSTUVWXYZ"

with open("quadgrams") as f:
    items = [ln.strip().split(",") for ln in f]
# Very important: sort by quadgrams!
items.sort(key=lambda x: x[0])
keys = [j[0] for j in items]
values = [j[1] for j in items]
L = len(keys)

print("// This code has been generated by mkquadgrams2.py")
print("// It contains two arrays, both sorted by the quadgrams")
print("// Quadgram keys are stored in QUADKEYS.")
print("// Quadgram values are stored in QUADVALUES.")
print("// The index is 26⁴×first character offset + 26³×second character offset +")
print("//  26²×third character offset + fourth character offset.")
print("#include <stdint.h>")
print()

print(f"const int32_t QUADKEYS[{L}] = " + "{")
for key in keys[:-1]:
    keyval = sum((ord(c)-65)*26**n for n, c in enumerate(key[::-1], start=0))
    print(f"  {keyval:6d}, // {key}")
# Last key has no “,” after the keyval!
keyval = sum((ord(c)-65)*26**n for n, c in enumerate(keys[-1][::-1], start=0))
print(f"  {keyval:6d}  // {keys[-1]}")
print("};")
print()

print(f"const float QUADVALUES[{L}] = " + "{")
for value in values[:-1]:
    print(f"  {value},")
print(f"  {values[-1]}")
print("};")

The time needed to generate these tables is around 0.2 s on my machine.

The lookup algorithm translated into C is shown below.

// Binary lookup in a *sorted* array of int32_t.
// Returns the index of “value”, or -1 if “value” is not in the array.
int32_t array_lookup(int32_t value, const int32_t array[], int32_t arraylen) {
    int32_t left = 0, right = arraylen - 1;
    int32_t mid = (left + right) / 2, midval = 0;
    while (true) {
        midval = array[mid];
        if (value == midval) {
            return mid;
        }
        if (value < midval) {
            if (value > array[mid - 1]) {
                // Value is not present in the array!
                return -1;
            }
            right = mid;
        } else if (value > midval) {
            if (value < array[mid + 1]) {
                // Value is not present in the array!
                return -1;
            }
            left = mid;
        }
        mid = (left + right) / 2;
    }
    // Should not be reached.
    return -1;
}

The fitness function based on this is shown below.

extern const int32_t QUADKEYS[42171];
extern const float QUADVALUES[42171];

float quadgram_score2(const char txt[]) {
    int32_t n = strnlen(txt, 1024);
    float total = 0.0;
    for (int32_t j = 0; j < (n - 3); j++) {
        int32_t val = 17576 * (txt[j] - 65) + 676 * (txt[j + 1] - 65) +
                    26 * (txt[j + 2] - 65) + (txt[j + 3] - 65);
        // 26³ = 17576, 26² = 676
        int32_t idx = array_lookup(val, QUADKEYS, 42171);
        if (idx != -1) {
            total += QUADVALUES[idx];
        }
    }
    return total;
}

This solution uses approximately 18.5% of the amount of memory for a single table. As a consequence, the look-up time is theoretically O(15) instead of O(1). In practice, it was around 11× slower than a direct lookup. So in this case the first solution was used. But this is a valuable technique for datasets that would otherwise not fit in memory.

Using all available cores

It was a pleasant surprise how relatively easy it was to run code on multiple cores using OpenMP. Of course I had to read up on how it works. In the end, all it required in this case was two #pragma statements and a compiler flag. From my perspective, parallelizing a loop in C using OpenMP is easier than using multiprocessing in Python.

It was used in testing all the rotor combinations and start positions. Each of these calculations in independent of the others, so it is a good candidate for parallelization. This reduced the runtime by 3/4, which is the best possible reduction on my machine with four cores.

The compiler in the FreeBSD base system (clang 18) supports it out of the box. The gcc13 compiler from ports also works. However, when compiled with identical flags, the runtime of the gcc OpenMP code is double that of the clang code, even though both where using four threads according to ps. Since I use the compiler from the FreeBSD base system by default, I’ve not investigated this further.

Lessons learned

The vim editor has very good support for C coding.
Where possible, use statically or automatically allocated arrays and structures.
Sorted C arrays can be used as simple dictionaries.
Using a struct and functions is fine for where object orientation is a good fit. Composition can be used instead of inheritance.
OpenMP is easy to use from C and works well in suitable cases.
As opposed to the old indent, clang-format is a decent solution for getting a consistent formatting of modern C code. I use “google” style but disabled sorting of include files.
Python is very well suited to generating static data in the form of C structures and arrays.

All in all, it was a much more pleasurable experience than I anticipated.

For comments, please send me an e-mail.

← Profiling Python scripts(6): auto-orient Statusline program in C for i3 on FreeBSD →

Roland's homepage