CUDA VERSUS T52AB UNKNOWN PLAINTEXT

Following on from the attack on known plaintext for a T52ab Sturgeon cipher machine, this post describes an attack with unknown plaintext. The known plaintext attack relied on a brute-force application of all possible XOR wheel settings to the plaintext. This created a representation of the state halfway through encryption (prior to the permutation). Since this intermediate five-bit state was only subject to a permutation by the rest of the key, the number of set bits in the intermediate state will match the number of set bits in the ciphertext, character by character. With about 20 characters, this generates a unique result and therefore solves the XOR part of the key. An attack without known plaintext cannot use the same method as the machine applies the XOR to the plaintext. The ciphertext cannot be treated with the XOR as it would have to be permuted first, and the permutation is unknown.

However if we look at a candidate XOR stream and find a place where the 5-bit XOR applied to a character has either no bits set or all bits set, then the bit permutation is irrelevant in terms of deciding how the XOR will act as the XOR is the same for each bit. This does not quite return us to the same state as the known plaintext attack – in the known plaintext attack we could check that the number of bits matched between an XOR’d-plaintext and a ciphertext character. But here we do not have plaintext characters. All we can do is statistical – rely on the fact that plaintext contains a bias in the expected number of set bits. Fortunately it does. T52 relies on the standard teleprinter alphabet which tends to favour a low number of set bits for common letters.

The disadvantage of relying on XOR characters that have either none or all bits set is that these occur at only about 1/16th of the points in the keystream. We can improve matters by observing that a similar effect occurs when the CT itself has all or no bits set. Again this occurs 1/16th of the time on average, although some will clash with the known cases, so a suitable case occurs in about 12% of the message characters. Despite this, the test is quite powerful and works for typical T52 message lengths (1000-2000 characters). However we can improve.

If we know, say, that there is one bit set in the XOR and one bit set in the ciphertext, there are then only two possible situations. If the two bits align after the unknown permutation (p=0.2), this will result in zero bits set in the plaintext. Conversely if they do not align (p=0.8), there will be two bits set in the plaintext. Without knowing which of these two cases is right, we can use the probabilities to aggregate a likelihood estimate. Similar situations, with different probabilities, exist for each combination of the number of bits set in the ciphertext and XOR characters. The all-or-none-set case is simply a special case, albeit the most powerful one.

The algorithm, as might be expected, does need significantly more characters than the known plaintext solution (hundreds as opposed to just twenty). Somewhat counterintuitively, it operates faster than the known plaintext algorithm (12s-30s per partition on a P106-100 for 1000 character text, partitions vary in size). This is mostly because not knowing the internal permutation does not get ‘double counted’ against not knowing the ordering of the XOR wheels. In the known plaintext case, we must apply the right XOR and so must not just use the right partition of XOR wheels, we must use the right order. This involves permuting the choice of five wheels (i.e. a factor of 120). This permutation is distinct from the permutation the machine does internally. For the unknown plaintext case, the algorithm is such that while these are different permutations, their net effect is of one permutation. It might also be argued that the unknown plaintext algorithm does less : it can determine the right XOR wheels and their position, but only as a set of wheels, it does not determine their ordering.

Taken together the fasted solution to a likely message (1000+ characters of which a short preamble could be guessed) would be to determine the wheel partition using the Unknown plaintext algorithm and then the ordering using the known plaintext algorithm.

Cuda implementation. In this implementation, the keystream for the specified set of XOR wheels is generated in a first step. However unlike the known plaintext case where the full keystream was stored, all that is required here is the number of set bits at each position in the stream. This fits in under 0.7GB, reasonable with current GPU memories. The second step involves calculating from every startpoint in the stream, the log likelihood score of those XORs, given each position in the ciphertext. While the correct key is not guaranteed to have the highest score, it is likely to be a high scoring result – and this likelihood obviously increases with ciphertext length. Other high-scoring keys are likely to be similar to the true key (which is why a hill-climbing attack may be preferable to this brute force attack).

The CUDA kernels are short and simple and the code for this attack is available on GitHub.

CUDA versus T52ab Known Plaintext

The T52ab cipher machine has a basic structure shown here (the Permutation Circuit, F-K, is configurable, and only one possible configuration is shown):

A plaintext character is represented by five bits (Pn). Each bit is first XOR’d with a keystream bit (A-E) and then the bits are permuted based on other keystream bits (F-K). Ten wheels of known patterns are assigned, in a key-dependent way, to either one of five XOR controls or one of five switches that can swap the lanes with one another. The switches are also arranged in a key-dependent way (e.g. although the above figure does not show a direct swap between the lanes 0 and 2, a different key setting exists that could have selected this permutation). The ten wheels are each also at some key-dependent offset in their cycles and therefore provide a bitstream, one bit per wheel at each character in the text. The wheels are each of a relatively-prime length to each other, and this ensures that while they all advance one step at each encryption, the cycle length is large.

At the point where the Pn have each passed through an XOR (A-E) they are then solely subject to an unknown permutation to create the ciphertext (C). It follows that the number of set bits at this intermediate stage must be the same as the number of set bits in the ciphertext. This provides an opportunity for a known plaintext attack to solve the five XOR wheels independently from the permutation/switch wheels, and the permutation circuit. It therefore solves a part of the key, independently of the remainder, as part of a divide-and-conquer attack.

There is the necessity for a suitable amount of known plaintext to make this attack give a unique solution, as the bitcount could match by chance. Experiments suggest around 22 characters, which need not be contiguous, are enough.

Using a GPU seems suited to carrying out this attack, as there is little data transfer to and from the GPU, and instead the vast majority of the work takes place on the GPU. I’ve put together a CUDA implementation on that basis. The use of CUDA in cryptanalysis is discussed in a new Chapter in the CrypTool book.

The keyspace in question is firstly the 30,240 ways in which five specific wheels from the ten wheels could be connected to the XOR controls. This is based on 5C10 = 252 partitions, and 5! = 120 ways the wheels can be ordered. Each wheel is also at an unknown offset. The keysize here depends on the length of the specific wheels. A rough approximation takes the average length, c.64. Given there are five wheels, the keyspace from this component is c.645 = 230. Since 30,240 is approximately 215 the total keyspace to be solved is c.245.

On the basis that the necessary length of the known plaintext to achieve the attack with tolerable false alarms is found to be probably a little over 20 characters, depending on the work required to resolve false alarms, the warp-size of 32 (currently used by Nvidia GPUs) was used to attack the whole text in parallel. i.e. all 32 characters are tested simultaneously by different threads in the warp. A mask allows fine-grained control of this. For instance, if only 20 characters of text are available, the highest 12 threads are disregarded. Alternatively, if the known plaintext is non-contiguous (e.g. if it were know to have a standard preamble, an unknown block of defined length, and a known suffix) the unknown characters can be omitted from the test based on the mask.

Secondly, the program is designed to run for one specific partition of the wheels. Internally it generates the 120 orderings of the wheels in the partition, and tests all possible offsets. Therefore each invocation of the program tests c.120 x 230 = c237 keys, and runs in between 30s and 90s on a P106-100 GPU (a GeForce 1060 equivalent). The time depends on the product of the wheel lengths in the partition, so can vary noticeably.

The keystream we have generated is for some arbitrary order of the five XOR wheels. To address the 120 permutations of the selected XOR wheels, the code in fact pre-permutes the plaintext into 120 copies and applies the same key to each of these. Since it is the bit count of the result that matters, the approach is equivalent to applying the correct key to the correct plaintext. The advantage is that the pre-computation need only be done once.

It might seem that with five bits per character, only 32 permutation are possible, making the 120 superfluous. However we are seeking the permutation from the 120 where the result matches for every plaintext position. In that case, we have to test all 120. This is done as a serial operation.

The program first generates the keystream. This is (at worst, with the longest wheels) about 1.6×109 steps long. With packing it fits into a little over 1GB, suitable for a GPU memory. The program then invokes a warp for every startpoint on that keystream. The warp considers 32 characters of the plaintext simultaneously, including the keystream advanced to match.

This might not be the most efficient attack against known plaintext, but the advantage of a brute-force solution is that it is deterministic and the true key will be found. The only question is the number of false alarms that remain.

The code for this attack is available on GitHub.