Jitterentropy: tuning the configuration

The jitterentropy library is written by Stephan Mueller, is available at https://github.com/smuellerDD/jitterentropy-library, and is documented at http://www.chronox.de/jent.html. In Zircon, it's used as a simple entropy source to seed the system CPRNG.

The companion document about basic configuration options to jitterentropy describes two options that fundamentally affect how jitterentropy runs. This document describes instead the numeric parameters that control how fast jitterentropy is and how much entropy it collects, but without fundamentally altering its principles of operation. It also describes how to test various parameters and what to look for in the output (e.g. if adding support for a new device, or to do a more thorough job of optimizing the parameters).

A rundown of jitterentropy's parameters

The following tunable parameters control how fast jitterentropy runs, and how fast it collects entropy:

`kernel.jitterentropy.ll`

"ll" stands for "LFSR loops". Jitterentropy uses a (deliberately inefficient implementation of a) LFSR to exercise the CPU, as part of its noise generation. The inner loop shifts the LFSR 64 times; the outer loop repeats kernel.jitterentropy.ll-many times.

In my experience, the LFSR code significantly slows down jitterentropy, but doesn't generate very much entropy. I tested this on RPi3 and qemu-arm64 with qualitatively similar results, but it hasn't been tested on x86 yet. This is something to consider when tuning: using fewer LFSR loops tends to lead to better overall performance.

Note that setting kernel.jitterentropy.ll=0 causes jitterentropy to choose the number of LFSR loops in a "random-ish" way. As described in the basic config doc, I discourage the use of kernel.jitterentropy.ll=0.

`kernel.jitterentropy.ml`

"ml" stands for "memory access loops". Jitterentropy walks through a moderately large chunk of RAM, reading and writing each byte. The size of the chunk and access pattern are controlled by the two parameters below. The memory access loop is repeated kernel.jitterentropy.ml-many times.

In my experience, the memory access loops are a good source of raw entropy. Again, I've only tested this on RPi3 and qemu-arm64 so far.

Much like kernel.jitterentropy.ll, if you set kernel.jitterentropy.ml=0, then jitterentropy will choose a "random-ish" value for the memory access loop count. I also discourage this.

`kernel.jitterentropy.bs`

"bs" stands for "block size". Jitterentropy divides its chunk of RAM into blocks of this size. The memory access loop starts with byte 0 of block zero, then "byte -1" of block 1 (which is actually the last byte of block 0), then "byte -2" of block 2 (i.e. the second-to-last byte of block 1), and so on. This pattern ensures that every byte gets hit, and most accesses go into different blocks.

I have usually tested jitterentropy with kernel.jitterentropy.bs=64, based on the size of a cache line. I haven't tested yet to see whether there's a better option on some/all platforms.

`kernel.jitterentropy.bc`

"bc" stands for "block count". Jitterentropy uses this many blocks of RAM, each of size kernel.jitterentropy.bs, in its memory access loops.

Since I choose kernel.jitterentropy.bs=64, I usually choose kernel.jitterentropy.bc=1024. This means using 64KB of RAM, which is enough to overflow L1 cache.

The jitterentropy source code in the comment before jent_memaccess suggests choosing the block size and count so that the RAM used is bigger than L1. Confusingly, the default values in upstream jitterentropy (block size = 32, block count = 64) aren't big enough to overflow L1.

Tuning process

The basic idea is simple: on a particular target device, try different values for the parameters. Collect a large amount of data for each parameter set (ideally around 1MB), then run the NIST test suite to analyze the data. Determine which parameters give the best entropy per unit time. The time taken to draw the entropy samples is logged on the system under test.

One complication is the startup testing built into jitterentropy. This essentially draws and discards 400 samples, after performing some basic analysis (mostly making sure that the clock is monotonic and has a high enough resolution and variability). A more accurate test would reboot twice for each set of parameters: once to collect around 1MB of data for analysis, and a second time to boot with the "right" amount of entropy (as computed according to the entropy estimate in the first phase, with appropriate safety margins, etc. See "Determining the entropy_per_1000_bytes statistic", below). This second phase of testing simulates a real boot, including the startup tests. After completing the second phase, choose the parameter set that boots fastest. Of course, each phase of testing should be repeated a few times to reduce random variations.

Determining the entropy_per_1000_bytes statistic

The crypto::entropy::Collector interface in kernel/lib/crypto/include/lib/crypto/entropy/collector.h requires a parameter entropy_per_1000_bytes from its instantiations. The value relevant to jitterentropy is currently hard-coded in kernel/lib/crypto/entropy/jitterentropy_collector.cpp. This value is meant to measure how much min-entropy is contained in each byte of data produced by jitterentropy (since the bytes aren't independent and uniformly distributed, this will be less than 8 bits). The "per 1000 bytes" part simply makes it possible to specify fractional amounts of entropy, like "0.123 bits / byte", without requiring fractional arithmetic (since float is disallowed in kernel code, and fixed-point arithmetic is confusing).

The value should be determined by using the NIST test suite to analyze random data samples, as described in the entropy quality tests document. The test suite produces an estimate of the min-entropy; repeated tests of the same RNG have (in my experience) varied by a few tenths of a bit (which is pretty significant when entropy values can be around 0.5 bits per byte of data!). After getting good, consistent results from the test suites, apply a safety factor (i.e. divide the entropy estimate by 2), and update the value of entropy_per_1000_bytes (don't forget to multiply by 1000).

Note that eventually entropy_per_1000_bytes should probably be configured somewhere instead of hard-coded in jitterentropy_collector.cpp. Kernel cmdlines or even a preprocessor symbol could work.

Notes about the testing script

The scripts/entropy-test/jitterentropy/test-tunable script automates the practice of looping through a large test matrix. The downside is that tests run in sequence on a single machine, so (1) an error will stall the test pipeline so supervision is required, and (2) the machine is being constantly rebooted rather than cold-booted (plus it's a netboot-reboot), which could conceivably confound the tests. Still, it beats hitting power-off/power-on a thousand times by hand!

Some happy notes:

When netbooting, the script leaves bootserver on while waiting for netcp to successfully export the data file. If the system hangs, you can power it off and back on, and the existing bootserver process will restart the failed test.
If the test is going to run (say) 16 combinations of parameters 10 times each, it will go like this:

test # 0: ml = 1 ll = 1 bc = 1 bs = 1 test # 1: ml = 1 ll = 1 bc = 1 bs = 64 test # 2: ml = 1 ll = 1 bc = 32 bs = 1 test # 3: ml = 1 ll = 1 bc = 32 bs = 64 ... test #15: ml = 128 ll = 16 bc = 32 bs = 64 test #16: ml = 1 ll = 1 bc = 1 bs = 1 test #17: ml = 1 ll = 1 bc = 1 bs = 64 ...

(The output files are numbered starting with 0, so I started with 0 above.)

So, if test #17 fails, you can delete tests #16 and #17, and re-run 9 more iterations of each test. You can at least keep the complete results from the first iteration. In theory, the tests could be smarter and also keep the existing result from test #16, but the current shell scripts aren't that sophisticated.

The scripts don't do a two-phase process like I suggested in the "Tuning process" section above. It's certainly possible, but again the existing scripts aren't that sophisticated.

Open questions

How much do we trust the low-entropy extreme?

It's a priori possible that we maximize entropy per unit time by choosing small parameter values. Most extreme is of course ll=1, ml=1, bs=1, bc=1, but even something like ll=1, ml=1, bs=64, bc=32 is an example of what I'm thinking of. Part of the concern is the variability in the test suite: if hypothetically the tests are only accurate to within 0.2 bits of entropy per byte, and if they're reporting 0.15 bits of entropy per byte, what do we make of it? Hopefully running the same test a few hundred times in a row will reveal a clear modal value, but it's still a little bit risky to rely on that low estimate to be accurate.

The NIST publication states (line 1302, page 35, second draft) that the estimators "work well when the entropy-per-sample is greater than 0.1". This is fairly low, so hopefully it isn't an issue in practice. Still, the fact that there is a lower bound means we should probably leave a fairly conservative envelope around it.

How device-dependent is the optimal choice of parameters?

There's evidently a significant difference in the actual "bits of entropy per byte" metric on different architectures or different hardware. Is it possible that most systems are optimal at similar parameter values (so that we can just hard-code these values into kernel/lib/crypto/entropy/jitterentropy_collector.cpp? Or, do we need to put the parameters into MDI or into a preprocessor macro, so that we can use different defaults on a per-platform basis (or whatever level of granularity is appropriate).

Can we even record optimal parameters with enough granularity?

I mentioned it above, but one of our targets is "x86", which is what runs on any x86 PC. Naturally, x86 PCs can very quite a bit. Even if we did something like add preprocessor symbols like JITTERENTROPY_LL_VALUE etc. to the build, customized in kernel/project/target/pc-x86.mk, could we pick a good value for all PCs?

If not, what are our options?

We could store a lookup table based on values accessible at runtime (like the exact CPU model, the core memory size, cache line size, etc.). This seems rather unwieldy. Maybe if we could find one or two simple properties to key off of, say "CPU core frequency" and "L1 cache size", we could make this relatively non-terrible.
We could try an adaptive approach: monitor the quality of the entropy stream, and adjust the parameters according on the fly. This would take a lot of testing and justification if we want to trust it.
We could settle for "good enough" parameters on most devices, with the option to tune via kernel cmdlines or a similar mechanism. This seems like the most likely outcome to me. I expect that "good enough" parameters will be easy to find, and not disruptive enough to justify extreme solutions.