Generating random numbers from quantum hardware is only half the challenge. The other half is proving they're actually random. TrueEntropy runs every batch of quantum entropy through 7 core statistical tests from the NIST SP 800-22 test suite before it enters the entropy pool. This article explains what each test does, why it matters, and what we learned building the pipeline.

What Is NIST SP 800-22?

NIST Special Publication 800-22 is a statistical test suite published by the National Institute of Standards and Technology. It was designed to evaluate the randomness of binary sequences produced by hardware and software random number generators. The full specification defines 15 tests, each targeting a different statistical property of a random sequence.

Each test produces a p-value - the probability that a truly random sequence would produce a result as extreme as the one observed. If the p-value falls below a significance threshold (α), the sequence fails that test. We use α = 0.01, meaning we accept a 1% false-positive rate. A sequence must pass all 7 of our implemented tests to be admitted to the entropy pool.

The 7 Tests We Run

1. Frequency (Monobit) Test

The most fundamental test. It counts the number of ones and zeros in the entire sequence and checks whether the proportion is close to 50/50. A biased generator - one that produces more ones than zeros, or vice versa - fails this test immediately.

What it catches: Systematic bias in qubit measurement. If a qubit is not properly initialised or the Hadamard gate is miscalibrated, the output will skew toward 0 or 1. The Frequency test is the first line of defence against hardware calibration issues.

2. Frequency Within a Block Test

Divides the sequence into blocks of a fixed length and checks whether the proportion of ones in each block is approximately 50%. This is more sensitive than the Monobit test because a sequence could have perfect global balance while containing locally biased regions.

What it catches: Block-level patterns. During development, we discovered that our bitstring extraction code was grouping identical measurement outcomes together before concatenation. The global Frequency test passed, but Block Frequency failed because identical bitstrings clustered in adjacent blocks. The fix was to shuffle the expanded bitstrings before concatenation - a critical lesson in how output ordering affects statistical quality.

3. Runs Test

A "run" is an uninterrupted sequence of identical bits - for example, 1111 is a run of four ones. The Runs test checks whether the number of runs in the sequence matches what you'd expect from a truly random source. Too few runs suggests the sequence is "sticky" (long stretches of the same bit); too many suggests it alternates too rapidly.

What it catches: Correlation between consecutive bits. If the quantum measurement process introduces temporal correlation - one measurement influencing the next - the Runs test will detect it.

4. Longest Run of Ones in a Block Test

Divides the sequence into blocks and finds the longest run of ones within each block. The distribution of these longest runs is compared to the theoretical distribution for a truly random sequence. An unusually long run (or an absence of long runs) indicates a departure from randomness.

What it catches: Structural patterns in the output that might not affect the mean or variance but create suspiciously long (or short) streaks. This is particularly relevant for gambling applications where a "streak" of outcomes could indicate a compromised generator.

5. Serial Test

Examines the frequency of all possible overlapping patterns of a given length. For a pattern length of m, there are 2^m possible patterns. The Serial test checks whether all patterns occur with approximately equal frequency. This generalises the Frequency and Runs tests to higher-order patterns.

What it catches: Complex multi-bit patterns that simpler tests miss. A generator might pass the Frequency and Runs tests while still producing detectable 3-bit or 4-bit patterns.

6. Approximate Entropy Test

Compares the frequency of overlapping patterns of length m and m+1. The approximate entropy measures the regularity of the sequence - a truly random sequence has maximum entropy (minimum regularity). If the entropy is lower than expected, the sequence contains exploitable patterns.

What it catches: Subtle regularity that isn't visible as simple bias or runs. This test is particularly good at detecting generators whose output appears random to the eye but contains compressible structure.

7. Cumulative Sums Test

Treats the sequence as a random walk: each 1 is a step forward (+1) and each 0 is a step backward (−1). The test checks whether the cumulative sum stays within the bounds expected for a truly random walk. We run this test in both forward and reverse directions to catch asymmetric patterns.

What it catches: Drift and trend in the sequence. If the generator slowly shifts its bias over time - perhaps due to hardware warming or calibration drift - the Cumulative Sums test detects the resulting trend in the random walk.

Real Results

We run the NIST test suite on every batch of entropy, whether generated by the local simulator or IBM Quantum hardware. Our results:

Source: IBM Quantum (ibm_fez, 156 qubits)
Shots: 4096 · Significance: α = 0.01

PASS Frequency (Monobit)
PASS Block Frequency
PASS Runs
PASS Longest Run of Ones
PASS Serial
PASS Approximate Entropy
PASS Cumulative Sums (Forward)
PASS Cumulative Sums (Reverse)

Result: 7/7 PASS

The same 7/7 PASS result holds on both the local simulator and IBM Quantum hardware.

The Bitstring-Shuffle Fix

One of the most instructive lessons from building this pipeline came during development. Quantum circuit execution with 4,096 shots doesn't return 4,096 unique bitstrings - it returns a counts dictionary mapping each observed bitstring to the number of times it appeared. For an 8-qubit Hadamard circuit, you might get 256 distinct bitstrings, each appearing roughly 16 times.

Our initial implementation expanded this counts dictionary by repeating each bitstring according to its count, then concatenated them in order. The global Frequency test passed - the overall ratio of ones to zeros was fine. But Block Frequency and Longest Run tests failed, because identical bitstrings were grouped together, creating artificial block-level patterns.

The fix was simple: shuffle the expanded bitstrings before concatenation (a technique we also discuss in our QuBitLang compiler deep-dive). This eliminates the ordering artifact while preserving the statistical distribution of the original quantum measurements. After applying the shuffle, all 7 tests passed immediately.

Why Testing Matters

Quantum hardware is not perfect. Qubits decohere. Gates have finite fidelity. Calibration drifts over time. Even a quantum random number generator can produce biased output if the hardware is degraded. Statistical testing is the safety net that catches these failures before they reach your application.

At TrueEntropy, no entropy enters the pool without passing all 7 NIST tests. You can read more in our NIST compliance documentation. If a batch fails, it's rejected and a fresh batch is generated. This is the verification layer that lets us guarantee the quality of every random number we deliver.

Understanding NIST SP800-22: How We Verify Every Random Number