This repository provides several implementations of the Number Theoretic Transform (NTT) in Rust.
In particular, it implements the negacyclic version of the NTT over the ring 
Note: This code was created for educational purposes by a learner.
- 
shoup_ntt.rs- A general-purpose NTT for up to $64$ -bit primes, using Shoup multiplication.
- On macOS, its performance is roughly comparable to concrete-ntt.
 
- A general-purpose NTT for up to 
- 
mont_ntt.rs- A general-purpose NTT for up to $64$ -bit primes, using Montgomery multiplication.
- Currently has some overhead due to converting to/from the Montgomery form.
 
- A general-purpose NTT for up to 
- 
goldilocks_ntt.rs- Specialized for the so-called Goldilocks prime $2^{64} - 2^{32} + 1$ amd used PZT22 reduction.
- Its performance is roughly on par with the plonky2 implementation.
 
- Specialized for the so-called Goldilocks prime 
- 
barrett_scalar_ntt.rs- A general-purpose NTT for up to $32$ -bit primes, using Barrett multiplication.
 
- A general-purpose NTT for up to 
- 
barrett_vector_ntt.rs- A general-purpose NTT for up to $32$ -bit primes, partially vectorized with NEON instructions (on AArch64) for addition and subtraction.
- Currently, its performance is similar to the scalar version because barrett_mulis still done in scalar form. There is a plan to vectorizebarrett_mulas well.
 
- A general-purpose NTT for up to 
cargo bench
Note that barrett_vector.rs code runs only on target_arch = "aarch64".
Here is a speed comparison of forward_inplace.
| log_n | mont_ntt.rs | shoup_ntt.rs | concrete_ntt | 
|---|---|---|---|
| 11 | 50.197µs | 10.585µs | 10.623µs | 
| 12 | 112.19µs | 23.880µs | 22.586µs | 
| 13 | 225.49µs | 49.489µs | 48.661µs | 
| 14 | 475.96µs | 106.66µs | 106.25µs | 
| 15 | 996.50µs | 240.23µs | 230.52µs | 
| 16 | 2.0449ms | 477.64µs | 477.72µs | 
(from benches/ntt.rs)
| log_n | goldilocks_ntt.rs | plonky2 | 
|---|---|---|
| 11 | 23.192µs | 31.813µs | 
| 12 | 50.804µs | 63.902µs | 
| 13 | 108.68µs | 128.01µs | 
| 14 | 251.59µs | 281.26µs | 
| 15 | 510.63µs | 591.31µs | 
| 16 | 1.1101ms | 1.2720ms | 
(from benches/ntt.rs)
| log_n | barrett_scalar_ntt.rs(un-optimized add/sub) | barrett_scalar_ntt.rs(optimized add/sub) | barrett_vector_ntt.rs | concrete_ntt | 
|---|---|---|---|---|
| 11 | 17.583µs | 10.027µs | 10.403µs | 9.5152µs | 
| 12 | 37.852µs | 24.04µs | 23.507µs | 20.331µs | 
| 13 | 82.653µs | 54.333µs | 49.696µs | 39.564µs | 
| 14 | 177.18µs | 96.052µs | 102.73µs | 82.942µs | 
| 15 | 432.07µs | 208.71µs | 215.82µs | 166.73µs | 
| 16 | 878.98µs | 427.84µs | 457.99µs | 347.83µs | 
(from benches/vectorized.rs)
cargo test
This command runs unit tests for various finite field arithmetic, as well as the NTT round-trip and polynomial multiplication tests.
- Apply parallelization to the butterfly algorithm.