2026-03-18|35 min read|[QAT, OpenSSL, Hardware Acceleration, TLS, Qdrant, Rustls, C, Cryptography]

QAT Engine: Hardware Crypto Acceleration with OpenSSL & Qdrant

░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░

When we think about TLS performance, we usually accept that cryptographic operations are just "expensive" — a cost of doing business for secure connections. But Intel's QuickAssist Technology (QAT) changes that equation entirely. This is the story of building, patching, and deploying the QAT Engine to offload crypto operations from CPU to dedicated hardware — achieving a 56x speedup on RSA-2048 signing.

The Problem

Every TLS handshake requires the server to perform a private key signing operation. For RSA-2048, this is a modular exponentiation (m^d mod n). On a standard CPU, a single core manages about 1,000–1,700 signs per second. For high-throughput services — think a vector database like Qdrant serving thousands of concurrent connections — this becomes the bottleneck.

Intel QAT is a dedicated co-processor that sits on the motherboard alongside the CPU. Instead of the CPU doing the signing math, you send the numbers to QAT hardware, it computes the result in silicon, and sends the answer back.

Data Flow Visualization

How a crypto operation flows from app to hardware

Application (HAProxy / Rustls)TLS handshake requests
← active
OpenSSL 3.xCrypto dispatch layer
QAT Engine / ProviderIntercepts & redirects to HW
qatlib (VFIO)Kernel driver interface
QAT 4xxx HardwareCrypto in silicon
Application (HAProxy / Rustls)

TLS handshake requests

The Codebase: QAT Engine v2.0.0b

The QAT Engine repository (~60,000 lines of C + docs + patches) builds two outputs from the same codebase:

  • qatengine.so — OpenSSL Engine API (legacy, used by HAProxy/NGINX)
  • qatprovider.so — OpenSSL 3.x Provider API (modern, used by Rustls)

The build flag --enable-qat_provider switches which output you get. Both share the same core C files for QAT hardware interaction.

Phase 1: Building from Source — The Linker Error

Building against the system's in-tree qatlib 24.02.0:

./autogen.sh
./configure --with-qat_hw_dir=/usr
make -j$(nproc)

# Result: undefined symbol: icp_sal_AsymGetInflightRequests

The function icp_sal_AsymGetInflightRequests() is used for congestion management — checking how many crypto operations are in-flight on the hardware. It was added in qatlib 24.09.0+, but our system had 24.02.0. The header was shipped, but the implementation wasn't linked into libqat.so.

The Fix: Weak Symbol Stubs

We added weak symbol fallbacks in e_qat.c:

#ifdef QAT_HW_INTREE
__attribute__((weak))
CpaStatus icp_sal_AsymGetInflightRequests(
    CpaInstanceHandle instanceHandle,
    Cpa32U *maxInflightRequests,
    Cpa32U *numInflightRequests)
{
    if (maxInflightRequests) *maxInflightRequests = 1;
    if (numInflightRequests) *numInflightRequests = 0;
    return CPA_STATUS_SUCCESS;
}
#endif

The __attribute__((weak)) trick means: if the real qatlib provides these symbols at runtime, they override our stubs. If not, our stubs return "no congestion" so the engine loads fine. This is safe because it just disables congestion-based throttling.

Phase 2: Qdrant + HAProxy Integration

Qdrant itself does zero cryptography. The integration puts HAProxy as a TLS termination proxy in front of Qdrant:

Client (HTTPS) → HAProxy :8443 → Qdrant :6333 (HTTP)
                     │
                     └── ssl-engine qatengine algo RSA
                         (offloads TLS signing to QAT HW)

HAProxy was rebuilt from source with USE_ENGINE=1 since the system package didn't include engine support. Qdrant required CPU limiting (--cpus=8) because its Actix-rt framework panics on 256-core systems.

Phase 3: Rustls PoC — Replicating Intel's Benchmark

The benchmark path uses patched Rust crates for async QAT offload:

crypto_bench_async (Rust/Tokio binary)
    → rustls-openssl (patched crate)
        → qatprovider.so (OpenSSL 3.x Provider)
            → OpenSSL 3.5.0 (built from source)
                → qatlib (VFIO)
                    → QAT 4xxx Hardware

This required building OpenSSL 3.5.0 from source, rebuilding the QAT code as a Provider (not Engine), and applying two patches from the rustls-poc directory:

  • rustls.patch — Adds async non-blocking sign support to rustls v0.23.33
  • rustls-openssl.patch — Adds async-jobs feature, batched worker loop, and the benchmark binary

Phase 4: Device Optimization

The system had 4 QAT Gen4 devices across 2 NUMA nodes, but only 2 were configured for crypto (sym;asym) — the other 2 were set to data compression (dc). Reconfiguring all 4 to sym;asym via sysfs doubled the crypto capacity:

# Bring down, reconfigure, bring up
echo down > /sys/bus/pci/devices/XXXX/qat/state
echo asym > /sys/bus/pci/devices/XXXX/qat/cfg_services
echo up   > /sys/bus/pci/devices/XXXX/qat/state
RSA-2048 Sign Performance

Operations per second (higher is better)

Software (Ring)1,035 ops/s
AWS-LC1,743 ops/s
OpenSSL SW1,744 ops/s
QAT HW58,629 ops/s
56.6x speedupQAT HW vs Ring software baseline

The Driver Architecture — In-Tree vs Out-of-Tree

Understanding the two driver types is critical:

  • In-Tree (qatlib) — Ships with the Linux kernel, uses VFIO for userspace access via Virtual Functions, managed by qatmgr. Requires IOMMU enabled. Uses a shared memory allocator (causes contention under multi-threading).
  • Out-of-Tree (OOT) — Downloaded from Intel separately, uses UIO/USDM with thread-specific DMA allocators (zero contention). Requires IOMMU disabledin PF mode.

We attempted the OOT driver for its thread-specific USDM allocator but hit:Cannot use PF with IOMMU enabled and SVM off. The system's IOMMU couldn't be disabled without a reboot, so we proceeded with the in-tree driver.

Final Results

Algorithm         | Software (Ring) | QAT HW        | Speedup
RSA-2048 Sign     | 1,035 ops/s     | 58,629 ops/s   | 56.6x
ECDSA P-384 Sign  | 896 ops/s       | 57,881 ops/s   | 64.6x

vs Intel Chart Reference:
RSA-2048:   58,629 → 84% of ~70,000 target
ECDSA P-384: 57,881 → 118% of ~49,000 (exceeded!)

Key Files and Their Roles

  • e_qat.c — Engine entry point, the only file we modified (weak symbol stubs)
  • qat_hw_rsa.c — RSA sign/verify via QAT hardware CPA API
  • qat_hw_ec.c — ECDSA sign/verify and ECDH key agreement
  • qat_hw_polling.c — Background thread polling for completed QAT operations
  • qat_prov_init.c — Provider entry point for OpenSSL 3.x
  • configure.ac — Autoconf detection of in-tree vs OOT driver

Lessons Learned

  • Weak symbols are a powerful C technique for cross-version library compatibility.
  • The OpenSSL Engine vs Provider API distinction matters — Engine is legacy but widely used, Provider is modern but requires OpenSSL 3.x.
  • IOMMU configuration is a hard constraint for OOT driver PF mode — plan for this at the infrastructure level.
  • Sysfs device reconfiguration is volatile — it resets on reboot. Production deployments need a systemd unit or udev rule.
  • Multi-thread scaling under the in-tree driver degrades due to shared allocator contention — 2 threads performed worse than 1 thread on both RSA and ECDSA.

Signup for Updates:

I promise to only email you cool shit. Draft chapters, progress updates, sneak peaks at illustrations I'm working on. Stuff like that.

░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░