Computer Vision Workload Analysis: Memory Leaks & System Debugging

Production benchmark systems fail in ways that unit tests never catch. This is a deep-dive into debugging a computer vision benchmark platform processing concurrent video streams across CPU (AMX and non-AMX) and GPU backends — where we uncovered three interconnected bugs that caused disk exhaustion, infinite task loops, and memory explosions.

The System Under Test

The CV benchmark platform runs Dockerized, with a Celery coordinator dispatching benchmark scenarios to specialized workers:

┌─────────────────────────────────────────────────┐
│  API Server (FastAPI)                           │
│    └── Celery Coordinator                       │
│         ├── CV Worker (AMX) ─── OpenVINO Async  │
│         ├── CV Worker (CPU) ─── OpenVINO Direct │
│         └── CV Worker (GPU) ─── HTTP → GPU Srv  │
│                                                 │
│  Infrastructure: Redis │ PostgreSQL │ Prometheus │
└─────────────────────────────────────────────────┘

Each scenario runs at increasing concurrency levels (1, 2, 4, 8, 16, 32, 64 streams), processing 210 video frames through YOLO-based detection pipelines.

Bug 1: The Memory Leak — 5000-Frame Replication

The first symptom: at 64 concurrency, the system's disk would fill up completely. Investigation revealed the root cause wasn't disk — it was memory.

Root Cause Analysis

In openvino_async.py, the inference-only path replicates frames to ensure a minimum benchmark duration:

# openvino_async.py - The culprit
preprocessed = [self._preprocess_frame(frame) for frame in frames]

if inference_only:
    min_frames_for_benchmark = 5000
    iterations = max(1, min_frames_for_benchmark // len(preprocessed))
    if iterations > 1:
        preprocessed = preprocessed * iterations  # Replicate frames

Each preprocessed frame occupies 4.9 MB (shape [1, 3, 640, 640] float32). With a 300-frame video, this replicates 16x to 4800 frames = 23.6 GB per stream.

Memory Scaling Problem

5000-frame replication × N concurrent streams = exponential memory growth

1 stream23.6 GB

2 streams47 GB

4 streams94 GB

8 streams189 GB

16 streams378 GB

32 streamsOOM / Disk Full

Each preprocessed frame: 4.9 MB (shape [1, 3, 640, 640] float32)
× 5000 frame replication × N concurrent streams = memory explosion

The multi-stream CPU/AMX path launches num_streams threads, each creating its own replicated set independently. At just 2 streams: 47 GB. At 4 streams: 94 GB. These numpy arrays need physical RAM, and when the system swaps, it fills the disk.

The Fix

Added explicit memory cleanup at four levels:

# 1. benchmark.py - Between concurrency levels
del video_results
gc.collect()

# 2. openvino_async.py - After distributing to threads
del preprocessed    # Free 23GB before inference starts
gc.collect()

# 3. vvl_gpu_http.py - After HTTP POST
del payload         # Free serialized frames
del preprocessed

# 4. openvino_direct.py - After batch distribution
del all_preprocessed
gc.collect()

The Python Scoping Gotcha

After deploying the fix, we got cannot access local variable 'gc'. The root cause: Python treats a variable as local for the entire functionif there's any import gc anywhere inside it:

def _run_single_stream_level(self):
    gc.collect()       # Line 1465 - UnboundLocalError!
    # ... 50 lines later ...
    import gc          # Line 1521 - Python sees this and makes gc "local"
    gc.collect()       # This one works fine

Fix: removed all 4 local import gc statements, relying on the top-level import.

Bug 2: Disk Exhaustion — oneDNN Verbose Logging

Even after the memory fix, disk kept filling. Monitoring revealed the memory fix was working for RAM, but container logs were eating the disk.

The Docker Compose configuration had ONEDNN_VERBOSE=1 set for the workers. This dumps a log line for every single oneDNN kernel execution — at 30+ concurrent streams with 210 frames and 50+ layers per frame, that generates millions of log lines flowing into Docker's JSON log files.

The Fix

# docker-compose.yaml
environment:
  - ONEDNN_VERBOSE=0    # Was 1

# Also added log rotation
logging:
  driver: "json-file"
  options:
    max-size: "50m"
    max-file: "3"

Bug 3: The Infinite Loop — Celery Visibility Timeout

The most subtle bug: after a CV batch completed successfully, it would immediately restart and run again — indefinitely:

05:40:15 - Task succeeded (took ~2h 37m)
05:40:15 - Task received  ← redelivered copy!  
08:17:20 - Task succeeded (took ~2h 37m)
08:17:20 - Task received  ← redelivered AGAIN

Three Critical Bugs Found

🧠

Memory Leak

No gc.collect() between concurrency levels. 5000-frame replication per stream never freed.

FIX: Added explicit del + gc.collect() in benchmark loop and all backends

💾

Disk Exhaustion

ONEDNN_VERBOSE=1 dumped millions of kernel trace lines to Docker JSON logs

FIX: Set ONEDNN_VERBOSE=0 and added Docker log rotation (max-size: 50m)

🔄

Infinite Loop

Celery visibility_timeout (1h default) < batch runtime (3h). Redis redelivered completed tasks.

FIX: Set visibility_timeout: 14400 (4 hours) in broker_transport_options

Root Cause

The Celery configuration used task_acks_late=True (task not acknowledged until finished). But Celery's Redis broker has a default visibility_timeout of3600 seconds (1 hour). A CV batch with 3 scenarios takes ~3 hours.

The flow:

Coordinator task starts, runs 3 CV scenarios (~3 hours total)
After 1 hour, Redis thinks the message was lost (not acked within visibility timeout)
Redis redelivers the message back to the queue
When the original run finishes and acks, the worker picks up the redelivered copy
The cycle repeats infinitely

The Fix

# celery_app.py
broker_transport_options={"visibility_timeout": 14400},  # 4 hours

GPU Utilization Analysis

Another investigation revealed why GPU utilization showed as 0% most of the time. The pattern was clear when monitoring over time:

16 streams (3360 frames):
  Preprocessing: 31.8s (CPU) — GPU idle
  Inference:      3.7s (GPU at 97%)
  → GPU busy only ~10% of the time

32 streams (6720 frames):
  Preprocessing: 61.5s (CPU) — GPU idle
  Inference:      ~5s  (GPU at 99%)
  → GPU busy only ~7% of the time

The root cause: preprocessing + serialization is entirely CPU-bound and dominates wall-clock time. The GPU model server only receives frames after the worker finishes preprocessing and HTTP serialization. The actual YOLO inference runs at ~3500 FPS combined — so it blasts through work in seconds and sits idle waiting for the next batch.

Docker Volume Path Bug

A separate issue caused OVMS (OpenVINO Model Server) containers to fail withis a directory: permission denied. Docker had created empty directories instead of mounting the actual entrypoint script file.

Root cause: ./ relative paths in docker-compose.yaml resolve based on the working directory when docker compose was run. If run from a different directory, Docker creates the bind mount targets as empty directories instead of using the actual files.

Fix: auto-detect the host project directory from the Docker socket API as a fallback, eliminating the need for manual path configuration on new servers.

Key Takeaways

Python's garbage collector doesn't guarantee immediate reclamation of large numpy/tensor arrays — explicit del + gc.collect() is essential for memory-critical workloads.
Python's scoping rules can bite you: any import statement inside a function makes that name local for the entire function body, not just after the import line.
Celery + Redis visibility_timeout must exceed maximum task runtime when using task_acks_late=True — this is a well-documented but easy-to-miss pitfall.
Container log rotation is not optional for verbose workloads — a single ONEDNN_VERBOSE=1 flag can generate gigabytes of logs per hour.
GPU utilization metrics can be misleading — low utilization might mean the CPU preprocessing pipeline is the actual bottleneck, not the GPU.
Docker bind mount paths with ./ are fragile — use absolute paths or auto-detection for portable deployments.

The System Under Test

Bug 1: The Memory Leak — 5000-Frame Replication

Root Cause Analysis

The Fix

The Python Scoping Gotcha

Bug 2: Disk Exhaustion — oneDNN Verbose Logging

The Fix

Bug 3: The Infinite Loop — Celery Visibility Timeout

Memory Leak

Disk Exhaustion

Infinite Loop

Root Cause

The Fix

GPU Utilization Analysis

Docker Volume Path Bug

Key Takeaways

Signup for Updates: