Computer Vision Workload Analysis: Memory Leaks & System Debugging
Production benchmark systems fail in ways that unit tests never catch. This is a deep-dive into debugging a computer vision benchmark platform processing concurrent video streams across CPU (AMX and non-AMX) and GPU backends — where we uncovered three interconnected bugs that caused disk exhaustion, infinite task loops, and memory explosions.
The System Under Test
The CV benchmark platform runs Dockerized, with a Celery coordinator dispatching benchmark scenarios to specialized workers:
┌─────────────────────────────────────────────────┐
│ API Server (FastAPI) │
│ └── Celery Coordinator │
│ ├── CV Worker (AMX) ─── OpenVINO Async │
│ ├── CV Worker (CPU) ─── OpenVINO Direct │
│ └── CV Worker (GPU) ─── HTTP → GPU Srv │
│ │
│ Infrastructure: Redis │ PostgreSQL │ Prometheus │
└─────────────────────────────────────────────────┘Each scenario runs at increasing concurrency levels (1, 2, 4, 8, 16, 32, 64 streams), processing 210 video frames through YOLO-based detection pipelines.
Bug 1: The Memory Leak — 5000-Frame Replication
The first symptom: at 64 concurrency, the system's disk would fill up completely. Investigation revealed the root cause wasn't disk — it was memory.
Root Cause Analysis
In openvino_async.py, the inference-only path replicates frames to ensure a minimum benchmark duration:
# openvino_async.py - The culprit
preprocessed = [self._preprocess_frame(frame) for frame in frames]
if inference_only:
min_frames_for_benchmark = 5000
iterations = max(1, min_frames_for_benchmark // len(preprocessed))
if iterations > 1:
preprocessed = preprocessed * iterations # Replicate framesEach preprocessed frame occupies 4.9 MB (shape [1, 3, 640, 640] float32). With a 300-frame video, this replicates 16x to 4800 frames = 23.6 GB per stream.
5000-frame replication × N concurrent streams = exponential memory growth
Each preprocessed frame: 4.9 MB (shape [1, 3, 640, 640] float32)
× 5000 frame replication × N concurrent streams = memory explosion
The multi-stream CPU/AMX path launches num_streams threads, each creating its own replicated set independently. At just 2 streams: 47 GB. At 4 streams: 94 GB. These numpy arrays need physical RAM, and when the system swaps, it fills the disk.
The Fix
Added explicit memory cleanup at four levels:
# 1. benchmark.py - Between concurrency levels
del video_results
gc.collect()
# 2. openvino_async.py - After distributing to threads
del preprocessed # Free 23GB before inference starts
gc.collect()
# 3. vvl_gpu_http.py - After HTTP POST
del payload # Free serialized frames
del preprocessed
# 4. openvino_direct.py - After batch distribution
del all_preprocessed
gc.collect()The Python Scoping Gotcha
After deploying the fix, we got cannot access local variable 'gc'. The root cause: Python treats a variable as local for the entire functionif there's any import gc anywhere inside it:
def _run_single_stream_level(self):
gc.collect() # Line 1465 - UnboundLocalError!
# ... 50 lines later ...
import gc # Line 1521 - Python sees this and makes gc "local"
gc.collect() # This one works fineFix: removed all 4 local import gc statements, relying on the top-level import.
Bug 2: Disk Exhaustion — oneDNN Verbose Logging
Even after the memory fix, disk kept filling. Monitoring revealed the memory fix was working for RAM, but container logs were eating the disk.
The Docker Compose configuration had ONEDNN_VERBOSE=1 set for the workers. This dumps a log line for every single oneDNN kernel execution — at 30+ concurrent streams with 210 frames and 50+ layers per frame, that generates millions of log lines flowing into Docker's JSON log files.
The Fix
# docker-compose.yaml
environment:
- ONEDNN_VERBOSE=0 # Was 1
# Also added log rotation
logging:
driver: "json-file"
options:
max-size: "50m"
max-file: "3"Bug 3: The Infinite Loop — Celery Visibility Timeout
The most subtle bug: after a CV batch completed successfully, it would immediately restart and run again — indefinitely:
05:40:15 - Task succeeded (took ~2h 37m)
05:40:15 - Task received ← redelivered copy!
08:17:20 - Task succeeded (took ~2h 37m)
08:17:20 - Task received ← redelivered AGAINMemory Leak
No gc.collect() between concurrency levels. 5000-frame replication per stream never freed.
Disk Exhaustion
ONEDNN_VERBOSE=1 dumped millions of kernel trace lines to Docker JSON logs
Infinite Loop
Celery visibility_timeout (1h default) < batch runtime (3h). Redis redelivered completed tasks.
Root Cause
The Celery configuration used task_acks_late=True (task not acknowledged until finished). But Celery's Redis broker has a default visibility_timeout of3600 seconds (1 hour). A CV batch with 3 scenarios takes ~3 hours.
The flow:
- Coordinator task starts, runs 3 CV scenarios (~3 hours total)
- After 1 hour, Redis thinks the message was lost (not acked within visibility timeout)
- Redis redelivers the message back to the queue
- When the original run finishes and acks, the worker picks up the redelivered copy
- The cycle repeats infinitely
The Fix
# celery_app.py
broker_transport_options={"visibility_timeout": 14400}, # 4 hoursGPU Utilization Analysis
Another investigation revealed why GPU utilization showed as 0% most of the time. The pattern was clear when monitoring over time:
16 streams (3360 frames):
Preprocessing: 31.8s (CPU) — GPU idle
Inference: 3.7s (GPU at 97%)
→ GPU busy only ~10% of the time
32 streams (6720 frames):
Preprocessing: 61.5s (CPU) — GPU idle
Inference: ~5s (GPU at 99%)
→ GPU busy only ~7% of the timeThe root cause: preprocessing + serialization is entirely CPU-bound and dominates wall-clock time. The GPU model server only receives frames after the worker finishes preprocessing and HTTP serialization. The actual YOLO inference runs at ~3500 FPS combined — so it blasts through work in seconds and sits idle waiting for the next batch.
Docker Volume Path Bug
A separate issue caused OVMS (OpenVINO Model Server) containers to fail withis a directory: permission denied. Docker had created empty directories instead of mounting the actual entrypoint script file.
Root cause: ./ relative paths in docker-compose.yaml resolve based on the working directory when docker compose was run. If run from a different directory, Docker creates the bind mount targets as empty directories instead of using the actual files.
Fix: auto-detect the host project directory from the Docker socket API as a fallback, eliminating the need for manual path configuration on new servers.
Key Takeaways
- Python's garbage collector doesn't guarantee immediate reclamation of large numpy/tensor arrays — explicit
del+gc.collect()is essential for memory-critical workloads. - Python's scoping rules can bite you: any
importstatement inside a function makes that name local for the entire function body, not just after the import line. - Celery + Redis
visibility_timeoutmust exceed maximum task runtime when usingtask_acks_late=True— this is a well-documented but easy-to-miss pitfall. - Container log rotation is not optional for verbose workloads — a single
ONEDNN_VERBOSE=1flag can generate gigabytes of logs per hour. - GPU utilization metrics can be misleading — low utilization might mean the CPU preprocessing pipeline is the actual bottleneck, not the GPU.
- Docker bind mount paths with
./are fragile — use absolute paths or auto-detection for portable deployments.
Signup for Updates:
I promise to only email you cool shit. Draft chapters, progress updates, sneak peaks at illustrations I'm working on. Stuff like that.