RunPod: GPU Cloud Deployment Mental Model

When deploying large language models on GPU cloud infrastructure, understanding the mental model of how services like RunPod work is essential. This guide captures the architecture, the APIs, the pitfalls, and the debugging patterns from deploying a 27B parameter model on GPU cloud.

Architecture: Template → Pod → Endpoint

RunPod follows a three-layer abstraction model. A Template is a reusable configuration blueprint — it defines the Docker image, environment variables, ports, and disk sizes. A Pod is a running GPU instance created from a template. An Endpoint is the proxy URL that lets you access your service from the outside.

Template (reusable config blueprint)
   └── Pod (running GPU instance from template)
        └── Endpoint (proxy URL to access your service)

The Two APIs

RunPod exposes two ways to interact: a CLI (runpodctl) that wraps GraphQL, and a REST API at https://rest.runpod.io/v1/pods with full control. The CLI has known bugs — for instance, it doesn't pass disk sizes from templates. The REST API is more reliable for production use.

The REST API Pod Create Call

The key fields in a pod creation request and what they mean:

POST https://rest.runpod.io/v1/pods
Authorization: Bearer <API_KEY>
Content-Type: application/json

{
  "name": "model-deployment",
  "imageName": "vllm/vllm-openai:latest",
  "gpuTypeIds": ["NVIDIA B200"],
  "gpuCount": 1,
  "containerDiskInGb": 100,
  "volumeInGb": 100,
  "volumeMountPath": "/root/.cache/huggingface",
  "ports": ["8000/http", "22/tcp"],
  "cloudType": "SECURE"
}

containerDiskInGb — Ephemeral disk (wiped on restart). Must fit the Docker image + model weights during download. Default is only 20-50GB — this was the first critical bug we hit.
volumeInGb — Persistent disk (survives restarts). Mounted at volumeMountPath.
volumeMountPath — Set to /root/.cache/huggingface so model downloads persist across restarts.
ports — http ports are accessible via RunPod proxy URL. tcp ports provide direct IP access (SSH).
gpuTypeIds — Array of GPU type strings. Can list fallbacks.

Docker Entrypoint vs CMD — Key Concept

Understanding Docker's ENTRYPOINT and CMD interaction is critical when deploying custom configurations. The final command a container runs is ENTRYPOINT + CMD.

# Normal vLLM image:
ENTRYPOINT: ["python3", "-m", "vllm.entrypoints.openai.api_server"]
CMD: (args from dockerStartCmd)
Final: python3 -m vllm.entrypoints.openai.api_server --model ...

# Override pattern (for injecting env fixes):
ENTRYPOINT: ["bash", "-c"]           ← override
CMD: ["export LD_LIBRARY_PATH=... && python3 -m vllm..."]
Final: bash -c "export LD_LIBRARY_PATH=... && python3 -m vllm..."

The Two Bugs We Encountered

Bug 1: Disk Too Small

runpodctl pod create --template-id=X ignores the template's containerDiskInGb and defaults to 20GB. A 27B parameter model needs ~50GB for weights alone. The container filled up and crash-looped.

Fix: Use the REST API with explicit containerDiskInGb: 100.

Bug 2: CUDA-Compat Library Conflict

The CUDA 13.0 container image ships cuda-compat libraries that shadow the host driver's libcuda.so. Even though the host driver (580) and CUDA 13.0 are compatible, the container's /usr/local/cuda/compat/libcuda.so takes priority in the linker search path and breaks things.

Fix:

export LD_LIBRARY_PATH=/lib/x86_64-linux-gnu:$LD_LIBRARY_PATH

This forces the dynamic linker to find the host's real driver libraries first before falling back to the container's compat libraries.

Endpoint URL Pattern

Once a pod is running, the proxy URL follows a predictable pattern:

https://<POD_ID>-<PORT>.proxy.runpod.net

This URL is OpenAI-compatible, so any OpenAI SDK can use it by setting base_url. This makes integration with existing LLM pipelines seamless.

Useful Operations Commands

# List available GPU types and stock
runpodctl gpu list

# List your running pods
runpodctl pod list

# Delete a pod
runpodctl pod remove <ID>

# SSH connection details
runpodctl ssh info <ID>

Monitoring via GraphQL

For programmatic monitoring, RunPod's GraphQL API provides GPU utilization and uptime:

curl -H "Authorization: Bearer $KEY" \
  https://api.runpod.io/graphql \
  -d '{"query":"{ pod(input:{podId:\"<ID>\"}) { 
    runtime { 
      uptimeInSeconds 
      gpus { memoryUtilPercent } 
    } 
  } }"}'

Critical Limitation: No Log API

RunPod has no API for container logs. Logs are only available via the web console. This is a significant limitation for automated debugging and monitoring. Plan your logging strategy to push logs to an external service (e.g., a logging endpoint within your application) rather than relying on container stdout.

Key Takeaways

Always use the REST API for production deployments — the CLI has known bugs with parameter passing.
Set containerDiskInGb explicitly — never rely on defaults for large models.
Mount the HuggingFace cache to a persistent volume to avoid re-downloading models on restart.
Watch for CUDA-compat library shadowing — use LD_LIBRARY_PATH overrides when needed.
The OpenAI-compatible endpoint pattern makes RunPod pods drop-in replacements for any OpenAI SDK-based workflow.