KEDA Jobs Monorepo: Event-Driven Job Processing on Kubernetes
- The Problem
- What You Get
- Architecture Overview
- Add a Job = Add a Directory
- Job Chaining via Events
- Dead Letter Queue & Retries
- Scale to Zero
- CI/CD: Build Only What Changed
- Monitoring
- Quick Start
TL;DR: A monorepo template where adding a job = adding a directory. KEDA scales workers 0→N based on NATS JetStream consumer lag. Includes automatic retries with linear backoff, dead letter queues, job chaining via events, and Grafana dashboards out of the box.
GitHub: github.com/atas/keda-jobs-monorepo-template
The Problem
Running background jobs on Kubernetes usually means either: heavyweight frameworks (Celery, Sidekiq) that bring their own complexity, or hand-rolling everything from scratch — retry logic, scaling, monitoring, dead letter handling. You want something in between: a simple pattern that gives you production-grade job processing without a framework.
That's what this template is. No framework, no runtime dependency beyond NATS and KEDA. Just a shared Python package that abstracts the consumer loop, and a directory convention that makes adding a new job trivial.
What You Get
- Scale-to-zero with KEDA — 0→N pods based on queue lag, back to 0 when idle
- Add a job = add a directory — CI handles build + deploy automatically
- Chain jobs into pipelines via NATS subject-based events
- Dead letter queue with linear backoff (15 retries over ~3 days)
- Prometheus metrics + Grafana dashboards — NATS JetStream, KEDA scaling, application metrics
- Full control — your images, your runtime, your timeout, no vendor lock-in
Architecture Overview
The event flow is straightforward:
- A message lands in NATS JetStream (the persistent message store)
- KEDA polls the NATS monitoring endpoint every 10 seconds, watches consumer lag
- When lag > 0, KEDA scales the target Deployment from 0→N
- Job pods pull messages from their durable consumer, process, ack
- Optionally publish a new event to trigger the next job in the chain
Everything runs on a single stream (keda-jobs-events) with subject-based routing. Each job has its own durable consumer that filters on a specific subject. Work-queue retention means messages are removed after acknowledgement.
Add a Job = Add a Directory
Every job lives in jobs/<name>/ with a predictable structure:
jobs/image-download/
├── main.py # Handler logic
├── Dockerfile # Multi-stage: base → test → production
├── service.yaml # Deployment + ScaledObject
├── pyproject.toml # Job-specific dependencies
└── tests/ # Pytest tests (run during Docker build)
The handler itself is minimal. Here's the entire image-download job — downloads an image from a URL, uploads to Cloudflare R2, and publishes an event for the next job in the chain:
from shared_py.nats_consumer import run_consumer
from shared_py.r2 import upload_to_r2
async def handle_event(data: dict, publish):
url = data.get("url")
headers = data.get("headers", {})
# Download
image_bytes, content_type = download_image(url, headers)
# Upload to R2
r2_key = build_r2_key(url)
upload_to_r2(r2_key, image_bytes, content_type)
# Publish event for next job in chain
await publish("image-downloaded", {"r2_key": r2_key})
if __name__ == "__main__":
run_consumer(handler=handle_event, job_name="image-download")
The run_consumer() call from the shared package handles everything else: NATS connection, message pulling, JSON decoding, ack/nak, retries, dead letter routing, health checks (/health on port 8080), and graceful shutdown on SIGTERM.
Job Chaining via Events
Jobs form pipelines by publishing events on NATS subjects. Each consumer only sees messages matching its subject filter.
Example pipeline with two jobs:
- Publish
image-downloadmessage with{"url": "https://..."} - image-download job picks it up, downloads the image, uploads to R2 as
images/{uuid}.{ext}, publishesimage-downloadedwith{"r2_key": "images/abc.jpg"} - image-resize job picks up the
image-downloadedevent, downloads from R2, resizes to max 200px, uploads toimages_resized/{uuid}.{ext}
One stream, subject-based filtering. Adding a step to the pipeline means adding a directory and publishing to the right subject.
Dead Letter Queue & Retries
Consumers are configured with 15 delivery attempts and linear backoff from 1 minute to 24 hours:
nats consumer add keda-jobs-events image-download-consumer \
--filter="image-download" \
--ack=explicit \
--max-deliver=15 \
--backoff=linear --backoff-min=1m --backoff-max=24h --backoff-steps=15 \
--pull
After exhausting all retries, the shared consumer code sends the message to a dead-letter subject with the original subject, data, and delivery count — then acks the original message so it doesn't block the consumer:
async def _handle_failure(msg, js, max_deliveries):
meta = msg.metadata()
if meta.num_delivered >= max_deliveries:
dead_letter = {
"subject": msg.subject,
"data": json.loads(msg.data.decode()),
"num_delivered": meta.num_delivered,
}
await js.publish("dead-letter", json.dumps(dead_letter).encode())
await msg.ack()
else:
await msg.nak(delay=30)
Scale to Zero
The ScaledObject configuration is where KEDA meets NATS:
apiVersion: keda.sh/v1alpha1
kind: ScaledObject
spec:
scaleTargetRef:
name: image-download
minReplicaCount: 0
maxReplicaCount: 5
pollingInterval: 10 # Check NATS every 10 seconds
cooldownPeriod: 60 # Wait 60s before scaling to zero
triggers:
- type: nats-jetstream
metadata:
natsServerMonitoringEndpoint: "nats.nats.svc.cluster.local:8222"
stream: "keda-jobs-events"
consumer: "image-download-consumer"
lagThreshold: "5"
activationLagThreshold: "0" # Any pending message wakes a pod
activationLagThreshold: 0 means a single pending message is enough to wake a pod from zero. lagThreshold: 5 controls further scaling — KEDA adds replicas when lag exceeds 5 per replica, up to maxReplicaCount: 5.
CI/CD: Build Only What Changed
The CI pipeline uses atas/actions/changed-dirs to detect which jobs/ subdirectories have changed. Only changed jobs get built and deployed via a matrix strategy:
- uses: atas/actions/changed-dirs@main
with:
path: jobs
trigger_all: shared-py # If shared-py changes, rebuild all jobs
Docker builds are multi-stage: base installs dependencies, test runs pytest (build fails if tests fail), production creates the final lean image with a non-root user. Tests run inside the Docker build — no separate test step needed.
On main branch, images are pushed to GHCR with SHA-based tags and deployed via kubectl apply + rollout status.
Monitoring
Three areas of observability, all wired to Prometheus via ServiceMonitors:
- NATS JetStream — stream messages, consumer lag, ack pending, redeliveries (NATS exporter sidecar)
- KEDA — scaling events, active/idle scalers, trigger metrics
- Application — processed message counts, error rates, pod counts
Grafana dashboards are deployed as ConfigMaps. The monitoring setup script applies ServiceMonitors and dashboard definitions in one shot.
Quick Start
# Clone the template
git clone https://github.com/atas/keda-jobs-monorepo-template
# Setup infrastructure (NATS, KEDA, namespace, streams)
./k8s/setup-scripts/setup-all.sh
# Build and deploy
make build-all && make push-all
kubectl apply -f jobs/image-download/service.yaml
kubectl apply -f jobs/image-resize/service.yaml
# Test the pipeline
nats pub image-download '{"url":"https://picsum.photos/2000/1000"}'