Building Enterprise Computer Vision Systems: From Research to Production
Introduction
Deploying computer vision systems into production is a fundamentally different challenge from training a model that performs well on a benchmark. At Pt Bagus Harapan Tritunggal, I had the opportunity to build two distinct production CV systems: an Enterprise Face Recognition API for biometric verification and a Fish Recognition System for KNMP (National Fisheries Management Council). This post captures the key lessons and architectural decisions from both projects.
The Gap Between Research and Production
A model with 99% accuracy on your test set can fail spectacularly in production due to:
- Distribution shift — real-world images differ from training data (lighting, angle, occlusion)
- Adversarial inputs — deliberate attempts to fool the system (especially critical for biometrics)
- Latency constraints — users expect responses in milliseconds, not seconds
- Throughput requirements — handling concurrent requests at scale
Face Recognition: Security-Critical Deployment
The face recognition system required not just high accuracy, but anti-spoofing capabilities — the ability to distinguish a live face from a photograph or video replay attack. This added an entire layer of complexity to the pipeline.
class FaceVerificationPipeline:
def __init__(self):
# Detection: Ultralytics YOLO for face detection
self.detector = YOLO('face_detection.pt')
# Recognition: InsightFace for face embeddings
self.recognizer = FaceAnalysis(providers=['CUDAExecutionProvider'])
self.recognizer.prepare(ctx_id=0, det_size=(640, 640))
# Liveness: MediaPipe for landmark detection + custom anti-spoof model
self.liveness_checker = LivenessDetector()
def verify(self, image: np.ndarray, enrolled_embedding: np.ndarray) -> dict:
# Step 1: Detect faces
faces = self.detector(image)
if len(faces) == 0:
return {"status": "no_face_detected"}
# Step 2: Liveness check — reject spoofing attempts
liveness_score = self.liveness_checker.predict(image, faces[0])
if liveness_score < LIVENESS_THRESHOLD:
return {"status": "spoof_detected", "confidence": liveness_score}
# Step 3: Extract embedding and compare
face_embedding = self.recognizer.get(image)[0].embedding
similarity = cosine_similarity(face_embedding, enrolled_embedding)
return {
"status": "verified" if similarity > SIMILARITY_THRESHOLD else "not_matched",
"similarity": float(similarity),
"liveness_score": float(liveness_score)
}Fish Recognition: Domain-Specific Data Challenges
The fish classification system presented different challenges. The primary hurdle was data scarcity — getting labeled images of specific fish species in Indonesian waters at sufficient volume and quality.
Our approach:
- Data collection — partnered with KNMP to gather images from field surveys
- Data augmentation — extensive augmentation pipeline (rotation, flipping, color jitter, Cutout) to artificially expand the dataset
- Transfer learning — fine-tuned a pre-trained EfficientNet-B4 on our domain-specific dataset
- Active learning — deployed a confidence-threshold-based system to flag uncertain predictions for human review
# Training with class imbalance handling
class FishClassificationTrainer:
def __init__(self, num_classes: int):
self.model = timm.create_model(
'efficientnet_b4',
pretrained=True,
num_classes=num_classes
)
def compute_class_weights(self, dataset) -> torch.Tensor:
"""Handle class imbalance with weighted sampling."""
class_counts = Counter(dataset.labels)
total = sum(class_counts.values())
weights = [total / (len(class_counts) * class_counts[i])
for i in range(len(class_counts))]
return torch.FloatTensor(weights)
def train(self, train_loader, val_loader, epochs=50):
class_weights = self.compute_class_weights(train_loader.dataset)
criterion = nn.CrossEntropyLoss(weight=class_weights.to(device))
optimizer = torch.optim.AdamW(
self.model.parameters(), lr=1e-4, weight_decay=0.01
)
scheduler = torch.optim.lr_scheduler.CosineAnnealingLR(
optimizer, T_max=epochs
)
# ... training loopServing Architecture
Both systems needed to be served as REST APIs with low latency. The architecture we landed on:
┌─────────────────────────┐
│ Load Balancer │
└──────────┬──────────────┘
│
┌──────────────────────┼──────────────────────┐
▼ ▼ ▼
┌──────────────────┐ ┌──────────────────┐ ┌──────────────────┐
│ Django API │ │ Django API │ │ Django API │
│ (Worker 1) │ │ (Worker 2) │ │ (Worker 3) │
└────────┬─────────┘ └────────┬─────────┘ └────────┬─────────┘
│ │ │
└─────────────────────┼──────────────────────┘
│
┌─────────────────────┼──────────────────────┐
▼ ▼ ▼
┌──────────────────┐ ┌──────────────────┐ ┌──────────────────┐
│ Model Server │ │ PostgreSQL │ │ Redis │
│ (TorchServe) │ │ (Embeddings) │ │ (Cache) │
└──────────────────┘ └──────────────────┘ └──────────────────┘Key decisions:
- TorchServe for dedicated model inference — separates the ML runtime from the Django app, enabling independent scaling
- Redis for caching face embeddings — embedding lookup on every request would be prohibitively slow
- PostgreSQL with pgvector for scalable similarity search across enrolled face embeddings
Performance Optimizations
Model Quantization
For the face recognition model, INT8 quantization reduced model size by ~4x and inference time by ~2x with minimal accuracy degradation:
import torch.quantization
# Post-training static quantization
model.eval()
model.qconfig = torch.quantization.get_default_qconfig('fbgemm')
torch.quantization.prepare(model, inplace=True)
# Calibration
with torch.no_grad():
for images, _ in calibration_loader:
model(images)
torch.quantization.convert(model, inplace=True)Batch Processing
For the fish recognition API, requests are batched using an async queue to amortize GPU overhead:
class BatchInferenceQueue:
def __init__(self, max_batch_size=32, max_wait_ms=50):
self.queue = asyncio.Queue()
self.max_batch_size = max_batch_size
self.max_wait_ms = max_wait_ms
async def predict(self, image: np.ndarray) -> dict:
future = asyncio.Future()
await self.queue.put((image, future))
return await future
async def process_batches(self):
while True:
batch = []
deadline = time.monotonic() + self.max_wait_ms / 1000
while len(batch) < self.max_batch_size:
try:
timeout = max(0, deadline - time.monotonic())
item = await asyncio.wait_for(
self.queue.get(), timeout=timeout
)
batch.append(item)
except asyncio.TimeoutError:
break
if batch:
images, futures = zip(*batch)
results = model.predict_batch(np.stack(images))
for future, result in zip(futures, results):
future.set_result(result)Monitoring in Production
Both systems are monitored via:
- Prometheus + Grafana for latency percentiles (p50, p95, p99) and throughput
- Confidence score distribution tracking — a drop in average confidence often signals data drift
- Rejection rate monitoring — for the face recognition system, a spike in anti-spoof rejections can indicate an attack attempt
Lessons Learned
- Liveness detection is non-negotiable for biometrics — without it, a printed photo can fool even a highly accurate recognition system
- Data quality beats model complexity — 1000 high-quality, diverse training images outperform 10,000 noisy ones
- Separate your ML runtime from your API — Django is excellent for the business logic; TorchServe handles the model serving
- Cache aggressively — face embeddings, model outputs for known images, and preprocessing results can all be cached
- Build with confidence thresholds, not binary outputs — returning a confidence score enables the application layer to handle edge cases gracefully
This post is based on my work at Pt Bagus Harapan Tritunggal building production computer vision systems.