Real-Time Computer Vision Pipeline: Facial Landmark Detection, Transform Calculation, and Projection Mapping

Real-Time Computer Vision: Facial Landmark Detection and Projection Mapping

Executive summary: Maedcore developed a real-time computer vision pipeline for dynamic projection mapping onto a moving subject: an AI-powered facial landmark detection model running at sub-frame latency, a projective transform calculation layer that compensates for head rotation, expression changes, and translational movement frame-by-frame, and a live compositing system that renders and projects effects synchronized to facial motion with no perceptible lag. The system processes camera input, runs landmark inference, calculates the deformation transform, composites the output frame, and drives the projector — all within a single deterministic loop at 30+ fps. The deployment application is live performance and interactive art; the computer vision architecture is directly applicable to industrial quality control inspection, robotics guidance systems, and any real-time human-machine vision interface.

Client: Built by Maedcore as engineering partner for artist Filip Ćustić, commissioned under the patronage of Espacio SOLO and Onkaos.

The Computer Vision Challenge

Projecting a static image onto a static surface is a solved problem. Projecting dynamically generated visual content onto a moving, deforming surface — a human face — in real time introduces three hard requirements that compound each other:

Sub-frame landmark detection latency. The facial landmark model must complete inference and return updated keypoint coordinates within a single frame cycle (~33 ms at 30 fps). If landmark detection takes longer than one frame, the projection falls behind the subject’s movement and the visual alignment breaks.

Projective transform accuracy under motion. The transform matrix that maps from screen-space effect coordinates to the physical surface of the face must be recalculated on every frame, accounting for:

Translation (the person moves closer, further, or sideways).
Rotation (head tilts, turns, nods).
Non-rigid deformation (facial expression changes the shape of the surface).

A linear homography is insufficient for non-rigid deformation. The system requires a per-region transform that handles the 3D curvature of facial geometry.

Single-loop determinism. Camera capture, landmark inference, transform calculation, compositing, and projector output must all complete within one frame cycle. If any stage is delayed by OS scheduling, network latency, or inference variance, the output lags and the visual coherence collapses.

System Architecture

The system is structured as a single deterministic processing loop with four stages executing sequentially on each frame:

Stage 1 — Camera Capture and Pre-Processing

A high-frame-rate camera captures the subject at 60 fps, providing a 16 ms head start over the 30 fps output frame rate. Frames are pre-processed (colour space conversion, exposure normalisation) before being passed to the landmark model. Pre-processing is implemented in hardware-accelerated code to stay within the frame budget.

Stage 2 — Facial Landmark Detection

The facial landmark model processes each captured frame and returns a set of 60+ keypoints representing the geometric structure of the face: eye corners, nose bridge, nostril positions, lip contour, jaw line, and forehead boundary.

The model runs on a dedicated inference accelerator (GPU or NPU) to isolate its latency from the CPU-bound transform and compositing stages. The model was selected and optimised for:

Inference time < 15 ms on the target hardware.
Keypoint stability — low jitter between consecutive frames under normal head motion.
Robustness — consistent performance across different skin tones, lighting conditions, and face sizes in frame.

Stage 3 — Projective Transform Calculation

Using the 60+ landmark positions from Stage 2, the transform calculation stage computes the mapping from effect-coordinate-space to projector-output-space:

Triangulation — the 60+ keypoints define a mesh of triangles covering the face surface.
Per-triangle homography — for each triangle, a local affine transform is calculated mapping from effect coordinates to projector output coordinates.
Warp application — the effect image is warped using the per-triangle transforms, producing a distorted output image where each region of the effect is correctly aligned to the corresponding region of the face.

This piecewise-linear approach handles the non-rigid deformation of facial expression changes without requiring a full 3D mesh reconstruction.

Stage 4 — Compositing and Projection Output

The warped effect frame is composited against any ambient layer, brightness-corrected for the ambient light level in the environment, and sent to the projector output buffer. The projector renders at 30 fps, aligned to the camera capture frame rate.

Hardware Configuration

The physical system consists of three hardware components positioned relative to each other at calibrated distances:

Component	Specification
Camera	60 fps high-frame-rate, low-motion-blur, hardware-triggered
Projector	High-luminance (3500+ lm) for visibility under ambient light; short-throw lens for close-range operation
Processing unit	GPU-equipped embedded system; dedicated inference accelerator

Camera and projector are co-mounted to maintain a fixed geometric relationship, simplifying the projective transform calibration. System calibration uses a checkerboard target to establish the camera-to-projector homography baseline, which is applied as a pre-correction to every output frame.

Performance Results

Metric	Result
Landmark detection keypoints	60+ per frame
Output frame rate	30 fps sustained
End-to-end pipeline latency	< 33 ms (sub-frame)
Perceptible lag	None under normal subject movement
Lighting robustness	Validated under controlled stage lighting, ambient gallery lighting, and mixed natural light
Subject variability	Validated across multiple face geometries, skin tones, and distances

Technology Applications

The computer vision pipeline Maedcore built for this project addresses the core pattern of real-time vision-driven control systems — which recurs across industrial and enterprise contexts:

Industrial quality control inspection. The landmark-detection-to-transform pipeline is structurally identical to a vision system that detects feature positions on a manufactured component and calculates whether they fall within tolerance. The frame-rate real-time constraint and per-region transform calculation are the same engineering problems.

Robotics guidance systems. A robot that tracks a moving target — product on a conveyor, a human collaborator’s hand, a docking port on a vehicle — requires the same sub-frame perception-to-action latency that this projection system achieves.

Augmented reality overlays. Any system that renders virtual content aligned to a real-world surface in real time (AR headsets, industrial AR maintenance guidance, in-situ measurement overlays) uses the same projective transform and compositing architecture.

Real-time human-machine interfaces. The deterministic single-loop architecture — camera to inference to output with no buffering — is the pattern required for any vision-based HMI where response time to human movement is a design constraint.

Technologies Used

Project developed with: Computer Vision — Real-Time AI — Facial Landmark Detection — Projective Transform — Projection Mapping — OpenCV — Embedded Vision — GPU Inference — Mechatronics

Building a Real-Time Computer Vision System?

Maedcore engineered a complete real-time CV pipeline — perception, inference, transform, output — meeting sub-frame latency requirements on embedded hardware. If you need a computer vision system for inspection, guidance, HMI, or any real-time visual application, request a technical consultation.

Talk to the AI Team | View AI & Software Services | See All Success Stories