Real-Time Head Pose Estimation with MobileNet and ResNet
This project estimates head orientation from images, videos, or webcam input. For each detected face, the model predicts three Euler angles: yaw, pitch, and roll. See the project on github.com/yakhyo/head-pose-estimation.
The implementation builds on 6DRepNet and extends it with more pretrained backbones, updated weights, SCRFD-based face detection, ONNX export, ONNX Runtime inference, and distributed training support.
What the Model Predicts
Head pose is represented with three angles:
| Angle | Meaning |
|---|---|
| Yaw | left-right head rotation |
| Pitch | up-down head rotation |
| Roll | sideways head tilt |
The output is useful when a system needs a coarse estimate of attention or orientation. Common examples include driver monitoring, classroom attention analysis, video meeting tools, AR/VR interaction, and accessibility interfaces.
Model Families
The repository provides both ResNet and MobileNet backbones.
| Backbone | Main tradeoff |
|---|---|
| ResNet-18 | smaller ResNet baseline with good accuracy |
| ResNet-34 | better accuracy with moderate size |
| ResNet-50 | strongest reported accuracy, largest ResNet option |
| MobileNet V2 | compact model for lower-resource inference |
| MobileNet V3 small | smallest MobileNet option, lower accuracy |
| MobileNet V3 large | larger mobile model with better accuracy than V3 small |
The README provides both PyTorch and ONNX weights for each of these models.
Evaluation on AFLW2000
The reported results are evaluated on AFLW2000. Lower MAE is better.
| Model | Size | Yaw | Pitch | Roll | MAE |
|---|---|---|---|---|---|
| ResNet-18 | 43 MB | 4.5027 | 5.8261 | 4.2188 | 4.8492 |
| ResNet-34 | 81.6 MB | 4.4538 | 5.2690 | 3.8855 | 4.5361 |
| ResNet-50 | 91.3 MB | 3.5529 | 4.9962 | 3.4986 | 4.0159 |
| MobileNet V2 | 9.59 MB | 5.6880 | 6.0391 | 4.4433 | 5.3901 |
| MobileNet V3 small | 6 MB | 8.6926 | 7.7089 | 6.0035 | 7.4683 |
| MobileNet V3 large | 17 MB | 5.6068 | 6.6022 | 4.9959 | 5.7350 |
ResNet-50 gives the best reported MAE. MobileNet V2 is much smaller, but its error is higher. That is the central tradeoff in this repository: accuracy versus runtime and model size.
Training and Data
The training setup uses 300W-LP, while evaluation is reported on AFLW2000. The project also supports multi-GPU training through PyTorch distributed execution.
SCRFD is used as the face detector in the inference pipeline. That keeps face localization separate from pose estimation: first detect a face, then estimate head orientation on the crop.
PyTorch and ONNX
The repository includes PyTorch weights for training and experimentation, plus ONNX weights for lighter deployment. ONNX is especially useful when the application does not need a full PyTorch runtime.
Available released weight formats:
| Model | PyTorch | ONNX |
|---|---|---|
| ResNet-18 | yes | yes |
| ResNet-34 | yes | yes |
| ResNet-50 | yes | yes |
| MobileNet V2 | yes | yes |
| MobileNet V3 small | yes | yes |
| MobileNet V3 large | yes | yes |
The repository README has the current weight links and usage details.