Real-Time Gaze Estimation Using Lightweight Deep Learning Models
This project focuses on predicting gaze direction using lightweight deep learning models optimized for real-time performance on mobile devices. The implementation combines classification and regression techniques to create an efficient and accurate solution suitable for deployment on resource-constrained hardware.
Applications and Use Cases
Gaze estimation technology enables a wide range of applications across multiple domains:
- Mobile User Experience: Hands-free navigation and attention-aware interfaces
- Virtual and Augmented Reality: Natural interaction through eye tracking in VR/AR systems
- Accessibility: Assistive technologies for users with limited mobility
- Automotive Safety: Driver attention monitoring and drowsiness detection
- Human-Computer Interaction: Intuitive control mechanisms for various devices
- Market Research: Understanding user attention patterns and visual behavior
Model Architecture and Design
The project implements multiple lightweight architectures, each optimized for different deployment scenarios:
ResNet Variants
Employs residual learning techniques to enable deeper networks without degradation. The residual connections allow gradients to flow more effectively during training, resulting in better accuracy without significant computational overhead.
MobileNet v2
Specifically designed for mobile deployment, MobileNet v2 introduces inverted residual structures and linear bottlenecks. This architecture achieves an optimal balance between model size, inference speed, and accuracy, making it ideal for on-device gaze estimation.
MobileOne (s0-s4)
The MobileOne family represents the state-of-the-art in mobile-optimized architectures. With variants ranging from s0 to s4, it offers flexibility in trading off between speed and accuracy. The architecture is specifically optimized for mobile CPUs, achieving impressive real-time performance without GPU acceleration.
Face Detection Integration
The system integrates SCRFD (Sample and Computation Redistribution for Efficient Face Detection) for robust face localization. SCRFD provides:
- Fast inference suitable for real-time applications
- High accuracy across various face scales and poses
- Efficient resource utilization for mobile deployment
- Reliable performance in challenging lighting conditions
Technical Implementation
The gaze estimation pipeline consists of several stages:
- Face Detection: SCRFD localizes faces in the input frame
- Face Alignment: Detected faces are normalized to a standard pose
- Eye Region Extraction: Precise localization of eye regions for gaze prediction
- Gaze Prediction: Deep learning model estimates gaze direction as pitch and yaw angles
- Temporal Smoothing: Optional filtering to reduce jitter in video streams
Performance Characteristics
The implementation achieves:
- Real-time inference (30+ FPS) on modern mobile devices
- Low latency suitable for interactive applications
- Minimal battery impact through efficient computation
- Robust performance across different lighting conditions and head poses
The complete implementation, including training scripts, pre-trained models, and deployment examples, is available on GitHub.