As a Research Engineer focused on Multi-Modal Understanding, you will develop advanced algorithms that integrate computer vision with other modalities such as language, audio, and sensor data. You will also drive the curation of multi-modal datasets and ground truth annotation pipelines to support model training and evaluation. You will work closely with our research team to bring innovative multi-modal solutions to production, bridging the gap between visual perception and holistic contextual understanding for immersive applications.
Responsibilities
Design and implement multi-modal understanding systems that combine vision, language, and other sensory inputs to enable richer contextual awareness
• Develop algorithms for cross-modal learning, fusion, and reasoning to improve human-AI interaction
• Lead the curation and management of multi-modal datasets, ensuring data quality and diversity across vision, language, and sensor modalities
• Design and oversee ground truth annotation workflows and quality assurance processes for multi-modal data
• Complete medium to large features spanning multiple tasks independently with minimal to no guidance
• Collaborate with researchers and engineers across computer vision and machine learning teams to drive multi-modal innovation
• Develop well-organized code with proper testing and documentation, building production-ready multi-modal systems
Minimum Qualifications
• Currently has, or is in the process of obtaining a Bachelor's degree in Computer Science, Computer Engineering, relevant technical field, or equivalent practical experience. Degree must be completed prior to joining Meta
• Currently has, or is in the process of obtaining a Bachelor's degree in Computer Science, Computer Engineering, relevant technical field, or equivalent practical experience. Degree must be completed prior to joining Meta
• Proven experience with C++ and/or Python, including experience with modern features
• Experience working with deep learning frameworks such as PyTorch and TensorFlow
• Demonstrated experience working collaboratively in cross-functional teams
Preferred Qualifications
• Master's degree in Computer Science, Computer Vision, Machine Learning, or related field
• Experience with vision-language models or multi-modal transformers
• Publications or contributions to multi-modal understanding research
• Familiarity with large language models and their integration with visual understanding systems
• Experience with data curation, annotation tools, or ground truth labeling pipelines