This role focuses on developing models and systems that can reason across multiple modalities including text, images, video, and audio. You will work on cutting-edge research to enable AI systems to perceive, interpret, and generate content across diverse data types, contributing to products that impact billions of users worldwide.
Responsibilities
Conduct research on multi-modal learning, including vision-language models, audio-visual understanding, and cross-modal reasoning
• Develop novel architectures and training methodologies for models that integrate and reason across multiple modalities
• Design and execute experiments to evaluate multi-modal model capabilities and identify areas for improvement
• Publish research findings at top-tier conferences and contribute to Meta's research community
• Collaborate with cross-functional teams to translate research innovations into product applications
• Mentor and guide other researchers on multi-modal AI projects
Minimum Qualifications
• Bachelor's degree in Computer Science, Computer Engineering, relevant technical field, or equivalent practical experience
• PhD in Computer Science, Machine Learning, Artificial Intelligence, or a related field
• Experience with multi-modal learning, vision-language models, or cross-modal representation learning demonstrated through publications or projects
• Experience programming in Python and with deep learning frameworks such as PyTorch
• Experience with large-scale model training and distributed computing
Preferred Qualifications
• Experience building end-to-end multi-modal systems from research to production
• Experience with video understanding or audio-visual learning
• Publications at venues such as NeurIPS, ICML, ICLR, CVPR, ACL, or EMNLP focused on multi-modal learning
• Experience with large language models, vision transformers, or foundation models