Staff Machine Learning Engineer - Computer Vision & Multi-Modal AI

Unity Technologies • $130K — $180K *

San Francisco, CA 94112In-Person

Consumer Technology

5 - 7 years of experience

Today

Be an Early Applicant

By clicking Apply, I agree with Ladders' Terms of Use and Privacy Policy

Job Overview by Ladders

Qualifications

6+ years in ML engineering, focusing on computer vision and/or multi-modal modeling.
Proven experience with transformer-based and diffusion-based vision models.
Strong command of the entire model lifecycle: data curation, training, fine-tuning, evaluation, and serving.
Familiarity with efficient attention and vision-language alignment techniques.
Expertise in Python and modern deep-learning frameworks like PyTorch.
Proven leadership in technical roles, steering cross-functional teams and mentoring engineers.

Responsibilities

Set the technical vision and roadmap for computer vision and multi-modal AI models.
Drive design and implementation of models for image and video understanding.
Make architectural decisions balancing quality, capability, latency, and cost.
Own the transition from research prototypes to production systems.
Collaborate with research scientists to implement deployable models.
Design scalable systems for processing diverse multimodal inputs.
Define KPIs for model performance and ensure team accountability.

Benefits

Comprehensive health, life, and disability insurance
Commute subsidy
Employee stock ownership
Competitive retirement/pension plans
Generous vacation and personal days
Support for new parents through leave and family-care programs
Mental health and wellbeing programs and support
Training and development programs
Volunteering and donation matching program

Full Job Description

The opportunity
We are building the next generation of AI-driven game experiences - generative world models, neural rendering, and multi-modal understanding that turn images, text, and 3D primitives into interactive worlds. As our Staff Machine Learning Engineer, you will be a core technical leader bringing state-of-the-art computer vision and multi-modal models - transformers, diffusion networks, vision-language models (VLMs), and JEPA-style architectures - from research into robust, production-grade systems.

This is a deeply hands-on, high-impact role. You will help define the modeling and deployment strategy, drive architectural decisions across the ML stack, and mentor a team of senior and mid-level engineers. Your work will directly shape the quality, capability, and performance of AI features experienced by billions of players - across cloud, server, and on-device targets.

What you'll be doing

Technical Leadership

Help set the technical vision and roadmap for computer vision and multi-modal AI models, spanning transformers, diffusion models, vision-language models, and JEPA-style generative architectures.
Drive design and implementation of models for image and video understanding, generation, segmentation, detection, and dense prediction, as well as multi-modal reasoning over images, text, and 3D inputs.
Make sound decisions on model architecture, training strategy, data pipelines, and evaluation - balancing quality, capability, latency, and cost across deployment targets.
Own the path from research prototype to production: training, fine-tuning, distillation, export, and serving, with deployment spanning cloud GPUs through to efficient on-device inference where the product requires it.

Architecture & Research Translation

Collaborate directly with research scientists to translate novel CV and multi-modal model architectures into deployable, well-engineered implementations.
Design scalable systems for multi-modal inference that process diverse inputs images,
video, text, primitives, and metadata - and produce rich outputs from semantic
predictions to pixel-level generation.
Track and rapidly adopt breakthroughs across the field: vision-language pretraining and
alignment, efficient diffusion (e.g., consistency models, flow matching), efficient attention
e.g., FlashAttention, linear-attention variants), and tokenization/representation learning
for vision.
Where latency or device constraints demand it, apply compression, quantization, pruning, and knowledge distillation, and work with appropriate runtimes (e.g., TensorRT, ONNX Runtime, CoreML, TFLite) to meet performance budgets.
Team & Cross-Functional Leadership
Lead and mentor a team of ML engineers; define engineering best practices, code review standards, and rigorous benchmarking and evaluation methodology.
Partner with research, platform engineers, product managers, and runtime teams to align ML capabilities with product roadmaps and target-platform constraints.
Champion a culture of measurement: define KPIs for model quality, accuracy, latency, memory, and cost, and ensure the team tracks them rigorously.

What we're looking for

6+ years in ML engineering, with significant depth in computer vision and/or multi-modal modeling.
Proven production experience with transformer-based and diffusion-based vision models (e.g., ViT, CLIP/SigLIP-style encoders, Stable Diffusion, DETR/SAM-style architectures)
Strong command of the full model lifecycle: data curation, training and fine-tuning, evaluation, and serving at scale.
Familiarity with efficient attention, diffusion samplers, multi-modal fusion, and vision-language alignment techniques.
Strong Python and modern deep-learning tooling (PyTorch); solid software
engineering fundamentals.
Track record of technical leadership: setting direction, influencing cross-functional partners, and growing engineers.

You might also have

Experience with world-model, video-generation, or neural rendering pipelines (NeRF,
3DGS, or similar).
Experience deploying models to constrained or on-device targets, including quantization
INT8/INT4/FP16), pruning, distillation, and runtimes such as CoreML, TFLite, ONNX
Familiarity with mobile SoC accelerators (Apple Neural Engine, Qualcomm Hexagon/Adreno,ARM Mali) or compiler stacks such as MLIR, TVM, or XLA.
Contributions to open-source ML frameworks or peer-reviewed CV/ML research publications.
Background in real-time graphics or game engine pipelines (Metal, Vulkan, OpenGL ES).

Additional information

Relocation support is not available for this position
Work visa/immigration sponsorship is not available for this position

Benefits
At Unity, we want our team members to thrive. We offer a wide range of benefits designed to support well-being and work-life balance.

Please note: Benefits eligibility, specific offerings, and coverage vary based on the country and employment status.

While specific benefits vary, here are some of the ways we strive to take care of our eligible team members globally: Comprehensive health, life, and disability insurance | Commute subsidy | Employee stock ownership | Competitive retirement/pension plans | Generous vacation and personal days | Support for new parents through leave and family-care programs | Office food snacks | Mental Health and Wellbeing programs and support | Employee Resource Groups | Global Employee Assistance Program | Training and development programs | Volunteering and donation matching program

#SEN #LI-MC1

About Unity Technologies

Unity Technologies is a software company that provides a platform for creating and operating interactive, real-time 3D content. The company's platform is used by game developers, architects, automotive designers, filmmakers, and other creators to build and distribute interactive experiences. Unity Technologies was founded in 2004 and is headquartered in San Francisco, California. The company has offices in North America, Europe, and Asia.

Learn more about Unity Technologies

Size

4,000 employees

Industry

Information Technology

Founded

2004

* Ladders Estimates

Similar Jobs

Staff Software Engineer - Platform (Remote anywhere in USA Only)
$23K — $270K *
Legion Intelligence
Remote
Today
Staff, Software Engineer
$143K — $286K *
Walmart
Sunnyvale, CA 94087 (Santa Clara County)
Today
Senior/Staff Software Engineer (Interactives)
$180K — $250K *
Brilliant.org
Remote
Today
Staff Software Engineer
$120K — $150K *
Interra Health
Remote
Reposted Today
Staff Software Engineer - Paze
$149K — $198K *
Early Warning Services
San Francisco, CA 94112 (San Francisco County)
Today
Staff Scientist, Bioinformatics
$129K — $175K *
Thermo Fisher Scientific
Pleasanton, CA 94566 (Alameda County)
Today

Get Ready For Your
Next Interview

More Jobs at Unity Technologies

Staff Machine Learning Engineer - Computer Vision & Multi-Modal AI
$130K — $180K *
San Francisco, CA 94112 (San Francisco County)
Today
Consumer Technology
In-Person
Staff Data Platform Engineer,
$200K — $280K *
San Francisco, CA 94112 (San Francisco County)
Today
Information Technology
In-Person
Director, Executive Compensation
$183K — $274K *
New York, NY 10025 (New York County)
Reposted Yesterday
Business Services
In-Person
Principal Forward Deployed Engineer
$179K — $269K *
San Francisco, CA 94112 (San Francisco County)
Yesterday
Information Technology
In-Person
Senior Technical Product Manager, XR Graphics
$153K — $230K *
Bellevue, WA 98006 (King County)
Yesterday
Consumer Technology
In-Person

More Consumer Technology Jobs

Product Manager - Commercial
$80K — $110K *
Springs Window Fashions
Middleton, WI 53562 (Dane County)
Today
Bilingual Vietnamese Field Sales Representative
$60K — $100K *
AT&T
Beaverton, OR 97007 (Washington County)
Today
Field Sales Representative
$60K — $100K *
AT&T
Beaverton, OR 97007 (Washington County)
Today
Product Manager
$90K — $130K *
Lucid Software Inc
Salt Lake City, UT 84118 (Salt Lake County)
Reposted Today
Senior Animator
$90K — $120K *
PUBG Madison
Madison, WI 53711 (Dane County)
Today

Find similar Staff Machine Learning Engineer - Computer Vision & Multi-Modal AI jobs:

Nationwide San Francisco, CA

Staff Machine Learning Engineer - Computer Vision & Multi-Modal AI

Job Overview by Ladders

Full Job Description

Get Ready For Your Next Interview

Find similar Staff Machine Learning Engineer - Computer Vision & Multi-Modal AI jobs:

Get Ready For Your
Next Interview