Research Scientist - Vision-Language Modeling

Epsilon Health

$130K — $180K *
Healthcare
5 - 7 years of experience
Job Overview by Ladders

Qualifications

  • 6+ years of experience in vision-language modeling or multimodal learning
  • Expertise in training large VLMs (e.g., LLaVA, Flamingo)
  • Strong background in post-training techniques like DPO and RLHF
  • Proven ability to adapt complex models for new applications
  • Proficiency in PyTorch or JAX, experience with distributed training
  • Experience with medical imaging for radiology report generation
  • Solid software engineering skills for production-quality code

Responsibilities

  • Design and train multimodal foundation models for radiology
  • Implement advanced post-training strategies to enhance model accuracy
  • Research inference-time compute scaling techniques for diagnostic performance
  • Develop capabilities for grounded report generation in medical imaging
  • Create evaluation frameworks for assessing medical text quality
  • Engage in all aspects of model development from curation to deployment
  • Stay updated on research in vision-language modeling and medical AI

Benefits

  • Opportunity to work with one of the largest medical imaging datasets
  • Collaborative environment fostering research and technical excellence
  • Engagement with cutting-edge AI developments in healthcare
  • Contribution to publications and best practices in the field
  • Focus on impactful work in clinical settings
Full Job Description
Role Overview

We're seeking a Research Scientist with deep expertise in Vision Language Modeling (VLMs) to join our ML team. You'll be at the forefront of developing and deploying state-of-the-art multimodal models for clinical use in radiology settings. This role focuses on training and fine-tuning vision-language models (VLMs) that can generate accurate & grounded radiology reports across multiple imaging modalities including X-rays, CT scans, and MRI. You'll work with one of the largest and most diverse medical imaging datasets in the industry, advancing the state-of-the-art in grounded medical report generation, model alignment, and inference-time reasoning while maintaining the clinical rigor required for healthcare deployment.

Key Responsibilities
  • Design, train, and scale vision-language foundation models for radiology applications.
  • Develop and implement advanced post-training strategies including preference optimization (DPO, IPO, KTO), reinforcement learning from human feedback (RLHF), and other alignment techniques to improve clinical accuracy and reduce hallucinations.
  • Research and deploy inference-time compute scaling techniques such as chain-of-thought reasoning, self-refinement, and test-time training to enhance model performance on complex diagnostic cases.
  • Pioneer grounded report generation capabilities, enabling models to spatially localize findings within medical images using bounding boxes or segmentation masks.
  • Design rigorous evaluation frameworks that assess text for medical accuracy and writing style.
  • Contribute hands-on to all stages of model development including dataset curation, architecture design, distributed training, post-training optimization, and production deployment.
  • Stay current with cutting-edge research in vision-language modeling, medical AI, and model alignment techniques.
  • Drive research and technical excellence through conference publications and technical blog posts, establishing best practices for training robust medical VLMs at scale.


Qualifications
  • 6+ years of academia/industry experience in vision-language modeling, multimodal learning, or related fields
  • Deep expertise in training and fine-tuning large vision-language models (e.g., LLaVA, Flamingo, CogVLM, Qwen-VL, or similar architectures)
  • Strong foundation in modern post-training techniques including:
    • Preference optimization methods (DPO, IPO, ORPO, KTO)
    • RLHF and reward modeling
    • Inference-time compute scaling and reasoning strategies
    • Constitutional AI and other alignment techniques
  • Track record of implementing complex models from research papers and adapting them to new domains
  • Proficiency in PyTorch or JAX, with experience training large models on multi-GPU/distributed systems
  • Experience with autoregressive language modeling and instruction tuning
  • Hands-on experience with medical imaging applications, particularly radiology report generation
  • Strong software engineering skills and ability to write production-quality code

Preferred Qualifications
  • Publications at top-tier conferences (NeurIPS, ICML, ICLR, CVPR, ACL, EMNLP, MICCAI)
  • Experience with grounded generation tasks (visual grounding, referring expression comprehension)
  • Knowledge of evaluation methodologies for long-form generation, including factuality assessment and hallucination detection
  • Experience with 3D medical image processing and temporal modeling
  • Familiarity with clinical NLP and medical knowledge representation
  • Experience with model interpretability, explainability, and uncertainty quantification in safety-critical applications

Similar Jobs

More Jobs at Epsilon Health

More Healthcare Jobs

Find similar Research Scientist - Vision-Language Modeling jobs: