About the roleWe are looking for a
Senior Data Scientist who will own some of the most consequential diagnostic AI in rare disease: building, validating, and operationalizing the models that help us find and diagnose patients who have never had a name for their disease, powering the analytical rigor behind our testing programs, and shaping how we use data to make smarter product decisions.
What you will do- Own the end-to-end development, validation, and operationalization of PG's predictive diagnostic AI models - from feature engineering through production deployment - that power program eligibility decisions and clinical decisions for patients
- Run prospective testing experiments: apply diagnostic models to undiagnosed patients, coordinate testing, and track outcomes to continuously improve model performance
- Build and maintain PG's synthetic patient data pipeline, a critical deliverable for our research programs, and key input to our own model development lifecycle
- Optimize our patient intake experience using NLP and multimodal data analysis to determine which questions to ask, in what order, to maximize data quality and conversion
- Own API usage and cost optimization across PG's AI stack, including prompt engineering, model evaluation, and ongoing performance monitoring
- Conduct ad hoc strategic analyses that inform product prioritization, causality assessment, and generate customer-facing program insights
- Establish MLOps infrastructure: model monitoring, drift detection, API observability, and lightweight but durable operational processes
- Have the freedom to conduct blue sky research initiatives aimed at creating value from our data
- Work with Data Engineering to build a robust, scalable data foundation that supports all of the above
Who you areWe are looking for a few specific things that will help you succeed in this role:
- 7+ years of experience in data science, machine learning engineering, or a closely related field
- Strong Python proficiency and fluency across the core data science stack: pandas, NumPy, scikit-learn, PySpark, and SQL
- Demonstrated end-to-end ML experience: you have taken models from problem definition through feature engineering, validation, deployment, and monitoring in a production environment
- Experience with NLP techniques and applying language models to real-world problems
- Comfort with prompt engineering and evaluating external AI API performance (e.g., OpenAI)
- A track record of operating with high ownership in lean, fast-moving environments where you have had to build structure as much as execute within it
- Strong analytical communication skills - you can translate complex model outputs and data findings into clear, actionable narratives for technical and non-technical audiences alike
Some things that are not required, but you will learn on the job:
- Experience with Databricks or similar lakehouse/ML platform environments
- Familiarity with synthetic data generation techniques
- Domain knowledge in healthcare, rare disease, genomics, or clinical research
- Experience with MLOps tooling and building observability infrastructure from scratch
- Exposure to biopharma or insurance analytics use cases
What we offer at Probably Genetic:- An engaging and supportive team all on a mission to improve lives
- Fair and equitable compensation with competitive early-stage equity grants
- Generous Flexible Time off policy, that we actually use
- Parental Leave Benefits (12 weeks for both birthing and non-birthing)
- Hybrid, flexible work with high-trust and autonomy
- A bright, inviting, pet-friendly office in Downtown SF near transit
- A "work from anywhere" policy, up to 4 weeks a year
- Regular team retreats in exciting destinations
- Health Benefits including medical, dental, vision, therapy, FSA, and 401k
- And so much more!