About the Role

A leading AI initiative is seeking experienced MLOps professionals to support the development and evaluation of next-generation large-scale AI systems. This opportunity focuses on machine learning infrastructure, model training environments, distributed computing, and framework-level optimization for advanced AI applications.

This opportunity is ideal for engineers with deep expertise in ML systems, production-scale training pipelines, and modern machine learning frameworks. Candidates should be comfortable working across infrastructure, performance optimization, and technical evaluation while communicating complex engineering concepts clearly.

The work involves designing challenging ML systems tasks, developing high-quality technical solutions, evaluating model outputs, and contributing to the creation of robust training and assessment frameworks. Success in this role requires strong expertise in distributed systems, training infrastructure, and performance-critical machine learning workloads.

What You'll Do

Design and evaluate complex MLOps and ML infrastructure challenges
Develop accurate technical solutions for machine learning systems problems
Assess model-generated outputs and provide structured technical feedback
Create evaluation rubrics for training pipelines, infrastructure design, and optimization tasks
Guide technical efforts related to ML systems architecture and operational best practices
Analyze distributed training workflows and infrastructure performance bottlenecks
Review machine learning platform designs for scalability, reliability, and efficiency
Support the development of high-quality datasets used for AI model improvement
Collaborate with technical experts to maintain consistency and accuracy across evaluations
Contribute expertise in framework-level optimization and model training operations

Requirements

2+ years of professional experience in MLOps, ML infrastructure, or machine learning systems engineering
Hands-on production experience with JAX and/or PyTorch
Experience designing, operating, or optimizing large-scale machine learning training environments
Strong understanding of distributed systems and machine learning infrastructure architecture
Experience developing or optimizing custom GPU kernels using Pallas (JAX), Triton, or similar technologies
Knowledge of model training workflows, orchestration systems, and performance tuning techniques
Ability to evaluate technical solutions and provide detailed written feedback
Strong written communication skills with the ability to explain complex technical concepts clearly
Demonstrated career progression and increasing technical responsibility
Availability to support a full-time weekday engagement (40 hours per week)
Ability to work effectively within structured project environments and cross-functional teams
Experience with infrastructure automation, cloud platforms, or ML platform engineering preferred
Familiarity with large language model training, evaluation, or optimization workflows preferred
Experience contributing to AI research infrastructure or frontier AI initiatives preferred
Knowledge of GPU performance optimization, compiler technologies, or kernel-level programming preferred

MLOps Engineer – AI Model Training Infrastructure

About the Role

What You'll Do

Requirements

Explore Similar Global AI Roles

Frontend Engineer – AI Coding Systems Evaluation

Internal Medicine Clinical Reasoning Expert (AI Evaluation)

Marketing & Commercial Strategy Specialist – Growth Analytics & Strategic Evaluation (Remote, Contract)