Back to all jobs
Artificial Intelligence / Machine Learning Infrastructure

MLOps Engineer – AI Model Training Infrastructure

United States (Remote) Full-time Platform: Mercor

About the Role

A leading AI initiative is seeking experienced MLOps professionals to support the development and evaluation of next-generation large-scale AI systems. This opportunity focuses on machine learning infrastructure, model training environments, distributed computing, and framework-level optimization for advanced AI applications.

This opportunity is ideal for engineers with deep expertise in ML systems, production-scale training pipelines, and modern machine learning frameworks. Candidates should be comfortable working across infrastructure, performance optimization, and technical evaluation while communicating complex engineering concepts clearly.

The work involves designing challenging ML systems tasks, developing high-quality technical solutions, evaluating model outputs, and contributing to the creation of robust training and assessment frameworks. Success in this role requires strong expertise in distributed systems, training infrastructure, and performance-critical machine learning workloads.

What You'll Do

  • Design and evaluate complex MLOps and ML infrastructure challenges
  • Develop accurate technical solutions for machine learning systems problems
  • Assess model-generated outputs and provide structured technical feedback
  • Create evaluation rubrics for training pipelines, infrastructure design, and optimization tasks
  • Guide technical efforts related to ML systems architecture and operational best practices
  • Analyze distributed training workflows and infrastructure performance bottlenecks
  • Review machine learning platform designs for scalability, reliability, and efficiency
  • Support the development of high-quality datasets used for AI model improvement
  • Collaborate with technical experts to maintain consistency and accuracy across evaluations
  • Contribute expertise in framework-level optimization and model training operations

Requirements

  • 2+ years of professional experience in MLOps, ML infrastructure, or machine learning systems engineering
  • Hands-on production experience with JAX and/or PyTorch
  • Experience designing, operating, or optimizing large-scale machine learning training environments
  • Strong understanding of distributed systems and machine learning infrastructure architecture
  • Experience developing or optimizing custom GPU kernels using Pallas (JAX), Triton, or similar technologies
  • Knowledge of model training workflows, orchestration systems, and performance tuning techniques
  • Ability to evaluate technical solutions and provide detailed written feedback
  • Strong written communication skills with the ability to explain complex technical concepts clearly
  • Demonstrated career progression and increasing technical responsibility
  • Availability to support a full-time weekday engagement (40 hours per week)
  • Ability to work effectively within structured project environments and cross-functional teams
  • Experience with infrastructure automation, cloud platforms, or ML platform engineering preferred
  • Familiarity with large language model training, evaluation, or optimization workflows preferred
  • Experience contributing to AI research infrastructure or frontier AI initiatives preferred
  • Knowledge of GPU performance optimization, compiler technologies, or kernel-level programming preferred
Application Note: By submitting your profile for this partnered position, our team can quickly review your background and reach out to present you with this specific opportunity or match you with similar AI Training projects.