About the Role

A technology-focused AI research initiative is seeking experienced professionals to help develop benchmark datasets used to evaluate advanced AI systems. The role focuses on creating realistic, high-quality assessment tasks that measure document understanding, instruction following, technical reasoning, and problem-solving capabilities within software engineering and data science domains.

This opportunity is ideal for professionals with hands-on experience in software development, data science, analytics, or related technical disciplines. Candidates should be comfortable working with technical documentation, codebases, APIs, architecture materials, and analytical workflows while applying expert judgment to define accurate evaluation standards.

The work involves designing complex, multi-step tasks grounded in real-world technology scenarios, establishing objective evaluation criteria, and producing high-quality reference outputs. Success depends on technical depth, attention to detail, and the ability to translate practical industry knowledge into structured AI evaluation frameworks.

What You'll Do

Design and author benchmark tasks that assess AI performance in technology-related domains
Create realistic scenarios based on technical specifications, architecture documentation, API references, and code repositories
Develop multi-step instructions that evaluate reasoning, document comprehension, and task execution capabilities
Produce accurate ground-truth outputs for benchmark evaluations
Define objective scoring rubrics and evaluation criteria for AI-generated responses
Review technical content to ensure clarity, accuracy, and consistency
Conduct research using technical documentation and relevant resources to support task development
Evaluate task quality and identify opportunities to improve benchmark reliability
Document methodologies, assumptions, and expected outcomes for evaluation projects
Collaborate with project teams to maintain quality standards across datasets
Contribute domain expertise to strengthen AI assessment frameworks and testing methodologies

Requirements

Minimum 3 years of hands-on experience in software engineering, data science, analytics, or a related technical field
Strong understanding of software development processes, technical documentation, and system architecture concepts
Ability to interpret and work with API documentation, codebases, and engineering specifications
Experience analyzing complex technical information and translating it into structured outputs
Strong written communication skills with exceptional attention to detail
Ability to design objective evaluation criteria and quality standards
Experience working independently in remote, project-based environments
Strong analytical thinking and problem-solving capabilities
Comfortable reviewing and validating technical content for accuracy and completeness
Ability to commit approximately 15–20 hours per week to project work
Experience with data analysis, statistical reasoning, or analytical workflows preferred
Familiarity with programming languages, software development tools, or data science platforms preferred
Experience creating technical documentation, assessments, benchmarks, or evaluation frameworks preferred
Knowledge of AI systems, large language models, or AI evaluation methodologies preferred
Experience using AI-assisted productivity tools, including ChatGPT, preferred
Additional Considerations:
Independent contractor engagement
Flexible remote schedule with self-managed working hours
Project duration may vary based on business requirements and performance outcomes
Compensation is based on services rendered under contract terms
Candidates must meet applicable work authorization and engagement requirements

AI Evaluation Specialist – Software Engineering & Data Science

About the Role

What You'll Do

Requirements

Explore Similar Global AI Roles

Digital Workspace Pricing & Commercial Strategy Consultant

Data Analyst (Excel)

Growth Partnerships & Channel Development Lead