AI Evaluation Specialist – Software Engineering & Data Science
About the Role
A technology-focused AI research initiative is seeking experienced professionals to help develop benchmark datasets used to evaluate advanced AI systems. The role focuses on creating realistic, high-quality assessment tasks that measure document understanding, instruction following, technical reasoning, and problem-solving capabilities within software engineering and data science domains.
This opportunity is ideal for professionals with hands-on experience in software development, data science, analytics, or related technical disciplines. Candidates should be comfortable working with technical documentation, codebases, APIs, architecture materials, and analytical workflows while applying expert judgment to define accurate evaluation standards.
The work involves designing complex, multi-step tasks grounded in real-world technology scenarios, establishing objective evaluation criteria, and producing high-quality reference outputs. Success depends on technical depth, attention to detail, and the ability to translate practical industry knowledge into structured AI evaluation frameworks.
What You'll Do
- Design and author benchmark tasks that assess AI performance in technology-related domains
- Create realistic scenarios based on technical specifications, architecture documentation, API references, and code repositories
- Develop multi-step instructions that evaluate reasoning, document comprehension, and task execution capabilities
- Produce accurate ground-truth outputs for benchmark evaluations
- Define objective scoring rubrics and evaluation criteria for AI-generated responses
- Review technical content to ensure clarity, accuracy, and consistency
- Conduct research using technical documentation and relevant resources to support task development
- Evaluate task quality and identify opportunities to improve benchmark reliability
- Document methodologies, assumptions, and expected outcomes for evaluation projects
- Collaborate with project teams to maintain quality standards across datasets
- Contribute domain expertise to strengthen AI assessment frameworks and testing methodologies
Requirements
- Minimum 3 years of hands-on experience in software engineering, data science, analytics, or a related technical field
- Strong understanding of software development processes, technical documentation, and system architecture concepts
- Ability to interpret and work with API documentation, codebases, and engineering specifications
- Experience analyzing complex technical information and translating it into structured outputs
- Strong written communication skills with exceptional attention to detail
- Ability to design objective evaluation criteria and quality standards
- Experience working independently in remote, project-based environments
- Strong analytical thinking and problem-solving capabilities
- Comfortable reviewing and validating technical content for accuracy and completeness
- Ability to commit approximately 15–20 hours per week to project work
- Experience with data analysis, statistical reasoning, or analytical workflows preferred
- Familiarity with programming languages, software development tools, or data science platforms preferred
- Experience creating technical documentation, assessments, benchmarks, or evaluation frameworks preferred
- Knowledge of AI systems, large language models, or AI evaluation methodologies preferred
- Experience using AI-assisted productivity tools, including ChatGPT, preferred
- Additional Considerations:
- Independent contractor engagement
- Flexible remote schedule with self-managed working hours
- Project duration may vary based on business requirements and performance outcomes
- Compensation is based on services rendered under contract terms
- Candidates must meet applicable work authorization and engagement requirements