Evaluation Scenario Writer - AI Agent Testing Specialist
OpenTrain AI · Remote · Worldwide · Posted Jun 10, 2026
About OpenTrain
OpenTrain is a central job board for AI-training and data-labeling work that aggregates roles from many AI companies and labeling platforms. Creating an OpenTrain account is free and applying to listings typically takes only a few minutes.
About AI Training Work
Human evaluators and annotators provide the examples, judgments, and corrections that modern AI systems learn from. For LLM agents this includes designing tests, rating outputs, and defining ideal behavior so models perform accurately and safely in real tasks.
This role focuses on evaluation design and QA-style analysis for generative systems: you will create structured test scenarios, document expected (gold-standard) responses, and score agent behavior to improve overall model quality.
The Role
You will be an Evaluation Scenario Writer focused on LLM agent testing and evaluation design. This is a remote contractor, part-time role aimed at intermediate-level contributors.
You will work with TEXT data and perform EVALUATION_RATING tasks through a labeling workflow (labeling software: OTHER).
- Hours: 20+ hours per week
- Pay: $18–$24 USD per hour (pay-per-hour)
- Employment type: Contractor, Part-time
- Location: Worldwide / fully remote
What You'll Do
You will design realistic, reusable evaluation scenarios that simulate real-world tasks for LLM-based agents. Your work will help define how agents should behave in typical and edge-case situations.
- Create structured evaluation scenarios and document them in JSON and/or YAML formats.
- Define the golden path (gold-standard agent behavior) and acceptable variations.
- Annotate task steps, expected outputs, edge cases, and scoring logic.
- Review agent outputs and rate them according to defined scoring rubrics.
- Iterate on scenarios to improve clarity, coverage, and reproducibility.
- Collaborate with developers and other contributors to refine evaluation frameworks.
Requirements
You must meet the stated educational, technical, and experience requirements below. We will not invent or substitute missing qualifications.
- Bachelor’s and/or Master’s degree in CS, Software Engineering, Data Science/Analytics, AI/ML, Computational Linguistics/NLP, Information Systems, or a related field
- Prior experience in QA, software testing, test case design, data analysis, or NLP annotation
- Demonstrated ability to design reproducible test scenarios with strong coverage and edge cases
- Comfortable reading and using structured formats like JSON and/or YAML to describe scenarios
- Able to define gold-standard agent behavior, acceptable variations, and clear scoring logic
- Basic working experience with Python and JavaScript (reading/editing simple scripts)
- Strong written English skills for producing clear, unambiguous documentation
- Comfortable working with AI-generated outputs, agent logs, and prompt-based behaviors
- Able to switch between topics quickly and follow complex guidelines accurately
- Fully remote readiness: reliable laptop, stable internet connection, and consistent availability
Who Should Apply
This role is a good match for people with QA or test-case design backgrounds, data analysts, experienced annotators, or engineers who enjoy clearly documenting behavior and edge cases.
- Intermediate-level contributors who can work independently and communicate clearly in writing
- People comfortable with structured data formats (JSON/YAML) and basic scripting
- Detail-oriented testers who like designing reproducible scenarios and scoring logic
- Candidates who prefer flexible, remote, part-time contract work
How It Works
Apply through OpenTrain: create a free account and submit your application (it typically takes only a few minutes). If selected, you will be onboarded to the project’s workflow and labeling tools.
Typical workflow includes authoring scenarios in JSON/YAML, running them against agent outputs, rating results, and iterating with engineering teams. You will use provided guidelines and scoring rubrics for evaluation tasks.
- Data type: TEXT; label type: EVALUATION_RATING; labeling software: OTHER
- Contract details: part-time contractor, 20+ hours/week, paid $18–$24/hr
- Worldwide applicants accepted; ensure you meet the fully remote readiness items listed in Requirements