Skip to content
OpenTrain AI

Evaluation Scenario Writer - AI Agent Testing Specialist

OpenTrain AI · Remote · Worldwide · Posted Jun 10, 2026

Apply for this job Hourly · $18–$24/hr

About OpenTrain

OpenTrain is a central job board for AI-training and data-labeling work that aggregates roles from many AI companies and labeling platforms. Creating an OpenTrain account is free and applying to listings typically takes only a few minutes.

About AI Training Work

Human evaluators and annotators provide the examples, judgments, and corrections that modern AI systems learn from. For LLM agents this includes designing tests, rating outputs, and defining ideal behavior so models perform accurately and safely in real tasks.

This role focuses on evaluation design and QA-style analysis for generative systems: you will create structured test scenarios, document expected (gold-standard) responses, and score agent behavior to improve overall model quality.

The Role

You will be an Evaluation Scenario Writer focused on LLM agent testing and evaluation design. This is a remote contractor, part-time role aimed at intermediate-level contributors.

You will work with TEXT data and perform EVALUATION_RATING tasks through a labeling workflow (labeling software: OTHER).

  • Hours: 20+ hours per week
  • Pay: $18–$24 USD per hour (pay-per-hour)
  • Employment type: Contractor, Part-time
  • Location: Worldwide / fully remote

What You'll Do

You will design realistic, reusable evaluation scenarios that simulate real-world tasks for LLM-based agents. Your work will help define how agents should behave in typical and edge-case situations.

  • Create structured evaluation scenarios and document them in JSON and/or YAML formats.
  • Define the golden path (gold-standard agent behavior) and acceptable variations.
  • Annotate task steps, expected outputs, edge cases, and scoring logic.
  • Review agent outputs and rate them according to defined scoring rubrics.
  • Iterate on scenarios to improve clarity, coverage, and reproducibility.
  • Collaborate with developers and other contributors to refine evaluation frameworks.

Requirements

You must meet the stated educational, technical, and experience requirements below. We will not invent or substitute missing qualifications.

  • Bachelor’s and/or Master’s degree in CS, Software Engineering, Data Science/Analytics, AI/ML, Computational Linguistics/NLP, Information Systems, or a related field
  • Prior experience in QA, software testing, test case design, data analysis, or NLP annotation
  • Demonstrated ability to design reproducible test scenarios with strong coverage and edge cases
  • Comfortable reading and using structured formats like JSON and/or YAML to describe scenarios
  • Able to define gold-standard agent behavior, acceptable variations, and clear scoring logic
  • Basic working experience with Python and JavaScript (reading/editing simple scripts)
  • Strong written English skills for producing clear, unambiguous documentation
  • Comfortable working with AI-generated outputs, agent logs, and prompt-based behaviors
  • Able to switch between topics quickly and follow complex guidelines accurately
  • Fully remote readiness: reliable laptop, stable internet connection, and consistent availability

Who Should Apply

This role is a good match for people with QA or test-case design backgrounds, data analysts, experienced annotators, or engineers who enjoy clearly documenting behavior and edge cases.

  • Intermediate-level contributors who can work independently and communicate clearly in writing
  • People comfortable with structured data formats (JSON/YAML) and basic scripting
  • Detail-oriented testers who like designing reproducible scenarios and scoring logic
  • Candidates who prefer flexible, remote, part-time contract work

How It Works

Apply through OpenTrain: create a free account and submit your application (it typically takes only a few minutes). If selected, you will be onboarded to the project’s workflow and labeling tools.

Typical workflow includes authoring scenarios in JSON/YAML, running them against agent outputs, rating results, and iterating with engineering teams. You will use provided guidelines and scoring rubrics for evaluation tasks.

  • Data type: TEXT; label type: EVALUATION_RATING; labeling software: OTHER
  • Contract details: part-time contractor, 20+ hours/week, paid $18–$24/hr
  • Worldwide applicants accepted; ensure you meet the fully remote readiness items listed in Requirements