Data Science Expert (Python, SQL, GenAI)
OpenTrain AI · Remote · Worldwide · Posted Apr 5, 2026
**Job Name:** Data Science Expert (Python, SQL, GenAI)
**Dataset Description (5–8 words):** Real-world data science problem authoring
**Data Type (select one):** Text
**Subject Matter/Industry (5–8 words):** Applied data science across multiple industries
**Pre-labeled Data (Yes/No):** No
**Labeling Software:** Other
**Label Types (select 1+):**
* Prompt + Response Writing (SFT)
* Text Generation
* Question Answering
* Evaluation/Rating
* Computer Programming/Coding
* Text Summarization
---
## Labeling Overview
**Qualifications / requirements:**
We’re seeking experienced data science specialists with expert Python and SQL skills to design and verify computational, business-realistic data science problems. Ideal contributors have 5+ years of hands-on data science experience with measurable business impact, strong statistical/ML foundations, and strong written English (C1+). You should be comfortable producing deterministic, reproducible solutions (including fixed random seeds when needed) and writing clear documentation.
**What you’ll do:**
You’ll design original, end-to-end computational data science problems reflecting real analytical workflows across industries (e.g., telecom, finance, government, e-commerce, healthcare). Tasks include creating Python-based problems spanning ingestion, cleaning, EDA, feature engineering, modeling, validation, and deployment considerations. You will ensure problems are computationally intensive (not solvable manually in a reasonable timeframe), include realistic business context (fraud, forecasting, optimization, risk, customer analytics), support reproducibility, and verify correct answers using standard data science libraries.
---
**Required Locations:** Global - Any Location
**Required English Level:** Fluent
---
## Other Qualifications & Requirements (for screening)
* 5+ years of hands-on data science experience with demonstrated business impact
* Expert Python for data science: Pandas, NumPy, SciPy, scikit-learn, statsmodels
* Comfortable using visualization libraries for EDA/communication (Matplotlib; Seaborn is a plus)
* Strong ability to design deterministic, reproducible problems (e.g., fixed seeds, no stochastic ambiguity)
* Deep statistical analysis + ML knowledge (feature engineering, model selection, evaluation, error analysis)
* Expert SQL skills (complex joins, aggregations, window functions) and database operations
* Experience designing end-to-end DS workflows (ingestion → cleaning → EDA → modeling → validation)
* Familiarity with big data/scalable processing concepts (partitioning, performance considerations, memory constraints)
* Experience with GenAI technologies (LLMs, RAG, prompt engineering, vector databases)
* Understanding of MLOps/model deployment workflows (packaging, reproducibility, monitoring basics)
* Experience with modern frameworks (TensorFlow or PyTorch; bonus: LangChain)
* Written English proficiency at C1+ level (or equivalent), able to write clear business problem statements
* Availability to contribute ~10–20 hours/week during active project phases (project-based; not permanent)