Working Student (m/f/d) LLM Agent Evaluation & Benchmarking

Agile Robots SE

2 hours agoTechWorking Student

On-site

English

Required skills

Inspect AIAnthropic APILangSmithlm-evaluation-harnessnumpypandasRagasLangChainpytestOpenAI APIPlotlyPythonOllamascikit-learnLangGraph

Job description

We are looking for a Working Student (m/f/d) LLM Agent Evaluation & Benchmarking. In this role, you will design and build an agent-agnostic benchmarking harness, run comparative evaluations across frontier and local models, and translate findings into prompt, guard, and tool-schema improvements.

Your Responsibilities

Harness Development: Design and build an agent-agnostic benchmarking harness that executes versioned task suites against frontier and local models with reproducible, version-controlled runs.
Task Suite Design: Define and maintain evaluation task suites that measure task success, grounding accuracy, latency, and cost across the agent portfolio.
Model Evaluation: Run periodic head-to-head evaluations across models to produce structured comparisons that support model selection decisions.
Eval Reporting: Analyze evaluation outputs and produce reports and visualizations that communicate findings clearly to agent owners.
Improvement Feedback: Translate evaluation findings into concrete changes to prompts, guard logic, and tool schemas in collaboration with agent owners.

Essential Skills

Academic Background: Currently enrolled in a Master's programme in Computer Science, Machine Learning, or Data Science.
Python Engineering: Ability to write well-structured Python for tooling and automation, including test frameworks such as pytest, dependency management, reproducible execution, and basic CI pipeline configuration.
Eval Frameworks: Working familiarity with LLM evaluation frameworks such as LangSmith, Ragas, Inspect AI, or lm-evaluation-harness.
Agent Concepts: Working understanding of LLM APIs (OpenAI, Anthropic, Ollama), prompt structure, and how multi-step agent systems are built and instrumented.
Experimental Design: Ability to design controlled comparisons, define success metrics, and interpret results across multiple evaluation conditions.

Beneficial Skills

Data Analysis: Familiarity with statistical comparison methods and data handling using numpy, pandas, and scikit-learn.
Reporting Tools: Familiarity with data visualization and reporting using tools such as Plotly, Streamlit, or notebooks.
Agent Frameworks: Familiarity with agent orchestration frameworks such as LangChain or LangGraph.

What we offer

Practical learning opportunities to complement your studies.
Dynamic high-tech company combined with financial soundness and world class investors.
Join an interdisciplinary, international team with 60+ different nationalities in a collaborative work environment.
Corporate Benefits Program that covers health, mobility and learning with 100 € net per month.
Modern office facilities with a rooftop terrace overlooking Munich, free drinks & fruits, and regular company events contribute to a good working environment.

Agile Robots SE is an international high-tech company based in Munich, Germany with a production site in Kaufbeuren and more than 2300 employees worldwide. Our mission is to bridge the gap between artificial intelligence and robotics by developing systems that combine state-of-the-art force-moment-sensing and world-leading image-processing technology. This unique combination of technologies allows us to provide user-friendly and affordable robotic solutions that enable intelligent precision assembly.

This is made possible by our employees, who bring out the best in each and every day with creativity and enthusiasm. Become part of this team and shape the future of robotics with us!

We are proud of our diversity and welcome your application regardless of gender and sexual identity, nationality, ethnicity, religion, age, or disability.

Apply now

Report an issue

Apply now

More at Agile Robots SE

See all jobs at Agile Robots SE

2 hours agoOn-site