Skip to main content

About the Agentic AI and LLM Evaluation Lab

The rise of agentic AI

Recent advances in large language models have enabled new forms of artificial intelligence systems that can perform complex reasoning tasks, interact with software tools, and coordinate multi step workflows. These systems are increasingly described as agentic AI, because they can operate as autonomous or semi-autonomous agents capable of planning actions, retrieving information, invoking external tools, and collaborating with humans.

Agentic AI systems are rapidly emerging as a new paradigm for building intelligent software systems. Instead of traditional rule-based automation, these systems rely on large language models to interpret tasks, generate plans, and coordinate actions across digital environments.

 

Why systematic evaluation is needed

While large language models have demonstrated remarkable capabilities, their behaviour can vary significantly across tasks, domains, and deployment contexts. For applications in complex domains such as the energy sector, it is therefore essential to understand how different models perform and how they interact with external systems.

Reliable deployment of agentic AI requires systematic evaluation of aspects such as reasoning quality, robustness, consistency, error behaviour, tool use, and the ability to operate within structured workflows. Without rigorous evaluation frameworks, it is difficult to determine which models are suitable for specific applications or how they should be integrated into real world systems.

The Agentic AI and LLM Evaluation Lab addresses this challenge by providing a research environment dedicated to studying how large language models behave in realistic agentic workflows.

 

Evaluating agentic AI systems

The lab focuses on developing methods and experimental environments for evaluating large language models when they are used as components of agent-based systems. This includes studying how models perform when interacting with data sources, software tools, simulation environments, and digital twins.

Key research questions include how different models perform across complex reasoning tasks, how agent orchestration patterns influence system behaviour, and how AI agents interact with human users in decision support settings. The lab also investigates reproducibility, reliability, and robustness when LLM based agents operate in multi-step workflows.

Through benchmarking experiments, workflow testing, and comparative analysis of different models and orchestration strategies, the lab aims to establish more systematic approaches for evaluating agentic AI systems.

 

An experimental platform for agentic AI research

The Agentic AI and LLM Evaluation Lab provides an experimental platform where researchers can design, test, and analyse agent-based AI systems in realistic digital environments. These environments allow AI agents to interact with datasets, software platforms, digital twins, and domain specific tools that are relevant to energy informatics.

By combining benchmarking, experimentation, and system level analysis, the lab contributes to developing reliable methods for integrating agentic AI into digital solutions for the energy sector.

Through this work, the lab supports the development of trustworthy AI technologies that can contribute to the green transition by enabling more intelligent and adaptive digital systems.

Last Updated 18.03.2026