Skip to main navigation Skip to search Skip to main content

FermBench: A new benchmark for measuring the capabilities of LLMs on fermentation knowledge

Research output: Contribution to journalJournal articleResearchpeer-review

3 Downloads (Orbit)

Abstract

Generative Artificial Intelligence (GenAI) chatbots continue to amaze users worldwide with their rapid improvements. These tools possess vast general knowledge and can thus be used in various fields, including education. However, before rolling out these models in pedagogical applications, it is fundamental to understand whether the information provided is reliable and if any of the currently available chatbots are best suited for domain-specific tasks. The objective of this study is to thoroughly investigate these aspects in a specific domain, fermentation, with the overarching goal of providing guidelines to students and teachers to select the best GenAI assistant. To achieve this goal, we introduce FermBench , a dataset specifically designed for fermentation processes. We use the collected data to benchmark five large language models (LLMs) powering commercially available GenAI chatbots, including ChatGPT, Gemini, DeepSeek, Claude and le Chat. To evaluate the responses of these models, we propose a robust experimental framework that includes automated metrics, human annotations, and the LLM-as-a-Judge approach. The obtained results suggest that, given the high baseline and the fact that the judges were unable to agree on an overall best model, the current knowledge embedded within these models is adequate and the standalone results cannot provide pedagogical guidelines regarding which chatbot should be used in education. These results suggest that the choice of which GenAI chatbot should be supported by institutional or government guidance, as well as individual preferences, perhaps informed by the parameters identified in our analysis. An interesting finding of the study is that curated answers are not necessarily better than generated ones.
Original languageEnglish
Article number100577
JournalComputers and Education: Artificial Intelligence
Volume10
Number of pages17
DOIs
Publication statusPublished - 2026

Keywords

  • Artificial intelligence
  • Large language models
  • Fermentation
  • Chatbots in education

Fingerprint

Dive into the research topics of 'FermBench: A new benchmark for measuring the capabilities of LLMs on fermentation knowledge'. Together they form a unique fingerprint.

Cite this