FB-Bench

Written by

in

FB-Bench: Evaluating How Well LLMs Actually Listen to Human Feedback

The true measure of an artificial intelligence assistant is not just how it handles a first impression, but how it responds when corrected. Traditional static benchmarks frequently struggle to capture this interactive dynamic. To solve this, researchers introduced FB-Bench, a fine-grained, multi-task benchmark explicitly engineered to evaluate Large Language Models’ (LLMs) responsiveness to diverse human feedback. Published at the 2025 Conference on Empirical Methods in Natural Language Processing (EMNLP), FB-Bench bridges the gap between static model testing and real-world, iterative human-AI collaboration. The Architecture: A Three-Dimensional Taxonomy

While most benchmarks evaluate independent, single-turn prompts, human-AI teamwork relies on an ongoing dialogue. FB-Bench standardises this interaction by building a structured, three-tier taxonomy that tracks three essential pillars:

User Queries: Initial multi-task prompts spanning eight distinct operational domains.

Model Responses: Initial model outputs paired with five predefined “deficiency types” (e.g., factual errors, logical gaps, or tone issues).

User Feedback: Nine unique types of explicit human interventions, prompting the model to adjust. Dual Evaluation Scenarios

FB-Bench categorises human-model text interactions into two primary functional challenges: 1. Error Correction

This scenario measures a model’s capacity to absorb negative feedback, identify its own previous errors, and deploy fixes. It tests whether an LLM can rewrite text, fix buggy code, or alter tone without fracturing the context of the conversation. 2. Response Maintenance

Critically, users are not always right. Sometimes human feedback is ambiguous, misguided, or explicitly incorrect. Response Maintenance evaluates an LLM’s behavioral resilience. The benchmark tests if a model can stand its ground, politely defend its correct answer, and avoid being misled by flawed human prompts. Checklist-Driven Evaluation Framework

To provide precise, reproducible scoring across hundreds of curated multi-turn dialogue samples, FB-Bench abandons generic grading. Instead, it uses a sample-specific weighted checklist.

[Human-Curated Sample Prompt] │ ▼ [LLM Follow-up Response] │ ▼ [GPT-4o Evaluation Judge] ──► Evaluates against unique Weighted Checklist │ ▼ [Final Granular Accuracy Score (>90% Human Agreement)]

When an LLM generates a follow-up response, an automated “LLM-as-a-Judge” engine (powered by GPT-4o) grades the output based on that sample’s unique checklist. This highly granular evaluation protocol achieves an agreement rate of over 90% with human experts, ensuring that the resulting leaderboard metrics are reliable and highly reflective of actual user preferences. Why FB-Bench Matters for the Future of AI

The initial findings from FB-Bench reveal vast performance discrepancies between popular open-source and proprietary frontier LLMs. While many models excel at straightforward instruction-following, their performance degrades when balancing complex error correction against manipulative human prompts.

Comments

Leave a Reply

Your email address will not be published. Required fields are marked *