Abstract
The rapid adoption of Large Language Models (LLMs) as tools for annotation and evaluation tasks in
Natural Language Processing (NLP) has led to important questions about their reliability, robustness, and
alignment with human expectations. This thesis investigates these concerns across three connected lines
of inquiry:
• the indistinguishability of LLM-generated annotations from human annotations,
• human alignment and reliability of LLMs as subjective judges of language quality, and
• the susceptibility of LLM-judges to misinformation attacks in evaluation settings.
In the first part of this thesis, we study whether annotations made by LLMs can be reliably distinguished from those created by humans. Previous research has claimed that LLMs behave as ’human-like
annotators’, motivating our work to rigorously test this hypothesis. We frame this as a classification
task: Given a dataset of annotations, can a model detect their origin? Surprisingly, our findings indicate
that even state-of-the-art classifiers achieve near-chance performance, with accuracy not exceeding 51%.
These results offer strong empirical evidence that LLMs produce annotations that are indistinguishable
from human annotators in most settings, validating the ’human-like’ claim from a discriminative modeling
point of view.
Building on this, we explore whether LLMs can function as reliable evaluators: a more nuanced role
that goes beyond labeling and involves subjective scoring of text quality (e.g., summaries, open-ended
responses). In this phase, we test how well LLM-generated judgments align with human evaluations and
how robust they are to subtle changes in prompt phrasing. We introduce controlled prompt perturbations
and evaluate the consistency of LLM scores and textual justifications. The results show a significant lack
of robustness: Small, often misleading perturbations lead to large changes in judgment outputs. These
inconsistencies expose limitations in using LLMs as trustworthy evaluators, especially in high-stakes or
subjective settings.
The final part of this thesis presents a preliminary exploration into the robustness of LLM-based
evaluators when exposed to misinformation, marking a conceptual shift from controlled perturbations to
real-world-inspired adversarial scenarios. We introduce a novel framework that systematically injects misinformation into prompts, grounded in a taxonomy of ten misinformation types. This study probes how
these manipulations affect LLM judgments, analyzing both score alignment and justification consistency.
Contrary to our initial expectations, the results suggest that LLM-judges demonstrate a surprising
degree of resilience. Scores and justifications remain largely consistent across many misinformation
types, even when factual content is altered. However, this robustness appears to be superficial, with the
LLM-Judge often failing to explicitly recognize or reflect awareness of being misled in its justifications.
Rather than detecting and correcting misinformation, they tend to preserve their prior judgments, raising
concerns about unconscious robustness without epistemic awareness.
These early findings open new avenues for future research. Comparative assessment, misinformation
detection, and deeper semantic analysis may reveal subtler vulnerabilities. More broadly, this work lays
the groundwork for building misinformation-aware evaluation systems and motivates the development of
LLMs that not only evaluate well but do so with fact-sensitive reasoning.
Collectively, this thesis explores the evolving role of LLMs as annotation and evaluation agents.
While they show promise as ’human-like’ annotators, their behavior as evaluators under adversarial and
misinformation-rich contexts highlights both strengths and blind spots. By offering early insights into
their robustness and limitations, we provide a foundation for future work aimed at building trustworthy,
interpretable, and context-aware LLM judges.