ParliaBench: An Evaluation and Benchmarking Framework for LLM-Generated Parliamentary Speech

UK Parliamentary Proceedings Dataset (2015-2022)

Project Overview

This project explores political dialogue generation using large language models fine-tuned on UK parliamentary speeches. It encompasses data processing, model selection, fine-tuning with QLoRA, speech generation, and comprehensive evaluation across linguistic, semantic, and political dimensions.

πŸ“Š Data Processing
447K speeches, 1.9K speakers
πŸ€– Models
5 LLMs with QLoRA
πŸ’¬ Generation
2.7K speeches per model
πŸ“ˆ Evaluation
12+ metrics across 4 dimensions

Dataset Overview

The ParlaMint-GB dataset version 5.0 from CLARIN contains structured UK parliamentary proceedings with comprehensive metadata including speaker information, political affiliations, gender, and complete speech transcripts.

447,778
Total Speeches
1,901
Unique Speakers
11
Political Parties
~99.94M
Total Words
Time Period: January 5, 2015 to July 21, 2022
Houses: House of Commons & House of Lords
Mean Words per Speech: 223.2 | Median: 99.0

Data Cleaning Criteria

  • Kept only parties with more than 1,000 speeches
  • Removed speeches with less than 35 words (5th percentile)
  • Removed speeches with over 1,580 words (99th percentile)
  • Filtered out "Unknown" party affiliation
  • Removed "Business of the House" and "Point of Order" sections
  • Standardized quotation marks to regular double quotes

Party Distribution

Party Political Orientation Total Speeches Percentage
Conservative Centre-right 263,513 58.85%
Labour Centre-left 108,831 24.31%
Scottish National Party Centre-left 23,562 5.26%
Liberal Democrats Centre to centre-left 23,517 5.25%
Crossbench Unknown 11,878 2.65%
Democratic Unionist Party Right 6,610 1.48%
Others Various 9,867 2.20%

Data Processing Pipeline

1. XML Parsing & Metadata Extraction

  • Parse speaker information from listPerson.xml
  • Extract political affiliations with temporal bounds
  • Extract speech content from dated session XML files
  • Filter out procedural elements

2. Temporal Alignment

  • Match speeches to correct political party at time of delivery
  • Handle party changes and role transitions
  • Use temporal validity ranges (@from and @to attributes)

3. Prompt Extraction

  • Identify and separate question prompts from speeches
  • Store prompts as list of strings
  • Clean prompts by removing number and letter prefixes

4. Political Orientation Classification

  • Extract political orientation codes from ParlaMint-listOrg.xml
  • Map codes to orientation labels (Left, Centre, Right, etc.)
  • 13 distinct orientation categories

5. Topic Categorization (EuroVoc)

  • Direct mapping for 13 categories with clear CAP-EuroVoc correspondence
  • Automated classification using KEVLar for complex categories
  • 21 EuroVoc thematic categories
  • Highest confidence score used for final topic assignment

EuroVoc Topic Categories

Speeches were classified into 21 thematic categories using the EuroVoc taxonomy:

  • International Relations
  • Law
  • Social Questions
  • Politics
  • Education and Communications
  • Geography
  • Economics
  • Employment and Working Conditions
  • European Union
  • Transport
  • Trade
  • Environment
  • Production, Technology and Research
  • Energy
  • Agriculture, Forestry and Fisheries
  • Finance
  • Industry
  • Business and Competition
  • Agri-foodstuffs
  • International Organisations
  • Science

Training Data Structure

Each training instance contains the following features:

  • speech: The parliamentary speech text
  • section: Debate section context
  • party: Political party affiliation
  • prompts: Associated question prompts
  • house: House of Commons or House of Lords
  • political_orientation_label: Political orientation classification
  • eurovoc_topic: Thematic category
Train-Test Split: 80% training / 20% test (random seed: 42)

Key Statistics

Statistic Value
Mean words per speech 223.2
Median words per speech 99.0
Standard deviation 278.7
Minimum words 36
Maximum words 1,579

Model Selection

Five large language models were selected based on their architecture, performance, and compatibility with the UNSLOTH fine-tuning framework. All models use 4-bit quantization for memory efficiency.

Model Memory Reduction Inference Speed
mistral-7b-v0.3-bnb-4bit 62% 2.2Γ—
Meta-Llama-3.1-8B-bnb-4bit 58% 2.4Γ—
gemma-2-9b-bnb-4bit 58% 2.2Γ—
Qwen2-7B-bnb-4bit N/A N/A
Yi-1.5-6b-bnb-4bit N/A N/A

Fine-Tuning Methodology

QLoRA (Quantized Low-Rank Adaptation)

Parameter-efficient fine-tuning using 4-bit quantization with low-rank matrix adaptation, enabling efficient model customization without massive computational resources.

QLoRA Configuration

Parameter Value Rationale
LoRA Rank (r) 16 Optimal balance for fast fine-tuning
LoRA Alpha 16 Set equal to rank (Ξ±/r = 1) for baseline
Target Modules 7 layers All linear transformations
LoRA Dropout 0 Enable Unsloth optimizations
Bias Configuration none Faster training, reduced memory

Training Configuration

Parameter Value Justification
Batch Size 64 GPU memory optimization
Learning Rate 2e-4 Standard for LoRA fine-tuning
Max Steps 11,194 2 epochs
Warmup Steps 336 10% of max steps for stability
Optimizer AdamW Memory-efficient
Weight Decay 0.01 Prevents overfitting on political data
Max Sequence Length 1024 Optimal for dataset median length
Scheduler Linear Linear learning rate schedule

System Prompt

You are a seasoned UK parliamentary member. Use proper British parliamentary language appropriate for the specified House. The speech should reflect the political orientation and typical positions of the specified party on the given topic.

Context Fields (Input to Model)

  • PARTY: Political party affiliation (e.g., Conservative)
  • EUROVOC TOPIC: Thematic classification (e.g., TRADE)
  • SECTION: Parliamentary debate section
  • POLITICAL ORIENTATION: Orientation label (e.g., Centre-right)
  • HOUSE: House of Commons or House of Lords
  • INSTRUCTION: Question prompt or generic instruction

Tools & Environment

Framework
Unsloth + HuggingFace
Backend
PyTorch
Hardware
AWS A100 GPU
Training Method
Supervised Fine-Tuning

Speech Generation Pipeline

A systematic speech generation system that loads trained models and creates political speeches based on structured inputs.

Input Distribution

All models received identical generation tasks with realistic distributional characteristics:

House Distribution

  • House of Commons: 78%
  • House of Lords: 22%

Top Parties by Weight

  • Conservative: 59%
  • Labour: 24%
  • Scottish National Party: 5%
  • Liberal Democrats: 5%

Generation Parameters

Parameter Value Purpose
Speeches per Model 2,700 Comprehensive evaluation dataset
Temperature 0.7 Balances coherence and variation
Top-p (Nucleus Sampling) 0.85 Focused yet diverse outputs
Repetition Penalty 1.2 Prevents redundant phrasing
Batch Size 32 3Γ— speed improvement
Min Word Count 43 P10 threshold for quality
Max Word Count 635 P90 threshold for quality
Max New Tokens 850 1.33Γ— P90 speech length
Decoding Strategy: Nucleus sampling (top-p) chosen over greedy/beam search to avoid repetitive or incoherent text

Speech Validation Process

9-step validation procedure to ensure quality, coherence, and relevance of generated speeches:

1. Template Marker Detection
Detects 27 template artifacts (role markers, context labels, special tokens)
2. Unicode Corruption Detection
Identifies 14 corruption patterns and checks 11 forbidden Unicode ranges (CJK, Cyrillic, Arabic, etc.)
3. Language Detection
Uses spacy-langdetect to flag non-English text (>85% confidence threshold)
4. Repetition Detection
Three patterns: (1) Same word >3 times, (2) Sequences of 3-7 words >3 times, (3) Counting patterns
5. Semantic Relevance Check
Cosine similarity between speech and context (threshold: <0.08 flagged as off-topic)
6. Length Constraints
Validates word count (43-635 words)
7. Concatenation Detection
Detects multiple opening phrases (β‰₯4 instances of "My Lords", "Mr Speaker", etc.)
8. Corrupted Endings Detection
Identifies nonsensical endings
9. Refusal Pattern Matching
Catches AI refusal patterns ("I cannot generate", "I'm sorry but...")

Evaluation Framework

Comprehensive multi-dimensional assessment system evaluating linguistic quality, semantic coherence, political alignment, and overall effectiveness.

1. Linguistic Quality & Diversity Metrics

Perplexity

Measures how natural the text appears to a language model. Lower scores indicate more human-like, predictable text.

  • Model: GPT-2 base (117M parameters)
  • Processing: Max 512 words per speech, batch size 8
  • Interpretation: Lower = more natural

Distinct-N (N=1,2,3,4)

Evaluates lexical diversity by measuring ratio of unique n-grams to total tokens.

  • Distinct-1: Unique unigrams (basic lexical diversity)
  • Distinct-2: Unique bigrams (phrase-level variety)
  • Distinct-3/4: Multi-word patterns (sophisticated language use)
  • Interpretation: Higher = more diverse vocabulary

Self-BLEU

Measures similarity between generated texts from the same model to detect repetitive content.

  • Method: Each speech compared to all others from same model
  • Interpretation: Lower = higher diversity (desirable)

2. Semantic Coherence & Text Quality

GRUEN Score

Comprehensive quality metric combining Grammaticality, non-Redundancy, focUs, structurE, and coNherence.

  • Grammaticality: BERT perplexity + CoLA classifier (0-1 scale)
  • Non-Redundancy: LCS, Edit Distance, Word Overlap between sentences
  • Focus: Word Mover's Distance + SpaCy semantic similarity
  • Formula: GRUEN = min(1, max(0, G + R_penalty + F_penalty))

BERTScore

Measures semantic similarity between generated and real speeches using contextualized embeddings.

  • Model: RoBERTa-large (auto-selected for English)
  • References: N=6 most semantically similar speeches from ParlaMint-GB
  • Metrics: Precision, Recall, F1

MoverScore

Computes optimal transport cost between embedding distributions using Earth Mover's Distance.

  • Model: DistilBERT-base-uncased
  • Method: IDF-weighted embeddings with POT library
  • References: N=6 most semantically similar speeches from ParlaMint-GB
  • Score Range: 0-1 (higher = better alignment)

3. Political Alignment Metrics

Political Spectrum Alignment (PSA)

Measures ideological alignment with expected political orientation (13-point scale from Far-left to Far-right).

  • Model: all-mpnet-base-v2 sentence transformer
  • Method: Cosine similarity to orientation centroids
  • Formula: PSA = cosine_similarity Γ— max(0, 100 - d/12 Γ— 100)
  • Range: 0-1 (higher = better alignment)

Party Alignment

Assesses whether speech captures party-specific linguistic characteristics.

  • Model: all-mpnet-base-v2 sentence transformer
  • Method: Cosine similarity to party-specific centroids
  • Range: 0-1 (higher = better party alignment)

4. LLM-as-a-Judge Evaluation

Automated assessment using Flow-Judge-v0.1 (3.8B parameters, 4-bit quantization) across six dimensions on a 10-point scale with rubrics for each metric:

Metric Evaluation Criteria
Coherence Logical flow, argument connectivity, parliamentary structure
Conciseness Efficient message delivery without excessive verbosity (parliamentary context)
Relevance Direct addressing of prompt/question with complete coverage
Authenticity Natural Westminster discourse vs AI-generated patterns
Political Appropriateness Alignment with party's typical positions and rhetoric
Overall Quality Effectiveness as political communication, argumentation sophistication
Configuration: Batch size 32, Temperature 0.3, Max new tokens 2000, Default score -1 for errors

Evaluation Summary

Linguistic Metrics
Perplexity, Distinct-N, Self-BLEU
Semantic Metrics
GRUEN, BERTScore, MoverScore
Political Metrics
PSA, Party Alignment
Judge Metrics
6 Dimensions (1-10 scale)

Key Findings & Results

FineTuning Impact

The horizontal bar chart displays percentage improvements from baseline to fine-tuned models across multiple evaluation metrics. Yi 6B shows the most impressive gains, with J_Auth improving by 106.4% and several other metrics showing 30–70% increases. For most models, PSA and Party Align show significant improvement. However, some metrics show decreases, such as GRUEN score and Bert score. The mixed results across metrics highlight that fine-tuning optimizes certain dimensions while potentially compromising others, emphasizing the importance of multi-metric evaluation.

FineTuning Improvement Methods

The heatmap displays absolute changes in performance from baseline to fine-tuned models, separated by computational and LLM-judge metrics. Yi 6B demonstrates the strongest improvements across both measurement categories, showing particularly dramatic gains in LLM-judge metrics (0.237). Llama 3.1 8B also shows substantial positive changes (0.116 LLM-judge, 0.050 computational). In contrast, Qwen2 7B exhibits slight negative changes in LLM-judge metrics (βˆ’0.026), while Gemma 2 9B shows minimal improvement. The results indicate that LLM-judge metrics are generally more sensitive to fine-tuning effects than computational metrics.

Stability Performance

The bar chart presents stability scores calculated as 100/(1 + CV) across three dimensions: Party Stability, Topic Stability, and Orientation Stability for five model architectures. All models achieve remarkably high topic and orientation stability scores (> 91), indicating consistent performance across different subject matters and political orientations. Party stability shows the greatest variation among models, suggesting this dimension is most sensitive to model architecture differences.

Party Performance

Party alignment performance varied substantially across models. Major parties (Conservative, Labour) achieved stable performance across models, benefiting from substantial training data (58.9%,24.3%). Minor parties exhibited greater variability. Mistral struggled with heterogeneous groups (Non-Affiliated: 0.436), while Qwen excelled with ideologically coherent minorities (Bishops: 0.664). Yi demonstrated robust cross-party performance (0.614-0.633). Both new political authenticity metrics (PSA and Party Align) successfully discriminate their target political dimensions. Party Align distinguishes parties while PSA distinguishes orientations (both p < 0.001). Our analysis reveals that Party Align performance depends primarily on data abundance and ideological coherence rather than party size alone. Models successfully learn party-specific language patterns when training data provides clear stylistic signals, indicating targeted data collection for under-represented parties could improve coverage.

Party Difficulty

Applying cross-context stability analysis, party difficulty scores ranged narrowly (0.382-0.456), with no statistically significant differences. This suggests relatively consistent modeling challenges across parties regardless of size or ideological composition.

Orientation Performance

Performance across political orientations showed expected patterns. Centrist positions dominated the dataset and achieved higher scores. Model-specific strengths emerged as both Gemma, Yi, Llama and Qwen achieved highest scores on Right positions and Mistral underperformed consistently, indi- cating architectural rather than ideological limitations.

Orientation Difficulty

No significant differences between political orientations.

Topic Performance

The figure hows model performance across topic domains. Science achieved lowest scores (avg 0.516), while Economics (0.610) and European Union (0.606) showed highest performance.

Topic Difficulty

Science and Geography ranked as most difficult while Finance, Business, and Economics ranked lowest. Technical and natural science domains display higher cross-model disagreement than economic and political topics, consistent with greater terminological specialization and rapidly evolving concepts. In contrast, economic and political discussions employs more stable conceptual frameworks aligned with core parliamentary functions.

Research Objective: Investigate the generation of authentic political discourse through domain-specific fine-tuning of large language models on UK parliamentary speeches.

1. Architectural Design & Context Window Effects

Best Overall Performer: Llama 3.1 8B

  • 128,000-token context window enabled superior performance across multiple dimensions
  • Enhanced instruction-following capabilities
  • Better captured argumentative structure and rhetorical patterns
  • Successfully referenced prior statements and built cumulative cases

Most Stable Performer: Gemma 2 9B

  • Consistent scores across all political parties
  • Performed well regardless of party affiliation or training data abundance
  • Strong cross-context stability

Second Best: Yi 1.5 6B

  • Outperformed larger models despite fewer parameters
  • Bilingual pretraining advantage
  • 3-trillion-token corpus exposure conferred generalization benefits

Weakest Performer: Mistral 7B v0.3

  • Scored below 0.50 on technical topics (Science: 0.483, Agri-foodstuffs: 0.475)
  • Struggled with ideologically diverse parties (Non-Affiliated: 0.436, Independent: 0.482)
  • 8,000-token sliding window insufficient for extended contextual dependencies
  • Poor at capturing nuanced ideological positioning

2. Domain-Specific Fine-Tuning Impact

45/70
Metrics with Significant Improvement
64%
Success Rate

Statistically Robust Findings (p < 0.05):

  • Self-BLEU decreased: Reduced formulaic repetition in favor of contextually appropriate variation
  • Political Spectrum Alignment (PSA) improved significantly: Better captured ideological positioning of different political orientations
  • Party Alignment scores increased: Enhanced fidelity to party-specific rhetoric, policy positions, and argumentative strategies
  • Improvements held across all model architectures: Domain adaptation through supervised fine-tuning is transferable and reliable
Key Insight: Fine-tuning improvements were not uniformly distributedβ€”political authenticity metrics showed the strongest gains, demonstrating that domain adaptation is particularly effective for capturing ideological authenticity.

3. Novel Political Authenticity Metrics Validation

Methodological Contribution: Introduction and validation of Political Spectrum Alignment (PSA) and Party Alignment metrics extends beyond conventional NLP evaluation approaches.

p < 0.001
Statistical Significance
High Confidence
Discriminative Testing

Validation Results:

  • Party Alignment: Successfully discriminates between parties, achieving differentiation even for ideologically proximate parties (e.g., Labour vs. Liberal Democrats)
  • PSA: Successfully distinguishes political orientations across the left-right spectrum
  • Both metrics capture their intended political dimensions with high statistical confidence
Impact: These metrics provide a validated framework for evaluating political authenticity in generated text, applicable to future research in political discourse generation.

Summary of Principal Findings

  1. Architecture matters: Extended context windows and advanced attention mechanisms are crucial for capturing complex political discourse
  2. Fine-tuning works: Domain-specific adaptation significantly improves political authenticity across diverse model architectures
  3. Novel metrics validated: PSA and Party Alignment successfully capture political dimensions beyond traditional NLP metrics

πŸ“š Citation

If you use ParliaBench in your research, please cite:

@misc{ParliaBench2025,
    title={ParliaBench: An Evaluation and Benchmarking Framework 
           for LLM-Generated Parliamentary Speech},
    author={Marios Koniaris and Argyro Tsipi and Panayiotis Tsanakas},
    year={2025},
    eprint={2511.08247},
    archivePrefix={arXiv},
    primaryClass={cs.CL},
    url={https://arxiv.org/abs/2511.08247}
}
Paper: arXiv:2511.08247