ParliaBench: An Evaluation and Benchmarking Framework for LLM-Generated Parliamentary Speech

UK Parliamentary Proceedings Dataset (2015-2022)

Project Overview

This project explores political dialogue generation using large language models fine-tuned on UK parliamentary speeches. It encompasses data processing, model selection, fine-tuning with QLoRA, speech generation, and comprehensive evaluation across linguistic, semantic, and political dimensions.

📊 Data Processing
447K speeches, 1.9K speakers

🤖 Models
5 LLMs with QLoRA

💬 Generation
2.7K speeches per model

📈 Evaluation
12+ metrics across 4 dimensions

Dataset Overview

The ParlaMint-GB dataset version 5.0 from CLARIN contains structured UK parliamentary proceedings with comprehensive metadata including speaker information, political affiliations, gender, and complete speech transcripts.

447,778

Total Speeches

1,901

Unique Speakers

Political Parties

~99.94M

Total Words

                    Time Period: January 5, 2015 to July 21, 2022

                    Houses: House of Commons & House of Lords

                    Mean Words per Speech: 223.2 | Median: 99.0

Data Cleaning Criteria

Kept only parties with more than 1,000 speeches
Removed speeches with less than 35 words (5th percentile)
Removed speeches with over 1,580 words (99th percentile)
Filtered out "Unknown" party affiliation
Removed "Business of the House" and "Point of Order" sections
Standardized quotation marks to regular double quotes

Party Distribution

Party	Political Orientation	Total Speeches	Percentage
Conservative	Centre-right	263,513	58.85%
Labour	Centre-left	108,831	24.31%
Scottish National Party	Centre-left	23,562	5.26%
Liberal Democrats	Centre to centre-left	23,517	5.25%
Crossbench	Unknown	11,878	2.65%
Democratic Unionist Party	Right	6,610	1.48%
Others	Various	9,867	2.20%

Data Processing Pipeline

1. XML Parsing & Metadata Extraction

Parse speaker information from listPerson.xml
Extract political affiliations with temporal bounds
Extract speech content from dated session XML files
Filter out procedural elements

2. Temporal Alignment

Match speeches to correct political party at time of delivery
Handle party changes and role transitions
Use temporal validity ranges (@from and @to attributes)

3. Prompt Extraction

Identify and separate question prompts from speeches
Store prompts as list of strings
Clean prompts by removing number and letter prefixes

4. Political Orientation Classification

Extract political orientation codes from ParlaMint-listOrg.xml
Map codes to orientation labels (Left, Centre, Right, etc.)
13 distinct orientation categories

5. Topic Categorization (EuroVoc)

Direct mapping for 13 categories with clear CAP-EuroVoc correspondence
Automated classification using KEVLar for complex categories
21 EuroVoc thematic categories
Highest confidence score used for final topic assignment

EuroVoc Topic Categories

Speeches were classified into 21 thematic categories using the EuroVoc taxonomy:

International Relations
Law
Social Questions
Politics
Education and Communications
Geography
Economics
Employment and Working Conditions
European Union
Transport
Trade
Environment
Production, Technology and Research
Energy
Agriculture, Forestry and Fisheries
Finance
Industry
Business and Competition
Agri-foodstuffs
International Organisations
Science

Training Data Structure

Each training instance contains the following features:

speech: The parliamentary speech text
section: Debate section context
party: Political party affiliation
prompts: Associated question prompts
house: House of Commons or House of Lords
political_orientation_label: Political orientation classification
eurovoc_topic: Thematic category

Train-Test Split: 80% training / 20% test (random seed: 42)

Key Statistics

Statistic	Value
Mean words per speech	223.2
Median words per speech	99.0
Standard deviation	278.7
Minimum words	36
Maximum words	1,579

Model Selection

Five large language models were selected based on their architecture, performance, and compatibility with the UNSLOTH fine-tuning framework. All models use 4-bit quantization for memory efficiency.

Model	Memory Reduction	Inference Speed
mistral-7b-v0.3-bnb-4bit	62%	2.2×
Meta-Llama-3.1-8B-bnb-4bit	58%	2.4×
gemma-2-9b-bnb-4bit	58%	2.2×
Qwen2-7B-bnb-4bit	N/A	N/A
Yi-1.5-6b-bnb-4bit	N/A	N/A

Fine-Tuning Methodology

QLoRA (Quantized Low-Rank Adaptation)

Parameter-efficient fine-tuning using 4-bit quantization with low-rank matrix adaptation, enabling efficient model customization without massive computational resources.

QLoRA Configuration

Parameter	Value	Rationale
LoRA Rank (r)	16	Optimal balance for fast fine-tuning
LoRA Alpha	16	Set equal to rank (α/r = 1) for baseline
Target Modules	7 layers	All linear transformations
LoRA Dropout	0	Enable Unsloth optimizations
Bias Configuration	none	Faster training, reduced memory

Training Configuration

Parameter	Value	Justification
Batch Size	64	GPU memory optimization
Learning Rate	2e-4	Standard for LoRA fine-tuning
Max Steps	11,194	2 epochs
Warmup Steps	336	10% of max steps for stability
Optimizer	AdamW	Memory-efficient
Weight Decay	0.01	Prevents overfitting on political data
Max Sequence Length	1024	Optimal for dataset median length
Scheduler	Linear	Linear learning rate schedule

System Prompt

                            You are a seasoned UK parliamentary member. Use proper British parliamentary language appropriate for the specified House. The speech should reflect the political orientation and typical positions of the specified party on the given topic.
                        

Context Fields (Input to Model)

PARTY: Political party affiliation (e.g., Conservative)
EUROVOC TOPIC: Thematic classification (e.g., TRADE)
SECTION: Parliamentary debate section
POLITICAL ORIENTATION: Orientation label (e.g., Centre-right)
HOUSE: House of Commons or House of Lords
INSTRUCTION: Question prompt or generic instruction

Tools & Environment

Framework

Unsloth + HuggingFace

Backend

PyTorch

Hardware

AWS A100 GPU

Training Method

Supervised Fine-Tuning

Speech Generation Pipeline

A systematic speech generation system that loads trained models and creates political speeches based on structured inputs.

Input Distribution

All models received identical generation tasks with realistic distributional characteristics:

House Distribution

House of Commons: 78%
House of Lords: 22%

Top Parties by Weight

Conservative: 59%
Labour: 24%
Scottish National Party: 5%
Liberal Democrats: 5%

Generation Parameters

Parameter	Value	Purpose
Speeches per Model	2,700	Comprehensive evaluation dataset
Temperature	0.7	Balances coherence and variation
Top-p (Nucleus Sampling)	0.85	Focused yet diverse outputs
Repetition Penalty	1.2	Prevents redundant phrasing
Batch Size	32	3× speed improvement
Min Word Count	43	P10 threshold for quality
Max Word Count	635	P90 threshold for quality
Max New Tokens	850	1.33× P90 speech length

                    Decoding Strategy: Nucleus sampling (top-p) chosen over greedy/beam search to avoid repetitive or incoherent text
                

Speech Validation Process

9-step validation procedure to ensure quality, coherence, and relevance of generated speeches:

1. Template Marker Detection
Detects 27 template artifacts (role markers, context labels, special tokens)

2. Unicode Corruption Detection
Identifies 14 corruption patterns and checks 11 forbidden Unicode ranges (CJK, Cyrillic, Arabic, etc.)

3. Language Detection
Uses spacy-langdetect to flag non-English text (>85% confidence threshold)

4. Repetition Detection
Three patterns: (1) Same word >3 times, (2) Sequences of 3-7 words >3 times, (3) Counting patterns

5. Semantic Relevance Check
Cosine similarity between speech and context (threshold: <0.08 flagged as off-topic)

6. Length Constraints
Validates word count (43-635 words)

7. Concatenation Detection
Detects multiple opening phrases (≥4 instances of "My Lords", "Mr Speaker", etc.)

8. Corrupted Endings Detection
Identifies nonsensical endings

9. Refusal Pattern Matching
Catches AI refusal patterns ("I cannot generate", "I'm sorry but...")

Evaluation Framework

Comprehensive multi-dimensional assessment system evaluating linguistic quality, semantic coherence, political alignment, and overall effectiveness.

1. Linguistic Quality & Diversity Metrics

Perplexity

Measures how natural the text appears to a language model. Lower scores indicate more human-like, predictable text.

Model: GPT-2 base (117M parameters)
Processing: Max 512 words per speech, batch size 8
Interpretation: Lower = more natural

Distinct-N (N=1,2,3,4)

Evaluates lexical diversity by measuring ratio of unique n-grams to total tokens.

Distinct-1: Unique unigrams (basic lexical diversity)
Distinct-2: Unique bigrams (phrase-level variety)
Distinct-3/4: Multi-word patterns (sophisticated language use)
Interpretation: Higher = more diverse vocabulary

Self-BLEU

Measures similarity between generated texts from the same model to detect repetitive content.

Method: Each speech compared to all others from same model
Interpretation: Lower = higher diversity (desirable)

2. Semantic Coherence & Text Quality

GRUEN Score

Comprehensive quality metric combining Grammaticality, non-Redundancy, focUs, structurE, and coNherence.

Grammaticality: BERT perplexity + CoLA classifier (0-1 scale)
Non-Redundancy: LCS, Edit Distance, Word Overlap between sentences
Focus: Word Mover's Distance + SpaCy semantic similarity
Formula: GRUEN = min(1, max(0, G + R_penalty + F_penalty))

BERTScore

Measures semantic similarity between generated and real speeches using contextualized embeddings.

Model: RoBERTa-large (auto-selected for English)
References: N=6 most semantically similar speeches from ParlaMint-GB
Metrics: Precision, Recall, F1

MoverScore

Computes optimal transport cost between embedding distributions using Earth Mover's Distance.

Model: DistilBERT-base-uncased
Method: IDF-weighted embeddings with POT library
References: N=6 most semantically similar speeches from ParlaMint-GB
Score Range: 0-1 (higher = better alignment)

3. Political Alignment Metrics

Political Spectrum Alignment (PSA)

Measures ideological alignment with expected political orientation (13-point scale from Far-left to Far-right).

Model: all-mpnet-base-v2 sentence transformer
Method: Cosine similarity to orientation centroids
Formula: PSA = cosine_similarity × max(0, 100 - d/12 × 100)
Range: 0-1 (higher = better alignment)

Party Alignment

Assesses whether speech captures party-specific linguistic characteristics.

Model: all-mpnet-base-v2 sentence transformer
Method: Cosine similarity to party-specific centroids
Range: 0-1 (higher = better party alignment)

4. LLM-as-a-Judge Evaluation

Automated assessment using Flow-Judge-v0.1 (3.8B parameters, 4-bit quantization) across six dimensions on a 10-point scale with rubrics for each metric:

Metric	Evaluation Criteria
Coherence	Logical flow, argument connectivity, parliamentary structure
Conciseness	Efficient message delivery without excessive verbosity (parliamentary context)
Relevance	Direct addressing of prompt/question with complete coverage
Authenticity	Natural Westminster discourse vs AI-generated patterns
Political Appropriateness	Alignment with party's typical positions and rhetoric
Overall Quality	Effectiveness as political communication, argumentation sophistication

                    Configuration: Batch size 32, Temperature 0.3, Max new tokens 2000, Default score -1 for errors
                

Evaluation Summary

Linguistic Metrics

Perplexity, Distinct-N, Self-BLEU

Semantic Metrics

GRUEN, BERTScore, MoverScore

Political Metrics

PSA, Party Alignment

Judge Metrics

6 Dimensions (1-10 scale)

Key Findings & Results

FineTuning Impact

The horizontal bar chart displays percentage improvements from baseline to fine-tuned models across multiple evaluation metrics. Yi 6B shows the most impressive gains, with J_Auth improving by 106.4% and several other metrics showing 30–70% increases. For most models, PSA and Party Align show significant improvement. However, some metrics show decreases, such as GRUEN score and Bert score. The mixed results across metrics highlight that fine-tuning optimizes certain dimensions while potentially compromising others, emphasizing the importance of multi-metric evaluation.

FineTuning Improvement Methods

The heatmap displays absolute changes in performance from baseline to fine-tuned models, separated by computational and LLM-judge metrics. Yi 6B demonstrates the strongest improvements across both measurement categories, showing particularly dramatic gains in LLM-judge metrics (0.237). Llama 3.1 8B also shows substantial positive changes (0.116 LLM-judge, 0.050 computational). In contrast, Qwen2 7B exhibits slight negative changes in LLM-judge metrics (−0.026), while Gemma 2 9B shows minimal improvement. The results indicate that LLM-judge metrics are generally more sensitive to fine-tuning effects than computational metrics.

Stability Performance

The bar chart presents stability scores calculated as 100/(1 + CV) across three dimensions: Party Stability, Topic Stability, and Orientation Stability for five model architectures. All models achieve remarkably high topic and orientation stability scores (> 91), indicating consistent performance across different subject matters and political orientations. Party stability shows the greatest variation among models, suggesting this dimension is most sensitive to model architecture differences.

Party Performance

Party alignment performance varied substantially across models. Major parties (Conservative, Labour) achieved stable performance across models, benefiting from substantial training data (58.9%,24.3%). Minor parties exhibited greater variability. Mistral struggled with heterogeneous groups (Non-Affiliated: 0.436), while Qwen excelled with ideologically coherent minorities (Bishops: 0.664). Yi demonstrated robust cross-party performance (0.614-0.633). Both new political authenticity metrics (PSA and Party Align) successfully discriminate their target political dimensions. Party Align distinguishes parties while PSA distinguishes orientations (both p < 0.001). Our analysis reveals that Party Align performance depends primarily on data abundance and ideological coherence rather than party size alone. Models successfully learn party-specific language patterns when training data provides clear stylistic signals, indicating targeted data collection for under-represented parties could improve coverage.

Party Difficulty

Applying cross-context stability analysis, party difficulty scores ranged narrowly (0.382-0.456), with no statistically significant differences. This suggests relatively consistent modeling challenges across parties regardless of size or ideological composition.

Orientation Performance

Performance across political orientations showed expected patterns. Centrist positions dominated the dataset and achieved higher scores. Model-specific strengths emerged as both Gemma, Yi, Llama and Qwen achieved highest scores on Right positions and Mistral underperformed consistently, indi- cating architectural rather than ideological limitations.

Orientation Difficulty

No significant differences between political orientations.

Topic Performance

The figure hows model performance across topic domains. Science achieved lowest scores (avg 0.516), while Economics (0.610) and European Union (0.606) showed highest performance.

Topic Difficulty

Science and Geography ranked as most difficult while Finance, Business, and Economics ranked lowest. Technical and natural science domains display higher cross-model disagreement than economic and political topics, consistent with greater terminological specialization and rapidly evolving concepts. In contrast, economic and political discussions employs more stable conceptual frameworks aligned with core parliamentary functions.

                    Research Objective: Investigate the generation of authentic political discourse through domain-specific fine-tuning of large language models on UK parliamentary speeches.
                

1. Architectural Design & Context Window Effects

Best Overall Performer: Llama 3.1 8B

128,000-token context window enabled superior performance across multiple dimensions
Enhanced instruction-following capabilities
Better captured argumentative structure and rhetorical patterns
Successfully referenced prior statements and built cumulative cases

Most Stable Performer: Gemma 2 9B

Consistent scores across all political parties
Performed well regardless of party affiliation or training data abundance
Strong cross-context stability

Second Best: Yi 1.5 6B

Outperformed larger models despite fewer parameters
Bilingual pretraining advantage
3-trillion-token corpus exposure conferred generalization benefits

Weakest Performer: Mistral 7B v0.3

Scored below 0.50 on technical topics (Science: 0.483, Agri-foodstuffs: 0.475)
Struggled with ideologically diverse parties (Non-Affiliated: 0.436, Independent: 0.482)
8,000-token sliding window insufficient for extended contextual dependencies
Poor at capturing nuanced ideological positioning

2. Domain-Specific Fine-Tuning Impact

45/70

Metrics with Significant Improvement

64%

Success Rate

Statistically Robust Findings (p < 0.05):

Self-BLEU decreased: Reduced formulaic repetition in favor of contextually appropriate variation
Political Spectrum Alignment (PSA) improved significantly: Better captured ideological positioning of different political orientations
Party Alignment scores increased: Enhanced fidelity to party-specific rhetoric, policy positions, and argumentative strategies
Improvements held across all model architectures: Domain adaptation through supervised fine-tuning is transferable and reliable

                        Key Insight: Fine-tuning improvements were not uniformly distributed—political authenticity metrics showed the strongest gains, demonstrating that domain adaptation is particularly effective for capturing ideological authenticity.
                    

3. Novel Political Authenticity Metrics Validation

Methodological Contribution: Introduction and validation of Political Spectrum Alignment (PSA) and Party Alignment metrics extends beyond conventional NLP evaluation approaches.

p < 0.001

Statistical Significance

High Confidence

Discriminative Testing

Validation Results:

Party Alignment: Successfully discriminates between parties, achieving differentiation even for ideologically proximate parties (e.g., Labour vs. Liberal Democrats)
PSA: Successfully distinguishes political orientations across the left-right spectrum
Both metrics capture their intended political dimensions with high statistical confidence

                        Impact: These metrics provide a validated framework for evaluating political authenticity in generated text, applicable to future research in political discourse generation.
                    

Summary of Principal Findings

Architecture matters: Extended context windows and advanced attention mechanisms are crucial for capturing complex political discourse
Fine-tuning works: Domain-specific adaptation significantly improves political authenticity across diverse model architectures
Novel metrics validated: PSA and Party Alignment successfully capture political dimensions beyond traditional NLP metrics

📚 Citation

If you use ParliaBench in your research, please cite:

@misc{ParliaBench2025,
    title={ParliaBench: An Evaluation and Benchmarking Framework 
           for LLM-Generated Parliamentary Speech},
    author={Marios Koniaris and Argyro Tsipi and Panayiotis Tsanakas},
    year={2025},
    eprint={2511.08247},
    archivePrefix={arXiv},
    primaryClass={cs.CL},
    url={https://arxiv.org/abs/2511.08247}
}

Paper: arXiv:2511.08247