Transformer-Based Emotion Recognition from Song Lyrics

November 2025 · Technical Deep Dive

Can transformers understand the emotions in song lyrics? In this research developed at CIIC/CISUC (University of Coimbra), we fine-tuned 8 encoder models and built an ensemble that achieves 77.43% F1-score on Russell's emotional quadrants — a 5.5% improvement over previous benchmarks.


Why Lyrics Matter for Emotion Recognition

Music Emotion Recognition (MER) aims to computationally identify the affective states conveyed through musical content. While most research has focused on acoustic features — tempo, key, timbre — there's a rich source of emotional information that's often underutilized: lyrics.

Think about it: when you hear "I will always love you," the melody carries emotion, but so do those words. Lyric-based Music Emotion Recognition (LMER) leverages the semantic and narrative content embedded in song lyrics to capture emotional expressions that audio alone might miss.

MER workflow showing how audio and lyrics can be processed for emotion recognition
A typical MER workflow: diverse input modalities (audio features, lyrics, MIDI, video) undergo feature extraction and are mapped onto an emotion space such as Russell's Valence-Arousal model.

But LMER faces significant challenges:

  • Manual feature engineering — traditional approaches relied on hand-crafted features like sentiment lexicons
  • Metaphors and cultural references — "I'm on fire" doesn't mean what it literally says
  • Long lyrics — many songs exceed the 512-token limit of standard transformers

This study evaluates whether modern transformer architectures can overcome these limitations through automated feature extraction and better contextual understanding.


The Evolution of LMER

LMER has evolved through three distinct phases:

Phase 1: Classical Approaches

Early methods combined text representations (Bag-of-Words, TF-IDF) with traditional ML algorithms (SVM, k-NN). Part-of-Speech tagging identified grammatical patterns associated with emotional expression; sentiment lexicons like ANEW (Affective Norms for English Words)[7] provided direct polarity mapping.

The limitations were significant: heavy reliance on manual feature engineering introduced bias, limited generalization across genres, and these approaches struggled to capture contextual emotional expressions — particularly when confronted with rhetorical devices, culturally-embedded references, and situational context.

Phase 2: Deep Learning

Word embeddings (Word2Vec, GloVe) transformed LMER by introducing automated feature learning — encoding emotional relationships in dense vector spaces without manual feature engineering. Recurrent Neural Networks, particularly Long Short-Term Memory (LSTM) networks, captured temporal dependencies within lyrics.

However, LSTMs have a practical limitation: context windows of approximately 200 tokens. This creates problems for songs with lengthy lyrics, where emotional narratives may span hundreds of words. Multimodal approaches combining audio and lyrics emerged during this period, showing that fusing complementary signals improves classification performance.

Phase 3: Transformers

Pre-trained language models like BERT and RoBERTa marked a paradigm shift. Self-attention mechanisms enable truly bidirectional context encoding, and encoder-only architectures are particularly well-suited for classification tasks.

Why Encoders? For discriminative tasks like emotion classification, encoder-only models (BERT, RoBERTa) offer superior performance with reduced computational demands compared to decoder-only architectures (GPT-style models).

Russell's Circumplex Model

How do we represent emotions computationally? We use Russell's Circumplex Model of Affect, which maps emotions using two dimensions:

  • Valence: negative ← → positive (how pleasant the emotion is)
  • Arousal: low ← → high (the energy or intensity level)
Russell's Circumplex Model showing the four emotional quadrants with Valence and Arousal axes
Russell's Circumplex Model: emotions are mapped onto a 2D space defined by Valence (pleasant-unpleasant) and Arousal (activation-deactivation), creating four distinct emotional quadrants.

This quadrant approach preserves the dimensional model's depth while enabling discrete categorization — essential for supervised learning.


Methodology & Model Selection

Dataset: MERGE Lyrics

We used the MERGE Lyrics dataset[1]: 2,568 English songs annotated according to Russell's quadrants. This dataset was specifically designed for Music Emotion Recognition research and provides high-quality annotations:

  • Manually validated annotations — no noisy automatic labels; human annotators assigned quadrant labels
  • Balanced across quadrants — 600 samples each in the balanced version, addressing class imbalance issues common in emotion datasets
  • Genre diversity — Rock, Pop, R&B, Country, Rap, Metal, Folk, ensuring the models learn generalizable patterns
  • Publicly available — accessible on Zenodo (DOI: 10.5281/zenodo.10873009) for reproducibility
Why this dataset? Previous LMER studies suffered from small sample sizes and inconsistent annotations. MERGE Lyrics provides a standardized benchmark with rigorous quality control, enabling fair comparison across different approaches.

Model Selection

We selected 8 encoder-based transformers based on three criteria:

  1. Computational efficiency: models under ~450M parameters
  2. Established baselines: BERT, RoBERTa
  3. Extended context & architectural novelty: Longformer, DeBERTaV3, ModernBERT
Model Parameters Max Tokens Key Feature
BERT 340M 512 Bidirectional baseline
RoBERTa 355M 512 Dynamic masking, larger pretraining
DeBERTaV3 435M 1024 Disentangled attention
XLNet 340M 1024 Permutation language modeling
BigBird RoBERTa 400M 1024 Sparse attention
Longformer 435M 2048 Sliding window + global attention
ModernBERT 395M 2048 Novel positional encoding
ERNIE 3.0 296M 2048 Knowledge integration

Training Setup

Each model was fine-tuned with:

  • Optimizer: AdamW with weight decay 0.01
  • Learning rate: Optimized via Optuna (range: 10⁻⁶ to 10⁻⁴)
  • Epochs: 15 max, with early stopping (4 epochs patience)
  • Scheduler: Cosine with warmup
  • Runs: 10 per configuration (for statistical robustness)

The Truncation Problem

One critical challenge: many song lyrics exceed model token limits. When a song has 800 tokens but your model only accepts 512, what happens to the truncated content?

Histogram showing distribution of token counts across lyrics
Distribution of token counts per song. Notice how many exceed the 512-token limit of BERT/RoBERTa.

We analyzed truncation effects on the 70-15-15 split:

Model Truncated Instances Error Rate on Truncated
RoBERTa 61 6.56%
BERT 56 8.93%
DeBERTaV3 Most Robust 9 0%
XLNet Degraded 10 50%
Longformer 0 N/A
ModernBERT 0 N/A
🔍

Key Insight: Architecture Matters More Than Truncation Severity

Within each model, the degree of truncation doesn't predict errors (all p-values > 0.05). But which architecture you use matters enormously. DeBERTaV3 handles truncation gracefully; XLNet fails catastrophically.


Results & Analysis

Individual Model Performance

Macro F1-scores on the validation set (averaged over 10 runs):

Model F1 (70-15-15 Balanced) F1 (70-15-15 Complete) Std Dev
RoBERTa Best Individual 76.20% 75.86% ±1.47-2.02%
ModernBERT 75.95% 76.06% ±1.09-1.50%
Longformer 75.93% 75.94% ±1.36-1.37%
BERT 74.46% 75.52% ±1.36-1.38%
XLNet 74.90% 74.35% ±1.15-2.23%
BigBird RoBERTa 73.65% 73.62% ±1.57-2.62%
DeBERTaV3 73.18% 71.24% ±1.46-7.18%
ERNIE 3.0 69.08% 71.82% ±1.42-1.47%

Key observations:

  • RoBERTa maintained superior stability across all splits
  • ModernBERT achieved competitive performance with extended context
  • Longformer showed minimal variance (±1.36%) — very consistent
  • ERNIE 3.0 underperformed by ~6-7% — its Chinese pretraining likely hurt English lyrics understanding

Quadrant-Level Performance

F1 scores broken down by Russell quadrant for each model
Per-quadrant F1-scores reveal consistent patterns: Q2 (angry/tense) is easiest; Q1 (happy/excited) is hardest.

Interestingly, all models struggled most with Q1 (high arousal, positive valence) — the "excited/happy" quadrant. This might be because:

  • Happy lyrics often use more indirect or metaphorical language
  • Excitement can be confused with anger (both high arousal)
  • Positive emotions may be expressed more subtly in lyrics

Ensemble: Better Together

Individual models have different strengths. Can we combine them?

We implemented a weighted soft-voting ensemble that combines probability distributions rather than discrete predictions. The weight for each model is computed using softmax over their validation F1-scores:

wi = exp(fi) / Σ exp(fj)

Then the ensemble prediction combines probabilities:

P(class|x) = Σ wi · Pi(class|x)

Ensemble Results

Bar chart comparing ensemble F1-scores against baseline
F1-score improvements over the previous best-performing approach (RoBERTa + SVM).
Configuration Accuracy Precision Recall F1-Score
40-30-30 Complete Best 77.73% 77.53% 77.42% 77.43%
40-30-30 Balanced 76.34% 76.48% 76.15% 75.76%
70-15-15 Complete 77.08% 76.97% 76.54% 76.68%
70-15-15 Balanced 76.12% 76.35% 75.58% 75.80%
+5.51% Improvement! The ensemble achieved 77.43% F1-score on the 40-30-30 complete configuration — a significant improvement over the previous benchmark of 71.92% (RoBERTa embeddings + SVM classifier).

Which Models Made the Cut?

The optimal ensemble composition varied by configuration, but some patterns emerged:

  • RoBERTa was included in all ensembles (weight ~0.25)
  • Longformer appeared in all configurations
  • Weights were remarkably uniform (max difference < 0.06 within each ensemble)
  • ERNIE 3.0 was never selected — its underperformance hurt ensemble diversity

The uniform weight distribution suggests that architectural diversity, not individual model dominance, drives ensemble improvements.


Key Takeaways

1️⃣

Transformers Work for LMER

Fine-tuned encoder transformers significantly outperform traditional approaches for lyric-based emotion recognition. RoBERTa emerged as the strongest individual model (F1: 75.57%).

2️⃣

Extended Context Helps (Sometimes)

Longformer and ModernBERT effectively handle long lyrics without truncation. But having more context doesn't guarantee better performance — model architecture matters more.

3️⃣

Truncation Robustness Varies Wildly

DeBERTaV3 handles truncation gracefully (0% error rate). XLNet fails catastrophically (50% error rate). Choose your architecture carefully if your lyrics are long.

4️⃣

Ensembles Beat Individuals

A weighted soft-voting ensemble achieves 77.43% F1-score — 5.5% better than previous benchmarks. Architectural diversity, not individual dominance, drives the improvement.

5️⃣

Happy Songs Are Hard

All models struggled most with Q1 (high arousal, positive valence). Detecting nuanced emotional intensity in "excited/happy" lyrics remains challenging.


Limitations & Future Work

This study focused on English lyrics from predominantly Western music. While the MERGE Lyrics dataset provides robust annotations, its focus on Western music may limit cross-cultural generalizability. Future directions include:

  • Multimodal fusion: Combining lyrics with audio features for richer emotional understanding
  • Multilingual models: Testing on non-English datasets (Portuguese, Spanish, Chinese lyrics)
  • Decoder models: Evaluating GPT-style architectures with instruction tuning for emotion classification
  • Explainability: Understanding which words, phrases, or linguistic patterns trigger emotion predictions using attention visualization
  • Fine-grained emotions: Moving beyond 4 quadrants to more nuanced emotional categories

References

  1. Louro, P., et al. (2024). "MERGE: A Bimodal Dataset for Static Music Emotion Recognition." Zenodo. DOI: 10.5281/zenodo.10873009
  2. Russell, J.A. (1980). "A Circumplex Model of Affect." Journal of Personality and Social Psychology, 39(6), 1161-1178.
  3. Devlin, J., et al. (2019). "BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding." NAACL-HLT.
  4. Liu, Y., et al. (2019). "RoBERTa: A Robustly Optimized BERT Pretraining Approach." arXiv:1907.11692.
  5. Beltagy, I., Peters, M.E., & Cohan, A. (2020). "Longformer: The Long-Document Transformer." arXiv:2004.05150.
  6. He, P., et al. (2023). "DeBERTaV3: Improving DeBERTa using ELECTRA-Style Pre-Training." ICLR.
  7. Malheiro, R., et al. (2017). "Emotion-based Analysis and Classification of Music Lyrics." International Journal of Multimedia Information Retrieval.
  8. Matos, B., et al. (2022). "Lyric-based Music Emotion Recognition using MERGE dataset." Proceedings of CMMR.

Acknowledgments

This work was developed at CIIC — Centre for Informatics and Intelligent Computing in collaboration with CISUC/LASI — Centre for Informatics and Systems of the University of Coimbra, Department of Informatics Engineering.

This research was supported by the Portuguese Foundation for Science and Technology (FCT) under the project UIDB/00326/2020.


Citation

If you reference this blog post, you may use the following citation:

@misc{Ribeiro2025TEMO,
    author       = {Ribeiro, Tiago F. R.},
    title        = {Transformer-Based Emotion Recognition from Song Lyrics},
    year         = {2025},
    month        = {nov},
    howpublished = {\url{https://tiago1ribeiro.github.io/blog_posts/12_transformer_emotion_lyrics.html}},
    note         = {Blog post}
}

This blog post is based on research developed as part of ongoing work in Music Information Retrieval at CIIC. Questions or collaboration ideas? Reach me at tiago.r.ribeiro@gmail.com.