Multimodal Architecture Using EEG(electroencephalogram) and Eye-Tracking Features (SEED-V)
Model architecture diagram by torchgreedy (Last updated: 2025-08-09)
This project presents a Transformer-based multimodal architecture for classifying human emotions using EEG and eye-tracking signals from the SEED-V dataset.
It incorporates novel attention mechanisms, domain adaptation, and interpretable gating to outperform previous benchmarks under rigorous evaluation.
-
Positional Encoding: Sinusoidal encodings enable the model to understand temporal relationships in sequential data.
-
Feature Importance Module (FIM): Learns modality-specific importance weights to adaptively emphasize stronger signals.
-
Cross-Modal Attention: Bidirectional attention mechanism allowing EEG features to attend to eye movement features and vice versa.
-
Self-Attention Transformer Encoder: Processes each modality to capture intra-modal temporal dynamics.
-
Global Pooling and Fusion: Integrates information from both modalities into a unified representation.
-
Domain Adaptation Layer: Employs gradient reversal to learn subject-invariant representations, improving generalization.
-
Classification Head: Multi-layer perceptron with GELU activations for final emotion classification.
For detailed mathematical formulations and implementation specifics, see our technical documentation.
Achieved Accuracy: 75.42% on SEED-V (LOSO evaluation)
Our Cross-Modal Transformer achieves 75.42% ± 11.15% mean accuracy on the SEED-V dataset using rigorous Leave-One-Subject-Out (LOSO) cross-validation.
Subject ID | Accuracy |
---|---|
Subject 1 | 88.89% |
Subject 2 | 84.44% |
Subject 3 | 82.22% |
Subject 4 | 60.00% |
Subject 5 | 75.56% |
Subject 6 | 84.44% |
Subject 7 | 64.44% |
Subject 8 | 86.67% |
Subject 9 | 55.56% |
Subject 10 | 64.44% |
Subject 11 | 75.56% |
Subject 12 | 75.56% |
Subject 13 | 80.00% |
Subject 14 | 57.78% |
Subject 15 | 80.00% |
Subject 16 | 91.11% |
Average | 75.42% |
The variance in performance across subjects (standard deviation: 11.15%) highlights the challenge of cross-subject generalization in physiological emotion recognition, which our domain adaptation approach helps address.
Our model significantly outperforms existing state-of-the-art approaches on the SEED-V dataset:
Method | Description | SEED-V Accuracy |
---|---|---|
Our Cross-Modal Transformer (2025) | Transformer-based architecture with bidirectional cross-modal attention and domain adaptation | 75.42% |
Attention-based Multimodal Fusion [1] (2023) | Attention mechanism to fuse EEG and eye movement features | 72.3% |
DFSAN [2] (2025) | Dual filtration subdomain adaptation for cross-subject emotion recognition | 65.57% |
RHPRNet [3] (2024) | Hybrid physiological representation network | 68.44% |
References:
- Mina et al., "Multimodal Deep Learning for Subject-Independent Emotion Recognition Using EEG and Eye Movement Data," IEEE, 2023.
- Zheng et al., "Dual filtration subdomain adaptation network for cross-subject emotion recognition," Neurocomputing, 2025.
- Tang et al., "Hierarchical multimodal-fusion of physiological signals for emotion recognition with scenario adaption and contrastive alignment, "Information Fusion, 2024.
├── Cross-Modal-Transformer-for-Robust-Emotion-Recognition/
│ ├── checkpoints/ # 16 model files
│ │ ├── model_fold_1.pth
│ │ ├── model_fold_2.pth
│ │ ├── ...
│ │ └── model_fold_16.pth
│ ├── docs/ # Technical documentation
│ │ ├── technical_overview.md # Overall architecture overview
│ │ └── modules/ # Detailed module descriptions
│ │ ├── classification_head.md # Classification component details
│ │ ├── cross_modal_attention.md# Bidirectional attention mechanism
│ │ ├── domain_adaptation.md # Subject-invariant learning approach
│ │ ├── feature_importance.md # Adaptive modality weighting
│ │ ├── global_pooling.md # Feature fusion methods
│ │ ├── linear_projection.md # Input dimension alignment
│ │ ├── positional_encoding.md # Temporal information encoding
│ │ └── self_attention.md # Intra-modal attention mechanism
│ ├── SEED-V/
│ │ ├── EEG_DE_features/ # EEG differential entropy features data, just reference only
│ │ └── Eye_movement_features/ # Eye movement features data, just reference only
│ ├── interpretability_results/
│ │ ├── attention_analysis/
│ │ │ ├── attention_heatmaps.png
│ │ │ └── temporal_attention_patterns.png
│ │ ├── confidence_analysis/
│ │ │ └── confidence_analysis.png
│ │ ├── emotion_patterns/
│ │ │ └── emotion_specific_attention.png
│ │ ├── frequency_bands_analysis/ # New folder for frequency band analyses
│ │ │ ├── subject_versus_time/ # Subject-specific frequency band analysis
│ │ │ │ ├── subject_01_time_band_heatmap.png
│ │ │ │ ├── subject_02_time_band_heatmap.png
│ │ │ │ ├── ...
│ │ │ │ └── subject_16_time_band_heatmap.png
│ │ │ └── emotion_versus_time/ # Emotion-specific frequency band analysis
│ │ │ ├── emotion_0_Disgust_time_band_heatmap.png
│ │ │ ├── emotion_1_Fear_time_band_heatmap.png
│ │ │ ├── emotion_2_Sad_time_band_heatmap.png
│ │ │ ├── emotion_3_Neutral_time_band_heatmap.png
│ │ │ ├── emotion_4_Happy_time_band_heatmap.png
│ │ │ └── overall_time_band_heatmap.png
│ │ ├── subject_analysis/
│ │ │ └── subject_variability_analysis.png
│ │ ├── T-SNE/ # New folder for T-SNE visualizations
│ │ │ ├── tsne_by_subject.png # T-SNE visualization colored by subject
│ │ │ ├── tsne_by_emotion.png # T-SNE visualization colored by emotion
│ │ │ └── tsne_by_correctness.png # T-SNE visualization colored by prediction correctness
│ │ └── confusion_matrix.png
│ ├── src/
│ │ ├── model.py # Main multimodal Transformer model
│ │ ├── modules/
│ │ │ ├── projections.py # Modality projection layers
│ │ │ ├── attention.py # Cross-modal and self-attention modules
│ │ │ ├── grl.py # Domain adaptation (Gradient Reversal Layer)
│ │ │ ├── fusion.py # Gating and final fusion logic
│ │ │ └── __init__.py # Module exports
│ │ ├── dataset.py # SEED-V preprocessing & Dataloader
│ │ ├── utils.py # Masking, normalization, evaluation helpers
│ │ └── __init__.py # Package exports
│ ├── train.py # Training script with LOSO evaluation
│ ├── evaluate.py # Standalone evaluation
│ ├── config.yaml # Model and training hyperparameters
│ ├── requirements.txt # Project dependencies
│ ├── LICENSE # Project license
│ └── README.md # Project documentation
- Clone the repository:
git clone https://github.com/torchgreedy/Cross-Modal-Transformer-for-Robust-Emotion-Recognition.git
cd Cross-Modal-Transformer-for-Robust-Emotion-Recognition
- Create a virtual environment:
python -m venv venv
source venv/bin/activate # On Windows: venv\Scripts\activate
- Install dependencies:
pip install -r requirements.txt
This project uses the SEED-V dataset containing EEG and eye-tracking data for emotion recognition. The dataset includes:
- 5 emotion categories: happy, sad, fear, disgust, neutral
- 16 subjects with 3 sessions each
- 62-channel EEG data and eye movement features
You'll need to request access to the dataset from the BCMI lab.
To train the model with Leave-One-Subject-Out cross-validation:
python train.py --config config.yaml
To evaluate a trained model:
python evaluate.py --model_path checkpoints/model_fold_1.pth --subject_id 1
This project is licensed under the AGPL-3.0 license - see the LICENSE file for details.