This repository provides a comprehensive suite of reproducible analysis pipelines for processing and extracting features from YouTube videos for downstream scientific research. This repository includes modules for audio, visual, motion, and linguistic analysis.
This is part of a larger project, "YouTube Video Clasification", headed by David Wegmann, under the ARTS Social Media Influence project of DATALAB – Center for Digital Social Research, Aarhus University. The processed YouTube videos are from participants in the Data donation as a method for investigating trends and challenges in digital media landscapes at national scale Project. The project investigates how digital platforms influence public discourse and develops ethical, legally compliant methods for collecting and processing user-contributed data.
YouTube-Video-Processing-Pipelines/
├── Audio/
│ ├── Get_1st_minute.ipynb # Extract first minute of audio
│ └── Audio_Analysis.ipynb # Audio feature extraction
├── Visual/
│ ├── visual_features_csv_combiner.ipynb
│ └── video_analysis_utils.py # Visual analysis utilities
├── Motion/
│ └── Merge_Sub-DataFrames.ipynb # Motion data consolidation
├── Linguistic/
│ ├── CPU_VM_folder/ # Transcription workspace
│ ├── Language_Detection_Script.ipynb
│ ├── Text_Descriptors.ipynb
│ └── Validation_of_Transcription.ipynb
└── Metadata/
└── create_video_metadata_df.ipynb
- First-minute extraction from videos
- Audio feature analysis including:
- Volume contour analysis
- Frequency characteristics
- Temporal features
- Advanced audio metrics (MFCC, ZCR)
- Color analysis
- Texture features
- Composition metrics
- Object detection
- Face and person detection
- Distributed processing support
- Optical flow analysis
- Scene detection
- Motion direction statistics
- Shot analysis
- Language detection (24 languages)
- Video transcription using Whisper
- Text feature extraction
- Multi-language support
- Validation tools
- Video metadata extraction
- Structured DataFrame creation
- Data validation
- CSV export functionality
Required packages include:
- Jupyter
- OpenAI Whisper
- OpenCV
- spaCy (with language models)
- FFmpeg
- textdescriptives
- pandas
- numpy
- scikit-image
- librosa
Each pipeline is contained in its respective Jupyter notebook. Follow these steps:
- Metadata Extraction
jupyter notebook Metadata/create_video_metadata_df.ipynb- Language Detection
jupyter notebook Linguistic/Language_Detection_Script.ipynb-
Transcription Processing Navigate to CPU_VM_folder and run the transcription pipeline.
-
Feature Extraction Run the respective notebooks for audio, visual, and motion analysis.
DATALAB – Center for Digital Social Research is an interdisciplinary research center at the School of Communication and Culture. The center is based on the vision that technology and data systems should maintain a focus on people and society, supporting the principles of democracy, human rights and ethics.
All research and activities of the center is focusing on three contemporary challenges facing the digital society, that is the challenge of 1) preserving conditions for privacy, autonomy and trust among individuals and groups; 2) sustaining the provision of and access to high-quality content online to safeguard democracy; and 3) maintaining a suitable and meaningful balance between algorithmic and human control in connection with automation.
For more information, visit DATALAB's website.
