A comprehensive voice persona dataset for character consistency in voice synthesis, generated using advanced audio-language models.
VoicePersona Dataset serves as the training foundation for VoiceForge - an AI architecture that generates character voices from pure text descriptions.
The Connection:
- VoicePersona provides detailed voice characteristics and personality profiles
- VoiceForge uses this data to learn textโvoice mapping for character consistency
- Together, they enable voice synthesis from natural language descriptions alone
VoiceForge Applications:
- ๐ฎ Game developers creating unique NPCs
- ๐ Interactive storytelling applications
- ๐ฌ Content creators needing character voices
- ๐ฌ Researchers in voice synthesis
This dataset bridges the gap between voice analysis and synthesis, providing the structured training data needed for consistent character voice generation without audio samples or voice actors.
Dataset Size:
- Total Samples: 15,082 voice recordings
- Unique Speakers: 10,179 individual speakers
- Total Duration: 48.7 hours of audio
- Average Duration: 11.6 seconds per sample
- Unique Accents: 702 different accent variations
| Dataset | Description | Samples | Link |
|---|---|---|---|
| Laions Got Talent | Emotional speech synthesis | 7,937 | laion/laions_got_talent |
| GLOBE_V2 | Global accents, 52 accents ร 3 genders | 3,146 | MushanW/GLOBE_V2 |
| AniSpeech | Anime speech synthesis | 2,000 | ShoukanLabs/AniSpeech |
| AnimeVox | Anime character voices | 1,999 | taresh18/AnimeVox |
Qwen2-Audio-7B-Instruct: Alibaba's multimodal audio-language model
- 7B parameters optimized for audio understanding
- Supports voice chat and audio analysis
- Multilingual capabilities (8+ languages)
This pipeline processes audio from multiple voice datasets and generates detailed character profiles using Qwen2-Audio-7B-Instruct. The system:
- Extracts Voice Characteristics: Analyzes pitch, tone, timbre, resonance, and speaking patterns
- Identifies Demographics: Estimates gender, age range, and accent
- Profiles Personality: Determines character traits and suitable roles
- Maintains Consistency: Focuses on "how" speakers talk rather than "what" they say
voicepersona_dataset/
โโโ globe_v2/
โ โโโ audio/ # Original audio files (.wav)
โ โโโ globe_v2_descriptions.json
โ โโโ globe_v2_hf_dataset/ # HuggingFace format
โโโ laions/
โ โโโ audio/
โ โโโ laions_descriptions.json
โ โโโ laions_hf_dataset/
โโโ animevox/
โ โโโ audio/
โ โโโ animevox_descriptions.json
โ โโโ animevox_hf_dataset/
โโโ anispeech/
โโโ audio/
โโโ anispeech_descriptions.json
โโโ anispeech_hf_dataset/
{
"index": 0,
"dataset": "globe_v2",
"speaker_id": "S_000658",
"transcript": "each member has one share and one vote.",
"audio_path": "/path/to/audio.wav",
"duration": 2.9,
"gender": "female",
"age": "thirties",
"accent": "New Zealand English",
"voice_description": "Detailed voice profile including vocal qualities, speaking style, emotional undertones, character impression, and distinctive features...",
"processing_timestamp": "2025-07-17T01:57:41.590598"
}git clone https://github.com/PranavMishra17/VoicePersona-Dataset
cd voicepersona-dataset
pip install -r requirements.txt# List available datasets
python main.py list
# Test processing
python main.py test globe_v2 --samples 5
# Process full dataset
python main.py process laions --max 1000
# Analyze results
python main.py analyze animevoxKey settings in src/config.py:
USE_QUANTIZATION: Enable 4-bit quantization for 6GB VRAMUSE_STREAMING: Stream datasets without full downloadCHECKPOINT_INTERVAL: Auto-save frequency
- Total Samples: 15,082 voice samples across 4 datasets
- Languages: 8+ languages and 52+ accent variations
- Demographics: Balanced gender and age distributions
- Domains: Conversational, emotional, anime, and synthetic speech
Gender Distribution:
- Female: 9,448 samples (62.6%)
- Male: 5,294 samples (35.1%)
- Unknown: 275 samples (1.8%)
- Other: 65 samples (0.4%)
Age Group Distribution:
- Twenties: 11,481 samples (76.1%)
- Teens: 1,950 samples (12.9%)
- Thirties: 545 samples (3.6%)
- Forties: 432 samples (2.9%)
- Fifties+: 181 samples (1.2%)
- Other/Unknown: 493 samples (3.3%)
Top 10 Accent Variations:
- General American: 3,481 samples (23.1%)
- United States English: 2,278 samples (15.1%)
- Unknown: 792 samples (5.3%)
- American English: 544 samples (3.6%)
- British RP: 461 samples (3.1%)
- US accent: 458 samples (3.0%)
- English: 452 samples (3.0%)
- German: 416 samples (2.8%)
- Australian English: 392 samples (2.6%)
- Valley girl accent: 368 samples (2.4%)
Data Completeness: 96.8%
- Complete demographic data: 14,807 samples (98.2%)
- Valid audio files: 15,082 samples (100%)
- Non-empty transcripts: 15,082 samples (100%)
- Voice descriptions: 15,082 samples (100%)
- Average description length: ~500 characters
Minimum:
- GPU: 6GB VRAM (RTX 3060+)
- RAM: 16GB
- Storage: 50GB free space
- CUDA 11.8+
Recommended:
- GPU: 12GB+ VRAM
- RAM: 32GB
- Storage: 100GB+ SSD
This dataset was created and maintained by:
Pranav Mishra
Pranav Vasist
Research Interests:
- Voice synthesis and character consistency
- Multimodal AI applications
- Audio-language model development
Contributions welcome! Areas for improvement:
Datasets:
- Additional voice datasets integration
- Multilingual voice collections
- Emotional speech datasets
Technical:
- Model optimization for lower VRAM
- Faster processing pipelines
- Better voice characteristic extraction
Analysis:
- Voice similarity metrics
- Character consistency evaluation
- Demographic bias analysis
- Fork the repository
- Create feature branch (
git checkout -b feature/improvement) - Commit changes (
git commit -am 'Add improvement') - Push branch (
git push origin feature/improvement) - Open Pull Request
This project is licensed under the CC0 1.0 Universal License - see the LICENSE file for details.
CC0 1.0 Universal Summary:
- โ Commercial use
- โ Modification
- โ Distribution
- โ Private use
- โ No warranties or liability
- Qwen Team for the Qwen2-Audio model
- Dataset Contributors: GLOBE_V2, Laions, AnimeVox, AniSpeech teams
- HuggingFace for dataset hosting and tools
- Open Source Community for supporting libraries
If you use this dataset in your research, please cite:
@misc{pranav_mishra_2025,
author = { Pranav Mishra },
title = { VoicePersona (Revision 431e3b5) },
year = 2025,
url = { https://huggingface.co/datasets/Paranoiid/VoicePersona },
doi = { 10.57967/hf/6085 },
publisher = { Hugging Face }
}