Skip to content

A comprehensive voice persona dataset for character consistency in voice synthesis, generated using advanced audio-language model Qwen2-Audio-7B, with a GPU-optimized pipeline

License

Notifications You must be signed in to change notification settings

PranavMishra17/VoicePersona-Dataset

Repository files navigation

VoicePersona Dataset

Hugging Face License: CC0 Python 3.8+

A comprehensive voice persona dataset for character consistency in voice synthesis, generated using advanced audio-language models.

๐Ÿ“‹ Overview

VoicePersona Dataset serves as the training foundation for VoiceForge - an AI architecture that generates character voices from pure text descriptions.

The Connection:

  • VoicePersona provides detailed voice characteristics and personality profiles
  • VoiceForge uses this data to learn textโ†’voice mapping for character consistency
  • Together, they enable voice synthesis from natural language descriptions alone

VoiceForge Applications:

  • ๐ŸŽฎ Game developers creating unique NPCs
  • ๐Ÿ“š Interactive storytelling applications
  • ๐ŸŽฌ Content creators needing character voices
  • ๐Ÿ”ฌ Researchers in voice synthesis

This dataset bridges the gap between voice analysis and synthesis, providing the structured training data needed for consistent character voice generation without audio samples or voice actors.

๐Ÿ“Š Dataset Statistics

Dataset Size:

  • Total Samples: 15,082 voice recordings
  • Unique Speakers: 10,179 individual speakers
  • Total Duration: 48.7 hours of audio
  • Average Duration: 11.6 seconds per sample
  • Unique Accents: 702 different accent variations

๐Ÿ—ƒ๏ธ Source Datasets

Dataset Description Samples Link
Laions Got Talent Emotional speech synthesis 7,937 laion/laions_got_talent
GLOBE_V2 Global accents, 52 accents ร— 3 genders 3,146 MushanW/GLOBE_V2
AniSpeech Anime speech synthesis 2,000 ShoukanLabs/AniSpeech
AnimeVox Anime character voices 1,999 taresh18/AnimeVox

๐Ÿค– Model Used

Qwen2-Audio-7B-Instruct: Alibaba's multimodal audio-language model

  • 7B parameters optimized for audio understanding
  • Supports voice chat and audio analysis
  • Multilingual capabilities (8+ languages)

๐ŸŽฏ What We Do

This pipeline processes audio from multiple voice datasets and generates detailed character profiles using Qwen2-Audio-7B-Instruct. The system:

  1. Extracts Voice Characteristics: Analyzes pitch, tone, timbre, resonance, and speaking patterns
  2. Identifies Demographics: Estimates gender, age range, and accent
  3. Profiles Personality: Determines character traits and suitable roles
  4. Maintains Consistency: Focuses on "how" speakers talk rather than "what" they say

๐Ÿ“Š Dataset Structure

voicepersona_dataset/
โ”œโ”€โ”€ globe_v2/
โ”‚   โ”œโ”€โ”€ audio/                    # Original audio files (.wav)
โ”‚   โ”œโ”€โ”€ globe_v2_descriptions.json
โ”‚   โ””โ”€โ”€ globe_v2_hf_dataset/      # HuggingFace format
โ”œโ”€โ”€ laions/
โ”‚   โ”œโ”€โ”€ audio/
โ”‚   โ”œโ”€โ”€ laions_descriptions.json
โ”‚   โ””โ”€โ”€ laions_hf_dataset/
โ”œโ”€โ”€ animevox/
โ”‚   โ”œโ”€โ”€ audio/
โ”‚   โ”œโ”€โ”€ animevox_descriptions.json
โ”‚   โ””โ”€โ”€ animevox_hf_dataset/
โ””โ”€โ”€ anispeech/
    โ”œโ”€โ”€ audio/
    โ”œโ”€โ”€ anispeech_descriptions.json
    โ””โ”€โ”€ anispeech_hf_dataset/

Sample Output Format

{
  "index": 0,
  "dataset": "globe_v2",
  "speaker_id": "S_000658",
  "transcript": "each member has one share and one vote.",
  "audio_path": "/path/to/audio.wav",
  "duration": 2.9,
  "gender": "female",
  "age": "thirties",
  "accent": "New Zealand English",
  "voice_description": "Detailed voice profile including vocal qualities, speaking style, emotional undertones, character impression, and distinctive features...",
  "processing_timestamp": "2025-07-17T01:57:41.590598"
}

๐Ÿš€ Usage

Installation

git clone https://github.com/PranavMishra17/VoicePersona-Dataset
cd voicepersona-dataset
pip install -r requirements.txt

Quick Start

# List available datasets
python main.py list

# Test processing
python main.py test globe_v2 --samples 5

# Process full dataset
python main.py process laions --max 1000

# Analyze results
python main.py analyze animevox

Configuration

Key settings in src/config.py:

  • USE_QUANTIZATION: Enable 4-bit quantization for 6GB VRAM
  • USE_STREAMING: Stream datasets without full download
  • CHECKPOINT_INTERVAL: Auto-save frequency

๐Ÿ“ˆ Dataset Statistics

  • Total Samples: 15,082 voice samples across 4 datasets
  • Languages: 8+ languages and 52+ accent variations
  • Demographics: Balanced gender and age distributions
  • Domains: Conversational, emotional, anime, and synthetic speech

Demographic Analysis

Gender Distribution:

  • Female: 9,448 samples (62.6%)
  • Male: 5,294 samples (35.1%)
  • Unknown: 275 samples (1.8%)
  • Other: 65 samples (0.4%)

Age Group Distribution:

  • Twenties: 11,481 samples (76.1%)
  • Teens: 1,950 samples (12.9%)
  • Thirties: 545 samples (3.6%)
  • Forties: 432 samples (2.9%)
  • Fifties+: 181 samples (1.2%)
  • Other/Unknown: 493 samples (3.3%)

Top 10 Accent Variations:

  1. General American: 3,481 samples (23.1%)
  2. United States English: 2,278 samples (15.1%)
  3. Unknown: 792 samples (5.3%)
  4. American English: 544 samples (3.6%)
  5. British RP: 461 samples (3.1%)
  6. US accent: 458 samples (3.0%)
  7. English: 452 samples (3.0%)
  8. German: 416 samples (2.8%)
  9. Australian English: 392 samples (2.6%)
  10. Valley girl accent: 368 samples (2.4%)

Data Quality Metrics

Data Completeness: 96.8%

  • Complete demographic data: 14,807 samples (98.2%)
  • Valid audio files: 15,082 samples (100%)
  • Non-empty transcripts: 15,082 samples (100%)
  • Voice descriptions: 15,082 samples (100%)
  • Average description length: ~500 characters

๐Ÿ”ง System Requirements

Minimum:

  • GPU: 6GB VRAM (RTX 3060+)
  • RAM: 16GB
  • Storage: 50GB free space
  • CUDA 11.8+

Recommended:

  • GPU: 12GB+ VRAM
  • RAM: 32GB
  • Storage: 100GB+ SSD

Developers

This dataset was created and maintained by:

Pranav Mishra

GitHub Portfolio LinkedIn Resume YouTube

Pranav Vasist

GitHub LinkedIn

Research Interests:

  • Voice synthesis and character consistency
  • Multimodal AI applications
  • Audio-language model development

๐Ÿค Contributing

Contributions welcome! Areas for improvement:

Datasets:

  • Additional voice datasets integration
  • Multilingual voice collections
  • Emotional speech datasets

Technical:

  • Model optimization for lower VRAM
  • Faster processing pipelines
  • Better voice characteristic extraction

Analysis:

  • Voice similarity metrics
  • Character consistency evaluation
  • Demographic bias analysis

How to Contribute

  1. Fork the repository
  2. Create feature branch (git checkout -b feature/improvement)
  3. Commit changes (git commit -am 'Add improvement')
  4. Push branch (git push origin feature/improvement)
  5. Open Pull Request

๐Ÿ“„ License

This project is licensed under the CC0 1.0 Universal License - see the LICENSE file for details.

CC0 1.0 Universal Summary:

  • โœ… Commercial use
  • โœ… Modification
  • โœ… Distribution
  • โœ… Private use
  • โŒ No warranties or liability

๐Ÿ™ Acknowledgments

  • Qwen Team for the Qwen2-Audio model
  • Dataset Contributors: GLOBE_V2, Laions, AnimeVox, AniSpeech teams
  • HuggingFace for dataset hosting and tools
  • Open Source Community for supporting libraries

๐Ÿ“ž Citation

If you use this dataset in your research, please cite:

@misc{pranav_mishra_2025,
	author       = { Pranav Mishra },
	title        = { VoicePersona (Revision 431e3b5) },
	year         = 2025,
	url          = { https://huggingface.co/datasets/Paranoiid/VoicePersona },
	doi          = { 10.57967/hf/6085 },
	publisher    = { Hugging Face }
}

About

A comprehensive voice persona dataset for character consistency in voice synthesis, generated using advanced audio-language model Qwen2-Audio-7B, with a GPU-optimized pipeline

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages