Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
81 commits
Select commit Hold shift + click to select a range
6ef5064
add current architecture
cyruszhang May 21, 2025
a2326af
more graphs
cyruszhang May 22, 2025
43f39de
Merge branch 'main' into feat/cyrusz/better-gpu-utilization
cyruszhang May 28, 2025
a15987f
Merge branch 'main' into feat/cyrusz/optimization-framework
cyruszhang Jun 10, 2025
3c3f743
Merge branch 'main' into feat/cyrusz/optimization-framework
cyruszhang Jun 11, 2025
5483b5d
Merge branch 'main' into feat/cyrusz/optimization-framework
cyruszhang Jun 13, 2025
037e596
ignore logs directory
cyruszhang Jun 16, 2025
f7bef69
[nit] .gitignore
cyruszhang Jun 16, 2025
73878cc
optimizer boilerplate and op fusion
cyruszhang Jun 17, 2025
7f92745
remove docs
cyruszhang Jun 17, 2025
a25b9f5
Merge branch 'main' into feat/cyrusz/optimization-framework
cyruszhang Jun 18, 2025
4e43bed
make filter fusion work
cyruszhang Jun 18, 2025
75f7d5c
remove optimization related logic from pipeline_ast
cyruszhang Jun 19, 2025
2211db9
remove optimization related logic from pipeline_ast
cyruszhang Jun 19, 2025
fe8a11a
remove optimization related logic from pipeline_ast
cyruszhang Jun 19, 2025
5583877
add performance benchmarks
cyruszhang Jun 19, 2025
53ea5a1
rid of excessive comments
cyruszhang Jun 20, 2025
68df788
revert pyproject yapf and isort settings
cyruszhang Jun 20, 2025
5de1c8a
use black for code formatting
cyruszhang Jun 20, 2025
22f6575
Revert "use black for code formatting"
cyruszhang Jun 21, 2025
bc8dcb0
Revert "revert pyproject yapf and isort settings"
cyruszhang Jun 21, 2025
3839a38
merge main
cyruszhang Jun 26, 2025
7ab1fd2
Merge branch 'main' into feat/cyrusz/optimization-framework
cyruszhang Jun 27, 2025
45c1282
use analyzer to guide optimization; more comprehensive performance be…
cyruszhang Jul 1, 2025
0d9b817
fix logging
cyruszhang Jul 1, 2025
5e0db32
fix precommit
cyruszhang Jul 1, 2025
f6afa92
remove accessory code
cyruszhang Jul 2, 2025
b8b9a6a
add fusion-analysis mode for fusion deicsion making test
cyruszhang Jul 2, 2025
f07eeff
remove optimizer mode as optimizer is used by default
cyruszhang Jul 2, 2025
e792631
remove lenient filter setup
cyruszhang Jul 2, 2025
9c38485
add more complex filters; clean up filter setup
cyruszhang Jul 2, 2025
ea6efa3
tune filters; add full test printouts; add result exporting
cyruszhang Jul 2, 2025
b63c1a3
fully use FusionStrategy
cyruszhang Jul 2, 2025
5825812
fix minor AND logic
cyruszhang Jul 3, 2025
d8e19d0
beef up op complexity logic; add more robust benchmark validation logic
cyruszhang Jul 8, 2025
f7cf0b2
add recipe mode in performance benchmark suite
cyruszhang Jul 8, 2025
3e1748f
use model_zoo caching properly in prepare_model
cyruszhang Jul 8, 2025
01d17d5
remove fused mode; tune up individual mode by removing redundant stat…
cyruszhang Jul 8, 2025
9099b71
add real dataset support to performance benchmark suite
cyruszhang Jul 9, 2025
092cf06
full result validation and bug fix; use faker for synthetic data gene…
cyruszhang Jul 10, 2025
b86a323
use debug instead of info fro debugging info
cyruszhang Jul 10, 2025
81ffeb4
fix load_ops bug
cyruszhang Jul 10, 2025
541f8b4
fix nested fused_mapper bug
cyruszhang Jul 10, 2025
9ea5a53
fix text field issue; fix fusedfilter arg list issue
cyruszhang Jul 10, 2025
69e8db4
use more comprehensive filters
cyruszhang Jul 11, 2025
4528ccc
solve stats key missing issue in parallel mode
cyruszhang Jul 11, 2025
9e45229
isolated perf test
cyruszhang Jul 15, 2025
f1acf48
move optimizer test
cyruszhang Jul 16, 2025
808b827
refactor data generation routines
cyruszhang Jul 16, 2025
12fabc6
update fused_op for huggingface datasets compability; slim down perfo…
cyruszhang Jul 17, 2025
1e1de76
add outputs in both mode
cyruszhang Jul 17, 2025
ffc806b
add clear cache utility; modify filter list
cyruszhang Jul 17, 2025
97b638c
update stats time; more logging in model utils
cyruszhang Jul 18, 2025
c47e92a
deep copy for test dataset
cyruszhang Jul 18, 2025
07f8a63
use optimizer instead fo fused_filter for performance benchmarking
cyruszhang Jul 23, 2025
74186e7
Merge branch 'main' into feat/cyrusz/optimization-framework
cyruszhang Sep 22, 2025
c622473
add perf benchmark framework for optimization strategies
cyruszhang Oct 3, 2025
649ccf4
ignore perf_bench_data
cyruszhang Oct 6, 2025
936547f
use existing workload present in github workflows
cyruszhang Oct 6, 2025
20d8d30
add support for sample rate and prefixed modality/config setup
cyruszhang Oct 6, 2025
d9d76b3
use existing optimization strategies
cyruszhang Oct 6, 2025
0bcc290
fix metrics: retention and throughput
cyruszhang Oct 7, 2025
020af12
fix metrics counting logic; add text-c4 configs
cyruszhang Oct 8, 2025
22600e1
add c4 workloads
cyruszhang Oct 9, 2025
40720f7
add c4 workloads
cyruszhang Oct 9, 2025
06ef848
add baseline param for ab test; fix reporting bug
cyruszhang Oct 9, 2025
bada847
remove old perf test code
cyruszhang Oct 9, 2025
0b64ff2
update default configs for optimization
cyruszhang Oct 23, 2025
ce83845
add op reorder strategy and demo yaml; add optimzation manager
cyruszhang Oct 23, 2025
741b39d
update config.py for optimizer related configs
cyruszhang Oct 23, 2025
fc1e8f5
add registry support for optimization strategies
cyruszhang Oct 23, 2025
a272a59
enable optimization for default executor
cyruszhang Oct 23, 2025
0124ac3
enable optimization for ray executor
cyruszhang Oct 23, 2025
04538b7
update __init__ to include optimization related classes and functions
cyruszhang Oct 23, 2025
d9f0ed4
enable reorder
cyruszhang Oct 23, 2025
f6eee01
use registry for filter and mapper fusion
cyruszhang Oct 23, 2025
91b0514
add op reoder to workloads
cyruszhang Oct 23, 2025
2e8b305
add op_reorder to benchmark
cyruszhang Oct 23, 2025
595eab7
expose ABTestConfig
cyruszhang Oct 23, 2025
04f72c3
update benchmark runner for new strategies and config overriding
cyruszhang Oct 23, 2025
9f8495f
Merge branch 'main' into feat/cyrusz/optimization-framework
cyruszhang Oct 23, 2025
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
7 changes: 7 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -3,6 +3,9 @@
outputs/
assets/

# logs
**/logs/*

# setup
data_juicer.egg-info/
py_data_juicer.egg-info/
Expand All @@ -16,6 +19,7 @@ wandb/
__pycache__
.vscode/
.ipynb_checkpoints/
performance_test_results*.json

# label studio related
label_studio_data/
Expand All @@ -31,3 +35,6 @@ tests/ops/data/*dup*
tests/tools/tmp_*/
tests/ops/deduplicator/chinese_dedup/
tests/ops/deduplicator/english_dedup/

# perf bench data
perf_bench_data
5 changes: 5 additions & 0 deletions configs/config_all.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -68,6 +68,11 @@ eoc_special_token: '<|__dj__eoc|>' # the special token
executor_type: default # type of executor, support "default" or "ray" for now.
ray_address: auto # the address of the Ray cluster.

# Core optimizer configuration
enable_optimizer: false # enable/disable core optimizer
optimizer_strategies: ['op_reorder'] # list of optimization strategies to apply
# available strategies: op_reorder, filter_fusion, mapper_fusion

# only for data analysis
percentiles: [0.25, 0.5, 0.75] # percentiles to analyze the dataset distribution
export_original_dataset: false # whether to export the original dataset with stats. If you only need the stats of the dataset, setting it to false could speed up the exporting.
Expand Down
5 changes: 5 additions & 0 deletions configs/config_min.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -11,3 +11,8 @@ executor_type: default # type of executor,
ray_address: auto # the address of the Ray cluster.
suffixes: null
add_suffix: false

# Core optimizer configuration
enable_optimizer: false # enable/disable core optimizer
optimizer_strategies: ['op_reorder'] # list of optimization strategies to apply
# available strategies: op_reorder, filter_fusion, mapper_fusion
92 changes: 92 additions & 0 deletions configs/demo/fused_operations_demo.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,92 @@
# Fused Operations Demo Configuration
# This config demonstrates how to use fused operations for optimal performance

project_name: 'fused_operations_demo'
dataset_path: 'path/to/your/dataset.jsonl' # Replace with your dataset path
export_path: 'output/fused_processed_dataset.jsonl'
export_shard_size: 0
export_in_parallel: false
np: 4
text_keys: 'text'
suffixes: []
turbo: false
skip_op_error: true
use_cache: true
ds_cache_dir: null
open_monitor: true
use_checkpoint: false
temp_dir: null
open_tracer: false
op_list_to_trace: []
trace_num: 10

# Enable fused operations for optimal performance
op_fusion: true
fusion_strategy: 'probe' # Use probe strategy for optimal ordering
cache_compress: null
keep_stats_in_res_ds: false
keep_hashes_in_res_ds: false
adaptive_batch_size: false

# For multimodal data processing
image_key: 'images'
image_special_token: '<__dj__image>'
audio_key: 'audios'
audio_special_token: '<__dj__audio>'
video_key: 'videos'
video_special_token: '<__dj__video>'
eoc_special_token: '<|__dj__eoc|>'

# Executor configuration
executor_type: default
ray_address: auto

# Process pipeline with operations that will be automatically fused
process:
# Phase 1: Text cleaning mappers (these run first)
- clean_html_mapper: {} # Remove HTML tags
- clean_links_mapper: {} # Remove URLs
- clean_email_mapper: {} # Remove email addresses
- clean_copyright_mapper: {} # Remove copyright notices

# Phase 2: Text quality filters (these will be fused automatically)
# Basic text characteristics
- text_length_filter: # Filter by text length
min_len: 50
max_len: 2000
- words_num_filter: # Filter by word count
min_num: 10
max_num: 500
- character_repetition_filter: # Filter repetitive characters
repetition_ratio: 0.8
- word_repetition_filter: # Filter repetitive words
min_ratio: 0.0
max_ratio: 0.5
- special_characters_filter: # Filter special character ratio
min_ratio: 0.0
max_ratio: 0.3
- alphanumeric_filter: # Filter alphanumeric ratio
min_ratio: 0.3
- average_line_length_filter: # Filter by average line length
min_len: 10
max_len: 100
- maximum_line_length_filter: # Filter by maximum line length
min_len: 10
max_len: 200

# Phase 3: Content quality filters (these will also be fused)
- perplexity_filter: # Filter by language model perplexity
max_ppl: 1500
- stopwords_filter: # Filter by stopword ratio
min_ratio: 0.1
- flagged_words_filter: # Filter by flagged word ratio
max_ratio: 0.05
- language_id_score_filter: # Filter by language confidence
lang: 'en'
min_score: 0.5
max_score: 1.0

# Phase 4: Text transformation mappers (these run after filtering)
- expand_macro_mapper: {} # Expand LaTeX macros
- chinese_convert_mapper: # Convert Chinese text
mode: 's2t' # Simplified to Traditional
53 changes: 53 additions & 0 deletions configs/optimization/op_reorder_showcase.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,53 @@
# Configuration to showcase operation reordering optimization
# This config has a suboptimal order that should be reordered by the optimizer
# GOAL: Show dramatic performance difference by putting expensive operations first

project_name: 'op-reorder-showcase'
dataset_path: 'perf_bench_data/text/wiki-10k.jsonl'
export_path: 'outputs/op_reorder_showcase/res.jsonl'
np: 4
use_cache: false

process:
# VERY EXPENSIVE OPERATIONS (should be moved after filtering)
# These are resource-intensive operations that waste computation on filtered data
- text_chunk_mapper:
chunk_size: 500 # Smaller chunks = more processing
text_key: 'text'
mem_required: '2GB'

- text_entity_dependency_filter:
min_score: 0.9 # Very strict filtering
text_key: 'text'
mem_required: '3GB'

- text_pair_similarity_filter:
min_score: 0.8
text_key: 'text'
mem_required: '2GB'

# LIGHT FILTERS (should be moved to front)
# These are fast filters that should run early to reduce data volume
- text_length_filter:
min_len: 50 # Less restrictive to keep more data
max_len: 5000
text_key: 'text'

- text_action_filter:
action_types: ['question', 'command', 'statement'] # Keep all types
text_key: 'text'

# DEPENDENCY CHAIN (must stay in order)
# language_id must come before perplexity
- language_id_score_filter:
lang: 'en'
min_score: 0.5 # Much less strict to keep more data
text_key: 'text'

- perplexity_filter:
lang: 'en'
min_score: 0.1 # Much less strict to keep more data
text_key: 'text'

# ADDITIONAL EXPENSIVE OPERATIONS
# text_pair_similarity_filter moved up to replace text_embd_similarity_filter
32 changes: 32 additions & 0 deletions data_juicer/benchmark/__init__.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,32 @@
"""
Data-Juicer Performance Benchmark Framework

A comprehensive framework for A/B testing optimization strategies
across different workloads, modalities, and operation complexities.
"""

from .core.benchmark_runner import BenchmarkConfig, BenchmarkRunner
from .core.metrics_collector import MetricsCollector
from .core.report_generator import ReportGenerator
from .core.result_analyzer import ResultAnalyzer
from .strategies.ab_test import ABTestConfig, StrategyABTest
from .strategies.strategy_library import STRATEGY_LIBRARY, OptimizationStrategy
from .utils.config_manager import ConfigManager
from .workloads.workload_suite import WORKLOAD_SUITE, WorkloadDefinition, WorkloadSuite

__version__ = "1.0.0"
__all__ = [
"BenchmarkRunner",
"BenchmarkConfig",
"MetricsCollector",
"ResultAnalyzer",
"ReportGenerator",
"OptimizationStrategy",
"STRATEGY_LIBRARY",
"StrategyABTest",
"ABTestConfig",
"WorkloadSuite",
"WorkloadDefinition",
"WORKLOAD_SUITE",
"ConfigManager",
]
15 changes: 15 additions & 0 deletions data_juicer/benchmark/core/__init__.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,15 @@
"""Core benchmark framework components."""

from .benchmark_runner import BenchmarkRunner
from .metrics_collector import BenchmarkMetrics, MetricsCollector
from .report_generator import ReportGenerator
from .result_analyzer import ComparisonResult, ResultAnalyzer

__all__ = [
"BenchmarkRunner",
"MetricsCollector",
"BenchmarkMetrics",
"ResultAnalyzer",
"ComparisonResult",
"ReportGenerator",
]
Loading