modelscope · cyruszhang · Jul 16, 2025 · Jul 23, 2025 · Jul 23, 2025 · Jul 23, 2025
diff --git a/README.md b/README.md
@@ -38,6 +38,7 @@ Data-Juicer is being actively updated and maintained. We will periodically enhan
 
 
 ## News
+- 🛠️ [2025-08-09] **New Job Management & Monitoring Features**: We've added comprehensive job monitoring capabilities including a [Processing Snapshot Utility](data_juicer/utils/job/snapshot.py) that provides detailed JSON analysis of job status, progress tracking, and checkpointing information. Also introduced [Resource-Aware Partitioning](data_juicer/core/executor/partition_size_optimizer.py) for automatic optimization of distributed processing resources, and enhanced logging with centralized rotation and retention policies.
 - 🛠️ [2025-06-04] How to process feedback data in the "era of experience"? We propose [Trinity-RFT: A General-Purpose and Unified Framework for Reinforcement Fine-Tuning of LLMs](https://arxiv.org/abs/2505.17826), which leverages Data-Juicer for its data pipelines tailored for RFT scenarios.
 - 🎉 [2025-06-04] Our [Data-Model Co-development Survey](https://ieeexplore.ieee.org/document/11027559) has been accepted by IEEE Transactions on Pattern Analysis and Machine Intelligence (**TPAMI**)! Welcome to explore and contribute the [awesome-list](https://modelscope.github.io/data-juicer/en/main/docs/awesome_llm_data.html).
 - 🔎 [2025-06-04] We introduce [DetailMaster: Can Your Text-to-Image Model Handle Long Prompts?](https://www.arxiv.org/abs/2505.16915) A synthetic benchmark revealing notable performance drops despite large models' proficiency with short descriptions.
@@ -111,6 +112,7 @@ Besides, our paper is also updated to [v3](https://arxiv.org/abs/2309.02033).
   - [How-to Guide for Developers](docs/DeveloperGuide.md)
   - [Distributed Data Processing in Data-Juicer](docs/Distributed.md)
   - [Sandbox](docs/Sandbox.md)
+  - [Job Management & Monitoring](docs/JobManagement.md)
 - Demos
   - [demos](demos/README.md)
 - Tools
@@ -127,6 +129,10 @@ Besides, our paper is also updated to [v3](https://arxiv.org/abs/2309.02033).
   - [Postprocess Tools](tools/postprocess/README.md)
   - [Preprocess Tools](tools/preprocess/README.md)
   - [Data Scoring](tools/quality_classifier/README.md)
+- Job Management & Monitoring
+  - [Processing Snapshot Utility](data_juicer/utils/job/snapshot.py) - Comprehensive job status analysis with JSON output
+  - [Job Management Tools](data_juicer/utils/job/) - Monitor and manage DataJuicer processing jobs
+  - [Resource-Aware Partitioning](data_juicer/core/executor/partition_size_optimizer.py) - Automatic resource optimization for distributed processing
 - Third-party
   - [LLM Ecosystems](thirdparty/LLM_ecosystems/README.md)
   - [Third-party Model Library](thirdparty/models/README.md)

diff --git a/README_ZH.md b/README_ZH.md
@@ -34,7 +34,8 @@ Data-Juicer正在积极更新和维护中，我们将定期强化和新增更多
 ----
 
 ## 新消息
-- 🛠️ [2025-06-04] 如何在“经验时代”处理反馈数据？我们提出了 [Trinity-RFT: A General-Purpose and Unified Framework for Reinforcement Fine-Tuning of LLMs](https://arxiv.org/abs/2505.17826)，该框架利用 Data-Juicer 为 RFT 场景量身定制数据处理管道。
+- 🛠️ [2025-08-09] **新增作业管理与监控功能**：我们新增了全面的作业监控功能，包括[处理快照工具](data_juicer/utils/job/snapshot.py)，提供详细的JSON格式作业状态分析、进度跟踪和检查点信息。同时引入了[资源感知分区](data_juicer/core/executor/partition_size_optimizer.py)功能，用于分布式处理资源的自动优化，以及增强的日志系统，提供集中化的日志轮转和保留策略。
+- 🛠️ [2025-06-04] 如何在"经验时代"处理反馈数据？我们提出了 [Trinity-RFT: A General-Purpose and Unified Framework for Reinforcement Fine-Tuning of LLMs](https://arxiv.org/abs/2505.17826)，该框架利用 Data-Juicer 为 RFT 场景量身定制数据处理管道。
 - 🎉 [2025-06-04] 我们的 [Data-Model Co-development 综述](https://ieeexplore.ieee.org/document/11027559) 已被 IEEE Transactions on Pattern Analysis and Machine Intelligence（**TPAMI**）接收！欢迎探索并贡献[awesome-list](https://modelscope.github.io/data-juicer/en/main/docs/awesome_llm_data.html)。
 - 🔎 [2025-06-04] 我们推出了 [DetailMaster: Can Your Text-to-Image Model Handle Long Prompts?](https://www.arxiv.org/abs/2505.16915) 一项合成基准测试，揭示了大模型虽擅长处理短描述，但在长提示下性能显著下降的问题。
 - 🎉 [2025-05-06] 我们的 [Data-Juicer Sandbox](https://arxiv.org/abs/2407.11784) 已被接收为 **ICML'25 Spotlight**（处于所有投稿中的前 2.6%）！
@@ -107,6 +108,7 @@ Data-Juicer正在积极更新和维护中，我们将定期强化和新增更多
   - [开发者指南](docs/DeveloperGuide_ZH.md)
   - [Data-Juicer分布式数据处理](docs/Distributed_ZH.md)
   - [沙盒实验室](docs/Sandbox_ZH.md)
+  - [作业管理与监控](docs/JobManagement_ZH.md)
 - Demos
   - [演示](demos/README_ZH.md)
     - [自动化评测：HELM 评测及可视化](demos/auto_evaluation_helm/README_ZH.md)
@@ -125,6 +127,10 @@ Data-Juicer正在积极更新和维护中，我们将定期强化和新增更多
   - [后处理工具](tools/postprocess/README_ZH.md)
   - [预处理工具](tools/preprocess/README_ZH.md)
   - [给数据打分](tools/quality_classifier/README_ZH.md)
+- 作业管理与监控
+  - [处理快照工具](data_juicer/utils/job/snapshot.py) - 提供JSON格式的全面作业状态分析
+  - [作业管理工具](data_juicer/utils/job/) - 监控和管理DataJuicer处理作业
+  - [资源感知分区](data_juicer/core/executor/partition_size_optimizer.py) - 分布式处理的自动资源优化
 - 第三方
   - [大语言模型生态](thirdparty/LLM_ecosystems/README_ZH.md)
   - [第三方模型库](thirdparty/models/README_ZH.md)

diff --git a/configs/demo/partition-checkpoint-eventlog.yaml b/configs/demo/partition-checkpoint-eventlog.yaml
@@ -0,0 +1,154 @@
+# =============================================================================
+# COMPREHENSIVE DATAJUICER DEMO: Checkpointing, Event Logging & Job Management
+# =============================================================================
+# This demo showcases:
+# 1. Configurable checkpointing strategies
+# 2. Event logging with job-specific directories
+# 3. Flexible storage architecture
+# 4. Job resumption capabilities
+# 5. Real DataJuicer operations
+# =============================================================================
+
+# Data location configuration (Mandatory)
+dataset_path: './demos/data/demo-dataset.jsonl'
+
+# Work directory configuration
+# IMPORTANT: If using {job_id} placeholder, it MUST be the last part of the path
+# Examples:
+#   ✅ work_dir: "./outputs/my_project/{job_id}"     # Valid
+#   ✅ work_dir: "/data/experiments/{job_id}"        # Valid
+#   ❌ work_dir: "./outputs/{job_id}/results"        # Invalid - {job_id} not at end
+#   ❌ work_dir: "./{job_id}/outputs/data"           # Invalid - {job_id} not at end
+#
+# If no {job_id} is specified, job_id will be automatically appended:
+#   work_dir: "./outputs/my_project" → job_dir: "./outputs/my_project/20250804_143022_abc123"
+work_dir: "./outputs/partition-checkpoint-eventlog/{job_id}"
+export_path: '{work_dir}/processed.jsonl'
+
+# Executor configuration
+executor_type: "ray_partitioned"  # Use our enhanced partitioned executor
+ray_address: "auto"
+# np will be auto-configured based on available cluster resources when partition.auto_configure: true
+# np: 2  # Number of Ray workers (auto-configured when partition.auto_configure: true)
+
+# Separate storage configuration
+# Partition directory (Optional) is used to store the partitions of the dataset if using ray_partitioned executor
+partition_dir: "{work_dir}/partitions"
+
+# Event logs: Fast storage (SSD, local disk) - small files, frequent writes (Optional)
+event_log_dir: "{work_dir}/event_logs"  # Optional: separate fast storage for event logs
+
+# Checkpoints: Large storage (HDD, network storage) - large files, infrequent writes (Optional)
+checkpoint_dir: "{work_dir}/checkpoints"  # Optional: separate large storage for checkpoints
+
+
+# Resource optimization configuration
+resource_optimization:
+  auto_configure: true  # Enable automatic optimization of partition size, worker count, and other resource-dependent settings
+  # Manual configuration (used when auto_configure: false)
+  # partition:
+  #   size: 10000  # Number of samples per partition
+  #   max_size_mb: 128  # Maximum partition size in MB
+  # np: 2  # Number of Ray workers
+
+# Partition configuration (used when resource_optimization.auto_configure: false)
+partition:
+  # size: 10000  # Number of samples per partition
+  # max_size_mb: 128  # Maximum partition size in MB
+
+# Checkpoint configuration
+checkpoint:
+  enabled: false
+  # strategy: "every_op"  # every_op, every_partition, every_n_ops, manual, disabled
+  # n_ops: 1  # Number of operations between checkpoints (for every_n_ops strategy)
+  # op_names: []  # Specific operation names to checkpoint after (for manual strategy)
+
+# Intermediate storage configuration (includes file lifecycle management)
+intermediate_storage:
+  format: "parquet"  # parquet, arrow, jsonl
+  parquet_batch_size: 10000  # Number of rows per parquet file
+
+# Event logging configuration
+event_logging:
+  enabled: true
+
+# Process pipeline with real DataJuicer operations
+process:
+  # Text cleaning operations
+  - clean_links_mapper:
+      text_key: "text"
+      min_links: 0
+      max_links: 10
+
+  - clean_email_mapper:
+      text_key: "text"
+      min_emails: 0
+      max_emails: 5
+
+  - whitespace_normalization_mapper:
+      text_key: "text"
+
+  - fix_unicode_mapper:
+      text_key: "text"
+
+  # Text filtering operations
+  - text_length_filter:
+      text_key: "text"
+      min_len: 5
+      max_len: 10000
+
+  - alphanumeric_filter:
+      text_key: "text"
+      min_ratio: 0.1
+
+  # Quality filtering
+  - character_repetition_filter:
+      text_key: "text"
+      min_ratio: 0.0
+      max_ratio: 0.5
+
+  - word_repetition_filter:
+      text_key: "text"
+      min_ratio: 0.0
+      max_ratio: 0.5
+
+# Export configuration
+export_in_parallel: true
+keep_stats_in_res_ds: true
+keep_hashes_in_res_ds: true
+
+# =============================================================================
+# COMPLETE USER EXPERIENCE:
+# =============================================================================
+# 1. Start job:
+#    dj-process --config configs/demo/partition-checkpoint-eventlog.yaml
+#    # Output shows: Job ID (timestamp_configname_suffix), job directory, resumption command
+#    # Example: 20241201_143022_partition-checkpoint-eventlog_abc123
+#
+# 2. If job fails, resume with:
+#    dj-process --config configs/demo/partition-checkpoint-eventlog.yaml --job_id <job_id>
+#    # System validates job_id and shows previous status
+#
+# 3. Directory structure (flexible storage):
+#    outputs/partition-checkpoint-eventlog/{job_id}/
+#    ├── partitions/           # Dataset partitions (large files)
+#    ├── checkpoints/          # Operation checkpoints (large files)
+#    ├── event_logs/           # Event logs (small files, frequent writes)
+#    ├── metadata/             # Job metadata and mapping
+#    ├── results/              # Final processed dataset
+#    └── processed.jsonl       # Final output file
+#
+# 4. Resource Optimization:
+#    - resource_optimization.auto_configure: true automatically optimizes:
+#      * Partition size based on data characteristics and available memory
+#      * Worker count (np) based on available CPU cores
+#      * Processing efficiency based on data modality (text, image, audio, video)
+#    - No manual tuning required - system adapts to your hardware and data
+#
+# 5. Monitoring and Debugging:
+#    - Real-time event logs in event_logs/ directory
+#    - Processing summary with statistics and timing
+#    - Checkpoint recovery for fault tolerance
+#    - Detailed resource utilization analysis
+#
+# =============================================================================