This project uses the DeepSeek-R1-Distill-Qwen-1.5B model and fine-tunes it using the LoRA method on ByTheStream magazine content to generate relevant responses about the magazine's content.
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ Standard Training Pipeline โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโ
โ Full Data โโโโโโถโ prepare_dataโโโโโโถโ train.py โโโโโโถโ Full Model โ
โ (All Vol.) โ โ โ โ โ โ โ
โโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ Optimized Training Pipeline โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโ
โ Reduced โโโโโโถโ prepare_dataโโโโโโถโ train_small โโโโโโถโ Optimized โ
โ Data Set โ โ _small.py โ โ .py โ โ Model โ
โ (10 Vol.) โ โ โ โ โ โ โ
โโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ Data Preparation Pipeline โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโ
โ Raw Articleโโโโโโถโ Data Filterโโโโโโถโ Data Aug. โโโโโโถโ Training โ
โ Content โ โ (10 Volumes)โ โ Techniques โ โ Samples โ
โโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ Data Augmentation Techniques โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโ
โ Synonym โ โ Sentence โ โ Context โ
โ Replacement โ โ Transform โ โ Expansion โ
โโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ Training Optimization โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโ
โ Base Model โโโโโโถโ Layer Freezeโโโโโโถโ LoRA Configโโโโโโถโ Training โ
โ โ โ โ โ โ โ Process โ
โโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ Layer Freezing Strategy โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโ
โ Frozen โ โ Partially โ โ Unfrozen โ
โ Layers โ โ Frozen โ โ Layers โ
โ (1-22) โ โ (23-27) โ โ (LoRA) โ
โโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ LoRA Configuration โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโ
โ Reduced โ โ Adjusted โ โ Reduced โ โ Increased โ
โ Rank (16) โ โ Alpha (32) โ โ Target โ โ Dropout โ
โ โ โ โ โ Modules โ โ (0.1) โ
โโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ Performance Optimization โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโ
โ Gradient โ โ Mixed โ โ Parallel โ โ Early โ
โ Checkpoint โ โ Precision โ โ Processing โ โ Stopping โ
โ โ โ (FP16) โ โ โ โ โ
โโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ Memory Usage Reduction โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโ
โ Reduced โ โ Optimized โ โ Efficient โ
โ Parameters โ โ Batch Size โ โ Data Load โ
โ (0.12%) โ โ (8) โ โ โ
โโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโ
- Collect article content from the ByTheStream magazine website
- Save articles in JSON format with the following fields:
{ "title": "Article Title", "author": "Author", "volume": "Issue Number", "content": "Article Content", "date": "Publication Date" }
We provide two data preprocessing scripts:
python prepare_data.py
This script:
- Cleans HTML tags
- Standardizes text format
- Generates Q&A pairs for training
- Creates training and validation sets
python prepare_data_small.py
-
Data Volume Reduction
- Processes only the first 10 volumes (reducing data size by ~50%)
- Reason: Focus on core content while maintaining quality
- Benefit: Faster training and reduced memory requirements
-
Advanced Data Augmentation
- Synonym replacement with protected keywords
- Preserves spiritual terms while enhancing vocabulary
- Uses custom synonym dictionary for domain-specific terms
- Sentence structure transformation
- Converts statements to questions
- Adds modifiers for context variation
- Context expansion
- Adds spiritual background information
- Includes explanatory content
- Benefit: Increases training data diversity without manual effort
- Synonym replacement with protected keywords
-
Intelligent Key Point Extraction
- Improved keyword extraction using TF-IDF and TextRank
- Combines multiple algorithms for better accuracy
- Filters out stopwords and common terms
- Enhanced sentence scoring mechanism
- Considers sentence length, position, and content
- Weights spiritual terms higher
- Core teaching focus
- Prioritizes paragraphs with spiritual content
- Maintains theological accuracy
- Benefit: Better quality training samples
- Improved keyword extraction using TF-IDF and TextRank
-
Structured Training Data Generation
- Creates targeted questions based on content
- Generates multiple question types
- Maintains context relevance
- Provides structured answers
- Includes article metadata
- Organizes content hierarchically
- Balances question types
- Mixes different question formats
- Ensures comprehensive coverage
- Benefit: More effective model training
- Creates targeted questions based on content
-
Error Handling and Logging
- Comprehensive error tracking
- Logs processing errors by file
- Maintains error statistics
- Data validation
- Ensures data integrity
- Handles missing or malformed content
- Benefit: Reliable data processing
- Comprehensive error tracking
The processed data format is as follows:
{
"question": "Question content",
"answer": {
"ๆ็ซ ไฟกๆฏ": {
"ๆ ้ข": "Article Title",
"ไฝ่
": "Author",
"ๅทๆ": "Volume Number",
"็ฑปๅซ": "Category"
},
"ไธป่ฆๅ
ๅฎน": {
"ๆฆ่ฟฐ": "Article overview",
"ๅ
ณ้ฎๆฎต่ฝ": ["Key paragraph 1", "Key paragraph 2", ...]
},
"ๅ
ณ้ฎ่ฏ่งฃ้": {
"keyword1": "Explanation of keyword1",
"keyword2": "Explanation of keyword2"
},
"ๅ
ณ้ฎๅฅๅญ่งฃ้": {
"sentence1": "Explanation of sentence1",
"sentence2": "Explanation of sentence2"
}
}
}
-
Install dependencies:
pip install -r requirements.txt
-
Ensure GPU environment:
- CUDA 11.8+
- PyTorch 2.0+
- At least 12GB VRAM
# LoRA Configuration
lora_config = LoraConfig(
r=32, # LoRA rank
lora_alpha=64, # LoRA alpha value
target_modules=[
"q_proj",
"k_proj",
"v_proj",
"o_proj",
"gate_proj",
"up_proj",
"down_proj",
],
lora_dropout=0.05,
bias="none",
task_type="CAUSAL_LM"
)
# Training Parameters
training_args = TrainingArguments(
num_train_epochs=2,
per_device_train_batch_size=4,
gradient_accumulation_steps=8,
eval_steps=200,
save_steps=200,
learning_rate=2e-4,
warmup_steps=100,
logging_steps=50
)
# LoRA Configuration
lora_config = LoraConfig(
r=16, # Reduced rank for efficiency
lora_alpha=32, # Adjusted alpha for balance
target_modules=[
"q_proj",
"v_proj", # Reduced target modules
],
lora_dropout=0.1, # Increased dropout for regularization
bias="none",
task_type="CAUSAL_LM"
)
# Training Parameters
training_args = TrainingArguments(
num_train_epochs=1, # Reduced epochs
per_device_train_batch_size=8, # Increased batch size
gradient_accumulation_steps=8,
eval_steps=50, # More frequent evaluation
save_steps=50, # More frequent saving
learning_rate=5e-4, # Adjusted learning rate
warmup_steps=25, # Adjusted warmup
logging_steps=10 # More frequent logging
)
-
Transfer Learning Improvements:
- Freezes more layers to reduce trainable parameters
- Only unfreezes the last few layers (23-27) for fine-tuning
- Reduces trainable parameters from 2.08% to 0.12%
-
LoRA Configuration Optimization:
- Reduces rank from 32 to 16
- Decreases target modules from 7 to 2
- Adjusts alpha from 64 to 32
- Increases dropout from 0.05 to 0.1
-
Training Process Optimization:
- Reduces training epochs from 2 to 1
- Increases batch size from 4 to 8
- More frequent evaluation and saving (every 50 steps)
- Adds early stopping with patience of 3
-
Memory Efficiency:
- Enables gradient checkpointing
- Uses mixed precision training (FP16)
- Optimizes data loading with parallel processing
-
Training Time Reduction:
- Estimated training time reduced from 9-11 hours to ~4 hours
- Each step takes approximately 44 seconds
# Standard training
python train.py
# Optimized training
python train_small.py
-
Load model and LoRA weights:
from transformers import AutoModelForCausalLM, AutoTokenizer from peft import PeftModel # Load base model model = AutoModelForCausalLM.from_pretrained( "deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B", trust_remote_code=True, device_map="auto", torch_dtype=torch.float16 ) # Load LoRA weights model = PeftModel.from_pretrained(model, "./results_small") tokenizer = AutoTokenizer.from_pretrained("deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B", trust_remote_code=True)
-
Generate responses:
def generate_response(prompt, model, tokenizer): inputs = tokenizer(prompt, return_tensors="pt").to(model.device) outputs = model.generate( **inputs, max_length=512, temperature=0.6, top_p=0.85, repetition_penalty=1.3, num_beams=3, length_penalty=0.8, no_repeat_ngram_size=3 ) response = tokenizer.decode(outputs[0], skip_special_tokens=True) return response
# Test questions
test_questions = [
"What articles has Yang Lei published in ByTheStream magazine?",
"What articles has Pastor Huang Zhiqi published in ByTheStream magazine?",
"What is the history of ByTheStream magazine's founding?"
]
# Generate responses
for question in test_questions:
prompt = f"<think>Please answer the question based on ByTheStream magazine content. Keep the answer concise and cite the source.</think>\n\nQuestion: {question}\n\nAnswer:"
response = generate_response(prompt, model, tokenizer)
print(f"Question: {question}")
print(f"Answer: {response}\n")
Metric | Standard Training | Optimized Training |
---|---|---|
Training Time | ~9-11 hours | ~4 hours |
Memory Usage | ~12GB | ~8GB |
Trainable Parameters | 2.08% | 0.12% |
Batch Size | 4 | 8 |
Steps per Epoch | 337 | 337 |
Time per Step | ~90 seconds | ~44 seconds |
- The optimized training script (train_small.py) is designed for faster training with minimal quality loss
- Early stopping mechanism prevents overfitting
- Checkpoints are saved every 50 steps for better recovery options
- Training progress is monitored through detailed logging