Research Topic: Investigating the causal impact of user sentiment on rating behavior in Amazon beauty reviews Dataset: 701,528 reviews, 631,986 users (large-scale analysis) Duration: 4-6 weeks (from data processing to theoretical modeling) Key Finding: Discovered a sentiment threshold effect at -0.483, with sentiment above the threshold having 10.7x stronger impact on ratings
code/ # Data preprocessing and analysis scripts
data/ # Dataset description
results/ # Analysis results and visualizations
docs/ # Methodology and research findings
requirements.txt # Python dependencies
.gitignore # Git ignore rules
README.md # Project overview
- Data cleaning, VADER sentiment analysis, cross-sectional statistics
- Findings: Sentiment-rating correlation r=0.61
- Technical highlight: Batch processing to handle memory limits
- Rolling window analysis, regression modeling, user stratification
- Findings: Regression Rยฒ=0.35, time-series correlation r=0.54
- Methodology: Robust multi-validation framework
- User trajectory visualization, K-means clustering, radar plots
- Findings: Identified 5 user segments with distinct patterns
- Incorporated Weber's Law, sliding threshold modeling
- Contribution: Applied psychophysics concepts to digital behavior for the first time
Module | Key Results |
---|---|
Sentiment vs. Rating | Correlation r=0.61, Rยฒ=0.35 |
User Clustering | 5 distinct user segments |
Threshold Effect | Threshold at -0.483, 10.7x impact |
Feature Importance | Sentiment 60% > Change 48% > Activity 3% |
- Custom Algorithms: Overcame pandas memory bottlenecks
- Robust Validation: Cross-sectional, time-series, stratified validation
- Theory Integration: Modeled non-linear threshold effects via psychophysics
- Engineering Scale: Processed 700K+ reviews efficiently
- Business Relevance: Clear user segmentation, quantifiable ROI
# Clone the repository
git clone https://github.com/your-username/amazon-sentiment-analysis.git
cd amazon-sentiment-analysis
# Install dependencies
pip install -r requirements.txt
# Run analysis
python code/01_data_preprocessing.py
This project is licensed under the MIT License.
If you'd like, I can also provide you a clean .md file version ready to upload directly. Just say "yes".
This project conducts a comprehensive analysis of Amazon beauty product reviews to investigate the causal impact of user sentiment on rating behavior. Through advanced statistical modeling and machine learning techniques, we discovered significant sentiment threshold effects and identified distinct user behavioral patterns.
- Large-scale Analysis: 701,528 reviews from 631,986 users
- Strong Effect Size: Rยฒ = 0.35-0.378 (large effect in social sciences)
- Novel Discovery: Sentiment threshold effect at -0.483 with 10.7x impact difference
- User Segmentation: 5 distinct user behavioral clusters identified
- Theoretical Innovation: First application of Weber's Law to digital sentiment analysis
๐ฏ Optimal Threshold: -0.483 ๐ Below Threshold: r = 0.163, Rยฒ = 0.027 ๐ Above Threshold: r = 0.538, Rยฒ = 0.290 โก Effect Magnitude: 10.7x stronger above threshold
Cluster | Size | Type | Avg Sentiment | Volatility | Avg Rating |
---|---|---|---|---|---|
0 | 65 (8.1%) | Stable Moderate | 0.366 | 0.289 | 3.286 |
1 | 85 (10.6%) | Stable Negative | -0.167 | 0.134 | 1.716 |
2 | 135 (16.9%) | Volatile Positive | 0.346 | 0.430 | 3.548 |
3 | 64 (8.0%) | Highly Volatile | 0.206 | 0.592 | 3.837 |
4 | 451 (56.4%) | Consistently Positive | 0.695 | 0.073 | 4.704 |
- Sentiment Level: 59.9% (Primary factor)
- Sentiment Change: 48.5% (Secondary factor)
- User Activity: 2.9% (Moderate factor)
- Volatility: 0.1% (Minor factor)
- Python 3.8+ - Primary development language
- Pandas - Data manipulation and analysis
- NumPy - Numerical computing
- Matplotlib/Seaborn - Data visualization
- Scikit-learn - Machine learning algorithms
- VADER - Sentiment analysis toolkit
- Correlation Analysis - Cross-sectional and time-series
- Regression Modeling - Linear and piecewise regression
- Clustering - K-means with optimal cluster selection
- Statistical Testing - t-tests, Chow tests, effect size analysis
- Threshold Detection - Sliding window analysis
- Rolling Window Analysis - Temporal pattern detection
- Piecewise Regression - Threshold effect modeling
- Feature Engineering - Multi-dimensional user profiling
- Memory Optimization - Custom algorithms for large datasets
Python 3.8+
pip or conda package manager
# Clone the repository
git clone https://github.com/yourusername/amazon-sentiment-analysis.git
cd amazon-sentiment-analysis
# Install dependencies
pip install -r requirements.txt
# Download sample data (if not included)
# Follow instructions in data/README.md
# Basic sentiment analysis
python code/01_data_preprocessing.py
# Run threshold analysis
python code/06_threshold_analysis.py
# Generate visualizations
python code/create_visualizations.py
- Cross-sectional correlation: r = 0.61
- Time-series correlation: r = 0.54
- Threshold detection: Optimal at -0.483
- Statistical significance: p < 0.001
- Rolling window patterns: 2, 3, 5-period windows
- Temporal stability: 50% positive trends in 5-period windows
- Volatility patterns: Increasing with window size
- ROI Coefficient: 1.72 (direct business value)
- User Segmentation: 5 actionable clusters
- Prediction Accuracy: Rยฒ = 0.35-0.378
- Large-scale data preprocessing (701K reviews)
- VADER sentiment analysis implementation
- Basic statistical analysis and correlation
- Time-series analysis across user timelines
- Statistical modeling with multiple validation
- User stratification and sampling strategies
- Professional-grade visualizations
- K-means clustering with optimal selection
- Comprehensive statistical testing
- Weber's Law application to digital behavior
- Threshold effect analysis and validation
- Theoretical framework development
- Memory-efficient algorithms for large-scale analysis
- Multi-validation framework ensuring robustness
- Sliding threshold detection for non-linear effects
- First application of Weber's Law to digital sentiment
- Threshold effect discovery in user behavior
- Psychophysical principles in e-commerce analytics
- User segmentation for targeted marketing
- Sentiment monitoring systems
- Predictive modeling for customer behavior
Metric | Value | Interpretation |
---|---|---|
Sample Size | 631,986 users | Large-scale analysis |
Effect Size | Rยฒ = 0.35 | Large effect (Cohen's standards) |
Correlation | r = 0.61 | Strong relationship |
Threshold Impact | 10.7x difference | Significant non-linearity |
Cluster Validity | 5 distinct groups | Clear segmentation |
- Deep learning sentiment models (BERT, RoBERTa)
- Real-time sentiment monitoring system
- Cross-platform validation (other e-commerce sites)
- Causal inference with instrumental variables
- Temporal dynamics of threshold effects
- Cross-cultural sentiment analysis
- Recommendation system integration
- A/B testing framework
- Customer lifetime value prediction
We welcome contributions! Please see our Contributing Guidelines for details.
- Algorithm optimization
- Additional statistical tests
- Visualization enhancements
- Documentation improvements
If you use this work in your research, please cite:
@software{amazon_sentiment_analysis,
author = {[Your Name]},
title = {Amazon Sentiment Analysis: Large-Scale User Behavior Research},
year = {2024},
url = {https://github.com/yourusername/amazon-sentiment-analysis}
}
This project is licensed under the MIT License - see the LICENSE file for details.
- Amazon for providing the review dataset
- VADER sentiment analysis toolkit
- Scientific Python community
- Statistical methodology references
โญ If you find this project useful, please give it a star!
๐ง Contact: [your.email@example.com] ๐ LinkedIn: [Your LinkedIn Profile] ๐ฑ Twitter: [@yourusername]