This project presents a sophisticated anime recommendation system developed as part of the academic curriculum. The system leverages a hybrid filtering approach, combining content-based analysis with user-centric features to provide personalized and high-quality anime recommendations. The entire research process, from data exploration to model implementation, is documented in a series of Jupyter Notebooks.
The project is methodologically divided into two principal phases: an in-depth exploratory data analysis (EDA) and the development of a hybrid recommendation model.
A comprehensive EDA was conducted to understand the underlying patterns within the anime and user datasets. This phase addressed several key research questions (RQs):
- RQ1 & RQ2: Analysis of the distribution of anime ratings and their correlation with fundamental features such as genre, type, premiere year, and episode count. Key findings indicate that specific genres and anime formats (e.g., TV, OVA) exhibit distinct rating patterns.
- RQ3 & RQ4: Investigation into user rating behavior, exploring rating distributions and potential biases related to user demographics (gender, age) and viewing habits (e.g., total days watched, number of completed series). The analysis revealed discernible differences in rating tendencies across different user groups.
These insights were instrumental in shaping the feature engineering and filtering criteria for the recommendation model.
To address the primary research objective (RQ5: How to effectively recommend high-quality animations?), a hybrid recommendation model was implemented. The model architecture integrates content-based filtering with collaborative and quality-based elements.
-
Content-Based Filtering:
- The core of the model relies on semantic understanding of anime content. We utilized a pre-trained BERT (Bidirectional Encoder Representations from Transformers) model (
bert-base-uncased
) to generate dense vector embeddings from the textual synopses of anime. - The Cosine Similarity metric is then employed to quantify the content-based similarity between different anime based on these embeddings.
- The core of the model relies on semantic understanding of anime content. We utilized a pre-trained BERT (Bidirectional Encoder Representations from Transformers) model (
-
Hybridization and Personalization:
- User Demographics: The model incorporates user
age
andgender
for personalization. An age-based filter excludes mature content (R-rated) for younger users. - Collaborative Element: Gender-based preferences are integrated by weighting similarity scores. The system adjusts recommendations based on the historical average ratings given by a user's gender to specific anime genres.
- Quality-Based Ranking: To ensure the quality of recommendations, a final
weighted_score
is calculated for each candidate anime. This score is a composite of its officialScore
,Favorites
count, andPopularity
rank, effectively prioritizing critically acclaimed and popular titles.
- User Demographics: The model incorporates user
The final recommendation list is generated by consolidating candidates, removing duplicates and previously watched anime, and sorting them by the final weighted score.
The project's source code is organized into several Jupyter notebooks:
src/data_exploration_and_RQ1-3.ipynb
: Contains the complete process of data cleaning, preprocessing, and exploratory data analysis corresponding to Research Questions 1-3.src/RQ4.ipynb
: Focuses on the analysis of Research Question 4, exploring the relationship between user behavior metrics and rating patterns.src/anime_recommendation.ipynb
: Implements the core hybrid recommendation system (RQ5), including BERT embedding generation and the final recommendation logic.report/
: Contains the detailed project report in PDF format.
To generate recommendations for a specific user, execute the recommend_anime
function within the anime_recommendation.ipynb
notebook.
# Example: Generate 5 recommendations for user with ID 20
recommend_anime(user_id=20, num_recommendations=5)
Prerequisites:
- Ensure all required libraries specified in the notebooks are installed.
- The
bert-base-uncased
model and tokenizer files should be located in thesrc/bert-base-uncased/
directory. - The datasets should be placed in the
src/data/
directory.
- Data Manipulation and Analysis: Pandas, NumPy
- Machine Learning and NLP: Scikit-learn, PyTorch, Transformers (Hugging Face)
- Data Visualization: Plotly
- Development Environment: Jupyter Notebook
本项目旨在构建一个先进的动漫推荐系统,是为完成课程设计而开发。系统采用一种混合过滤方法,结合了基于内容的分析与以用户为中心的特征,以提供个性化、高质量的动漫推荐。整个研究过程,从数据探索到模型实现,均在一系列 Jupyter Notebook 中有详细记录。
本项目在方法上主要分为两个阶段:深入的探索性数据分析(EDA)和混合推荐模型的开发。
我们进行了全面的探索性数据分析,以理解动漫和用户数据集中潜在的模式。此阶段解决了几个关键的研究问题(RQ):
- RQ1 & RQ2: 分析了动漫评分的分布及其与基本特征(如题材、类型、首播年份、集数)的相关性。主要发现表明,特定题材和动漫格式(如TV、OVA)表现出独特的评分模式。
- RQ3 & RQ4: 调查了用户的评分行为,探索了与用户人口统计学特征(性别、年龄)和观看习惯(如总观看天数、完成的系列数)相关的评分分布和潜在偏好。分析揭示了不同用户群体在评分倾向上的明显差异。
这些洞见为推荐模型的特征工程和过滤标准制定提供了重要依据。
为实现核心研究目标(RQ5:如何有效推荐高质量动漫?),我们实现了一个混合推荐模型。该模型架构整合了基于内容的过滤以及协同过滤和基于质量的元素。
-
基于内容的过滤 (Content-Based Filtering):
- 模型的核心依赖于对动漫内容的语义理解。我们利用预训练的 BERT (Bidirectional Encoder Representations from Transformers) 模型 (
bert-base-uncased
) 从动漫的文本简介中生成稠密的向量嵌入。 - 随后,采用 余弦相似度 (Cosine Similarity) 度量来量化不同动漫在内容上的相似性。
- 模型的核心依赖于对动漫内容的语义理解。我们利用预训练的 BERT (Bidirectional Encoder Representations from Transformers) 模型 (
-
混合化与个性化 (Hybridization and Personalization):
- 用户人口统计学特征: 模型整合了用户的
age
(年龄)和gender
(性别)进行个性化。基于年龄的过滤器会为年轻用户排除成人内容(R级)。 - 协同过滤元素: 通过对相似度分数进行加权,融入了基于性别的偏好。系统会根据用户所在性别对特定动漫题材的历史平均评分来调整推荐。
- 基于质量的排名: 为确保推荐质量,我们为每个候选动漫计算一个最终的
weighted_score
(加权分数)。该分数是其官方Score
(评分)、Favorites
(收藏数)和Popularity
(热门度)排名的加权组合,从而有效优先推荐备受好评和热门的作品。
- 用户人口统计学特征: 模型整合了用户的
最终的推荐列表通过整合候选动漫、移除重复项和用户已观看过的项目,并按最终加权分数排序而生成。
项目的源代码被组织在几个 Jupyter Notebook 中:
src/data_exploration_and_RQ1-3.ipynb
: 包含与研究问题1-3相对应的数据清洗、预处理和探索性数据分析的完整过程。src/RQ4.ipynb
: 专注于研究问题4的分析,探索用户行为指标与评分模式之间的关系。src/anime_recommendation.ipynb
: 实现了核心的混合推荐系统(RQ5),包括BERT嵌入生成和最终的推荐逻辑。report/
: 包含PDF格式的详细项目报告。
要在 anime_recommendation.ipynb
中为特定用户生成推荐,请执行 recommend_anime
函数。
# 示例:为ID为20的用户生成5个推荐
recommend_anime(user_id=20, num_recommendations=5)
先决条件:
- 确保已安装 Notebook 中指定的所有必需库。
bert-base-uncased
模型和分词器文件应位于src/bert-base-uncased/
目录中。- 数据集应放置在
src/data/
目录中。
- 数据处理与分析: Pandas, NumPy
- 机器学习与自然语言处理: Scikit-learn, PyTorch, Transformers (Hugging Face)
- 数据可视化: Plotly
- 开发环境: Jupyter Notebook