I performed a comprehensive data analysis of the best-selling books dataset (a dataset from Kaggle) to extract meaningful insights and inform data-driven decisions. Sharing insights derived from the data analysis:
Tools used:
- Jupyter Notebook (Code and Markdown)
- Draw.io (To create flowchart)
- Data cleaning: Performed data cleaning, handling missing values, and addressing data anomalies to prepare the dataset for analysis.
- Exploratory Data Analysis (EDA):
- Performed preliminary data analysis to discover valuable information regarding book sales patterns.
- Used data visualization methods to identify essential attributes, including book title, author, genre, publication year, and metrics related to sales.
- Top-selling authors : Identified the authors with the highest number of best-selling books. J.K.Rowling emerged as the top-selling author with 500 million sales of his books.
- Top-selling books: Identified the books with the most extensive record of best-sellers. "A Tale of Two Cities" stood out as the top performer, boasting 200 million in sales.
- Genre Analysis: Determined which genre performs better in terms of sales. The Fantasy genre outperforms other genres in terms of sales.
- Sales distribution by language: Investigated the distribution of sales in percentage by language. The books categorized under the English language consistently achieve the highest sales figures.
- Sales Analysis: Explored the metrics related to sales, such as total sales by book and genre.
- Correlation Analysis: Examined correlations between Sales and Year. The approximate sales range falls between 22 and 50 million units. Sales remained consistently high from 1950 to 2000.
- Visualization: Used data visualization techniques, such as bar charts, histograms, scatter plots, and pie charts, to present findings effectively.
I used the following Python functions and features for data analysis:
- Pandas: For data manipulation and cleaning. The key functions include
read_csv
(for loading the dataset),head
(for viewing the first few rows), and functions for filtering, aggregating, and transforming data. - NumPy: For calculations and statistical analysis of the data.
- Matplotlib and Seaborn: For data visualization. Created various types of plots, such as bar charts, histograms, scatter plots, pie charts, and line charts to visualize trends and patterns in the data.
- Regular Expressions (re module): For removing multiple random characters from book titles.
- Apply Functions (Pandas): For applying custom functions to the Book column.
- GroupBy (Pandas): For aggregating and summarizing data, such as finding the total sales per genre or author.
I used technical writing principles to document data analysis steps and explain the project workflow.
- Code documentation: Explained the Python code so that others can reproduce and use the code when creating a data-analysis project. I included comments at relevant points within the code to clarify the rationale behind the logic.