Skip to content

Performed an analysis on the best-selling books dataset to gain valuable insights about the books and their sales trends.

Notifications You must be signed in to change notification settings

kalpanapathak16/Data-Analysis-of-best-selling-books-dataset

Repository files navigation

Data Analysis of best selling books dataset

I performed a comprehensive data analysis of the best-selling books dataset (a dataset from Kaggle) to extract meaningful insights and inform data-driven decisions. Sharing insights derived from the data analysis:

Tools used:

  • Jupyter Notebook (Code and Markdown)
  • Draw.io (To create flowchart)

Summary of the project

  • Data cleaning: Performed data cleaning, handling missing values, and addressing data anomalies to prepare the dataset for analysis.
  • Exploratory Data Analysis (EDA):
    • Performed preliminary data analysis to discover valuable information regarding book sales patterns.
    • Used data visualization methods to identify essential attributes, including book title, author, genre, publication year, and metrics related to sales.
  • Top-selling authors : Identified the authors with the highest number of best-selling books. J.K.Rowling emerged as the top-selling author with 500 million sales of his books.
  • Top-selling books: Identified the books with the most extensive record of best-sellers. "A Tale of Two Cities" stood out as the top performer, boasting 200 million in sales.
  • Genre Analysis: Determined which genre performs better in terms of sales. The Fantasy genre outperforms other genres in terms of sales.
  • Sales distribution by language: Investigated the distribution of sales in percentage by language. The books categorized under the English language consistently achieve the highest sales figures.
  • Sales Analysis: Explored the metrics related to sales, such as total sales by book and genre.
  • Correlation Analysis: Examined correlations between Sales and Year. The approximate sales range falls between 22 and 50 million units. Sales remained consistently high from 1950 to 2000.
  • Visualization: Used data visualization techniques, such as bar charts, histograms, scatter plots, and pie charts, to present findings effectively.

Python functions and features

I used the following Python functions and features for data analysis:

  • Pandas: For data manipulation and cleaning. The key functions include read_csv (for loading the dataset), head (for viewing the first few rows), and functions for filtering, aggregating, and transforming data.
  • NumPy: For calculations and statistical analysis of the data.
  • Matplotlib and Seaborn: For data visualization. Created various types of plots, such as bar charts, histograms, scatter plots, pie charts, and line charts to visualize trends and patterns in the data.
  • Regular Expressions (re module): For removing multiple random characters from book titles.
  • Apply Functions (Pandas): For applying custom functions to the Book column.
  • GroupBy (Pandas): For aggregating and summarizing data, such as finding the total sales per genre or author.

Documentation

I used technical writing principles to document data analysis steps and explain the project workflow.

  • Code documentation: Explained the Python code so that others can reproduce and use the code when creating a data-analysis project. I included comments at relevant points within the code to clarify the rationale behind the logic.

About

Performed an analysis on the best-selling books dataset to gain valuable insights about the books and their sales trends.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published