Skip to content

Creating a baseline recommendation system to classify wines based on their description and points. KNN technique with KD Tree algorithm was used.

Notifications You must be signed in to change notification settings

ipshitag/Wine-Reviews-EDA-and-Recommender-System

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

9 Commits
 
 
 
 
 
 

Repository files navigation

WINE REVIEWS EDA AND RECOMMENDER SYSTEM

Red wine image

Background

Wines have been interwined with the human cultue since a long time. As time went by the wine industry blossomed. Today, revenue in the Wine segment amounts to US$323,501m in 2020. The market is expected to grow annually by 9.8% (CAGR 2020-2023), as stated by Statista. Thus wine reviews have become equally important, people dont want to invest unknowingly. Hence this analysis is done to see the various trends in the reviewed wines and how can a company benefit from them.

Questions

The questions that I will try to find answers of, are

  1. Which are the most reviewed country and most reviewed variety?
  2. Is there any relationship between the price and points received?
  3. What are some characteristics of wine, country-wise?
  4. What are some common terms appearing in the lowest-rated and highest rated wine?
  5. Can we create a recommender system?
  6. What should a company keep in mind to get good reviews?
  7. Is there any relationship between points and any other attributes?
  8. Which variety of grape will be best to make wine?
  9. Referring to which reviewer will be beneficial?
  10. Should we go with what is familiar or with something less standard in terms of variety?

OK, So lets get into it.

Data

The dataset consists of 129971 rows, in 13 columns. The dataset was scrapped from a famous wine magazine name Winemag. It is a dataset consisting of different wines and their names, province,tasters name, variety, points collected, price, and other variables.

Here are a few examples of descriptions:

"Fragrant notes of tangerine and yuzu peel abound on this citrusy dry Riesling. The palate is cutting and fresh, full of juicy white grapefruit and lime flavors. Light-bodied yet satisfyingly thirst-quenching, it finishes long with invigorating minerality."

  • Von Schleinitz 2015 Apollo Dry Riesling (Mosel) See here

"An earthy, nutty aroma and flavor come through the intense sweetness and full body of this dessert-style wine. It goes for earthy complexity rather than obvious fruit flavors, and tastes high in sugar and alcohol."

  • Terre Rouge 2013 Vin Doux Naturel Muscat Blanc Ă  Petits Grains (Shenandoah Valley (CA)) See here

The attributes given,

  1. Title: Name of wine
  2. Variety: Type of grape that is used in the wine.
  3. Country: Country of origin of the wine.
  4. Province: The region within the state in which the wine was produced. The specificity of the areas ranged widely.
  5. Region 1 and 2: A more specific information about the location of the wine, where the wine was produced.
  6. Price: The cost of the wine.
  7. Points: The rating of the wine it ranges from 80 to 100
  8. Taster Name: Name of the reviewer who reviewed the wine

Data Cleaning

Missing Values

The data set had missing values, heavily, in 'region_2'

Missing values

Dealing With Missing Values

  1. I dropped the unwanted columns, namely,
  • Unnamed :0
  • Designation
  • region_1
  • region_2
  • taster_twitter_handle
  1. For the price column, I used the median of the column to fill the null values
  2. I dropped the rest of the null, which were in taster_name

Outliers

For seeing the outliers, I plotted a boxplot for price and points.

Price

Outlier in price

We can see that there are a lot of outliers. But they are not impossible values and may help in further. So, I did not drop them.

Points

Outlier in points

Here we can see two outliers, one at ~98 and one at 100. But, they too are not physically impossible values. There maybe wines who got 100, that is why I have not dropped them too.

Exploratory Data Analysis

Let us see brief description about the features

  • country: 43 distinct values
  • description: 94984 distinct values
  • points: 21 distinct values
  • price: 381 distinct values
  • province: 420 distinct values
  • taster_name: 19 distinct values
  • title: 94090 distinct values
  • variety: 664 distinct values
  • winery: 14559 distinct values

Country Feature

Analyzing the country feature, I found that the USA has been the most reviewed country. Highest reviewed country

In terms of price, France has both the most expensive and cheap wines. Country X Price

In terms of point distribution, we can see England, India, and Austria at the top, but this seemed to be biased because they produce less wine than others. So, I plotted a strip plot, and the results were different. Then, I could see that Italy,France, the US, Portugal, and Spain were the top scorers according to the number of wine produced. Country X Point

Country X Point

The most expensive wine of different countries were, Most expensive

Variety Feature

The most common variety was Pinot Noir Most reviewed variety

The most expensive wine's were made of Bordeaux style red blend followed by, Pinot Noir most expensive variety

The highest rated wine were made of Port, Prugnolo Gentile, Merlot etc. Highest Rated wine

To understand whether there were any variety which got both good points are also were cheap, I did an intersection of both. The result was,

best and cheap

Thus we can say that, selecting Merlot,Cabernet Sauvignon,Chardonnay,Portuguese Red, Syrah and Shiraz are good choice for wine.

Taster Feature

Nice cartoon The most frequent taster in the dataset was Roger Voss, who was 10000 wines ahead of Michael Schachner, who was at number two.

top 20 taster

The next step was to see if there were any taster who,in general, gave more points or low points. On plotting a box plot and finding the range, I could see that almost everyone was giving points in the same range. A few people did have low values, but that could be because they reviewed less wine than others. point distribution

Description Feature

To understand the description of various wine's, I plotted their wordcloud.

The wordcloud of lowest rated wines desc lowest rated

We can see words like, bitter, burnt, sour etc

The wordcloud of highest rated wines desc highest rated

We can say words like magnificient and full body. But the words age, vintage, aged can be seen more than once. Hence we can conclude that age is an important attribute for wine.

The wordcloud of expensive wines desc expensive

This seems to be a mix of all words

The wordcloud of cheap wines desc cheap

This also looks like a mix of all words.

About

Creating a baseline recommendation system to classify wines based on their description and points. KNN technique with KD Tree algorithm was used.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published