WINE REVIEWS EDA AND RECOMMENDER SYSTEM

Background

Wines have been interwined with the human cultue since a long time. As time went by the wine industry blossomed. Today, revenue in the Wine segment amounts to US$323,501m in 2020. The market is expected to grow annually by 9.8% (CAGR 2020-2023), as stated by Statista. Thus wine reviews have become equally important, people dont want to invest unknowingly. Hence this analysis is done to see the various trends in the reviewed wines and how can a company benefit from them.

Questions

The questions that I will try to find answers of, are

Which are the most reviewed country and most reviewed variety?
Is there any relationship between the price and points received?
What are some characteristics of wine, country-wise?
What are some common terms appearing in the lowest-rated and highest rated wine?
Can we create a recommender system?
What should a company keep in mind to get good reviews?
Is there any relationship between points and any other attributes?
Which variety of grape will be best to make wine?
Referring to which reviewer will be beneficial?
Should we go with what is familiar or with something less standard in terms of variety?

OK, So lets get into it.

Data

The dataset consists of 129971 rows, in 13 columns. The dataset was scrapped from a famous wine magazine name Winemag. It is a dataset consisting of different wines and their names, province,tasters name, variety, points collected, price, and other variables.

Here are a few examples of descriptions:

"Fragrant notes of tangerine and yuzu peel abound on this citrusy dry Riesling. The palate is cutting and fresh, full of juicy white grapefruit and lime flavors. Light-bodied yet satisfyingly thirst-quenching, it finishes long with invigorating minerality."

Von Schleinitz 2015 Apollo Dry Riesling (Mosel) See here

"An earthy, nutty aroma and flavor come through the intense sweetness and full body of this dessert-style wine. It goes for earthy complexity rather than obvious fruit flavors, and tastes high in sugar and alcohol."

Terre Rouge 2013 Vin Doux Naturel Muscat Blanc à Petits Grains (Shenandoah Valley (CA)) See here

The attributes given,

Title: Name of wine
Variety: Type of grape that is used in the wine.
Country: Country of origin of the wine.
Province: The region within the state in which the wine was produced. The specificity of the areas ranged widely.
Region 1 and 2: A more specific information about the location of the wine, where the wine was produced.
Price: The cost of the wine.
Points: The rating of the wine it ranges from 80 to 100
Taster Name: Name of the reviewer who reviewed the wine

Data Cleaning

Missing Values

The data set had missing values, heavily, in 'region_2'

Dealing With Missing Values

I dropped the unwanted columns, namely,

Unnamed :0
Designation
region_1
region_2
taster_twitter_handle

For the price column, I used the median of the column to fill the null values
I dropped the rest of the null, which were in taster_name

Outliers

For seeing the outliers, I plotted a boxplot for price and points.

Price

We can see that there are a lot of outliers. But they are not impossible values and may help in further. So, I did not drop them.

Points

Here we can see two outliers, one at ~98 and one at 100. But, they too are not physically impossible values. There maybe wines who got 100, that is why I have not dropped them too.

Exploratory Data Analysis

Let us see brief description about the features

country: 43 distinct values
description: 94984 distinct values
points: 21 distinct values
price: 381 distinct values
province: 420 distinct values
taster_name: 19 distinct values
title: 94090 distinct values
variety: 664 distinct values
winery: 14559 distinct values

Country Feature

Analyzing the country feature, I found that the USA has been the most reviewed country.

In terms of price, France has both the most expensive and cheap wines.

In terms of point distribution, we can see England, India, and Austria at the top, but this seemed to be biased because they produce less wine than others. So, I plotted a strip plot, and the results were different. Then, I could see that Italy,France, the US, Portugal, and Spain were the top scorers according to the number of wine produced.

The most expensive wine of different countries were,

Variety Feature

The most common variety was Pinot Noir

The most expensive wine's were made of Bordeaux style red blend followed by, Pinot Noir

The highest rated wine were made of Port, Prugnolo Gentile, Merlot etc.

To understand whether there were any variety which got both good points are also were cheap, I did an intersection of both. The result was,

Thus we can say that, selecting Merlot,Cabernet Sauvignon,Chardonnay,Portuguese Red, Syrah and Shiraz are good choice for wine.

Taster Feature

The most frequent taster in the dataset was Roger Voss, who was 10000 wines ahead of Michael Schachner, who was at number two.

The next step was to see if there were any taster who,in general, gave more points or low points. On plotting a box plot and finding the range, I could see that almost everyone was giving points in the same range. A few people did have low values, but that could be because they reviewed less wine than others.

Description Feature

To understand the description of various wine's, I plotted their wordcloud.

The wordcloud of lowest rated wines

We can see words like, bitter, burnt, sour etc

The wordcloud of highest rated wines

We can say words like magnificient and full body. But the words age, vintage, aged can be seen more than once. Hence we can conclude that age is an important attribute for wine.

The wordcloud of expensive wines

This seems to be a mix of all words

The wordcloud of cheap wines

This also looks like a mix of all words.

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
README.md		README.md
wine-reviews-eda-and-recommender-system.ipynb		wine-reviews-eda-and-recommender-system.ipynb
wine_recommender.pdf		wine_recommender.pdf

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

WINE REVIEWS EDA AND RECOMMENDER SYSTEM

Background

Questions

Data