Wines have been interwined with the human cultue since a long time. As time went by the wine industry blossomed. Today, revenue in the Wine segment amounts to US$323,501m in 2020. The market is expected to grow annually by 9.8% (CAGR 2020-2023), as stated by Statista. Thus wine reviews have become equally important, people dont want to invest unknowingly. Hence this analysis is done to see the various trends in the reviewed wines and how can a company benefit from them.
The questions that I will try to find answers of, are
- Which are the most reviewed country and most reviewed variety?
- Is there any relationship between the price and points received?
- What are some characteristics of wine, country-wise?
- What are some common terms appearing in the lowest-rated and highest rated wine?
- Can we create a recommender system?
- What should a company keep in mind to get good reviews?
- Is there any relationship between points and any other attributes?
- Which variety of grape will be best to make wine?
- Referring to which reviewer will be beneficial?
- Should we go with what is familiar or with something less standard in terms of variety?
OK, So lets get into it.
The dataset consists of 129971 rows, in 13 columns. The dataset was scrapped from a famous wine magazine name Winemag. It is a dataset consisting of different wines and their names, province,tasters name, variety, points collected, price, and other variables.
Here are a few examples of descriptions:
"Fragrant notes of tangerine and yuzu peel abound on this citrusy dry Riesling. The palate is cutting and fresh, full of juicy white grapefruit and lime flavors. Light-bodied yet satisfyingly thirst-quenching, it finishes long with invigorating minerality."
- Von Schleinitz 2015 Apollo Dry Riesling (Mosel) See here
"An earthy, nutty aroma and flavor come through the intense sweetness and full body of this dessert-style wine. It goes for earthy complexity rather than obvious fruit flavors, and tastes high in sugar and alcohol."
- Terre Rouge 2013 Vin Doux Naturel Muscat Blanc Ă Petits Grains (Shenandoah Valley (CA)) See here
The attributes given,
- Title: Name of wine
- Variety: Type of grape that is used in the wine.
- Country: Country of origin of the wine.
- Province: The region within the state in which the wine was produced. The specificity of the areas ranged widely.
- Region 1 and 2: A more specific information about the location of the wine, where the wine was produced.
- Price: The cost of the wine.
- Points: The rating of the wine it ranges from 80 to 100
- Taster Name: Name of the reviewer who reviewed the wine
The data set had missing values, heavily, in 'region_2'
- I dropped the unwanted columns, namely,
- Unnamed :0
- Designation
- region_1
- region_2
- taster_twitter_handle
- For the price column, I used the median of the column to fill the null values
- I dropped the rest of the null, which were in taster_name
For seeing the outliers, I plotted a boxplot for price and points.
We can see that there are a lot of outliers. But they are not impossible values and may help in further. So, I did not drop them.
Here we can see two outliers, one at ~98 and one at 100. But, they too are not physically impossible values. There maybe wines who got 100, that is why I have not dropped them too.
Let us see brief description about the features
- country: 43 distinct values
- description: 94984 distinct values
- points: 21 distinct values
- price: 381 distinct values
- province: 420 distinct values
- taster_name: 19 distinct values
- title: 94090 distinct values
- variety: 664 distinct values
- winery: 14559 distinct values
Analyzing the country feature, I found that the USA has been the most reviewed country.
In terms of price, France has both the most expensive and cheap wines.
In terms of point distribution, we can see England, India, and Austria at the top, but this seemed to be biased because they produce less wine than others. So, I plotted a strip plot, and the results were different. Then, I could see that Italy,France, the US, Portugal, and Spain were the top scorers according to the number of wine produced.
The most expensive wine of different countries were,
The most common variety was Pinot Noir
The most expensive wine's were made of Bordeaux style red blend followed by, Pinot Noir
The highest rated wine were made of Port, Prugnolo Gentile, Merlot etc.
To understand whether there were any variety which got both good points are also were cheap, I did an intersection of both. The result was,
Thus we can say that, selecting Merlot,Cabernet Sauvignon,Chardonnay,Portuguese Red, Syrah and Shiraz are good choice for wine.
The most frequent taster in the dataset was Roger Voss, who was 10000 wines ahead of Michael Schachner, who was at number two.
The next step was to see if there were any taster who,in general, gave more points or low points. On plotting a box plot and finding the range, I could see that almost everyone was giving points in the same range. A few people did have low values, but that could be because they reviewed less wine than others.
To understand the description of various wine's, I plotted their wordcloud.
The wordcloud of lowest rated wines
We can see words like, bitter, burnt, sour etc
The wordcloud of highest rated wines
We can say words like magnificient and full body. But the words age, vintage, aged can be seen more than once. Hence we can conclude that age is an important attribute for wine.
The wordcloud of expensive wines
This seems to be a mix of all words
This also looks like a mix of all words.