#紅葡萄酒質(zhì)量數(shù)據(jù)探究

knitr::opts_chunk$set(message = FALSE, warning = FALSE, echo = FALSE)

========================================================

# Load all of the packages that you end up using
# in your analysis in this code chunk.

# Notice that the parameter "echo" was set to FALSE for this code chunk.
# This prevents the code from displaying in the knitted HTML output.
# You should set echo=FALSE for all code chunks in your file.

library(ggplot2)
library(dplyr)
library(GGally)
library(scales)
library(memisc)
library(reshape)
library(gridExtra)
# Load the Data
wq <- read.csv('wineQualityReds.csv')

簡介

本項(xiàng)目主要探究紅葡萄酒質(zhì)量數(shù)據(jù)集,目的是探究哪些化學(xué)物質(zhì)影響紅葡萄酒的質(zhì)量。使用統(tǒng)計軟件R探索數(shù)據(jù)。

單一變量部分

以下是數(shù)據(jù)集的一些常見統(tǒng)計信息:

#匯總統(tǒng)計
str(wq)
summary(wq)

由于我們主要探索質(zhì)量,所以專門看下“質(zhì)量”的情況:

summary(wq$quality)

有一些初步的觀察:

  • 有1599組數(shù)據(jù),每組數(shù)據(jù)包含13個變量。
  • x應(yīng)該是標(biāo)識符。
  • 質(zhì)量是結(jié)果,分為0-10級,至少由3名專家評選,值從3到8,平均值5.6,中位數(shù)是6。
#規(guī)范數(shù)據(jù)集的質(zhì)量數(shù)據(jù),試制更易于后期繪圖
wq$quality <- factor(wq$quality, ordered = T)

table(wq$quality)

以下是12個變量的直方圖,對數(shù)據(jù)有個直觀感受:

qplot(wq$quality)
             

因?yàn)橹饕治龅木褪怯绊懫咸丫瀑|(zhì)量的因素,所以首先看一下葡萄酒質(zhì)量的分布狀況。
可以看出:葡萄酒質(zhì)量評分分布在3-8,絕大部分分布在5和6,整體有正態(tài)分布的趨勢。

qplot(wq$fixed.acidity)

fixed.acidity固定酸度圖形也有點(diǎn)正太分布,但是跟質(zhì)量曲線區(qū)別較大,這個偏向左側(cè)。
可能跟質(zhì)量有一定關(guān)系;

qplot(wq$volatile.acidity)

volatile.acidity揮發(fā)酸度圖形也有點(diǎn)正太分布,但是跟質(zhì)量曲線區(qū)別較大,偏向左側(cè)。
跟固定酸度應(yīng)該有挺大關(guān)系,可能跟質(zhì)量也有一定關(guān)系,需要進(jìn)一步探究;

qplot(wq$citric.acid)

citric.acid檸檬酸的趨勢不明顯,嚴(yán)重偏向左側(cè),但是不連續(xù),跟同是酸類的 固定酸度和揮發(fā)性酸度 圖形區(qū)別較大;
數(shù)據(jù)可能是缺失或者有問題;

qplot(wq$residual.sugar)

residual.sugar殘余糖分圖形集中度較好,集中在1-3,且有些較大的離散值,與質(zhì)量曲線差異較大,應(yīng)該關(guān)聯(lián)不大;

qplot(wq$chlorides)

chlorides氯化物圖形集中度較好,集中在0.05--0.12,且有些較大的離散值,與質(zhì)量曲線差異較大,應(yīng)該關(guān)聯(lián)不大;

qplot(wq$free.sulfur.dioxide)

free.sulfur.dioxide游離二氧化硫整體偏向左側(cè),且數(shù)據(jù)不連續(xù),與質(zhì)量曲線差異較大,應(yīng)該關(guān)聯(lián)不大;

qplot(wq$total.sulfur.dioxide)

total.sulfur.dioxide總二氧化硫圖形整體嚴(yán)重偏向左側(cè),與質(zhì)量曲線差異較大,應(yīng)該關(guān)聯(lián)不大;

qplot(wq$density)

density密度圖形整體符合正態(tài)分布,與質(zhì)量圖形非常相似,需要重點(diǎn)探究,應(yīng)該與質(zhì)量關(guān)聯(lián)較大;

qplot(wq$pH)

ph圖形整體符合正態(tài)分布,與質(zhì)量圖形非常相似,需要重點(diǎn)探究,應(yīng)該與質(zhì)量關(guān)聯(lián)較大;

qplot(wq$sulphates)

sulphates硫酸鹽圖形整體有正態(tài)分布的趨勢,但是嚴(yán)重偏向左側(cè),且有大量長尾數(shù)據(jù),估計跟質(zhì)量有些關(guān)聯(lián),但是應(yīng)該不是強(qiáng)關(guān)聯(lián),需要進(jìn)一步探究;

qplot(wq$alcohol)

alcohol酒精圖形整體偏向左側(cè),與質(zhì)量曲線差異較大,應(yīng)該關(guān)聯(lián)不大;

小結(jié):
  • 查看這些直方圖,發(fā)現(xiàn) 密度,ph 與 質(zhì)量的圖形很相似,似乎有些關(guān)聯(lián);
  • 其他的圖形一般都偏向左側(cè);
  • 這是第一步猜想,需要進(jìn)一步探究;

單變量分析

您的數(shù)據(jù)集的結(jié)構(gòu)是什么?

有13個變量組成的1599個數(shù)據(jù);
x是唯一標(biāo)識符;
其他12個變量是:fixed.acidity, volatile.acidity, citric.acid, residual.sugar, chlorides, free.sulfur.dioxide, total.sulfur.dioxide, density, pH, sulphates, alcohol, quality;

數(shù)據(jù)集中的主要變量是什么?

最主要的變量是 quality質(zhì)量,也是這次分析的主要目標(biāo);
得分從0-10,大部分?jǐn)?shù)據(jù)集中在5-6,分布接近正態(tài)分布;

What other features in the dataset do you think will help support your investigation into your feature(s) of interest?

從直方圖可知,density和pH跟質(zhì)量圖形非常相似;
Fixed、volatile acidity、free and total sulphur dioxide、sulphates、alcohol圖形是偏斜而且長尾的;

Did you create any new variables from existing variables in the dataset?

新建立了一個指標(biāo)“rating”:質(zhì)量評級,把葡萄酒質(zhì)量分為優(yōu)質(zhì)“good”(質(zhì)量分7-10)、均質(zhì)“average”(質(zhì)量分5-6)、差“bad”(質(zhì)量分0-4)

wq$rating <- ifelse(wq$quality < 5, 'bad', ifelse(
  wq$quality < 7, 'average', 'good'))

wq$rating <- ordered(wq$rating,
                     levels = c('bad', 'average', 'good'))
summary(wq$rating)

qplot(wq$rating)

Of the features you investigated, were there any unusual distributions? Did you perform any operations on the data to tidy, adjust, or change the form of the data? If so, why did you do this?

citric acid的分布不太正常,因?yàn)榕c同是酸度的fixed acidity 和 volatile acidity的分布不一致,后2個分布符合ph的正態(tài)分布;
citric acid 應(yīng)該是缺失了大量數(shù)據(jù),或者部分?jǐn)?shù)據(jù)不可用導(dǎo)致,具體可以看下圖:

grid.arrange(ggplot(aes(fixed.acidity), data = wq) +
               geom_histogram() + scale_x_log10(),
             ggplot(aes(volatile.acidity), data = wq) +
               geom_histogram() + scale_x_log10(),
             ggplot(aes(citric.acid), data = wq) +
               geom_histogram() + scale_x_log10(),
             ncol = 1 )

Bivariate Plots Section

箱線圖


ggplot(aes(rating, fixed.acidity), data = wq) +
  geom_jitter( alpha = 0.3)  +
  geom_boxplot( alpha = 0.5)

圖中可以看到,隨著葡萄酒質(zhì)量評級的提高,fixed acidity隨著提高,說明很可能fixed acidity跟質(zhì)量是正相關(guān)的;


ggplot(aes(rating, volatile.acidity), data = wq) +
  geom_jitter( alpha = 0.3)  +
  geom_boxplot( alpha = 0.5)

圖中可以看到,隨著葡萄酒質(zhì)量評級的提高,volatile acidity隨著規(guī)律性的降低,說明很可能volatile acidity 跟質(zhì)量是 負(fù)相關(guān)的;


ggplot(aes(rating, citric.acid), data = wq) +
  geom_jitter( alpha = 0.3)  +
  geom_boxplot( alpha = 0.5)+
  coord_cartesian(ylim = c(0, 0.85))

圖中可以看到,隨著葡萄酒質(zhì)量評級的提高,citric acid隨著提高,說明很可能citric acid跟質(zhì)量是正相關(guān)的;


ggplot(aes(rating, residual.sugar), data = wq) +
  geom_jitter( alpha = 0.3)  +
  geom_boxplot( alpha = 0.5)+
  coord_cartesian(ylim = c(1, 6))

排除了較大的離散值后,從圖中可以看到,隨著葡萄酒質(zhì)量評級的提高,residual sugar也同樣升高,但是趨勢不太明顯,而且還是有很多離散值存在,說明residual sugar可能與葡萄酒質(zhì)量之間有正相關(guān)性,但相關(guān)強(qiáng)度應(yīng)該不大;


ggplot(aes(rating, chlorides), data = wq) +
  geom_jitter( alpha = 0.3)  +
  geom_boxplot( alpha = 0.5)+
  coord_cartesian(ylim = c(0.025, 0.12))

排除較大的離散值后,從圖中可以看到,隨著葡萄酒質(zhì)量評級的提高,chlorides含量先升后降,沒有明顯的規(guī)律性變化,說明很可能chlorides與葡萄酒質(zhì)量之間并沒有較強(qiáng)相關(guān)性;


ggplot(aes(rating, free.sulfur.dioxide), data = wq) +
  geom_jitter( alpha = 0.3)  +
  geom_boxplot( alpha = 0.5)+
  coord_cartesian(ylim = c(0, 60))

圖中可以看到,隨著葡萄酒質(zhì)量評級的提高,residual sugar先變大后變小,沒有規(guī)律性變化,說明很可能residual sugar與葡萄酒質(zhì)量之間并沒有相關(guān)性;


ggplot(aes(rating, total.sulfur.dioxide), data = wq) +
  geom_jitter( alpha = 0.3)  +
  geom_boxplot( alpha = 0.5)+
  coord_cartesian(ylim = c(0, 180))

圖中可以看到,隨著葡萄酒質(zhì)量評級的提高,total sulfur dioxide先變大后變小,沒有規(guī)律性變化,說明很可能total sulfur dioxide與葡萄酒質(zhì)量之間并沒有相關(guān)性;


ggplot(aes(rating, density), data = wq) +
  geom_jitter( alpha = 0.3)  +
  geom_boxplot( alpha = 0.5)

圖中可以看到,隨著葡萄酒質(zhì)量評級的提高,density先變大后變小,沒有規(guī)律性變化,說明很可能total sulfur dioxide與葡萄酒質(zhì)量之間并沒有相關(guān)性;
單變量分析中,以為density跟質(zhì)量有較大關(guān)聯(lián),現(xiàn)在看來并沒有;


ggplot(aes(rating, pH), data = wq) +
  geom_jitter( alpha = 0.3)  +
  geom_boxplot( alpha = 0.5)

圖中可以看到,隨著葡萄酒質(zhì)量評級的提高,pH 隨著規(guī)律性的降低,說明很可能pH 跟質(zhì)量是 負(fù)相關(guān)的


ggplot(aes(rating, sulphates), data = wq) +
  geom_jitter( alpha = 0.3)  +
  geom_boxplot( alpha = 0.5)+
  coord_cartesian(ylim = c(0.25, 1.8))

圖中可以看到,隨著葡萄酒質(zhì)量評級的提高,sulphates隨著提高,說明很可能sulphates跟質(zhì)量是正相關(guān)的;


ggplot(aes(rating, alcohol), data = wq) +
  geom_jitter( alpha = 0.3)  +
  geom_boxplot( alpha = 0.5)

圖中可以看到,隨著葡萄酒質(zhì)量評級的提高,alcohol隨著提高,說明很可能alcohol 跟質(zhì)量是正相關(guān)的;

小結(jié):
  • 好像 fixed acidity, citric acid, sulphates , alcohol 跟質(zhì)量是正相關(guān)的;
  • volatile acidity , pH 跟質(zhì)量是 負(fù)相關(guān)的;
  • density跟單變量分析結(jié)果不同,似乎不起作用;

相關(guān)性

corr <- NULL

for (i in names(wq)){
  corr[i] <- cor.test(as.numeric(wq[,i]), 
                      as.numeric(wq$quality))$estimate
  }

corr

以下的變量與葡萄酒質(zhì)量相關(guān)性較高:

  • alcohol: 47.6%
  • sulphates : 25.1%
  • citric acid: 22.6%
  • fixed acidity: 12.4%
  • volatile acidity: -39.1%
  • density: -17.5%

Bivariate Analysis

Talk about some of the relationships you observed in this part of the investigation. How did the feature(s) of interest vary with other features in the dataset?

從箱線圖中可以看出,fixed acidity, citric acid, sulphates , alcohol
與葡萄酒質(zhì)量直接相關(guān),volatile acidity , pH 跟質(zhì)量是負(fù)相關(guān)的。
從相關(guān)性測試中,觀察到類似的趨勢;

Did you observe any interesting relationships between the other features (not the main feature(s) of interest)?

sulfur dioxide(二氧化硫)有點(diǎn)意思,與total 和free sulfur dioxide 高度相關(guān),很容易理解,但是它們卻跟 sulphates(硫酸鹽) 不相關(guān)。

What was the strongest relationship you found?

與質(zhì)量的關(guān)系(相關(guān)性)強(qiáng)弱的是這些:

  • alcohol: 47.6%
  • sulphates : 25.1%
  • citric acid: 22.6%
  • fixed acidity: 12.4%
  • volatile acidity: -39.1%
  • density: -17.5%

Multivariate Plots Section

主要研究與質(zhì)量相關(guān)的4個特征:alcohol,sulphates,citric.acid,volatile.acidity

  
ggplot(aes(x = citric.acid, y = volatile.acidity, color = factor(quality)), data = wq) +
  geom_jitter(alpha = 0.2) +
  scale_color_brewer(palette = "Blues") +
  geom_smooth(method = "lm", se = FALSE,size=1) +
  labs(y = 'volatile.acidity',x = 'citric.acid') +
  ggtitle("Volatile.acidity  VS  Citric.acid VS  quality") 

ggplot(aes(x = alcohol, y = log10(sulphates), color = factor(quality)), data = wq) +
  geom_jitter(alpha = 0.2) +
  scale_color_brewer(palette = "Blues") +
  geom_smooth(method = "lm", se = FALSE,size=1) +
  labs(y = 'log10(sulphates)',x = 'alcohol') +
  ggtitle("Log10(sulphates)  VS  Alcohol VS  quality")

ggplot(aes(x = pH, y = alcohol, color = factor(quality)), data = wq) +
  geom_jitter(alpha = 0.2) +
  scale_color_brewer(palette = "Blues") +
  geom_smooth(method = "lm", se = FALSE,size=1) +
  labs(y = 'alcohol',x = 'pH') +
  ggtitle("Alcohol  VS  PH VS  quality")

Multivariate Analysis

Talk about some of the relationships you observed in this part of the investigation. Were there features that strengthened each other in terms of looking at your feature(s) of interest?

多變量圖,通過把質(zhì)量分?jǐn)?shù)分圖,通過3個評級類別進(jìn)行分析,可以看到,較高的alcohol, sulphates, citric acid, fixed acidity,較低的volatile acidity可以產(chǎn)生更高質(zhì)量的葡萄酒。

Were there any interesting or surprising interactions between features?

ph對質(zhì)量的影響很小,這個很意外;

OPTIONAL: Did you create any models with your dataset? Discuss the strengths and limitations of your model.

no


Final Plots and Summary

Plot One:acid 影響葡萄酒質(zhì)量

grid.arrange(ggplot(data = wq, aes(x = quality, y = fixed.acidity, fill = quality)) +
               scale_fill_brewer(palette = "Blues")+
               xlab('Quality') +
               ylab('Fixed Acidity') +
               geom_boxplot(),
             ggplot(data = wq, aes(x = quality, y = volatile.acidity, fill = quality)) +
               scale_fill_brewer(palette = "Blues")+
               xlab('Quality') +
               ylab('Volitile  Acidity') +
               geom_boxplot(),
             ggplot(data = wq, aes(x = quality, y = citric.acid, fill = quality)) +
               scale_fill_brewer(palette = "Blues")+
               xlab('Quality') +
               ylab('Citric Acidity') +
               geom_boxplot(),
             ggplot(data = wq, aes(x = quality, y = pH, fill = quality)) +
               scale_fill_brewer(palette = "Blues")+
               xlab('Quality') +
               ylab('PH') +
               geom_boxplot())
               

Description One

  • 這些圖是為了說明酸和ph對葡萄酒質(zhì)量的影響;
  • 一般來說,高質(zhì)量的葡萄酒有較高的酸acid或更低的ph;
  • 其中檸檬酸citric acid 的影響更大,固定酸fixed acid 影響較??;
  • 但是,揮發(fā)性酸volatile acidity不利于葡萄酒質(zhì)量;

Plot Two:Alcohol影響葡萄酒質(zhì)量

ggplot(data = wq, aes( x = quality, y = alcohol, fill = rating)) +
  geom_boxplot() +
  scale_fill_brewer(palette = "Blues")+
  xlab('Alcohol') +
  ylab('Volatile Acidity')
  
         

Description Two

  • 這里想說明酒精alcohol對葡萄酒質(zhì)量的影響;
  • 一般來說,高質(zhì)量葡萄酒含有較高的酒精含量,但是單獨(dú)看酒精對質(zhì)量的影響并不是那么強(qiáng);

Plot Three 究竟什么影響葡萄酒質(zhì)量

ggplot(data = subset(wq, rating != 'average'),
       aes(x = alcohol, y = volatile.acidity, color = rating)) +
  geom_point() +
  xlab('Alcohol') +
  ylab('Volatile Acidity')


Description Three

圖中去掉了 中級葡萄酒的影響,也沒有采用以上通用的漸變色表達(dá),因?yàn)闈u變色不明顯,看不清楚,現(xiàn)在可以更明顯的看到:

  • 高酒精含量alcohol 和 低揮發(fā)性酸度volatile acid 組合起來,可以產(chǎn)生更優(yōu)質(zhì)的葡萄酒;

Reflection

對紅酒質(zhì)量數(shù)據(jù)集的探索性數(shù)據(jù)分析很有意思,此數(shù)據(jù)集也比較適合,數(shù)據(jù)大小適中,規(guī)律相對比較明顯;
單變量分析時,對各變量都進(jìn)行了探索,ph,密度,固定酸度,揮發(fā)性酸度,硫酸鹽,酒精。
后來雙變量分析時越來越清晰,
最后多變量分析時,明確了高酒精含量 和 低揮發(fā)性酸度,對葡萄酒質(zhì)量影響非常大;

挫折或成功:在單變量分析時,以為密度會對葡萄酒質(zhì)量有較大的關(guān)聯(lián),但是通過雙變量分析,發(fā)現(xiàn)并沒有神馬關(guān)系,
考慮問題還是要多一個維度,會更加的客觀,發(fā)現(xiàn)真相。

提議:可以收集白酒和其他酒類的相關(guān)影響因素數(shù)據(jù),看看是否影響酒類的因素都是一樣的,這樣對酒類的了解就更加全面了,也有利于指導(dǎo)酒類生產(chǎn);

有一點(diǎn)也需要說明,這個質(zhì)量的評級是有主觀性的,所以這里的結(jié)論也只是一個有趣的觀點(diǎn)和視角,
不完全代表影響葡萄酒質(zhì)量的真實(shí)原因。

最后編輯于
?著作權(quán)歸作者所有,轉(zhuǎn)載或內(nèi)容合作請聯(lián)系作者
平臺聲明:文章內(nèi)容(如有圖片或視頻亦包括在內(nèi))由作者上傳并發(fā)布,文章內(nèi)容僅代表作者本人觀點(diǎn),簡書系信息發(fā)布平臺,僅提供信息存儲服務(wù)。

推薦閱讀更多精彩內(nèi)容

  • 觀察其變量。 整個數(shù)據(jù)集有13個變量,1599個觀察值。 單變量分析 紅葡萄酒分布在最好和最差的比較少,而分布在中...
    湯堯閱讀 4,432評論 15 6
  • 北京又要開始下雨了 外面雷聲響起 明天還要繼續(xù) 那雨聲好像漸漸近了 果然 我在北京這邊 暴風(fēng)雨來了 雷電交加 雨聲...
    哈皮皮皮閱讀 146評論 0 0
  • 人來人往,蕓蕓眾生,或許我們只是擦肩而過,倘若是偶爾的回眸,那便是不淺的緣份了。此人此事,無關(guān)風(fēng)花雪月,...
    云泉山閱讀 2,148評論 0 5
  • 一只大灰狼,在奮力追一只小綿羊,眼看著小綿羊已經(jīng)沒有路可以跑了,小綿羊絕望了,突然停下來,回過頭,看著大灰狼。大灰...
    阿全不會文字閱讀 630評論 0 2
  • 1.多管閑事的男生為女生以后嫁不出去所擔(dān)憂,卻忘記她們根本不會嫁給他們,甚至不會多看他們一眼。一群人為另外一群根本...
    維納斯的丘比特閱讀 136評論 0 1