利用ggplot2進(jìn)行數(shù)據(jù)可視化

2020-04-25

1.1. first step --意識到ggplot繪制其實是由一層層圖層組成,一個命令即可增加一層

ggplot(data = mpg) + geom_point(mapping = aes(x = displ, y = hwy))


ggplot()creates a coordinate system 坐標(biāo)系 that you can add layers圖層 to. The first argument of ggplot() is the dataset to use in the graph. So ggplot(data = mpg) creates an empty graph.

1.2. The function geom_point() adds a layer of points to your plot, which creates a scatterplot.

Themapping argument is always paired with aes(), and the x and y arguments of aes() specify which variables to map to the x and y axes.
ggplot()--function; geom_point--function 函數(shù); mapping--argument 參數(shù)
增加另一個數(shù)據(jù)的值:
ggplot(data=iris)+geom_point(mapping = aes(x=Sepal.Width,y=Sepal.Length,color=Species))

ggplot(data=iris)+geom_point(mapping = aes(x=Species,y=Sepal.Length,color=Sepal.Width))

實際上命令可疊加

ggplot(data=iris)+geom_point(mapping = aes(x=Sepal.Width,y=Sepal.Length,size=Species,color=Species))
Warning message:
Using size for a discrete variable is not advised. 

1.3. 還可手動設(shè)置對象的圖形屬性

ggplot(data=iris) + geom_point(mapping = aes(x=Sepal.Width,y=Sepal.Length,color="grey"))


此處,color設(shè)置在aes()內(nèi)部,意為:將“grey”這個字符串賦予color

ggplot(data=iris) + geom_point(mapping = aes(x=Sepal.Width,y=Sepal.Length),color="grey")


此處,color設(shè)置于aes()外部,不改變變量信息,只是改變geom_point()散點圖的外觀

One common problem when creating ggplot2 graphics is to put the + in the wrong place: it has to come at the end of the line, not the start. In other words, make sure you haven’t accidentally written code like this:

ggplot(data = mpg) 
+ geom_point(mapping = aes(x = displ, y = hwy))

1.4. 還可分面

注意:facet()是和aes()平級的函數(shù)

     ggplot(data=iris)+geom_point(mapping = aes(x=Sepal.Width,y=Sepal.Length))+facet_wrap(~Species,nrow=2)

注意:species是離散變量。如果對連續(xù)變量sepal.width分面:

>     ggplot(data=iris)+geom_point(mapping = aes(x=Sepal.Width,y=Sepal.Length))+facet_wrap(~Sepal.Width,nrow=4)

對iris數(shù)據(jù)進(jìn)行統(tǒng)計:

> p<-iris
> distinct(p,iris)
> distinct(p,Sepal.Length)     #展示非重復(fù)數(shù)據(jù)
   Sepal.Length
1           5.1
2           4.9
3           4.7
4           4.6
5           5.0
6           5.4
7           4.4
8           4.8
9           4.3
10          5.8
11          5.7
12          5.2
13          5.5
14          4.5
15          5.3
16          7.0
17          6.4
18          6.9
19          6.5
20          6.3
21          6.6
22          5.9
23          6.0
24          6.1
25          5.6
26          6.7
27          6.2
28          6.8
29          7.1
30          7.6
31          7.3
32          7.2
33          7.7
34          7.4
35          7.9
> count(p,Sepal.Length)    #統(tǒng)計非重復(fù)數(shù)據(jù)
# A tibble: 35 x 2
   Sepal.Length     n
          <dbl> <int>
 1          4.3     1
 2          4.4     3
 3          4.5     1
 4          4.6     4
 5          4.7     2
 6          4.8     5
 7          4.9     6
 8          5      10
 9          5.1     9
10          5.2     4
# … with 25 more rows

1.5. 比較facet_grid() 一般需要將具有更多唯一值的變量放在列上

ggplot(data=mpg)+
    geom_point(mapping = aes(drv,y=cyl))
> ggplot(data=mpg)+
+     geom_point(mapping = aes(drv,y=cyl))+
+     facet_grid(drv~cyl)
ggplot(data=mpg)+
+     geom_point(mapping = aes(drv,y=cyl))+
+     facet_grid(cyl~drv)
> ggplot(data=mpg)+
+     geom_point(mapping = aes(drv,y=cyl))+
+     facet_grid(drv~.)
  1. 關(guān)于stroke
 ggplot(data=iris)+
+     geom_point(mapping = aes(x=Sepal.Length,y=Sepal.Width,stroke=1,fill="lightpink",color=Species),shape=21)

放大可見描邊內(nèi)部形狀填充了lightpink

1.6. 幾何對象

> ggplot(data=iris)+
+     geom_smooth(mapping = aes(x=Sepal.Length,y=Sepal.Width,color=Species))
`geom_smooth()` using method = 'loess' and formula 'y ~ x'
>  

將相同對象納入不同命令處理時,可以這樣:

> ggplot(data = iris, mapping = aes(x=Sepal.Length,y=Sepal.Width))+
+     geom_point()+
+     geom_smooth()

(當(dāng)然最基本函數(shù)是這樣:)

> ggplot(data = iris)+
+     geom_point(mapping = aes(x=Sepal.Length,y=Sepal.Width))+
+     geom_smooth(mapping = aes(x=Sepal.Length,y=Sepal.Width))
二者出圖結(jié)果一致(這是必然的)

還可以單獨對某一函數(shù)施加命令:

> ggplot(data = iris, mapping = aes(x=Sepal.Length,y=Sepal.Width))+
+     geom_point(mapping = aes(color=Species))+
+     geom_smooth()

同理,可以對不同圖層施加不同數(shù)據(jù):局部可以覆蓋全局

> ggplot(data = iris, mapping = aes(x=Sepal.Length,y=Sepal.Width))+
+     geom_point(mapping = aes(color=Species),show.legend = F)+
+     geom_smooth(data=filter(iris,Species=="setosa"))

思考題

p1 <- ggplot(data = mpg, mapping = aes(displ, hwy)) +
      geom_point(size = 2.5) +
      geom_smooth(se = F, size = 1.5)

p2 <- ggplot(data = mpg, mapping = aes(displ, hwy)) +
      geom_point(size = 2.5) +
      geom_smooth(se = F, size = 1.5, mapping = aes(group = drv))

p3 <- ggplot(data = mpg, mapping = aes(displ, hwy, color = drv)) +
      geom_point(size = 2.5) +
      geom_smooth(se = F, size = 1.5, mapping = aes(group = drv, color = drv))

p4 <- ggplot(data = mpg, mapping = aes(displ, hwy)) +
      geom_point(size = 2.5, mapping = aes(color = drv)) +
      geom_smooth(se = F, size = 1.5)

p5 <- ggplot(data = mpg, mapping = aes(displ, hwy)) +
      geom_point(size = 2.5, mapping = aes(color = drv)) +
      geom_smooth(se = F, size = 1.5, mapping = aes(group = drv, linetype = drv))

p6 <- ggplot(data = mpg, mapping = aes(displ, hwy)) +
      geom_point(size = 2.5, mapping = aes(color = drv))

library(gridExtra)     #把幾張圖排到一起
grid.arrange(p1, p2, p3, p4, p5, p6, ncol= 2, nrow = 3)

1.7. 統(tǒng)計變換

geom_bar
view(diamonds)
geom_bar的默認(rèn)統(tǒng)計變換是stat_count,stat_count會計算出兩個新變量-count(計數(shù))和prop(proportions,比例)。

直方圖默認(rèn)的y軸是x軸的計數(shù)。此例子中x軸是是五種cut(切割質(zhì)量),直方圖自動統(tǒng)計了這五種質(zhì)量的鉆石的統(tǒng)計計數(shù),當(dāng)你不想使用計數(shù),而是想顯示各質(zhì)量等級所占比例的時候就需要用到prop

ggplot(data = diamonds) + 
  geom_bar(mapping = aes(x = cut, y = ..prop.., group = 1))

group=1的意思是把所有鉆石作為一個整體,顯示五種質(zhì)量的鉆石所占比例體現(xiàn)出來。如果不加這一句,就是每種質(zhì)量的鉆石各為一組來計算,那么比例就都是100%,

> ggplot(data = diamonds) + 
+     stat_summary(
+         mapping = aes(x = cut, y = depth),
+         fun.min = min,
+         fun.max = max,
+         fun = median
+     )
stat_summary(
  mapping = NULL,
  data = NULL,
  geom = "pointrange",    #`stat_summary`默認(rèn)幾何對象
  position = "identity",    #`geom_pointrange`的默認(rèn)統(tǒng)計變換,二者不可逆

因此,對于stat_summary,如果不適用該統(tǒng)計變換函數(shù),而使用幾何對象函數(shù):

ggplot(data = diamonds) +
  geom_pointrange(
    mapping = aes(x = cut, y = depth),
    stat = "summary"
  )

(本圖未加error bar)

  • geom_col針對最常見的柱狀圖 ,即既給ggplot映射x值(x值一般是因子型的變量,才能成為柱,而沒有成為曲線),也映射y值。
    如: ggplot2(data, aes(x = x, y = y)) +
    geom_col()

  • geom_bar針對計數(shù)的柱狀圖,即count, 是只給ggplot映射x值(x也一般是因子)。自動計算x的每個因子所擁有的數(shù)據(jù)點的個數(shù),將這個個數(shù)給與y軸。
    區(qū)別在于給ggplot是否映射y值。

Most geoms and stats come in pairs that are almost always used in concert. Read through the documentation and make a list of all the pairs. What do they have in common?

Complementary geoms and stats

geom stat
geom_bar() stat_count()
geom_bin2d() stat_bin_2d()
geom_boxplot() stat_boxplot()
geom_contour() stat_contour()
geom_count() stat_sum()
geom_density() stat_density()
geom_density_2d() stat_density_2d()
geom_hex() stat_hex()
geom_freqpoly() stat_bin()
geom_histogram() stat_bin()
geom_qq_line() stat_qq_line()
geom_qq() stat_qq()
geom_quantile() stat_quantile()
geom_smooth() stat_smooth()
geom_violin() stat_violin()
geom_sf() stat_sf()
geom_pointrange() stat_identity()

They tend to have their names in common, stat_smooth() and geom_smooth(). However, this is not always the case, with geom_bar() and stat_count() and geom_histogram() and geom_bin() as notable counter-examples.
If you want the heights of the bars to represent values in the data, use geom_col() instead. geom_bar() uses stat_count() by default: it counts the number of cases at each x position. geom_col() uses stat_identity(): it leaves the data as is.

ggplot2 geom layers and their default stats

geom default stat
geom_abline() -
geom_hline() -
geom_vline() -
geom_bar() stat_count()
geom_col() -
geom_bin2d() stat_bin_2d()
geom_blank() -
geom_boxplot() stat_boxplot()
geom_countour() stat_countour()
geom_count() stat_sum()
geom_density() stat_density()
geom_density_2d() stat_density_2d()
geom_dotplot() -
geom_errorbarh() -
geom_hex() stat_hex()
geom_freqpoly() stat_bin() x
geom_histogram() -stat_bin() x
geom_crossbar() -
geom_errorbar() -
geom_linerange() -
geom_pointrange() -
geom_map() -
geom_point() -
geom_map() -
geom_path() -
geom_line() -
geom_step() -
geom_point() -
geom_polygon() -
geom_qq_line() stat_qq_line() x
geom_qq() stat_qq() x
geom_quantile() stat_quantile() x
geom_ribbon() -
geom_area() -
geom_rug() -
geom_smooth() stat_smooth() x
geom_spoke() -
geom_label() -
geom_text() -
geom_raster() -
geom_rect() -
geom_tile() -
geom_violin() stat_ydensity() x
geom_sf() stat_sf() x

ggplot2 stat layers and their default geoms

stat default geom
stat_ecdf() geom_step()
stat_ellipse() geom_path()
stat_function() geom_path()
stat_identity() geom_point()
stat_summary_2d() geom_tile()
stat_summary_hex() geom_hex()
stat_summary_bin() geom_pointrange()
stat_summary() geom_pointrange()
stat_unique() geom_point()
stat_count() geom_bar()
stat_bin_2d() geom_tile()
stat_boxplot() geom_boxplot()
stat_countour() geom_contour()
stat_sum() geom_point()
stat_density() geom_area()
stat_density_2d() geom_density_2d()
stat_bin_hex() geom_hex()
stat_bin() geom_bar()
stat_qq_line() geom_path()
stat_qq() geom_point()
stat_quantile() geom_quantile()
stat_smooth() geom_smooth()
stat_ydensity() geom_violin()
stat_sf() geom_rect()

關(guān)于geom_smooth:有3個回歸函數(shù)
glm是廣義線性回歸函數(shù),當(dāng)然你也可以用它來做線性回歸
lm是線性回歸函數(shù),不能擬合廣義線性回歸模型
loess

 >p1<-ggplot(mpg, aes(displ, hwy, colour = class)) +
+     geom_point() +
+     geom_smooth( method = glm,se=FALSE)
> p2<-ggplot(mpg, aes(displ, hwy, colour = class)) +
+     geom_point() +
+     geom_smooth( method = lm,se=FALSE)
> p3<-ggplot(mpg, aes(displ, hwy, colour = class)) +
+     geom_point() +
+     geom_smooth( method = loess,se=FALSE)
library(gridExtra)
> grid.arrange(p1,p2,p3,ncol=2,nrow=2)

關(guān)于group=1

> p1=ggplot(data = diamonds) +
+     geom_bar(mapping = aes(x = cut, y = ..prop..))
> p2=ggplot(data = diamonds) +
+     geom_bar(mapping = aes(x = cut, fill = color, y = ..prop..))
> p3=ggplot(data = diamonds) +
+     geom_bar(mapping = aes(x = cut, fill = color, y = ..prop..,group=1))
> grid.arrange(p1,p2,p3,ncol=2,nrow=2)

因為縱軸是..prop..,即分類變量中每個類別占總量的比,group=1就是將這些類別當(dāng)作一組的這樣一個整體去分別計算各個類別的占比,所以須有g(shù)roup=1。
否則,默認(rèn)的就是各個類別各自一個“組”,在計數(shù)時就是普通的條形圖,而在計算占比時每個類別都是百分百占比,所以每個條形圖都是頂頭的一樣高。既第一條代碼所畫的圖片。
若是還有填充的映射,如fill=color,則每種顏色代表的color的一個分類在每個條形圖中都是高度為1,7種顏色堆疊在一起,縱坐標(biāo)的頂頭都是7。既第二條代碼所畫的圖片。
作者:咕嚕咕嚕轉(zhuǎn)的ATP合酶
鏈接:http://www.lxweimin.com/p/f36c3f8cfb24

1.8 位置變換

ggplot(data=iris)+    
+ geom_bar(mapping = aes(x=Sepal.Width,y=Sepal.Length,fill=Species),stat="identity")
> p1=ggplot(data=iris)+  
+ geom_bar(mapping = aes(x=Sepal.Width,fill=Species),alpha=3/5,position = "identity")
>  p2=ggplot(data=iris)+    
+ geom_bar(mapping = aes(x=Sepal.Width,fill=Species),alpha=3/5)
> p3=ggplot(data=iris)+    
+  geom_bar(mapping = aes(x=Sepal.Width,color=Species),fill=NA,position = "identity")
> p4=ggplot(data=iris)+    
+ geom_bar(mapping = aes(x=Sepal.Width,fill=Species),position = "fill")
> grid.arrange(p1,p2,p3,p4,ncol=2,nrow=2)
p5=ggplot(data = iris)+
+ geom_bar(mapping=aes(x=Sepal.Width,fill=Species),position="dodge")
> grid.arrange(p1,p2,p3,p4,p5,ncol=2,nrow=3)
仔細(xì)比較有無position=“identity”,可以看到,加上position時可使柱狀圖彼此重疊。(而非堆積)

關(guān)于“過繪制”:
默認(rèn)取整,因此部分重疊的點未能顯示

 p6=ggplot(data=iris)+
+     geom_point(mapping = aes(x=Sepal.Width,y=Sepal.Length),position="jitter")
> p7=ggplot(data=iris)+
+     geom_point(mapping = aes(x=Sepal.Width,y=Sepal.Length))
> grid.arrange(p6,p7,ncol=2,nrow=2)
jitter為每個數(shù)據(jù)點添加了隨機(jī)擾動
ggplot(data=iris,mapping = aes(x=Sepal.Width,y=Sepal.Length))+
+ geom_jitter()

也可以生成相同結(jié)果

微調(diào)jitter

 p8=ggplot(data=mpg,mapping=aes(x=cty,y=hwy))+
+     geom_jitter(aes(color=class))
>p <- ggplot(mpg, aes(cyl, hwy)) 
p9 <- p+geom_jitter(aes(color=class))
> grid.arrange(p8,p9,ncol=2,nrow=2)
p10=ggplot(data=mpg,mapping=aes(x=cyl,y=hwy))+
+     geom_jitter(aes(color=class))
> grid.arrange(p8,p9,p10,ncol=2,nrow=2)

Compare and contrast geom_jitter() with geom_count().

The geom geom_jitter() adds random variation to the locations points of the graph. In other words, it “jitters” the locations of points slightly. This method reduces overplotting since two points with the same location are unlikely to have the same random variation.

ggplot(data = mpg, mapping = aes(x = cty, y = hwy)) +
  geom_jitter()

However, the reduction in overlapping comes at the cost of slightly changing the x and y values of the points.

The geom geom_count() sizes the points relative to the number of observations. Combinations of (x, y) values with more observations will be larger than those with fewer observations.

ggplot(data = mpg, mapping = aes(x = cty, y = hwy)) +
  geom_count()

The geom_count() geom does not change x and y coordinates of the points. However, if the points are close together and counts are large, the size of some points can itself create overplotting. For example, in the following example, a third variable mapped to color is added to the plot. In this case, geom_count() is less readable than geom_jitter() when adding a third variable as a color aesthetic.

ggplot(data = mpg, mapping = aes(x = cty, y = hwy, color = class)) +
  geom_jitter()
image
ggplot(data = mpg, mapping = aes(x = cty, y = hwy, color = class)) +
  geom_count()
image

As that example shows,unfortunately, there is no universal solution to overplotting. The costs and benefits of different approaches will depend on the structure of the data and the goal of the data scientist.

1.9 坐標(biāo)系

coord_flip--置換X Y軸
coord_quickmap--為地圖選擇合適縱橫比
coord_polar--極坐標(biāo)系

usa<-map_data("usa")
nz<-map_data("nz")
ggplot(usa, aes(long, lat, group = group)) +
  geom_polygon(fill = "white", color = "black") +
  coord_quickmap()
ggplot(iris, aes(x = factor(1), fill = Species)) +
  geom_bar()
ggplot(iris, aes(x = factor(1), fill = Species)) +
  geom_bar(width = 1) +
  coord_polar(theta = "y")

The argument theta = "y" maps y to the angle of each section. If coord_polar() is specified without theta = "y", then the resulting plot is called a bulls-eye chart.

最后編輯于
?著作權(quán)歸作者所有,轉(zhuǎn)載或內(nèi)容合作請聯(lián)系作者
平臺聲明:文章內(nèi)容(如有圖片或視頻亦包括在內(nèi))由作者上傳并發(fā)布,文章內(nèi)容僅代表作者本人觀點,簡書系信息發(fā)布平臺,僅提供信息存儲服務(wù)。

推薦閱讀更多精彩內(nèi)容