首先,本文的數(shù)據(jù)下載自IMDB 5000 Movie Dataset From Kaggle**
原作者爬取了IMDB 5000多條觀測數(shù)據(jù),然后用回歸對IMDB各個電影的評分進行建模,作者的文章如下:
Predict Movie Rating - NYC Data Science Academy Blog**
本文主要借助該數(shù)據(jù)完成大數(shù)據(jù)分析第4講復雜數(shù)據(jù)和分析的作業(yè),對該講的內(nèi)容和知識點練練手。
導入相關(guān)包
library(ggplot2)
library(stringr)
library(dplyr)
數(shù)據(jù)導入
#當前項目運行根路徑
#例如:G:/DataCruiser/workspace/IMDB Analysis
projectPath <- getwd()
#movie_metadata.csv路徑
#例如G:/DataCruiser/workspace/IMDB Analysis/data/movie_metadata.csv
servicePath <- str_c(projectPath, "data", "movie_metadata.csv", sep = "/")
#導入數(shù)據(jù)
movies <- read.csv(servicePath, header = T, stringsAsFactors = F)
導演與電影評分數(shù)據(jù)處理
disDirector <- function(){
#選擇子集
mymovies <- select(movies,
title_year,
imdb_score,
director_facebook_likes,
actor_1_facebook_likes)
#列名重命名,等號左邊是新列名,右邊是就列名
mymovies <- rename(mymovies,
year = title_year,
scores = imdb_score,
direcotrlikes = director_facebook_likes,
actorlikes = actor_1_facebook_likes)
#刪除缺失數(shù)據(jù)
mymovies <- filter(mymovies,
!is.na(year),
!is.na(scores),
!is.na(direcotrlikes),
!is.na(actorlikes))
#數(shù)據(jù)排序
mymovies <- arrange(mymovies, desc(year))
#數(shù)據(jù)計算:facebook上導演點贊數(shù)與相應導演所導的電影IMDB評分數(shù)之間的關(guān)系
disDirector <- mymovies %>%
group_by(year) %>%
summarise( count = n(),
mean_scores = mean(scores, na.rm = TRUE),
mean_likes = mean(direcotrlikes, na.rm = TRUE) )
%>% filter(count > 0)
return(disDirector)
}
演員與電影評分數(shù)據(jù)處理
disActor <- function(){
#選擇子集
mymovies <- select(movies,
title_year,
imdb_score,
director_facebook_likes,
actor_1_facebook_likes)
#列名重命名,等號左邊是新列名,右邊是就列名
mymovies <- rename(mymovies,
year = title_year,
scores = imdb_score,
direcotrlikes = director_facebook_likes,
actorlikes = actor_1_facebook_likes)
#刪除缺失數(shù)據(jù) mymovies <- filter(mymovies,
!is.na(year),
!is.na(scores),
!is.na(direcotrlikes),
!is.na(actorlikes))
#數(shù)據(jù)排序 mymovies <- arrange(mymovies, desc(year))
#數(shù)據(jù)計算:facebook上一號演員點贊數(shù)與相應導演所導的電影IMDB評分數(shù)之間的關(guān)系
disActor <- mymovies %>%
group_by(year) %>%
summarise( count = n(),
mean_scores = mean(scores, na.rm = TRUE),
mean_likes = mean(actorlikes, na.rm = TRUE) )
%>% filter(count > 0)
return(disActor)
}
導演與評分圖形繪制
#導演評分散點圖
directorView <- ggplot(data = disDirector) +
geom_point(mapping = aes(x = mean_likes, y = mean_scores))+
geom_smooth(mapping = aes(x = mean_likes, y = mean_scores))
結(jié)果如下:
movieScore vs direcetorLikes.jpg
演員與評分圖形繪制
#演員評分散點圖
actorView <- ggplot(data = disActor) +
geom_point(mapping = aes(x = mean_likes, y = mean_scores))+
geom_smooth(mapping = aes(x = mean_likes, y = mean_scores))
結(jié)果如下:
movieScore vs actorLikes.jpg
結(jié)果保存
#保存分析結(jié)果
outputpath <- str_c(projectPath,"output","movieScore vs direcetorLikes.jpg",sep="/")
ggsave(filename=outputpath, plot=directorView)
#保存分析結(jié)果
outputpath <- str_c(projectPath,"output","movieScore vs actorLikes.jpg",sep="/")
ggsave(filename=outputpath, plot=actorView)
結(jié)果分析
在假定IMDB評分高低決定著電影好壞的前提下,從對IMDB 5000多條的數(shù)據(jù)分析可以初步得到以下結(jié)論:
- 總體上看導演在facebook上面獲得的點贊數(shù)與電影的好壞呈現(xiàn)正相關(guān),而一號演員在facebook獲得的點贊數(shù)與電影的好壞呈負相關(guān),通過導演的好壞來判斷一部電影的好壞往往更加靠譜;
- 有一些非主流的導演雖然在facebook上獲得的點贊數(shù)不多,但是也不排除會拍出一些好電影的可能性。
需要說明一定的是,在對于count較少的數(shù)據(jù)這里沒有剔除,如果設(shè)置不同的噪音門檻得出的結(jié)論略有不同,特別是演員的趨勢上,得出的結(jié)論會變化較大。
另外,本文的源碼以及輸出結(jié)果均已經(jīng)上傳到:
jijiwhywhy/IMDB-Analysis