文章原創,最近更新:2018-05-3
引言:關于series的介紹
這這里為了方便大家可以學習series這個案例,將fandango_score_comparison.csv這個文件以百度網盤分享給大家,鏈接: https://pan.baidu.com/s/1U6z7OvXK75L1AGm1vYlN4w 密碼: qe1a
課程來源: python數據分析與機器學習實戰-唐宇迪
dataframe是相當于矩陣,series是相當于矩陣的一行,series類型由一組數據及與之相關的數據索引組成.
比如以下一個小的案例:
import pandas as pd
a=pd.Series([9,8,7,6])
a
Out[19]:
0 9
1 8
2 7
3 6
dtype: int64
以下是關于電影的一個評分以及相關的數據.我們觀察以下用series結構有沒有什么特別之處?
import pandas as pd
fandango=pd.read_csv('fandango_score_comparison.csv')
series_film = fandango['FILM']
type(series_film)
Out[85]: pandas.core.series.Series
通過上面可以看出從fandango是個Datafram,然后將fandango其中的一列['FILM']拿出來,fandango['FILM']變成了Series.
在Series進行定位,與Datafram有什么區別呢?
其實都是一樣的用法,通過索引和切片的方式.
series_film = fandango['FILM']
series_film[0:5]
Out[84]:
0 Avengers: Age of Ultron (2015)
1 Cinderella (2015)
2 Ant-Man (2015)
3 Do You Believe? (2015)
4 Hot Tub Time Machine 2 (2015)
Name: FILM, dtype: object
series_rt = fandango['RottenTomatoes']
series_rt[0:5]
Out[87]:
0 74
1 85
2 80
3 18
4 14
Name: RottenTomatoes, dtype: int64
新建一個Series結構應該怎么辦?
首先我們查看series.values的結構.發現結果是一個ndarray.即就是從series每一個值拿出來,每個值就是ndarray.這就說明了,dataframe里面的結構是series,series里面的結構是ndarray.其實pandas是封裝在numpy的基礎之上的.
很多操作就是把numpy組合形成便利的條件,pandas與numpy很多操作都是互通的.
film_names=series_film.values
type(film_names)
Out[89]: numpy.ndarray
下面的操作是創建一個series出來,在pandas當中要將series導進來.
from pandas import Series
Series的字符串表現形式為:索引在左邊,值在右邊。由于我們沒有為數據指定索引,于是會自動創建一個0到N-1(N為數據的長度)的整數型索引。你可以通過Series 的values和index屬性獲取其數組表示形式和索引對象:
與普通NumPy數組相比,你可以通過索引的方式選取Series中的單個或一組值:
案例創建一個series,在這個結構當中,一個電影名字,對應其中一個媒體的評分值等于多少.
from pandas import Series
rt_scores = series_rt.values
series_custom = Series(rt_scores , index=film_names)
series_custom[['Minions (2015)', 'Leviathan (2014)']]
Out[96]:
Minions (2015) 54
Leviathan (2014) 99
dtype: int64
series如何排序?
reindex更多的不是修改pandas對象的索引,而只是修改索引的順序,如果修改的索引不存在就會使用默認的None代替此行。且不會修改原數組,要修改需要使用賦值語句。
首先提取電影的名稱,即是將index提取成列表.
original_index = series_custom.index.tolist()
original_index
Out[110]:
['Avengers: Age of Ultron (2015)',
'Cinderella (2015)',
'Ant-Man (2015)',
'Do You Believe? (2015)',
'Hot Tub Time Machine 2 (2015)',
....
'Mr. Holmes (2015)',
"'71 (2015)",
'Two Days, One Night (2014)',
'Gett: The Trial of Viviane Amsalem (2015)',
'Kumiko, The Treasure Hunter (2015)']
對電影的名稱進行排序.排序后的結果如下:
sorted_index = sorted(original_index)
sorted_index
Out[112]:
["'71 (2015)",
'5 Flights Up (2015)',
'A Little Chaos (2015)',
'A Most Violent Year (2014)',
'About Elly (2015)',
....
'What We Do in the Shadows (2015)',
'When Marnie Was There (2015)',
"While We're Young (2015)",
'Wild Tales (2014)',
'Woman in Gold (2015)']
用reindex函數,根據排序后的電影名稱修改series_custom的索引順序,具體如下:
sorted_by_index = series_custom.reindex(sorted_index)
sorted_by_index
Out[114]:
'71 (2015) 97
5 Flights Up (2015) 52
A Little Chaos (2015) 40
A Most Violent Year (2014) 90
About Elly (2015) 97
....
When Marnie Was There (2015) 89
While We're Young (2015) 83
Wild Tales (2014) 96
Woman in Gold (2015) 52
Length: 146, dtype: int64
如何用對series的索引以及值進行排序?
用sort_index()對索引進行排序,得到sc2
sc2 = series_custom.sort_index()
sc2
Out[116]:
'71 (2015) 97
5 Flights Up (2015) 52
A Little Chaos (2015) 40
A Most Violent Year (2014) 90
About Elly (2015) 97
....
What We Do in the Shadows (2015) 96
When Marnie Was There (2015) 89
While We're Young (2015) 83
Wild Tales (2014) 96
Woman in Gold (2015) 52
Length: 146, dtype: int64
用sort_values()對值進行排序,得到sc3
sc3 = series_custom.sort_values()
sc3
Out[118]:
Paul Blart: Mall Cop 2 (2015) 5
Hitman: Agent 47 (2015) 7
Hot Pursuit (2015) 8
Fantastic Four (2015) 9
Taken 3 (2015) 9
....
Song of the Sea (2014) 99
Phoenix (2015) 99
Selma (2014) 99
Seymour: An Introduction (2015) 100
Gett: The Trial of Viviane Amsalem (2015) 100
Length: 146, dtype: int64
如何對2個series進行相加?
對于兩個維度一樣的series,相加之后就會得到一個新的series.如果維度一樣,對應位置相加,如果維度不一樣,直接是分別相加的要給操作.
通過用add函數將2個series_custom進行相加.
series_custom
Out[123]:
Avengers: Age of Ultron (2015) 74
Cinderella (2015) 85
Ant-Man (2015) 80
Do You Believe? (2015) 18
Hot Tub Time Machine 2 (2015) 14
....
Mr. Holmes (2015) 87
'71 (2015) 97
Two Days, One Night (2014) 97
Gett: The Trial of Viviane Amsalem (2015) 100
Kumiko, The Treasure Hunter (2015) 87
Length: 146, dtype: int64
np.add(a,b)等價于a+b,相加的結果如下:
np.add(series_custom, series_custom)#等價于series_custom+series_custom
Out[124]:
Avengers: Age of Ultron (2015) 148
Cinderella (2015) 170
Ant-Man (2015) 160
Do You Believe? (2015) 36
Hot Tub Time Machine 2 (2015) 28
....
Mr. Holmes (2015) 174
'71 (2015) 194
Two Days, One Night (2014) 194
Gett: The Trial of Viviane Amsalem (2015) 200
Kumiko, The Treasure Hunter (2015) 174
Length: 146, dtype: int64
用np.sin()對series求sin值
np.sin(series_custom)
Out[126]:
Avengers: Age of Ultron (2015) -0.985146
Cinderella (2015) -0.176076
Ant-Man (2015) -0.993889
Do You Believe? (2015) -0.750987
Hot Tub Time Machine 2 (2015) 0.990607
....
Mr. Holmes (2015) -0.821818
'71 (2015) 0.379608
Two Days, One Night (2014) 0.379608
Gett: The Trial of Viviane Amsalem (2015) -0.506366
Kumiko, The Treasure Hunter (2015) -0.821818
Length: 146, dtype: float64
求series_custom的最大值,用np.max()進行計算
np.max(series_custom)
Out[127]: 100
判斷series_custom中大于50的數
series_custom > 50
Out[128]:
Avengers: Age of Ultron (2015) True
Cinderella (2015) True
Ant-Man (2015) True
Do You Believe? (2015) False
Hot Tub Time Machine 2 (2015) False
....
Mr. Holmes (2015) True
'71 (2015) True
Two Days, One Night (2014) True
Gett: The Trial of Viviane Amsalem (2015) True
Kumiko, The Treasure Hunter (2015) True
Length: 146, dtype: bool
查找series_custom中大于50的數
series_greater_than_50
Out[130]:
Avengers: Age of Ultron (2015) 74
Cinderella (2015) 85
Ant-Man (2015) 80
The Water Diviner (2015) 63
Top Five (2014) 86
....
Mr. Holmes (2015) 87
'71 (2015) 97
Two Days, One Night (2014) 97
Gett: The Trial of Viviane Amsalem (2015) 100
Kumiko, The Treasure Hunter (2015) 87
Length: 94, dtype: int64
查找series_custom中>50,<75的數
criteria_one = series_custom > 50
criteria_two = series_custom < 75
both_criteria = series_custom[criteria_one & criteria_two]
both_criteria
Out[134]:
Avengers: Age of Ultron (2015) 74
The Water Diviner (2015) 63
Unbroken (2014) 51
Southpaw (2015) 59
Insidious: Chapter 3 (2015) 59
The Man From U.N.C.L.E. (2015) 68
....
Woman in Gold (2015) 52
The Last Five Years (2015) 60
Jurassic World (2015) 71
Minions (2015) 54
Spare Parts (2015) 52
dtype: int64
如何使2個series的index相同?如何進行計算?
index相同,兩個value會在相對應的位置進行計算,會得到一個新的series
rt_critics = Series(fandango['RottenTomatoes'].values, index=fandango['FILM'])
rt_users = Series(fandango['RottenTomatoes_User'].values, index=fandango['FILM'])
rt_mean = (rt_critics + rt_users)/2
rt_mean
Out[138]:
FILM
Avengers: Age of Ultron (2015) 80.0
Cinderella (2015) 82.5
Ant-Man (2015) 85.0
Do You Believe? (2015) 51.0
Hot Tub Time Machine 2 (2015) 21.0
....
Inside Out (2015) 94.0
Mr. Holmes (2015) 82.5
'71 (2015) 89.5
Two Days, One Night (2014) 87.5
Gett: The Trial of Viviane Amsalem (2015) 90.5
Kumiko, The Treasure Hunter (2015) 75.0
Length: 146, dtype: float64
如何指定一個索引?
set_index函數拓展:
DataFrame可以通過set_index方法,可以設置單索引和復合索引。
DataFrame.set_index(keys, drop=True, append=False, inplace=False, verify_integrity=False)
append添加新索引,drop為False,inplace為True時,索引將會還原為列
fandango的index是0-146.
fandango=pd.read_csv('fandango_score_comparison.csv')
fandango.index
Out[149]: RangeIndex(start=0, stop=146, step=1)
通過set_index,將0-146更改為'FILM'這一列的值為索引,結果如下:
fandango_films = fandango.set_index('FILM', drop=False)
fandango_films.index
Out[140]:
Index(['Avengers: Age of Ultron (2015)', 'Cinderella (2015)', 'Ant-Man (2015)',
'Do You Believe? (2015)', 'Hot Tub Time Machine 2 (2015)',
'The Water Diviner (2015)', 'Irrational Man (2015)', 'Top Five (2014)',
'Shaun the Sheep Movie (2015)', 'Love & Mercy (2015)',
...
'The Woman In Black 2 Angel of Death (2015)', 'Danny Collins (2015)',
'Spare Parts (2015)', 'Serena (2015)', 'Inside Out (2015)',
'Mr. Holmes (2015)', ''71 (2015)', 'Two Days, One Night (2014)',
'Gett: The Trial of Viviane Amsalem (2015)',
'Kumiko, The Treasure Hunter (2015)'],
dtype='object', name='FILM', length=146)
對指定索引進行切片
一個數值型可以進行切片選擇,對str之間用冒號:,安裝字典的排列,比如a:c,代表a,b,c進行排列的.會將對應索引的行所有的數據都可以拿出來.與數值做索引的方法是類似的.
案例:切片從"Avengers: Age of Ultron (2015)"到"Hot Tub Time Machine 2 (2015)"的行.
fandango_films["Avengers: Age of Ultron (2015)":"Hot Tub Time Machine 2 (2015)"]與fandango_films.loc["Avengers: Age of Ultron (2015)":"Hot Tub Time Machine 2 (2015)"]等價
fandango_films["Avengers: Age of Ultron (2015)":"Hot Tub Time Machine 2 (2015)"]
Out[147]:
FILM \
FILM
Avengers: Age of Ultron (2015) Avengers: Age of Ultron (2015)
Cinderella (2015) Cinderella (2015)
Ant-Man (2015) Ant-Man (2015)
Do You Believe? (2015) Do You Believe? (2015)
Hot Tub Time Machine 2 (2015) Hot Tub Time Machine 2 (2015)
RT_user_norm ... IMDB_norm \
FILM ...
Avengers: Age of Ultron (2015) 4.3 ... 3.90
Cinderella (2015) 4.0 ... 3.55
Ant-Man (2015) 4.5 ... 3.90
Do You Believe? (2015) 4.2 ... 2.70
Hot Tub Time Machine 2 (2015) 1.4 ... 2.55
RT_norm_round RT_user_norm_round \
Fandango_Difference
FILM
Avengers: Age of Ultron (2015) 0.5
Cinderella (2015) 0.5
Ant-Man (2015) 0.5
Do You Believe? (2015) 0.5
Hot Tub Time Machine 2 (2015) 0.5
[5 rows x 22 columns]
相類似的小練習:
#查找一個索引對應的行
fandango_films.loc['Kumiko, The Treasure Hunter (2015)']
#查找三個索引對應的行
movies = ['Kumiko, The Treasure Hunter (2015)', 'Do You Believe? (2015)', 'Ant-Man (2015)']
fandango_films.loc[movies]
如何更改數據類型?
通過dtypes函數查詢dataframe每行的數據類型,得到結果如下:
import numpy as np
types = fandango_films.dtypes
types
Out[158]:
FILM object
RottenTomatoes int64
RottenTomatoes_User int64
Metacritic int64
Metacritic_User float64
....
IMDB_norm_round float64
Metacritic_user_vote_count int64
IMDB_user_vote_count int64
Fandango_votes int64
Fandango_Difference float64
dtype: object
獲取數據類型是float64的索引
float_columns = types[types.values == 'float64'].index
float_columns
Out[160]:
Index(['Metacritic_User', 'IMDB', 'Fandango_Stars', 'Fandango_Ratingvalue',
'RT_norm', 'RT_user_norm', 'Metacritic_norm', 'Metacritic_user_nom',
'IMDB_norm', 'RT_norm_round', 'RT_user_norm_round',
'Metacritic_norm_round', 'Metacritic_user_norm_round',
'IMDB_norm_round', 'Fandango_Difference'],
dtype='object')
通過獲得的float64的索引,以此得到對應索引中所有行的數據
float_df = fandango_films[float_columns]
float_df
Out[162]:
Metacritic_User IMDB \
FILM
Avengers: Age of Ultron (2015) 7.1 7.8
Cinderella (2015) 7.5 7.1
Ant-Man (2015) 8.1 7.8
Do You Believe? (2015) 4.7 5.4
Hot Tub Time Machine 2 (2015) 3.4 5.1
The Water Diviner (2015) 6.8 7.2
Irrational Man (2015) 7.6 6.9
Top Five (2014) 6.8 6.5
Shaun the Sheep Movie (2015) 8.8 7.4
Love & Mercy (2015) 8.5 7.8
Far From The Madding Crowd (2015) 7.5 7.2
Black Sea (2015) 6.6 6.4
Leviathan (2014) 7.2 7.7
Unbroken (2014) 6.5 7.2
The Imitation Game (2014) 8.2 8.1
Taken 3 (2015) 4.6 6.1
Ted 2 (2015) 6.5 6.6
Southpaw (2015) 8.2 7.8
Night at the Museum: Secret of the Tomb (2014) 5.8 6.3
Pixels (2015) 5.3 5.6
McFarland, USA (2015) 7.2 7.5
Insidious: Chapter 3 (2015) 6.9 6.3
The Man From U.N.C.L.E. (2015) 7.9 7.6
Run All Night (2015) 7.3 6.6
Trainwreck (2015) 6.0 6.7
Selma (2014) 7.1 7.5
Ex Machina (2015) 7.9 7.7
Still Alice (2015) 7.8 7.5
Wild Tales (2014) 8.8 8.2
The End of the Tour (2015) 7.5 7.9
...
Clouds of Sils Maria (2015) 0.1
Testament of Youth (2015) 0.1
Infinitely Polar Bear (2015) 0.1
Phoenix (2015) 0.1
The Wolfpack (2015) 0.1
The Stanford Prison Experiment (2015) 0.1
Tangerine (2015) 0.1
Magic Mike XXL (2015) 0.1
Home (2015) 0.1
The Wedding Ringer (2015) 0.1
Woman in Gold (2015) 0.1
The Last Five Years (2015) 0.1
Mission: Impossible a€“ Rogue Nation (2015) 0.1
Amy (2015) 0.1
Jurassic World (2015) 0.0
Minions (2015) 0.0
Max (2015) 0.0
Paul Blart: Mall Cop 2 (2015) 0.0
The Longest Ride (2015) 0.0
The Lazarus Effect (2015) 0.0
The Woman In Black 2 Angel of Death (2015) 0.0
Danny Collins (2015) 0.0
Spare Parts (2015) 0.0
Serena (2015) 0.0
Inside Out (2015) 0.0
Mr. Holmes (2015) 0.0
'71 (2015) 0.0
Two Days, One Night (2014) 0.0
Gett: The Trial of Viviane Amsalem (2015) 0.0
Kumiko, The Treasure Hunter (2015) 0.0
[146 rows x 15 columns]
通過std()函數,對每個指標都進行計算標準差
deviations = float_df.apply(lambda x: np.std(x))
deviations
Out[165]:
Metacritic_User 1.505529
IMDB 0.955447
Fandango_Stars 0.538532
Fandango_Ratingvalue 0.501106
RT_norm 1.503265
RT_user_norm 0.997787
Metacritic_norm 0.972522
Metacritic_user_nom 0.752765
IMDB_norm 0.477723
RT_norm_round 1.509404
RT_user_norm_round 1.003559
Metacritic_norm_round 0.987561
Metacritic_user_norm_round 0.785412
IMDB_norm_round 0.501043
Fandango_Difference 0.152141
dtype: float64
相類似的小練習:
rt_mt_user = float_df[['RT_user_norm', 'Metacritic_user_nom']]
rt_mt_user.apply(lambda x: np.std(x), axis=1)