1.Gapminder 數(shù)據(jù)
female_completion_rate:
Primary completion rate is the percentage of female students completing the
last year of primary school. It is calculated by taking the total number of
female students in the last grade of primary school, minus the number of
repeaters in that grade, divided by the total number of children of
official graduation age. The ratio can exceed 100% due to over-aged and
under-aged children who enter primary school late/early and/or repeat
grades. United Nations Educational, Scientific, and Cultural Organization
(UNESCO) Institute for Statistics.
male_completion_rate:
Same as female_completion_rate, but for male students.
life_expectancy:
The average number of years a newborn child would live if current mortality
patterns were to stay the same.
gpd_per_capita:
Gross Domestic Product per capita in constant 2000 US$. The inflation but
not the differences in the cost of living between countries has been taken
into account.
employment_above_15:
Percentage of total population, age above 15, that has been employed during
the given year.
question i thought of:
- how has employment in the US varied over time
- what are the highest and lowest employment levels
--which countries have them?
--where is the us on the spectrum - same questions for other variables
- how do these variables relate to each other?
- are there consistent trends across countries?
2 numpy和pandas中的一維數(shù)組
pandas讀取csv文件
import pandas as pd
daily_engagement = pd.read_csv('daily-engagement.csv')
len(daily_engagement['acct'].unique())
3.numpy數(shù)組
one-dimensional data structures
pandas | numpy(numerical python)
series | built on | array
more features | simpler
numpy arrays and python lists
numpy array np.array(['AL','AK','AZ','AR','CA'])
similarities:
- access elements by position
a[0] → 'AL' - access a range of elements
a[1:3] → np.array(['AK','AZ'])
*use loops
for x in a :
differences:
- each element should have same type(string,int,boolean,etc)
*convenient functions
mean(),std() - can be multi-dimensional
import numpy as np
countries = np.array([
'Afghanistan', 'Albania', 'Algeria', 'Angola', 'Argentina',
'Armenia', 'Australia', 'Austria', 'Azerbaijan', 'Bahamas',
'Bahrain', 'Bangladesh', 'Barbados', 'Belarus', 'Belgium',
'Belize', 'Benin', 'Bhutan', 'Bolivia',
'Bosnia and Herzegovina'
])
employment = np.array([
55.70000076, 51.40000153, 50.5 , 75.69999695,
58.40000153, 40.09999847, 61.5 , 57.09999847,
60.90000153, 66.59999847, 60.40000153, 68.09999847,
66.90000153, 53.40000153, 48.59999847, 56.79999924,
71.59999847, 58.40000153, 70.40000153, 41.20000076
])
索引
print countries[0]
print countries[3]
print countries[0:3]
print countries[:3]
print countries[17:]
print countries[:]
查看numpy數(shù)組元素的類型
print countries.dtype #s22:s表示string,22表示數(shù)組中最長的字符串有22個字母
print employment.dtype #float64:f表示float,64表示它們是以64位的格式儲存的
print np.array([0, 1, 2, 3]).dtype #查看numpy數(shù)組元素的類型
print np.array([1.0, 1.5, 2.0, 2.5]).dtype
print np.array([True, False, True]).dtype
print np.array(['AL', 'AK', 'AZ', 'AR', 'CA']).dtype
python字符串的format函數(shù)
for country in countries:
print 'Examining country {}'.format(country) #python字符串的format函數(shù)
for i in range(len(countries)):
country = countries[i]
country_employment = employment[i]
print 'Country {} has employment {}'.format(country,
country_employment)
數(shù)組的統(tǒng)計量
print employment.mean()
print employment.std()
print employment.max()
print employment.sum()
最大的就業(yè)量
def max_employment(countries,employment):
max_position = employment.argmax() #返回最大值的位置
max_employment = employment[max_position]
max_country = countries[max_position]
return (max_country,max_employment)
max_employment(countries,employment)
4.向量化運算
vectorized operations
a vector is a list of number
vector 1:[1,2,3]
vector 2:[4,5,6]
adding 2 vectors is called vector addition
may be result:
- [5,7,9]
vector addition in linear algebra and numpy
*[1,2,3,4,5,6]
list concatenation in python - error:you cant add vectors ,only numbers
very common
5.乘以標量
vector [1,2,3] * scalar 3 =
may be result:
- [1,2,3,1,2,3,1,2,3]
python - [3,6,9]
linear algebra and numpy
*error :you cant multiply a vector by a single number
6.計算整體完成率
more vectorized operations
math operations:
add +
subtract -
multiple *
divide /
exponentiate **
logical operations:
and &
or |
not ~
make sure your arrays contain booleans!
否則就是:按位與,按位和,按位取反
comparison operations:
greater >
greater or equal >=
less <
less or equal <=
equal ==
not equal !=
數(shù)組與數(shù)組之間的運算
import numpy as np
a = np.array([1, 2, 3, 4])
b = np.array([1, 2, 1, 2])
print a
print a + b #[2 4 4 6]
print a - b #[0 0 2 2]
print a * b #[1 4 3 8]
print a / b #[1 1 3 2]
print a ** b #[ 1 4 3 16]
數(shù)組與數(shù)字之間的運算
a = np.array([1, 2, 3, 4])
b = 2
print a + b #[3 4 5 6]
print a - b #[-1 0 1 2]
print a * b #[2 4 6 8]
print a / b #[0 1 1 2]
print a ** b #[ 1 4 9 16]
布爾數(shù)組之間的運算
a = np.array([True, True, False, False])
b = np.array([True, False, True, False])
print a & b #[ True False False False]
print a | b #[ True True True False]
print ~a #[False False True True]
布爾數(shù)組與單個布爾值之間的運算
print a & True #[ True True False False]
print a & False # [False False False False]
print a | True # [ True True True True]
print a | False # [ True True False False]
將兩個數(shù)組進行比較,就是返回布爾數(shù)組的向量運算
a = np.array([1, 2, 3, 4, 5])
b = np.array([5, 4, 3, 2, 1])
print a > b #[False False False True True]
print a >= b #[False False True True True]
print a < b #[ True True False False False]
print a <= b #[ True True True False False]
print a == b #[False False True False False]
print a != b #[ True True False True True]
將數(shù)組與單個數(shù)字進行比較,就是返回布爾數(shù)組的向量運算
a = np.array([1, 2, 3, 4])
b = 2
print a > b #[False False True True]
print a >= b #[False True True True]
print a < b #[ True False False False]
print a <= b #[ True True False False]
print a == b #[False True False False]
print a != b #[ True False True True]
平均受教育程度
countries = np.array([
'Algeria', 'Argentina', 'Armenia', 'Aruba', 'Austria','Azerbaijan',
'Bahamas', 'Barbados', 'Belarus', 'Belgium', 'Belize', 'Bolivia',
'Botswana', 'Brunei', 'Bulgaria', 'Burkina Faso', 'Burundi',
'Cambodia', 'Cameroon', 'Cape Verde'
])
female_completion = np.array([
97.35583, 104.62379, 103.02998, 95.14321, 103.69019,
98.49185, 100.88828, 95.43974, 92.11484, 91.54804,
95.98029, 98.22902, 96.12179, 119.28105, 97.84627,
29.07386, 38.41644, 90.70509, 51.7478 , 95.45072
])
male_completion = np.array([
95.47622, 100.66476, 99.7926 , 91.48936, 103.22096,
97.80458, 103.81398, 88.11736, 93.55611, 87.76347,
102.45714, 98.73953, 92.22388, 115.3892 , 98.70502,
37.00692, 45.39401, 91.22084, 62.42028, 90.66958
])
def overall_completion_rate(female_completion, male_completion):
return (female_completion+male_completion)/2
overall_completion_rate(female_completion,male_completion)
7.歸一化數(shù)據(jù)
standardizing data
how does one data point compare to the rest?
e.g. employment in US vs other countries
to answer, convert each data point to number of standard deviations away from the mean ,this is called standardizing the data
in 2007,
mean employment rate:58.6%
standard deviation : 10.5%
united states : 62.3% diff:3.7% or 0.35sd
mexico : 57.9% mexico diff: -0.7% or -0.067sd
import numpy as np
countries = np.array([
'Afghanistan', 'Albania', 'Algeria', 'Angola', 'Argentina',
'Armenia', 'Australia', 'Austria', 'Azerbaijan', 'Bahamas',
'Bahrain', 'Bangladesh', 'Barbados', 'Belarus', 'Belgium',
'Belize', 'Benin', 'Bhutan', 'Bolivia',
'Bosnia and Herzegovina'
])
employment = np.array([
55.70000076, 51.40000153, 50.5 , 75.69999695,
58.40000153, 40.09999847, 61.5 , 57.09999847,
60.90000153, 66.59999847, 60.40000153, 68.09999847,
66.90000153, 53.40000153, 48.59999847, 56.79999924,
71.59999847, 58.40000153, 70.40000153, 41.20000076
])
def standardize_data(values):
return (values-values.mean())/values.std()
8. numpy索引數(shù)組
numpy index arrays
suppose you have two arrays of the same length,and the second contains booleans
a = [1,2,3,4,5]
b= [F,F,T,T,T] index array:它告訴你應(yīng)保留第一個數(shù)組的哪些元素
b=a>2
a[b] = [3,4,5]=a[a>2]
import numpy as np
a = np.array([1, 2, 3, 4])
b = np.array([True, True, False, False])
print a[b] #[1,2]
print a[np.array([True, False, True, False])] #[1,3]
a = np.array([1, 2, 3, 2, 1])
b = (a >= 2) #[False,True,True,True,False]
print a[b]
print a[a >= 2] #[2,3,2]
a = np.array([1, 2, 3, 4, 5])
b = np.array([1, 2, 3, 2, 1])
print b == 2 [False True False True False]
print a[b == 2] #[2 4]
write a function that takes in time_spent
and days_to_cancel
and returns mean time spent for students who stay at least 7 days
time_spent = np.array([
12.89697233, 0. , 64.55043217, 0. ,
24.2315615 , 39.991625 , 0. , 0. ,
147.20683783, 0. , 0. , 0. ,
45.18261617, 157.60454283, 133.2434615 , 52.85000767,
0. , 54.9204785 , 26.78142417, 0.
])
days_to_cancel = np.array([
4, 5, 37, 3, 12, 4, 35, 38, 5, 37, 3, 3, 68,
38, 98, 2, 249, 2, 127, 35
])
def mean_time_for_paid_students(time_spent, days_to_cancel):
return (time_spent[days_to_cancel >= 7]).mean()
9.+與+=
+=運算會改變現(xiàn)有數(shù)組
+則創(chuàng)建一個新數(shù)組
code snippet 1
import numpy as np
a=np.array([1,2,3,4])
b=a
a+=np.array([1,1,1,1])
print b # output:[2,3,4,5]
code snippet 2
import numpy as np
a=np.array([1,2,3,4])
b=a
a= a+np.array([1,1,1,1])
print b # output:[1,2,3,4]
10.原地與非原地
原位運算 operates in-place += 會將所有的新值儲存在原值,而不是創(chuàng)建一個新的數(shù)組
非原位運算 operates not in-place +
原位運算的另一個例子:切片
import numpy as np
a = np.array([1,2,3,4,5])
slice=a[:3]
slice[0]=100
print a #[100 2 3 4 5]
slice refers to what's called a view of the original array,it will look like a array,but if you modify it ,the original array is modified as well,this makes slicing a numpy very fast,since you don't have to create a new array or copy any new values,but it means you should be very careful any time you want to modify a slice of an array
11.pandas series
import pandas as pd
countries = ['Albania', 'Algeria', 'Andorra', 'Angola', 'Antigua and Barbuda',
'Argentina', 'Armenia', 'Australia', 'Austria', 'Azerbaijan',
'Bahamas', 'Bahrain', 'Bangladesh', 'Barbados', 'Belarus',
'Belgium', 'Belize', 'Benin', 'Bhutan', 'Bolivia']
life_expectancy_values = [74.7, 75. , 83.4, 57.6, 74.6, 75.4, 72.3, 81.5, 80.2,
70.3, 72.1, 76.4, 68.1, 75.2, 69.8, 79.4, 70.8, 62.7,
67.3, 70.6]
gdp_values = [ 1681.61390973, 2155.48523109, 21495.80508273, 562.98768478,
13495.1274663 , 9388.68852258, 1424.19056199, 24765.54890176,
27036.48733192, 1945.63754911, 21721.61840978, 13373.21993972,
483.97086804, 9783.98417323, 2253.46411147, 25034.66692293,
3680.91642923, 366.04496652, 1175.92638695, 1132.21387981]
life_expectancy = pd.Series(life_expectancy_values)
gdp = pd.Series(gdp_values)
print life_expectancy
###
0 74.7
1 75.0
2 83.4
3 57.6
4 74.6
5 75.4
6 72.3
7 81.5
8 80.2
9 70.3
10 72.1
11 76.4
12 68.1
13 75.2
14 69.8
15 79.4
16 70.8
17 62.7
18 67.3
19 70.6
###
print life_expectancy[0] #74.7
print gdp[3:6]
for country_life_expectancy in life_expectancy:
print 'Examining life expectancy {}'.format(country_life_expectancy)
print life_expectancy.mean()
print life_expectancy.std()
print gdp.max()
print gdp.sum()
a = pd.Series([1, 2, 3, 4])
b = pd.Series([1, 2, 1, 2])
print a + b
print a * 2
print a >= 3
print a[a >= 3]
a series is similar to a numpy array,but with extra functionality
e.g. s.describe() 打印出平均值、標準偏差、中位數(shù)以及其他統(tǒng)計量
numpy和pandas的相似之處:
- accessing elements
s[0],s[3:7]
*looping
for x in s - convenient functions
s.mean(),s.max - vectorized operations
s1+s2 - implemented in C(fast!)
exercise using series
write a function that takes in 2 series (e.g. life expectancy and GDP in 2007)
when a country has a life expectancy above the mean ,will the GDP above the mean also?(or vice versa)
return 2 numbers:
1.number of countries where both values are above or both are below the mean
2.number of countries where one value is above and one is below the mean
hint: you can add booleans in python
true + true =2
def variable_correlation(variable1, variable2):
mean1 = variable1.mean()
mean2 = variable2.mean()
same_direction = ((variable1 > mean1) & (variable2 > mean2)) | ((variable1 < mean1) & (variable2 < mean2))
different_direction = ((variable1 > mean1) & (variable2 < mean2)) | ((variable1 < mean1) & (variable2 > mean2))
num_same_direction = sum(same_direction)
num_different_direction = sum(different_direction)
return (num_same_direction, num_different_direction)
def variable_correlation(variable1, variable2):
both_above = (variable1 > variable1.mean()) & (variable2 > variable2.mean())
both_below = (variable1 < variable1.mean()) & (variable2 < variable2.mean())
is_same_direction = both_above | both_below
num_same_direction = sum(is_same_direction)
num_different_direction = len(variable1) - num_same_direction
return (num_same_direction, num_different_direction)
variable_correlation(life_expectancy,gdp)
12.series索引
numpy是增強版的python list
pandas 是Python list(元素按順序排列,需要通過位置獲取)和python dictionary(可通過key查找值)的結(jié)合
pandas has index
import pandas as pd
countries = ['Albania', 'Algeria', 'Andorra', 'Angola', 'Antigua and Barbuda',
'Argentina', 'Armenia', 'Australia', 'Austria', 'Azerbaijan',
'Bahamas', 'Bahrain', 'Bangladesh', 'Barbados', 'Belarus',
'Belgium', 'Belize', 'Benin', 'Bhutan', 'Bolivia']
life_expectancy_values = [74.7, 75. , 83.4, 57.6, 74.6, 75.4, 72.3, 81.5, 80.2,
70.3, 72.1, 76.4, 68.1, 75.2, 69.8, 79.4, 70.8, 62.7,
67.3, 70.6]
life_expectancy = pd.Series(life_expectancy_values,index = countries)
Albania 74.7
Algeria 75.0
Andorra 83.4
Angola 57.6
Antigua and Barbuda 74.6
Argentina 75.4
Armenia 72.3
Australia 81.5
Austria 80.2
Azerbaijan 70.3
Bahamas 72.1
Bahrain 76.4
Bangladesh 68.1
Barbados 75.2
Belarus 69.8
Belgium 79.4
Belize 70.8
Benin 62.7
Bhutan 67.3
Bolivia 70.6
life_expectancy[0] #通過位置獲取值
life_expectancy.loc['Angola'] #通過索引值index獲取值
life_expectancy.iloc[0] #通過位置獲取值
life_expectancy_noindex = pd.Series(life_expectancy_values) #在不指定索引的情況下,數(shù)字0,1,2,3……稱為索引
就業(yè)量最大的國家
import pandas as pd
countries = [
'Afghanistan', 'Albania', 'Algeria', 'Angola',
'Argentina', 'Armenia', 'Australia', 'Austria',
'Azerbaijan', 'Bahamas', 'Bahrain', 'Bangladesh',
'Barbados', 'Belarus', 'Belgium', 'Belize',
'Benin', 'Bhutan', 'Bolivia', 'Bosnia and Herzegovina',
]
employment_values = [
55.70000076, 51.40000153, 50.5 , 75.69999695,
58.40000153, 40.09999847, 61.5 , 57.09999847,
60.90000153, 66.59999847, 60.40000153, 68.09999847,
66.90000153, 53.40000153, 48.59999847, 56.79999924,
71.59999847, 58.40000153, 70.40000153, 41.20000076,
]
employment = pd.Series(employment_values, index=countries)
def max_employment(employment):
max_country = employment.argmax() # 返回最大值的index
max_value = employment.loc[max_country] # Replace this with your code
return (max_country, max_value)
13. 向量化運算和series索引
import pandas as pd
s1 = pd.Series([1, 2, 3, 4], index=['a', 'b', 'c', 'd'])
s2 = pd.Series([10, 20, 30, 40], index=['a', 'b', 'c', 'd'])\
print s1 + s2 #將兩個索引值相同的series相加
output:
a 11
b 22
c 33
d 44
dtype: int64
import pandas as pd
s1 = pd.Series([1, 2, 3, 4], index=['a', 'b', 'c', 'd'])
s2 = pd.Series([10, 20, 30, 40], index=['b', 'd', 'a', 'c'])
print s1 + s2
a 31
b 12
c 43
d 24
dtype: int64
s1 = pd.Series([1, 2, 3, 4], index=['a', 'b', 'c', 'd'])
s2 = pd.Series([10, 20, 30, 40], index=['c', 'd', 'e', 'f'])
print s1 + s2
a NaN
b NaN
c 13.0
d 24.0
e NaN
f NaN
dtype: float64
s1 = pd.Series([1, 2, 3, 4], index=['a', 'b', 'c', 'd'])
s2 = pd.Series([10, 20, 30, 40], index=['e', 'f', 'g', 'h'])
print s1 + s2
a NaN
b NaN
c NaN
d NaN
e NaN
f NaN
g NaN
h NaN
dtype: float64
series 的向量化運算,值的匹配是根據(jù)索引而不是位置進行的
14.填充缺失值
s1 = pd.Series([1, 2, 3, 4], index=['a', 'b', 'c', 'd'])
s2 = pd.Series([10, 20, 30, 40], index=['c', 'd', 'e', 'f'])
sum_result = s1+s2
sum_result.dropna() #刪除缺失值
output:
c 13
d 24
15.pandas series apply()
non built-in calculations
so far,you've used built-in functions (e.g. mean()) and operations (e.g. +)
但是在沒有內(nèi)置函數(shù),也無法通過簡單的向量運算進行,那么可考慮如下方法:
- treat the series as a list(for loops,etc)
- use the function apply()
apply() takes a series and a function,and returns a new series
import pandas as pd
s = pd.Series([1, 2, 3, 4, 5])
def add_one(x):
return x + 1
print s.apply(add_one)
###
0 2
1 3
2 4
3 5
4 6
dtype: int64
###
states = pd.Series([
'California',
'OH',
'Michigan',
'NY'])
def clean_state(state):
if len(state) ==2:
return state
elif state =='California':
return "CA"
elif state=='Michigan':
return 'MI'
clean_states = states.apply(clean_state)
print clean_states
與使用循環(huán)相比,apply的有點在于可使代碼更簡潔,運行速度也更快
names = pd.Series([
'Andre Agassi',
'Barry Bonds',
'Christopher Columbus',
'Daniel Defoe',
'Emilio Estevez',
'Fred Flintstone',
'Greta Garbo',
'Humbert Humbert',
'Ivan Ilych',
'James Joyce',
'Keira Knightley',
'Lois Lane',
'Mike Myers',
'Nick Nolte',
'Ozzy Osbourne',
'Pablo Picasso',
'Quirinus Quirrell',
'Rachael Ray',
'Susan Sarandon',
'Tina Turner',
'Ugueth Urbina',
'Vince Vaughn',
'Woodrow Wilson',
'Yoji Yamada',
'Zinedine Zidane'
])
def reverse_name(name):
splited_name = name.split(' ')
firstname = splited_name[0]
lastname = splited_name[1]
return lastname+', '+firstname
def reverse_names(names):
return names.apply(reverse_name)
16. 在 Pandas 中繪圖
如果變量 data 是一個 NumPy 數(shù)組或 Pandas Series,就像它是一個列表一樣,
代碼
import matplotlib.pyplot as plt
plt.hist(data)
將創(chuàng)建數(shù)據(jù)的直方圖。
Pandas 還有在后臺使用 matplotlib 的內(nèi)置繪圖函數(shù),因此如果 data 是一個 Series,你可以使用 data.hist()
創(chuàng)建直方圖。
在此情形中,這兩者沒有區(qū)別,但有時候 Pandas 封裝器更加方便。例如,你可以使用 data.plot() 創(chuàng)建 Series 的線條圖。Series 索引被用于 x 軸,值被用于 y 軸。
如果你在本地運行繪圖代碼,你可能會需要加入一行 plt.show() 代碼。
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
%pylab inline #使用jupyter notebook時需要添加這行代碼
path = '/datasets/ud170/gapminder/'
employment = pd.read_csv(path + 'employment_above_15.csv', index_col='Country')
female_completion = pd.read_csv(path + 'female_completion_rate.csv', index_col='Country')
male_completion = pd.read_csv(path + 'male_completion_rate.csv', index_col='Country')
life_expectancy = pd.read_csv(path + 'life_expectancy.csv', index_col='Country')
gdp = pd.read_csv(path + 'gdp_per_capita.csv', index_col='Country')
# The following code creates a Pandas Series for each variable for the United States.
# You can change the string 'United States' to a country of your choice.
employment_us = employment.loc['United States']
female_completion_us = female_completion.loc['United States']
male_completion_us = male_completion.loc['United States']
life_expectancy_us = life_expectancy.loc['United States']
gdp_us = gdp.loc['United States']
employment_us.plot() #繪制折線圖,x軸是employment的index
employment_us.plot()