一. 背景
1. 意義
ConceptLearning作為一個比較早期出現的機器學習方法, 已經"顯得"沒有特別大的實用價值. 但是, Concept Learning卻有比較大的理論價值, 在邁入神經網絡等復雜方法之前, 它能夠幫助初學者很好地去理解機器學習的一些基本思想和理念(e.g.樣本空間, 假設空間).
2. 要點
Learning from example
General to specific ordering over hypo(歸納學習)
Version spaces and candidate elimination algo!! (核心)
picking new examples
The need for inductive bias
note: assume no noise, illustrate key concepts
一般最后是給出1, 0 二分結果.
二. Problem and Task
1. Problem
Automatically inferring the general definition of some concept, given examples labeled as members or nonmembers of the concept.
2. Concept Learning
Inferring a boolean–valued function from training examples of its input and output.
其實非常類似LogisticsRegression, 就是輸入數據, 能夠回答1, 0.
3. Task: EnjoySport Case
輸入數據形式:
sky temp humid wind water forecst enjoysport(Y)
sunny warm normal strong warm same Y
rainny cold high strong warm change N
可以看到, EnjoySpt這個列, 其實代表了y值列.
我們希望經過一定的trainning, 能夠實現給定一組屬性, 機器就能給出Y或者N的回答, 即是否enjoySport.
三. Representing hypo 形式化表示
1. 占位符:?
我們約定: 每個值要么是給定值, 要么是?
. 其中?
代表這個值在這個位置缺失, 因此可以是任意允許的值, 比如Temp只允許兩個值{Warm, Cold}, 那么此處占位的?
就代表either Warm or Cold. 原先PPT中有?這個標識, 但是我認為非常容易引發誤解(它不代表值缺失, 而代表本行y值經過布爾運算恒等于0/No), 最好干脆不使用它.
2. Notation(very important)
1)X: set of instance
樣本集合, 里面存放著一行一行的數據, 每行數據都有1~p個屬性
2)c: target concept
c一般就是一個函數, 輸入X集合中的某個記錄x[i], 就能返回一個y[i]值 --> y[i] = c(x[i])
3)traning example
an instance x from X, along with its target concept value c(x).
其實就是從X集合中取出一個x記錄, 以及對應的y=c(x)值, 這樣的(x, y) - LabeledVector(scala)形式就是訓練例子. 其中也可以有正向和負向例子(Positive/Negative Example), 就是y=1或者y=0的情況.
4)D: the set of available training examples
訓練例子集合 trainning set, 每個例子都是有屬性有y值的(x, y)形式
5)H: the set of all possible hypotheses
所有可能的假設, 表現為所有可能的模型函數function的集合.
6)h: 假設的布爾函數/模型
每個模型h in H都表示一個生成之為0, 1函數, 即h: X ->{0, 1}
each hypothesis h in H represents a boolean-valued function defined over X; that is, h: X {0, 1}.
因此, Goal of the learner: to find a hypo h such that h(x) = c(x) for all x, 換句話說找一個盡量完美擬合所有訓練數據的模型函數
3. Learning task representation
1) instance X
We have multiple records, each representing a day. All these make a set of instances that called X.
attributes: Sky, Air, ...
2) target concept c
we try to get an concept, an model of EnjoySport based on our data.
target concept c: EnjoySport: X {0, 1}
3) training example
4) D
就是我們打算用來作訓練集的數據, 這里我們可以選取70%的數據作為訓練集, 當然了, 必須帶上他們的y值. 用剩下的作為驗證.
D = {<attrs, c>, <attrs, c>, ...}
5) H, h
Hypo H : Conjunction of literals 取值的集合
比如: 其中一個hypo h可能是<?, cold, high, ?, ?, ?> -> Y
Note: 在機器學習中hypo h基本上是一個具體模型/函數公式的同義詞
四. Concept learning as search
首先, 再次復習下這個方法的思想
Idea: any hypothesis found to approximate the target function well over a sufficiently large set of training examples will also approximate the target function well over other unobserved examples.
1. Instance space X
我們先看一下每個屬性, 都有幾個可能的取值
Sky: 3 possible values
AirTemp: 2 possible values
Humid: 2
Wind: 2
Water: 2
Forecast: 2
The instance space X contains 96 distinct instance: 3*2*2*2*2*2=96
因此, X有 instance space 322*2.... = 96種獨特的instance
2. Hypotheses space H
如果我們填充了? 和?, 那么最后問題空間為5*4*4*4*4*4=5120
, 這叫做Syntactically distinct hypotheses. 但是, 我們之前討論過了, ?在機器學習中是一個非常容易令人誤解的概念, 我們可以盡量不去理它.
真正重要的是這個:
如果我們去掉?, 只保留?代表任意值. 因此, 我們可以認為真正有意義的H space應該是1+4*3*3*3*3*3=973
, 前面的1代表我們丟掉的所有?情況. 在計算有意義的Hspace的時候加不加上這個1都無所謂, 這個973的大小這叫做Semantically distinct hypotheses, i.e. H space
可以看到 X space < H space, 這是因為H space允許一些字段用代表任意值的?來代替. 這增強了模型的泛化能力/適用面(Generality).
3. Generality 泛化能力/適用面大小
h1=< Sunny, ?, ?, Strong, ?, ? >
h2=< Sunny, ?, ?, ?, ?, ? >
h2 is more general than h1.
可以看到, 在圖上, 越底層的h泛化能力越好.
五. 基礎: 只看positive example的Find-S算法
1. Initialize h to the most specific hypothesis in H
2. For each positive training instance x:
For each attribute constraint a[i] in h:
If the constraint a[i] in h is satisfied by x:
then do nothing
Else:
replace ai in h by the next more general constraint that is satised by x
3. Out put hypothesis h
其實就是把函數放松, 放松的過程.
可以看到, 算法Else語句說replace a[i] by the most general constraint 其實就是把Normal -> ?, 把 Warm, Same -> ?, ?的過程.
Find-S作為基礎算法, 實際上存在很多問題.
只看positive example, 忽略了很多的negative, 可能會嚴重影響泛化的正確性.
六. Version Spaces和改進的Candidate-Elimination algo
1. Consistent & VersionSpace
Consistent(全體正確性): a hypo h is consistent only if h(x) = c(x) for each training example <x, c(x)> in D
Version Space: VS<H, D> = {for each h∈ H | Consistent(h, D)}, 也就是說H中所有的hypo h都能滿足consistent(正確性).
或者另外一個簡單的定義, 就是a list of all hypotheses h.
注意, VersionSpace作為一個list, 是動態變化的, 在算法中里面的h會不斷被驗證并拋棄錯誤的那些.
2. the List-Then-Eliminate Algo
In a nutshell: generate a list, then try and eliminate list item, finally get a leftover list.
1. Version Space a list containing every hypothesis in H
2. For each training example <x, c(x)>:
remove from Version Space any hypothesis h for which h(x) != c(x)
3. Output the list of hypotheses in Version Space
3. 算法分析
這個算法能夠保證正確性, 因為所有的正負例都將被用于檢驗. 任何模型h, 只要犯錯一次就會被拋棄.
但是, 可以看到, 這個算法要對所有的h∈H(這個案例中是有快1000個)都進行檢驗, 非常非常耗時間. 實際應用中數據量一大起來, H space往往是上億的, 跑起來非常慢.
七. Compact-Elimination algo
1. G and S boundary
G: the general bound G, 是最general的member set
S: the specific bound S, 是最specific的member set
所以, 我們認為在VSpace中, 所有的h都應該有S<= all h <= G.
2. algo
define: G maximally general hypotheses in H
define: S maximally specific hypotheses in H
For each training example d:
If d is a positive example:
Remove from G any hypothesis inconsistent with d
For each hypothesis s in S that is not consistent with d:
Remove s from S //此處其實沒有錯誤, 這個寫法相當于先刪掉, 再按照all ds的共性添加新的s. 另外一個更有效率的方法, 應該是修改與新引入d矛盾的字段, 將其放松.
Add to S all minimal generalizations h of s such that //保持S的最具體性
1. h is consistent with d and previous ds, and
2. some member of G is more general than h
Remove from S any hypothesis that is more general than another hypothesis in S //去除同質性, 如果s2包含了s1, 那么應該去掉s2.
If d is a negative example:
Remove from S any hypothesis inconsistent with d
For each hypothesis g in G that is not consistent with d //保持G的最泛化性
Remove g from G
Add to G all minimal specializations h of g such that
1. h is consistent with d, and
2. some member of S is more specific than h //保證G和S能一致
Remove from G any hypothesis that is less general than another hypothesis in G //去除同質性, 如果g2包含了g1, 那么應該去掉g1
3. process illustration
可以看到, S2是當前允許條件下least-general的h
這里的算法過程很有意思, 首先生成能夠滿足這個NegativeExample的G3集合, 然后刪除掉那些和S2不能保持一致的.
注意到這里我們先刪除了所有不匹配的g∈G, 所以第三個g被刪除.
然后我們檢查s∈S, 顯然這里Temp和Forecst需要被放松, 因此我們把原先的attr=warm, same放松成了?, ?.
剩余的VersionSpace構成了最后的H, 其中S代表了具體界, G代表泛化界.