PAC is used to formalize and address following questions: - What can
be learned efficiently? - What is inherently hard to learn? - How many
examples are needed to learn successfully? - Is there a general model of
learning?
We denote be the set
of all possible examples, input space. The set of all posible labels or
targets is . A
concept is a mapping from to . A concept class is
a set of concepts is denoted by . The learner considers a
fixed set of possible concepts , called a hypothesis
set, which might not necessarily coincide with . It
Generalization error Given a hypothesis , a target concept , and an underlying
distribution , the
generalization error or risk of
is defined by
where is the indicator
function of the event . The
generalization error of a hypothesis is not directly accessible to the
learner, but learner can measure the empirical error of a
hypothesis on the labeled sample .
Empirical error Given a hypothesis , a target concept , and a sample , the empirical
error or empirical risk of is
defined by:
Thus, the empirical error of is its average error over the sample , while the generalization error is its
expected error based on the distribution . For a fixed , the expectation of the
empirical error on is equal to
the generalization error:
Indeed, by the linearity of the expectation and the fact that the
sample is drawn i.i.d., we can write
for any in sample . Thus
PAC-learning A concept class is said to be PAC-learnable
if there exists an algorithm and a polynomial funcion
such that for any
and , for all
distributions on and for any target concept
, the following
holds for any sample size :
If futher runs in
, then is said to be efficiently
PAC-learnable. When such an algorithm exists, it is called a
PAC-learning algorithm for .
Probably approximately correct (PAC) learning theory helps analyze
whether and under what conditions a learner will probably output an approximately
correct classifier.
First, let's define "approximate." A hypothesis is approximately correct
if its error over the distribution of inputs is bounded by some , . I.e.,
,
where is the distribution over
inputs.
Next, "probably." If will
output such a classifier with probability , with , we call
that classifier probably approximately correct.
Knowing that a target concept is PAC-learnable allows you to bound
the sample size necessary to probably learn an approximately correct
classifier, which is what's shown in the formula you've reproduced:
Shattering, VC-dimension
When the hypothesis spaces are infinite, the VC-dimension provides
such a measure.
Definition: Shattering (1) - the set of splittings of dataset
using concepts from .
shatters if .
A set of points is shattered
by is there are hypotheses in
that split in all of the possible ways; i.e.,
all possible ways of classifying points in are achievable using concepts in .
Definition: Shattering (2) A set of size of is shattered by if ,
i.e. all possible labelings of the set are realized by function in .
Definition: VC-dimension (1) The
VC-dimension of a hypothesis space is the cardinality of the largest set
that can be shattered by . If arbitraily large finite sets can be
shattered by , then .
Definition: VC-Dimension (2) VC-dim() = cardinality of the largest
set shattered by .
References
[1]@book{mohri2018foundations, title={Foundations
of machine learning}, author={Mohri, Mehryar and Rostamizadeh, Afshin
and Talwalkar, Ameet}, year={2018} }