TheGeneralizationPerformanceofERMAlgorithm.doc

上傳人：伐*** IP屬地：寧夏上傳時間：2019-07-11 格式：DOC 頁數(shù)：20 大?。?64.50KB 積分：12 舉報 版權(quán)申訴

TheGeneralizationPerformanceofERMAlgorithm.doc_第2頁

TheGeneralizationPerformanceofERMAlgorithm.doc_第3頁

TheGeneralizationPerformanceofERMAlgorithm.doc_第4頁

TheGeneralizationPerformanceofERMAlgorithm.doc_第5頁

已閱讀5頁，還剩15頁未讀，繼續(xù)免費(fèi)閱讀

版權(quán)說明：本文檔由用戶提供并上傳，收益歸屬內(nèi)容提供方，若內(nèi)容存在侵權(quán)，請進(jìn)行舉報或認(rèn)領(lǐng)

文檔簡介

精品論文推薦the generalization performance of erm algorithmwith strongly mixing observationsbin zou, luoqing lifaculty of mathematics and computer science, hubei university, wuhan 430062, chinazoubin0502, november 23, 2007abstractthe generalization performance is the main purpose of machine learning theo- retical research. the previous main bounds describing the generalization ability of erm algorithm are based on independent and identically distributed (i.i.d.) sam- ples. when the complexity of the given function set is high, the problem of solving the erm algorithm is usually ill-posed and overfitting may happen. in order to study the generalization performance of erm algorithm with dependent observa- tions, in this paper we decompose firstly the given function set into different com- pact subsets, and then we establish the exponential bound on the rate of relative uniform convergence on these compact subsets for erm algorithm with strongly mixing observations. in the end, we obtain the bounds on the generalization ability of erm algorithm with strongly mixing observations over the given function set, which extend the previous results of i.i.d. observations to the case of strongly mixing observations.keywords: generalization performance, principle of erm, relative uniform conver- gence, exponentially strongly mixingsupported in part by nsfc under grant 10771053 and by the national research foundation for thedoctoral program of higher education of china (srfdp) under grant 200605120019精品論文推薦1 introductionrecently there has been a large increase of the interest for theoretical issues in the machine learning community.it is mainly due to the fact that statistical learning theory has demonstrated its usefulness by providing the ground for developing successful and well- founded learning algorithm such as support vector machines (svm) 21. this renewed interest for theory naturally boosted the development of performance bounds 4, 5, 6, 7,16, 17, 27, 29the setting of learning theory is that given a set of data consisting of labeled objects, the goal is to find a function that assigns labels to objects such that, if new objects are given, this function will label them correctly. we assume that all the data is generated by same processes, that is, the data is sampled from a fixed unknown probability distribution. the only knowledge we have about it comes from a sample of observations. typically, we choose a class of possible function that correspond to the possible ways in which the labels can be related to the objects, then we choose a function in that class which agrees as much as possible with the given data. therefore the problems that learning from random sampling should address are thus how to choose an appropriate class of functions and how to choose a function in that class 4. in the sequel, we are interested in how to choose a function from a given function set.we assume that the data is sampled from a certain distribution, it is possible to related the observed behavior of a function on the data to its expected behavior on future data by means of probability theory. more precisely, for each given function in the class of interest, one can obtain confidence intervals for the expected mislabel error (expected error) in terms of the observed mislabel error (empirical error). this is not enough to guarantee that in a given class, a function which has a small empirical error will have a small expected error 4. so considering our sample as a random variable, we see that the empirical error of each function in our class is a random variable, which means we have a collection of random variables. if we want to have a bound on the expected error of the learning algorithm, since the particular function that the algorithm will pick after seeing the data is not known in advance, one has to bound uniformly the deviations of the collection of random variables. therefore the notion of size of a collection of random variables (indexed by a class of functions) is crucial in learning theory. one measure of the size of a collection of random variables is the covering number 28 and packing numbers 1, entropy numbers 26, vc-dimension for indicator functions 10, 21, v -dimension (or p -dimension) for real-valued functions 1, etc.in order to measure the generalization ability of algorithms that minimize empiricalrisk with i.i.d. observations, vapnik 21 established exponential bounds on the rate ofuniform convergence and relative uniform convergence for i.i.d. observations. bousquet 4 obtained a generalization of vapnik and chervonenkis bounds by using the measure of the size of function classes, the local rademacher average. cucker and smale 6, smale and zhou 16, 17 considered the least squares error and decomposed this difference between the value of achieved risk and the value of minimal possible risk into two parts: the sample error and the approximation error, and then they bounded the sample error and the approximation error based on i.i.d. observations respectively over the compact subset of hypothesis space. chen et al 5 obtained the bound on the excess expected risk with i.i.d. observations for pattern recognition by introducing a projection operator.however, independence is a very restrictive concept in several ways 23. first, it is often an assumption, rather than a deduction on the basis of observations. second, it is an all or nothing property, in the sense that two random variables are either indepen- dent or they are not the definition does not permit an intermediate notion of beingnearly independent. as a result, many of the proofs based on the assumption that theunderlying stochastic sequence is i.i.d. are rather “fragile”. the notion of mixing allows one to put the notion of “near independence” on a firm mathematical foundation, and moreover, permits one to derive a robust rather than a “fragile” theory. many machine learning applications such as market prediction, system diagnosis, and speech recognition are inherently temporal in nature, and consequently not i.i.d. processes 20. therefore, yu 3 established the rates of uniform convergence of the empirical means to their means based on stationary mixing sequences. white 24 considered cross-validated regression estimators for strongly mixing processes and established convergence, without rates, of their estimators. modha and masry 18 established the minimum complexity regression estimation with m-dependent observations and strongly mixing observations respectively. vidyasagar 23 considered the notions of mixing and proved that most of the desirable properties (e.g. pac property or ucemup property) of i.i.d. sequence are preserved when the underlying sequence is mixing sequence. nobel and dembo 15 proved that, if a family of functions has the property that empirical means based on i.i.d. sequence converge uniformly to their values as the number of samples approaches infinity, then the family of functions continues to have the same property if the i.i.d. sequence is replaced by -mixing sequence. karandikar and vidyasagar 14 extended this result to the case where the underlying probability is itself not fixed, but varies over a family of measures. vidyasagar 23 obtained the rate of uniform convergence of the empirical means to their means with -mixing sequence for a finite family of measurable functions. steinwart, hush and scovel 20 studied the learnability of svms for -mixing(not necessarily sta- tionary) processes for both classification and regression. zou et al 30 established the bounds on the rate of uniform convergence of learning machines with strongly mixingobservations.in this paper we obtain firstly the bound on the relative difference between the empir- ical risks and their expected risks by using the argument similar to that used by modha and masry in their proof of the bernsteins inequality for strongly mixing sequences 18, and then we extend previous bounds 4, 21 on the rate of relative uniform convergence to the case where the i.i.d. observations replaced by strongly mixing observations, and improve the results of 30, 23. the rest of this paper is organized as follows: in section2, we introduce some notations. in section 3 we obtain the bounds on the rate of rel- ative uniform convergence for learning machines with strongly mixing observations. we establish the bounds on the generalization ability of erm algorithm with strongly mixing sequence for the given function set in section 4.2 preliminarieswe introduce some notations and do some preparations in this section.i=let z = zi = (xi, yi)be a stationary real-valued sequence on a probability space(, b, p ). for i 0, 0, and c 0, where the constants and c are assumed to be known.remark 1 18 assumption 1 is satisfied by a large class of processes, for example, cer- tain linear processes(which includes certain arma processes) satisfy the assumption with = 1 25, and certain aperiodic, harris-recurrent markov processes(which includes cer- tain bilinear processes, nonlinear arx processes, and arh processes) satisfy the assump-tion 9. as a trivial example, i.i.d. random variables satisfy the assumption with = .denote by s the sample set of size n observationss = z1, z2, , zndrawn from the exponentially strongly mixing sequence z . the goal of machine learning from random sampling is to find a function f that assigns values to objects such that if new objects are given, the function f will forecast them correctly. letze (f ) = e(f, z) =(f, z)dp (1)be the expected error (or expected risk) of function f , where the function (f, z), which is integrable for any f and depends on f and z, is called loss function. in this paper, we would like to establish a general framework which includes pattern recognition and regression estimation, the important feature of the regression estimation problem is that the loss function (f, z) can take on arbitrary non-negative values whereas in pattern recognition problem it can take only two values. throughout the article, we require thatfor all z z ,let|(f, z)| m.f = f : a e (f ) b be the compact set of admissible functions, where a, b are positive constants, and b satisfies b m . a learning task is to find the minimizer of the expected risk over the function set f . since one knows only the set s of random samples instead of the distribution p , the minimizer of the expected risk can not be computed directly. according to the principleof empirical risk minimizing (erm), we minimize, instead of the expected risk (1), the so called empirical risk(or empirical error)1 nen (f ) = n x (f, zi).i=1let f be a function minimizing the expected risk e (f ) over f f , i.e.,f = arg min e (f ) = arg min z (f, z)dp.f ff fwe define f to be a function minimizing the empirical risk en(f ) over f f , i.e.,f = arg min en (f ) = arg min1 nx (f, zi).f ff fni=1according to the principle of erm, we shall consider the function f as an approximation function of the target function f.to measure the generalization performance of erm algorithm, vapnik 21, bousquet 4, cucker and smale 6 obtained the bounds on the rate of the empirical risks converge uniformly to their expected risks on a given set h (q) based on the i.i.d. sequences, thatis, for any 0, they bounded the termprobnsup |e (f ) en(f )| o.(2)f hyu 3 and vidyasager 23 also obtained the convergence rate of the term (2) based on a-mixing sequence and -mixing sequence.however, if the complexity of the set h is high, the problem of solving the erm algorithm is usually ill-posed and overfitting may happen. moreover the term (2) fails to capture the phenomenon that for those functions f h for which e (f ) is small, the deviation en(f ) e (f ) is also small with large probability. in order to extend the results in 4, 6, 21 to the case where the i.i.d. sequence replaced by -mixing sequence, improvethese estimates in 3, 30, 23 and solve the ill-posed and overfitting problem of erm algorithm on the function set of high complexity, we decompose the given target function set f into different disjointed parts by making use of the idea of 12 in this paper, thatis, for given a, b, a 1 and suppose that b = aqm for some m n, so thatb m = logq a .let i = aqi , i = 0, 1, , m (with 0 = a, m = b). we setf (i1) = f f : e (f ) i1 ,andthen we havef (i1, i = f (i)f (i1).mf = f (i1, i ,(3)i=1where m is finite because f is a compact set.for the sake of simplicity, we denote f (i1, i by f (r, s in the sequel . thus we bound firstly the following term for any 0probn supf f (r,se(f ) en (f ) o(4)pe (f )for erm algorithm with the strongly mixing sequences z , then we establish the bounds on the generalization ability over the set f (r, s by making use of the bound on the term(4). to bound the term (4) for erm algorithm with the strongly mixing sequences z , we need to give some assumptions on the loss function. for any t 0, definel(t) = supmax |(g1, z) (g2, z)| .z z|g g |g1|t,|g2 |t 1 2we assume that l(t) is finite if t is finite.our approach is based on bernstein moment condition 8 and covariance inequality for -mixing sequences 23 to establish a new bound on the relative difference between the empirical risks and their expected risks by similar idea from 18. we end this section by introducing the following useful lemmas.lemma 1 8 let w be a random variable such that e(w ) = 0, and w satisfies thebernstein moment condition, that is , for some k1 0,e|w |k v ar(w ) k!k k221for all k 2. then, for all 0 0, 0, c 0.assume that the variance d(f, zi) 2 for all 1 i n and for all functions inf (r, s. let n() be defined by (5). then for all 0, the inequality()probn e(f ) en (f ) o (1 + 4e2 ) exp nnr2o(6)holds true.pe (f )2(2 + sm/3)proof: in this proof, we use an argument similar to that used by 18 in their proof of the bernstein inequalit

人人文庫> 全部分類> 專業(yè)文獻(xiàn) > 工程機(jī)械

溫馨提示

1. 本站所有資源如無特殊說明，都需要本地電腦安裝OFFICE2007和PDF閱讀器。圖紙軟件為CAD,CAXA,PROE,UG,SolidWorks等.壓縮文件請下載最新的WinRAR軟件解壓。
2. 本站的文檔不包含任何第三方提供的附件圖紙等，如果需要附件，請聯(lián)系上傳者。文件的所有權(quán)益歸上傳用戶所有。
3. 本站RAR壓縮包中若帶圖紙，網(wǎng)頁內(nèi)容里面會有圖紙預(yù)覽，若沒有圖紙預(yù)覽就沒有圖紙。
4. 未經(jīng)權(quán)益所有人同意不得將文件中的內(nèi)容挪作商業(yè)或盈利用途。
5. 人人文庫網(wǎng)僅提供信息存儲空間，僅對用戶上傳內(nèi)容的表現(xiàn)方式做保護(hù)處理，對用戶上傳分享的文檔內(nèi)容本身不做任何修改或編輯，并不能對任何下載內(nèi)容負(fù)責(zé)。
6. 下載文件中如有侵權(quán)或不適當(dāng)內(nèi)容，請與我們聯(lián)系，我們立即糾正。
7. 本站不保證下載資源的準(zhǔn)確性、安全性和完整性, 同時也不承擔(dān)用戶因使用這些下載資源對自己和他人造成任何形式的傷害或損失。

TheGeneralizationPerformanceofERMAlgorithm.doc

文檔簡介

溫馨提示

最新文檔

評論

TheGeneralizationPerformanceofERMAlgorithm.doc

文檔簡介

溫馨提示

最新文檔

評論

相關(guān)文檔