數(shù)據(jù)挖掘課件10unsupervised learning_第1頁(yè)
數(shù)據(jù)挖掘課件10unsupervised learning_第2頁(yè)
數(shù)據(jù)挖掘課件10unsupervised learning_第3頁(yè)
數(shù)據(jù)挖掘課件10unsupervised learning_第4頁(yè)
數(shù)據(jù)挖掘課件10unsupervised learning_第5頁(yè)
免費(fèi)預(yù)覽已結(jié)束,剩余16頁(yè)可下載查看

下載本文檔

版權(quán)說(shuō)明:本文檔由用戶(hù)提供并上傳,收益歸屬內(nèi)容提供方,若內(nèi)容存在侵權(quán),請(qǐng)進(jìn)行舉報(bào)或認(rèn)領(lǐng)

文檔簡(jiǎn)介

1、chapter 10Unsupervised Learning9/13/20221數(shù)據(jù)挖掘與統(tǒng)計(jì)計(jì)算This chapter will instead focus on unsupervised learning, a set of statistical tools intended for the setting in which we have only a set of features X1, X2, . . . , Xp measured on n observations. We are not interested in prediction, because we do no

2、t have an associated response variable Y . Rather, the goal is to discover interesting things about the measurements on X1, X2, . . . , Xp. principal components analysisclustering9/13/2022數(shù)據(jù)挖掘與統(tǒng)計(jì)計(jì)算210.1 The Challenge of Unsupervised LearningIf we fit a predictive model using a supervised learning te

3、chnique, then it is possible to check our work by seeing how well our model predicts the response Y on observations not used in fitting the model. However, in unsupervised learning, there is no way to check ourwork because we dont know the true answerthe problem is unsupervised。9/13/2022數(shù)據(jù)挖掘與統(tǒng)計(jì)計(jì)算310

4、.2 Principal Components AnalysisWhen faced with a large set of correlated variables, principal components allow us to summarize this set with a smaller number of representative variables that collectively explain most of the variability in the original set. PCA is an unsupervised approach, since it

5、involves only a set of features X1, X2, . . . , Xp, and no associated responseY . Apart from producing derived variables for use in supervised learning problems, PCA also serves as a tool for data visualization (visualization of the observations or visualization of the variables).9/13/2022數(shù)據(jù)挖掘與統(tǒng)計(jì)計(jì)算4

6、10.2.1 What Are Principal Components?Suppose that we wish to visualize n observations with measurements on a set of p features, X1, X2, . . . , Xp, We could do this by examining two-dimensional scatterplots of the data, each of which contains the n observations measurements on two of thefeatures. Ho

7、wever, there are p(p1)/2 such scatterplots.9/13/2022數(shù)據(jù)挖掘與統(tǒng)計(jì)計(jì)算5low-dimensional representationClearly, a better method is required to visualize the n observations when p is large. In particular, we would like to find a low-dimensional representation of the data that captures as much of the information

8、 as possible. For instance, if we can obtain a two-dimensional representation of the data that captures most of the information, then we can plot the observations in this low-dimensional space.9/13/2022數(shù)據(jù)挖掘與統(tǒng)計(jì)計(jì)算6PCA provides a tool to do just this. It finds a low-dimensional representation of a data

9、 set that contains as much as possible of the variation. The idea is that each of the n observations lives in p dimensional space, but not all of these dimensions are equally interesting. PCA seeks a small number of dimensions that are as interesting as possible, where the concept of interesting is

10、measured by the amount that the observations vary along each dimension. 9/13/2022數(shù)據(jù)挖掘與統(tǒng)計(jì)計(jì)算7Geometric interpretation for PCAThere is a nice geometric interpretation for the first principal component. The loading vector 1 with elements 11, 21, . . . , p1 defines a direction in feature space along whic

11、h the data vary the most. If we project the n data points x1, . . . , xn onto this direction, the projected values are the principal component scores z11, . . . , zn1 themselves. 9/13/2022數(shù)據(jù)挖掘與統(tǒng)計(jì)計(jì)算8Low-dimensional views of the dataOnce we have computed the principal components, we can plot them agai

12、nst each other in order to produce low-dimensional views of the data.For instance, we can plot the score vector Z1 against Z2, Z1 against Z3, Z2 against Z3, and so forth. Geometrically, this amounts to projecting the original data down onto the subspace spanned by 1, 2, and 3, and plotting the proje

13、cted points.9/13/2022數(shù)據(jù)挖掘與統(tǒng)計(jì)計(jì)算99/13/2022數(shù)據(jù)挖掘與統(tǒng)計(jì)計(jì)算1010.2.2 Another Interpretation of Principal ComponentsPrincipal components provide low-dimensional linear surfaces that are closest to the observations. The first principal component loading vector has a very special property: it is the line in p-dim

14、ensional space that is closest to the n observations (using average squared Euclidean distance as a measure of closeness). The appeal of this interpretation is clear: we seek a single dimension of the data that lies as close as possible to all of the data points, since such a line will likely provid

15、e a good summary of the data.9/13/2022數(shù)據(jù)挖掘與統(tǒng)計(jì)計(jì)算119/13/2022數(shù)據(jù)挖掘與統(tǒng)計(jì)計(jì)算1210.2.3 More on PCAScaling the Variablesthe results obtained when we perform PCA will also depend on whether the variables have been individually scaled (each multiplied by a different constant). Uniqueness of the Principal Componen

16、tsThis means that two different software packages will yield the same principal component loading vectors, although the signs of those loading vectors may differ. 9/13/2022數(shù)據(jù)挖掘與統(tǒng)計(jì)計(jì)算13The Proportion of Variance ExplainedWe can now ask a natural question: how much of the information in a given data se

17、t is lost by projecting the observations onto the first few principal components? That is, how much of the variance in the data is not contained in the first few principal components? Deciding How Many Principal Components to UseIn fact, we would like to use the smallest number of principal componen

18、ts required to get a good understanding of the data. How many principal components are needed? Unfortunately, there is no single (or simple!) answer to this question.9/13/2022數(shù)據(jù)挖掘與統(tǒng)計(jì)計(jì)算149/13/2022數(shù)據(jù)挖掘與統(tǒng)計(jì)計(jì)算1510.3 Clustering MethodsClustering refers to a very broad set of techniques for finding subgrou

19、ps, or clusters, in a data set. When we cluster the observations of a data set, we seek to partition them into distinct groups so that the observations withineach group are quite similar to each other, while observations in different groups are quite different from each other. K-means clusteringHier

20、archical clustering9/13/2022數(shù)據(jù)挖掘與統(tǒng)計(jì)計(jì)算1610.3.1 K-Means Clustering1. C1 C2 . . . CK = 1, . . . , n. In other words, each observation belongs to at least one of the K clusters.2. Ck Ck = for all k k. In other words, the clusters are nonoverlapping: no observation belongs to more than one cluster.9/13/2022數(shù)據(jù)挖掘與統(tǒng)計(jì)計(jì)算179/13/2022數(shù)據(jù)挖掘與統(tǒng)計(jì)計(jì)算18Code : Principal Components Analysislibrary(ISLR)states =s ( USArrests )statesnames ( USArrests )apply ( USArrests , 2, mean )apply ( USArrests , 2, var )pr.out = p ( USArrests , scale =TRUE)nam

溫馨提示

  • 1. 本站所有資源如無(wú)特殊說(shuō)明,都需要本地電腦安裝OFFICE2007和PDF閱讀器。圖紙軟件為CAD,CAXA,PROE,UG,SolidWorks等.壓縮文件請(qǐng)下載最新的WinRAR軟件解壓。
  • 2. 本站的文檔不包含任何第三方提供的附件圖紙等,如果需要附件,請(qǐng)聯(lián)系上傳者。文件的所有權(quán)益歸上傳用戶(hù)所有。
  • 3. 本站RAR壓縮包中若帶圖紙,網(wǎng)頁(yè)內(nèi)容里面會(huì)有圖紙預(yù)覽,若沒(méi)有圖紙預(yù)覽就沒(méi)有圖紙。
  • 4. 未經(jīng)權(quán)益所有人同意不得將文件中的內(nèi)容挪作商業(yè)或盈利用途。
  • 5. 人人文庫(kù)網(wǎng)僅提供信息存儲(chǔ)空間,僅對(duì)用戶(hù)上傳內(nèi)容的表現(xiàn)方式做保護(hù)處理,對(duì)用戶(hù)上傳分享的文檔內(nèi)容本身不做任何修改或編輯,并不能對(duì)任何下載內(nèi)容負(fù)責(zé)。
  • 6. 下載文件中如有侵權(quán)或不適當(dāng)內(nèi)容,請(qǐng)與我們聯(lián)系,我們立即糾正。
  • 7. 本站不保證下載資源的準(zhǔn)確性、安全性和完整性, 同時(shí)也不承擔(dān)用戶(hù)因使用這些下載資源對(duì)自己和他人造成任何形式的傷害或損失。

最新文檔

評(píng)論

0/150

提交評(píng)論