




下載本文檔
版權(quán)說(shuō)明:本文檔由用戶(hù)提供并上傳,收益歸屬內(nèi)容提供方,若內(nèi)容存在侵權(quán),請(qǐng)進(jìn)行舉報(bào)或認(rèn)領(lǐng)
文檔簡(jiǎn)介
1、chapter 10Unsupervised Learning9/13/20221數(shù)據(jù)挖掘與統(tǒng)計(jì)計(jì)算This chapter will instead focus on unsupervised learning, a set of statistical tools intended for the setting in which we have only a set of features X1, X2, . . . , Xp measured on n observations. We are not interested in prediction, because we do no
2、t have an associated response variable Y . Rather, the goal is to discover interesting things about the measurements on X1, X2, . . . , Xp. principal components analysisclustering9/13/2022數(shù)據(jù)挖掘與統(tǒng)計(jì)計(jì)算210.1 The Challenge of Unsupervised LearningIf we fit a predictive model using a supervised learning te
3、chnique, then it is possible to check our work by seeing how well our model predicts the response Y on observations not used in fitting the model. However, in unsupervised learning, there is no way to check ourwork because we dont know the true answerthe problem is unsupervised。9/13/2022數(shù)據(jù)挖掘與統(tǒng)計(jì)計(jì)算310
4、.2 Principal Components AnalysisWhen faced with a large set of correlated variables, principal components allow us to summarize this set with a smaller number of representative variables that collectively explain most of the variability in the original set. PCA is an unsupervised approach, since it
5、involves only a set of features X1, X2, . . . , Xp, and no associated responseY . Apart from producing derived variables for use in supervised learning problems, PCA also serves as a tool for data visualization (visualization of the observations or visualization of the variables).9/13/2022數(shù)據(jù)挖掘與統(tǒng)計(jì)計(jì)算4
6、10.2.1 What Are Principal Components?Suppose that we wish to visualize n observations with measurements on a set of p features, X1, X2, . . . , Xp, We could do this by examining two-dimensional scatterplots of the data, each of which contains the n observations measurements on two of thefeatures. Ho
7、wever, there are p(p1)/2 such scatterplots.9/13/2022數(shù)據(jù)挖掘與統(tǒng)計(jì)計(jì)算5low-dimensional representationClearly, a better method is required to visualize the n observations when p is large. In particular, we would like to find a low-dimensional representation of the data that captures as much of the information
8、 as possible. For instance, if we can obtain a two-dimensional representation of the data that captures most of the information, then we can plot the observations in this low-dimensional space.9/13/2022數(shù)據(jù)挖掘與統(tǒng)計(jì)計(jì)算6PCA provides a tool to do just this. It finds a low-dimensional representation of a data
9、 set that contains as much as possible of the variation. The idea is that each of the n observations lives in p dimensional space, but not all of these dimensions are equally interesting. PCA seeks a small number of dimensions that are as interesting as possible, where the concept of interesting is
10、measured by the amount that the observations vary along each dimension. 9/13/2022數(shù)據(jù)挖掘與統(tǒng)計(jì)計(jì)算7Geometric interpretation for PCAThere is a nice geometric interpretation for the first principal component. The loading vector 1 with elements 11, 21, . . . , p1 defines a direction in feature space along whic
11、h the data vary the most. If we project the n data points x1, . . . , xn onto this direction, the projected values are the principal component scores z11, . . . , zn1 themselves. 9/13/2022數(shù)據(jù)挖掘與統(tǒng)計(jì)計(jì)算8Low-dimensional views of the dataOnce we have computed the principal components, we can plot them agai
12、nst each other in order to produce low-dimensional views of the data.For instance, we can plot the score vector Z1 against Z2, Z1 against Z3, Z2 against Z3, and so forth. Geometrically, this amounts to projecting the original data down onto the subspace spanned by 1, 2, and 3, and plotting the proje
13、cted points.9/13/2022數(shù)據(jù)挖掘與統(tǒng)計(jì)計(jì)算99/13/2022數(shù)據(jù)挖掘與統(tǒng)計(jì)計(jì)算1010.2.2 Another Interpretation of Principal ComponentsPrincipal components provide low-dimensional linear surfaces that are closest to the observations. The first principal component loading vector has a very special property: it is the line in p-dim
14、ensional space that is closest to the n observations (using average squared Euclidean distance as a measure of closeness). The appeal of this interpretation is clear: we seek a single dimension of the data that lies as close as possible to all of the data points, since such a line will likely provid
15、e a good summary of the data.9/13/2022數(shù)據(jù)挖掘與統(tǒng)計(jì)計(jì)算119/13/2022數(shù)據(jù)挖掘與統(tǒng)計(jì)計(jì)算1210.2.3 More on PCAScaling the Variablesthe results obtained when we perform PCA will also depend on whether the variables have been individually scaled (each multiplied by a different constant). Uniqueness of the Principal Componen
16、tsThis means that two different software packages will yield the same principal component loading vectors, although the signs of those loading vectors may differ. 9/13/2022數(shù)據(jù)挖掘與統(tǒng)計(jì)計(jì)算13The Proportion of Variance ExplainedWe can now ask a natural question: how much of the information in a given data se
17、t is lost by projecting the observations onto the first few principal components? That is, how much of the variance in the data is not contained in the first few principal components? Deciding How Many Principal Components to UseIn fact, we would like to use the smallest number of principal componen
18、ts required to get a good understanding of the data. How many principal components are needed? Unfortunately, there is no single (or simple!) answer to this question.9/13/2022數(shù)據(jù)挖掘與統(tǒng)計(jì)計(jì)算149/13/2022數(shù)據(jù)挖掘與統(tǒng)計(jì)計(jì)算1510.3 Clustering MethodsClustering refers to a very broad set of techniques for finding subgrou
19、ps, or clusters, in a data set. When we cluster the observations of a data set, we seek to partition them into distinct groups so that the observations withineach group are quite similar to each other, while observations in different groups are quite different from each other. K-means clusteringHier
20、archical clustering9/13/2022數(shù)據(jù)挖掘與統(tǒng)計(jì)計(jì)算1610.3.1 K-Means Clustering1. C1 C2 . . . CK = 1, . . . , n. In other words, each observation belongs to at least one of the K clusters.2. Ck Ck = for all k k. In other words, the clusters are nonoverlapping: no observation belongs to more than one cluster.9/13/2022數(shù)據(jù)挖掘與統(tǒng)計(jì)計(jì)算179/13/2022數(shù)據(jù)挖掘與統(tǒng)計(jì)計(jì)算18Code : Principal Components Analysislibrary(ISLR)states =s ( USArrests )statesnames ( USArrests )apply ( USArrests , 2, mean )apply ( USArrests , 2, var )pr.out = p ( USArrests , scale =TRUE)nam
溫馨提示
- 1. 本站所有資源如無(wú)特殊說(shuō)明,都需要本地電腦安裝OFFICE2007和PDF閱讀器。圖紙軟件為CAD,CAXA,PROE,UG,SolidWorks等.壓縮文件請(qǐng)下載最新的WinRAR軟件解壓。
- 2. 本站的文檔不包含任何第三方提供的附件圖紙等,如果需要附件,請(qǐng)聯(lián)系上傳者。文件的所有權(quán)益歸上傳用戶(hù)所有。
- 3. 本站RAR壓縮包中若帶圖紙,網(wǎng)頁(yè)內(nèi)容里面會(huì)有圖紙預(yù)覽,若沒(méi)有圖紙預(yù)覽就沒(méi)有圖紙。
- 4. 未經(jīng)權(quán)益所有人同意不得將文件中的內(nèi)容挪作商業(yè)或盈利用途。
- 5. 人人文庫(kù)網(wǎng)僅提供信息存儲(chǔ)空間,僅對(duì)用戶(hù)上傳內(nèi)容的表現(xiàn)方式做保護(hù)處理,對(duì)用戶(hù)上傳分享的文檔內(nèi)容本身不做任何修改或編輯,并不能對(duì)任何下載內(nèi)容負(fù)責(zé)。
- 6. 下載文件中如有侵權(quán)或不適當(dāng)內(nèi)容,請(qǐng)與我們聯(lián)系,我們立即糾正。
- 7. 本站不保證下載資源的準(zhǔn)確性、安全性和完整性, 同時(shí)也不承擔(dān)用戶(hù)因使用這些下載資源對(duì)自己和他人造成任何形式的傷害或損失。
最新文檔
- 九下第六單元大單元教學(xué)設(shè)計(jì)
- 機(jī)械原理 CH14學(xué)習(xí)資料
- 大型儀器服務(wù)平臺(tái)升級(jí)改造項(xiàng)目實(shí)施計(jì)劃
- 2025至2030年中國(guó)異徑管接數(shù)據(jù)監(jiān)測(cè)研究報(bào)告
- 二零二五年度網(wǎng)絡(luò)安全持股合作框架
- 2025年度母嬰護(hù)理月嫂專(zhuān)業(yè)育兒指導(dǎo)協(xié)議書(shū)
- 辦公室文員勞動(dòng)合同書(shū)(2025年度)-企業(yè)品牌形象維護(hù)服務(wù)
- 2023-2024學(xué)年高中信息技術(shù)(粵教版2019)-數(shù)據(jù)與計(jì)算必修-while循環(huán)的應(yīng)用教學(xué)設(shè)計(jì)
- 二零二五年度勞務(wù)派遣工勞動(dòng)合同規(guī)范及操作手冊(cè)
- 二零二五年度農(nóng)業(yè)發(fā)展銀行專(zhuān)項(xiàng)貸款合同書(shū)
- 第六章 圍手術(shù)期護(hù)理課件
- 2024廣東省深圳市寶安區(qū)中考初三二模英語(yǔ)試題及答案
- 中考字音字形練習(xí)題(含答案)-字音字形專(zhuān)項(xiàng)訓(xùn)練
- 音響設(shè)備出租行業(yè)競(jìng)爭(zhēng)分析及發(fā)展前景預(yù)測(cè)報(bào)告
- DB63-T 2313-2024 三江源國(guó)家公園生態(tài)監(jiān)測(cè)指標(biāo)
- 2024年湖南高速鐵路職業(yè)技術(shù)學(xué)院?jiǎn)握新殬I(yè)適應(yīng)性測(cè)試題庫(kù)及答案1套
- 駐場(chǎng)人員服務(wù)方案
- C語(yǔ)言程序設(shè)計(jì)(山東聯(lián)盟-青島科技大學(xué))智慧樹(shù)知到答案2024年青島科技大學(xué)
- 2024-2029年中國(guó)限幅器芯片行業(yè)市場(chǎng)現(xiàn)狀分析及競(jìng)爭(zhēng)格局與投資發(fā)展研究報(bào)告
- 醫(yī)療器械市場(chǎng)規(guī)劃
- 第13課+清前中期的興盛與危機(jī)【中職專(zhuān)用】《中國(guó)歷史》(高教版2023基礎(chǔ)模塊)
評(píng)論
0/150
提交評(píng)論