版權(quán)說(shuō)明:本文檔由用戶(hù)提供并上傳,收益歸屬內(nèi)容提供方,若內(nèi)容存在侵權(quán),請(qǐng)進(jìn)行舉報(bào)或認(rèn)領(lǐng)
文檔簡(jiǎn)介
1、數(shù)據(jù)發(fā)掘第八章:規(guī)范規(guī)范、工具和開(kāi)展趨勢(shì).本章內(nèi)容8.1 數(shù)據(jù)發(fā)掘規(guī)范與規(guī)范8.2 數(shù)據(jù)發(fā)掘工具8.3 數(shù)據(jù)發(fā)掘的研討趨勢(shì)根本要求:了解數(shù)據(jù)發(fā)掘在運(yùn)用中的相關(guān)規(guī)范規(guī)范及未來(lái)的研討趨勢(shì)。.8.1 數(shù)據(jù)發(fā)掘規(guī)范與規(guī)范數(shù)據(jù)發(fā)掘過(guò)程模型是確保數(shù)據(jù)發(fā)掘任務(wù)順利進(jìn)展的關(guān)鍵。典型的過(guò)程模型有:SPSS的5A模型評(píng)價(jià)(Assess)、訪問(wèn)(Access)、分析(Analyze)、行動(dòng)(Act)、自動(dòng)化(Automate)SAS的SEMMA模型采樣(Sample)、探求(Explore)、修正(Modify)、建模(Model)、評(píng)價(jià)(Assess)跨行業(yè)數(shù)據(jù)發(fā)掘過(guò)程規(guī)范CRISP-DM (Cross Indu
2、stry Standard Process for Data Mining ) 。Two Crows公司的數(shù)據(jù)發(fā)掘過(guò)程模型,它與正在建立的CRISP-DM有許多類(lèi)似之處。 . 數(shù)據(jù)發(fā)掘相關(guān)規(guī)范CRISP-DM交叉行業(yè)數(shù)據(jù)發(fā)掘過(guò)程規(guī)范,Cross Industry Standard Process for Data Mining。SPSS、NCR以及DaimlerChrysler三個(gè)在數(shù)據(jù)發(fā)掘領(lǐng)域閱歷豐富的公司發(fā)起建立一個(gè)社團(tuán),目的建立數(shù)據(jù)發(fā)掘方法和過(guò)程的規(guī)范 8.1 數(shù)據(jù)發(fā)掘規(guī)范與規(guī)范.Crisp - DMProject ObjectivesData UnderstandingData Pre
3、parationModelingEvaluationReportingBackgroundRequirements, assumptions, constraintsTerminologyData mining goals & success criteriaProject planInitial Data collection reportData description reportData Exploration reportData quality reportData description reportData pre-processing stepsModeling assump
4、tionTest designModel descriptionModel assessment (inc. validation)Assessment of data mining results withrespect to objectivesFinal report:Summary:ObjectivesData Mining processData Mining resultsData Mining assessment-ConclusionsFuture work(Business Understanding)(Deployment)Widely accepted PROCESS M
5、ODEL for data miningProvides a framework for describing the modeling process in detail“BEST PRACTICE.Business Understanding PhaseUnderstand the business objectivesWhat is the status quo?Understand business processesAssociated costs/painDefine the success criteriaDevelop a glossary of terms: speak th
6、e languageCost/Benefit AnalysisCurrent Systems AssessmentIdentify the key actorsMinimum: The Sponsor and the Key UserWhat forms should the output take?Integration of output with existing technology landscapeUnderstand market norms and standards8.1 數(shù)據(jù)發(fā)掘規(guī)范與規(guī)范.Business Understanding PhaseTask Decomposi
7、tionBreak down the objective into sub-tasksMap sub-tasks to data mining problem definitions Identify ConstraintsResourcesLaw e.g. Data ProtectionBuild a project planList assumptions and risk (technical/ financial/ business/ organisational) factors8.1 數(shù)據(jù)發(fā)掘規(guī)范與規(guī)范.Data Understanding PhaseCollect DataWha
8、t are the data sources?Internal and External Sources (e.g. Axiom, Experian)Document reasons for inclusion/exclusionsDepend on a domain expertAccessibility issuesAre there issues regarding data distribution across different databases/legacy systemsWhere are the disconnects?8.1 數(shù)據(jù)發(fā)掘規(guī)范與規(guī)范.Data Understa
9、nding PhaseData DescriptionDocument data quality issuesCompute basic statistics Data ExplorationSimple univariate data plots/distributionsInvestigate attribute interactionsData Quality IssuesMissing Values: Understand its sourceStrange Distributions8.1 數(shù)據(jù)發(fā)掘規(guī)范與規(guī)范.Data Preparation PhaseIntegrate DataJ
10、oining multiple data tablesSummarisation/aggregation of dataSelect DataAttribute subset selectionRationale for Inclusion/ExclusionData samplingTraining/Validation and Test sets8.1 數(shù)據(jù)發(fā)掘規(guī)范與規(guī)范.Data Preparation PhaseData TransformationUsing functions such as logFactor/Principal Components analysisNormal
11、ization/Discretisation/BinarisationClean DataHandling missing values/OutliersData ConstructionDerived Attributes8.1 數(shù)據(jù)發(fā)掘規(guī)范與規(guī)范.The Modeling PhaseBuild ModelChoose initial parameter settingsStudy model behaviour: Sensitivity analysisAssess the modelBeware of over-fittingInvestigate the error distribut
12、ion: Identify segments of the state space where the model is less effectiveIteratively adjust parameter settings8.1 數(shù)據(jù)發(fā)掘規(guī)范與規(guī)范.The Evaluation PhaseValidate ModelHuman evaluation of results by domain expertsEvaluate usefulness of results from business perspectiveDefine control groupsCalculate lift cur
13、vesExpected Return on InvestmentReview ProcessDetermine next stepsPotential for deploymentDeployment architectureMetrics for success of deployment8.1 數(shù)據(jù)發(fā)掘規(guī)范與規(guī)范.PMML預(yù)測(cè)模型標(biāo)志言語(yǔ),Predictive Model Markup Language。數(shù)據(jù)發(fā)掘運(yùn)用往往需求多種類(lèi)型的數(shù)據(jù)發(fā)掘軟件、算法協(xié)同運(yùn)轉(zhuǎn),這就要求對(duì)發(fā)掘出的模型可以很好地承繼、復(fù)用與集成。DMGThe Data Mining Group,DMG提出PMML言語(yǔ)。PMM
14、L最新版本為4.1,支持16種數(shù)據(jù)發(fā)掘模型,包括:AssociationModel 關(guān)聯(lián)規(guī)那么、BaselineModel基準(zhǔn)模型、ClusteringModel聚類(lèi)模型、GeneralRegressionModel回歸模型、MiningModel組合模型、NaiveBayesModel樸素貝葉斯、 NearestNeighborModel 最近鄰模型NeuralNetwork神經(jīng)網(wǎng)絡(luò)、RegressionModel線性、多項(xiàng)式、對(duì)數(shù)三種回歸模型、RuleSetModel規(guī)那么集、 SequenceModel序列方式、Scorecard、TimeSeriesModel、SupportVecto
15、rMachineModel支持向量機(jī)、 TextModel文本模型、TreeModel決策樹(shù)8.1 數(shù)據(jù)發(fā)掘規(guī)范與規(guī)范.PMML的模型定義由以下幾部分組成:8.1 數(shù)據(jù)發(fā)掘規(guī)范與規(guī)范.The header element contains general information about the PMML document, such as copyright formation for the model, its description, and information about the application used to generate the model such as na
16、me and version. 8.1 數(shù)據(jù)發(fā)掘規(guī)范與規(guī)范PMML version=3.2 . .The data dictionary records information about the data elds from which the model was built.8.1 數(shù)據(jù)發(fā)掘規(guī)范與規(guī)范 DataField name=Species . .Data Transformations: transformations allow for the mapping of user data into a more desirable form to be used by the mi
17、ning model. PMML defines several kinds of simple data transformations.Normalization: map values to numbers, the input can be continuous or discrete.Discretization: map continuous values to discrete values.Value mapping: map discrete values to discrete values.Functions (custom and built-in): derive a
18、 value by applying a function to one or more parameters.Aggregation: used to summarize or collect groups of values.8.1 數(shù)據(jù)發(fā)掘規(guī)范與規(guī)范.Model: contains the definition of the data mining model. Model Name (attribute modelName)Algorithm Name (attribute algorithmName)Number of Layers (attribute numberOfLayers
19、)Mining Schema: lists all fields used in the model. Name : must refer to a field in the data dictionaryUsage type: defines the way a field is to be used in the model. Typical values are: active, predicted, and supplementary. Predicted fields are those whose values are predicted by the model.Outlier
20、Treatment : defines the outlier treatment to be use. Missing Value Replacement Policy : if this attribute is specified then a missing value is automatically replaced by the given values.Missing Value Treatment : indicates how the missing value replacement was derived.8.1 數(shù)據(jù)發(fā)掘規(guī)范與規(guī)范.Targets: allow for
21、 post-processing of the predicted value in the format of scaling if the output of the model is continuous.8.1 數(shù)據(jù)發(fā)掘規(guī)范與規(guī)范.PMML Example: Association Rule :8.1 數(shù)據(jù)發(fā)掘規(guī)范與規(guī)范t1: Cracker, Coke, Watert2: Cracker, Watert3: Cracker, Watert4: Cracker, Coke, Water Model attributes Items.PMML Example: Association R
22、ule :8.1 數(shù)據(jù)發(fā)掘規(guī)范與規(guī)范t1: Cracker, Coke, Watert2: Cracker, Watert3: Cracker, Watert4: Cracker, Coke, Water Item SetsAssociation Rules.JDMJava Data Mining API。旨在提供一個(gè)訪問(wèn)數(shù)據(jù)發(fā)掘工具的規(guī)范API,支持?jǐn)?shù)據(jù)發(fā)掘模型的建立、運(yùn)用,數(shù)據(jù)及元數(shù)據(jù)的創(chuàng)建、存儲(chǔ)、訪問(wèn)及維護(hù),從而使得Java運(yùn)用程序可以可以方便集成數(shù)據(jù)發(fā)掘技術(shù)。8.1 數(shù)據(jù)發(fā)掘規(guī)范與規(guī)范. Semantic Web相關(guān)規(guī)范Tim Berners-Lee 在XML 2000會(huì)議報(bào)告中初次提
23、出了語(yǔ)義Web的層次模型Layer Cake。其特點(diǎn)在與:基于XML和RDF/RDFS,構(gòu)建本體和邏輯推理規(guī)那么,以完成基于語(yǔ)義的知識(shí)表示和推理,從而為計(jì)算機(jī)所了解和處置。8.1 數(shù)據(jù)發(fā)掘規(guī)范與規(guī)范.第一層是Unicode一致編碼和URIUniform Resource Identifier,一致資源標(biāo)識(shí)器。UNICODE于1993年成為國(guó)際規(guī)范組織ISO的一項(xiàng)國(guó)際規(guī)范ISO/IEC10646,其目的是全球一切文種一致編碼。URI包含三個(gè)部分:被用來(lái)訪問(wèn)資源的一致命名規(guī)那么分配體系、資源宿主機(jī)器的稱(chēng)號(hào)、途徑方式的資源稱(chēng)號(hào)。與URL 本不同的是,URI只是一個(gè)標(biāo)識(shí)符,不直接提供訪問(wèn)資源的方法。8
24、.1 數(shù)據(jù)發(fā)掘規(guī)范與規(guī)范.第二層是XMLEXtensible Markup Language。XML具有簡(jiǎn)單、自描畫(huà)、可擴(kuò)展的特點(diǎn),并且實(shí)現(xiàn)了內(nèi)容、構(gòu)造和表現(xiàn)三者的分別,因此,更適宜于數(shù)據(jù)表示和交換。XML Schema中的約束主要用于XML文檔的構(gòu)造合法性驗(yàn)證。第三層是RDFResource Description Framework,資源描畫(huà)框架。元數(shù)據(jù)層。RDF是建立在XML上的元數(shù)據(jù)描畫(huà)與交換框架,以“資源Resource屬性Property屬性值Property Value的方式描畫(huà)對(duì)象。一個(gè)例子8.1 數(shù)據(jù)發(fā)掘規(guī)范與規(guī)范.8.1 數(shù)據(jù)發(fā)掘規(guī)范與規(guī)范.8.1 數(shù)據(jù)發(fā)掘規(guī)范與規(guī)范.第四
25、層是RDF-SRDF Schema。RDF-S是對(duì)RDF 的擴(kuò)展,是RDF的詞匯描畫(huà)言語(yǔ)Vocabulary Description Language,用于定義RDF資源描畫(huà)文件中出現(xiàn)的詞匯。第五層是本體Ontology和規(guī)那么Rule。領(lǐng)域知識(shí)層。OWL用于明確表示詞匯體系中的術(shù)語(yǔ)及術(shù)語(yǔ)間的關(guān)系,在詞義和語(yǔ)義的表達(dá)來(lái)說(shuō),OWL有更強(qiáng)的表達(dá)才干。規(guī)那么用于描畫(huà)領(lǐng)域知識(shí)中的前提和結(jié)論。SPARQLSimple Protocol and RDF Query Language是W3C引薦的用于對(duì)RDF數(shù)據(jù)查詢(xún)的言語(yǔ)和協(xié)議。8.1 數(shù)據(jù)發(fā)掘規(guī)范與規(guī)范.本章內(nèi)容8.1 數(shù)據(jù)發(fā)掘規(guī)范與規(guī)范8.2 數(shù)據(jù)發(fā)掘
26、工具8.3 數(shù)據(jù)發(fā)掘的研討趨勢(shì).Free open-source data mining software and applicationsGATE: anatural language processingand language engineering tool.Orange: A component-based data mining andmachine learningsoftware suite written in thePythonlanguage.R: Aprogramming languageand software environment for statistical
27、computing, data mining, and graphics. RapidMiner: An environment formachine learningand data mining experiments.UIMA: The UIMA (Unstructured Information Management Architecture) is a component framework for analyzing unstructured content such as text, audio and video originally developed by IBM.Weka
28、: A suite of machine learning software applications written in theJavaprogramming language.8.2 數(shù)據(jù)發(fā)掘工具.Commercial data-mining software and applicationsIBM SPSS Modeler: data mining software provided by IBM.Microsoft Analysis Services: data mining software provided by Microsoft.Oracle Data Mining: dat
29、a mining software by Oracle.SAS Enterprise Miner: data mining software provided by the SAS Institute.STATISTICA Data Miner: data mining software provided by StatSoft.8.2 數(shù)據(jù)發(fā)掘工具.WEKA: Waikato Environment for Knowledge AnalysisIts a data mining/machine learning tool developed by Department of Computer
30、 Science, University of Waikato, New Zealand.Weka is also a bird found only on the islands of New Zealand. Download and Install WEKAWebsite: cs.waikato.ac.nz/ml/weka/index.htmlSupport multiple platforms (written in java): Windows, Mac OS X and Linux8.2 數(shù)據(jù)發(fā)掘工具.Main Features 49 data preprocessing tool
31、s76 classification/regression algorithms8 clustering algorithms3 algorithms for finding association rules15 attribute/subset evaluators + 10 search algorithms for feature selectionMain GUI“The Explorer (exploratory data analysis)“The Experimenter (experimental environment)“The KnowledgeFlow (new process model inspired interface)8.2 數(shù)據(jù)發(fā)掘工具.WEKA only deals with “flat files 8.2 數(shù)據(jù)發(fā)掘工具relation heart-disease-simplifiedattribute age numericattribute se
溫馨提示
- 1. 本站所有資源如無(wú)特殊說(shuō)明,都需要本地電腦安裝OFFICE2007和PDF閱讀器。圖紙軟件為CAD,CAXA,PROE,UG,SolidWorks等.壓縮文件請(qǐng)下載最新的WinRAR軟件解壓。
- 2. 本站的文檔不包含任何第三方提供的附件圖紙等,如果需要附件,請(qǐng)聯(lián)系上傳者。文件的所有權(quán)益歸上傳用戶(hù)所有。
- 3. 本站RAR壓縮包中若帶圖紙,網(wǎng)頁(yè)內(nèi)容里面會(huì)有圖紙預(yù)覽,若沒(méi)有圖紙預(yù)覽就沒(méi)有圖紙。
- 4. 未經(jīng)權(quán)益所有人同意不得將文件中的內(nèi)容挪作商業(yè)或盈利用途。
- 5. 人人文庫(kù)網(wǎng)僅提供信息存儲(chǔ)空間,僅對(duì)用戶(hù)上傳內(nèi)容的表現(xiàn)方式做保護(hù)處理,對(duì)用戶(hù)上傳分享的文檔內(nèi)容本身不做任何修改或編輯,并不能對(duì)任何下載內(nèi)容負(fù)責(zé)。
- 6. 下載文件中如有侵權(quán)或不適當(dāng)內(nèi)容,請(qǐng)與我們聯(lián)系,我們立即糾正。
- 7. 本站不保證下載資源的準(zhǔn)確性、安全性和完整性, 同時(shí)也不承擔(dān)用戶(hù)因使用這些下載資源對(duì)自己和他人造成任何形式的傷害或損失。
最新文檔
- 2024年版堅(jiān)定初心專(zhuān)題教育培訓(xùn)協(xié)議版B版
- 2024年環(huán)保項(xiàng)目質(zhì)押擔(dān)保及反擔(dān)保合同范本解析3篇
- 2024年環(huán)保項(xiàng)目抵押融資擔(dān)保合同示范文本3篇
- 房屋租賃合同模板錦集九篇
- 小學(xué)二年級(jí)教學(xué)工作計(jì)劃
- 無(wú)人貨架項(xiàng)目效益分析報(bào)告
- 中國(guó)移動(dòng)CAD行業(yè)市場(chǎng)運(yùn)行現(xiàn)狀及投資戰(zhàn)略研究報(bào)告
- 誰(shuí)的尾巴中班教案
- 石油化工非標(biāo)設(shè)備項(xiàng)目可行性研究報(bào)告
- 2025-2031年中國(guó)海南省生態(tài)旅游行業(yè)發(fā)展前景預(yù)測(cè)及投資方向研究報(bào)告
- 在小學(xué)語(yǔ)文教學(xué)中彰顯人文情懷 人文情懷
- 急性呼吸衰竭的診斷和處理
- GB/T 337.1-2014工業(yè)硝酸濃硝酸
- 小學(xué)語(yǔ)文課程標(biāo)準(zhǔn)(2023年版)
- GB/T 13738.2-2017紅茶第2部分:工夫紅茶
- 涉稅風(fēng)險(xiǎn)防范課件
- 《小英雄雨來(lái)》閱讀測(cè)試題附答案
- 應(yīng)用PDCA降低抗生素的使用率及使用強(qiáng)度課件
- DB31T 1238-2020 分布式光伏發(fā)電系統(tǒng)運(yùn)行維護(hù)管理規(guī)范
- 分包計(jì)劃范文
- 個(gè)人住房質(zhì)押擔(dān)保借款合同書(shū)范本(3篇)
評(píng)論
0/150
提交評(píng)論