版權(quán)說明:本文檔由用戶提供并上傳,收益歸屬內(nèi)容提供方,若內(nèi)容存在侵權(quán),請進行舉報或認領(lǐng)
文檔簡介
數(shù)據(jù)挖掘第八章:標準規(guī)范、工具和發(fā)展趨勢本章內(nèi)容8.1數(shù)據(jù)挖掘標準與規(guī)范8.2數(shù)據(jù)挖掘工具8.3數(shù)據(jù)挖掘的研究趨勢基本要求:了解數(shù)據(jù)挖掘在應(yīng)用中的相關(guān)標準規(guī)范及未來的研究趨勢。8.1數(shù)據(jù)挖掘標準與規(guī)范數(shù)據(jù)挖掘過程模型是確保數(shù)據(jù)挖掘工作順利進行的關(guān)鍵。典型的過程模型有:SPSS的5A模型——評估(Assess)、訪問(Access)、分析(Analyze)、行動(Act)、自動化(Automate)SAS的SEMMA模型——采樣(Sample)、探索(Explore)、修正(Modify)、建模(Model)、評估(Assess)跨行業(yè)數(shù)據(jù)挖掘過程標準CRISP-DM(CrossIndustryStandardProcessforDataMining)。TwoCrows公司的數(shù)據(jù)挖掘過程模型,它與正在建立的CRISP-DM有許多相似之處。
數(shù)據(jù)挖掘相關(guān)標準CRISP-DM(交叉行業(yè)數(shù)據(jù)挖掘過程標準,CrossIndustryStandardProcessforDataMining)。SPSS、NCR以及DaimlerChrysler三個在數(shù)據(jù)挖掘領(lǐng)域經(jīng)驗豐富的公司發(fā)起建立一個社團,目的建立數(shù)據(jù)挖掘方法和過程的標準8.1數(shù)據(jù)挖掘標準與規(guī)范Crisp-DMProjectObjectivesDataUnderstandingDataPreparationModelingEvaluationReportingBackground
Requirements,assumptions,constraints
Terminology
Datamininggoals&successcriteria
ProjectplanInitialDatacollectionreport
Datadescriptionreport
DataExplorationreport
DataqualityreportDatadescriptionreport
Datapre-processingstepsModelingassumption
TestdesignModeldescription
Modelassessment(inc.validation)Assessmentofdataminingresultswith
respecttoobjectivesFinalreport:Summary: Objectives DataMiningprocess DataMiningresults DataMiningassessment
-ConclusionsFuturework(BusinessUnderstanding)(Deployment)WidelyacceptedPROCESSMODELfordataminingProvidesaframeworkfordescribingthemodelingprocessindetail“BESTPRACTICE”BusinessUnderstandingPhaseUnderstandthebusinessobjectivesWhatisthestatusquo?UnderstandbusinessprocessesAssociatedcosts/painDefinethesuccesscriteriaDevelopaglossaryofterms:speakthelanguageCost/BenefitAnalysisCurrentSystemsAssessmentIdentifythekeyactorsMinimum:TheSponsorandtheKeyUserWhatformsshouldtheoutputtake?IntegrationofoutputwithexistingtechnologylandscapeUnderstandmarketnormsandstandards8.1數(shù)據(jù)挖掘標準與規(guī)范BusinessUnderstandingPhaseTaskDecompositionBreakdowntheobjectiveintosub-tasksMapsub-taskstodataminingproblemdefinitionsIdentifyConstraintsResourcesLawe.g.DataProtectionBuildaprojectplanListassumptionsandrisk(technical/financial/business/organisational)factors8.1數(shù)據(jù)挖掘標準與規(guī)范DataUnderstandingPhaseCollectDataWhatarethedatasources?InternalandExternalSources(e.g.Axiom,Experian)Documentreasonsforinclusion/exclusionsDependonadomainexpertAccessibilityissuesArethereissuesregardingdatadistributionacrossdifferentdatabases/legacysystemsWherearethedisconnects?8.1數(shù)據(jù)挖掘標準與規(guī)范DataUnderstandingPhaseDataDescriptionDocumentdataqualityissuesComputebasicstatisticsDataExplorationSimpleunivariatedataplots/distributionsInvestigateattributeinteractionsDataQualityIssuesMissingValues:UnderstanditssourceStrangeDistributions8.1數(shù)據(jù)挖掘標準與規(guī)范DataPreparationPhaseIntegrateDataJoiningmultipledatatablesSummarisation/aggregationofdata
SelectDataAttributesubsetselectionRationaleforInclusion/ExclusionDatasamplingTraining/ValidationandTestsets8.1數(shù)據(jù)挖掘標準與規(guī)范DataPreparationPhaseDataTransformationUsingfunctionssuchaslogFactor/PrincipalComponentsanalysisNormalization/Discretisation/Binarisation
CleanDataHandlingmissingvalues/Outliers
DataConstructionDerivedAttributes
8.1數(shù)據(jù)挖掘標準與規(guī)范TheModelingPhaseBuildModelChooseinitialparametersettingsStudymodelbehaviour:Sensitivityanalysis
AssessthemodelBewareofover-fittingInvestigatetheerrordistribution:IdentifysegmentsofthestatespacewherethemodelislesseffectiveIterativelyadjustparametersettings8.1數(shù)據(jù)挖掘標準與規(guī)范TheEvaluationPhaseValidateModelHumanevaluationofresultsbydomainexpertsEvaluateusefulnessofresultsfrombusinessperspectiveDefinecontrolgroupsCalculateliftcurvesExpectedReturnonInvestmentReviewProcessDeterminenextstepsPotentialfordeploymentDeploymentarchitectureMetricsforsuccessofdeployment8.1數(shù)據(jù)挖掘標準與規(guī)范PMML(預(yù)測模型標記語言,PredictiveModelMarkupLanguage)。數(shù)據(jù)挖掘應(yīng)用往往需要多種類型的數(shù)據(jù)挖掘軟件、算法協(xié)同運行,這就要求對挖掘出的模型能夠很好地繼承、復(fù)用與集成。DMG(TheDataMiningGroup,DMG)提出PMML語言。PMML最新版本為4.1,支持16種數(shù)據(jù)挖掘模型,包括:
AssociationModel(關(guān)聯(lián)規(guī)則)、BaselineModel(基準模型)、ClusteringModel(聚類模型)、GeneralRegressionModel(回歸模型)、MiningModel(組合模型)、NaiveBayesModel(樸素貝葉斯)、
NearestNeighborModel(最近鄰模型)NeuralNetwork(神經(jīng)網(wǎng)絡(luò))、RegressionModel(線性、多項式、對數(shù)三種回歸模型)、RuleSetModel(規(guī)則集)、SequenceModel(序列模式)、Scorecard、TimeSeriesModel、SupportVectorMachineModel(支持向量機)、TextModel(文本模型)、TreeModel(決策樹)8.1數(shù)據(jù)挖掘標準與規(guī)范PMML的模型定義由以下幾部分組成:8.1數(shù)據(jù)挖掘標準與規(guī)范TheheaderelementcontainsgeneralinformationaboutthePMMLdocument,suchascopyrightformationforthemodel,itsdescription,andinformationabouttheapplicationusedtogeneratethemodelsuchasnameandversion.8.1數(shù)據(jù)挖掘標準與規(guī)范<PMMLversion="3.2"...<Headercopyright="Copyright(c)2009Togaware"description="RPartDecisionTree"><Extensionname="timestamp"value="2009-02-1506:51:50"extender="Rattle"/><Extensionname="description"value="iristree"extender="Rattle"/><Applicationname="Rattle/PMML"version="1.2.7"/></Header>Thedatadictionaryrecordsinformationaboutthedata?eldsfromwhichthemodelwasbuilt.8.1數(shù)據(jù)挖掘標準與規(guī)范<DataDictionarynumberOfFields="5"><DataFieldname="Species"...<Valuevalue="setosa"/><Valuevalue="versicolor"/><Valuevalue="virginica"/><DataFieldname="Sepal.Length"optype="continuous"dataType="double"/></DataField>DataTransformations:transformationsallowforthemappingofuserdataintoamoredesirableformtobeusedbytheminingmodel.PMMLdefinesseveralkindsofsimpledatatransformations.Normalization:mapvaluestonumbers,theinputcanbecontinuousordiscrete.Discretization:mapcontinuousvaluestodiscretevalues.Valuemapping:mapdiscretevaluestodiscretevalues.Functions(customandbuilt-in):deriveavaluebyapplyingafunctiontooneormoreparameters.Aggregation:usedtosummarizeorcollectgroupsofvalues.8.1數(shù)據(jù)挖掘標準與規(guī)范Model:containsthedefinitionofthedataminingmodel.ModelName(attributemodelName)AlgorithmName(attributealgorithmName)NumberofLayers(attributenumberOfLayers)MiningSchema:listsallfieldsusedinthemodel.Name:mustrefertoafieldinthedatadictionaryUsagetype:definesthewayafieldistobeusedinthemodel.Typicalvaluesare:active,predicted,andsupplementary.Predictedfieldsarethosewhosevaluesarepredictedbythemodel.OutlierTreatment:definestheoutliertreatmenttobeuse.MissingValueReplacementPolicy:ifthisattributeisspecifiedthenamissingvalueisautomaticallyreplacedbythegivenvalues.MissingValueTreatment:indicateshowthemissingvaluereplacementwasderived.8.1數(shù)據(jù)挖掘標準與規(guī)范Targets:allowforpost-processingofthepredictedvalueintheformatofscalingiftheoutputofthemodeliscontinuous.
8.1數(shù)據(jù)挖掘標準與規(guī)范PMMLExample:AssociationRule:8.1數(shù)據(jù)挖掘標準與規(guī)范t1:Cracker,Coke,Watert2:Cracker,Watert3:Cracker,Watert4:Cracker,Coke,WaterModelattributesItemsPMMLExample:AssociationRule:8.1數(shù)據(jù)挖掘標準與規(guī)范t1:Cracker,Coke,Watert2:Cracker,Watert3:Cracker,Watert4:Cracker,Coke,Water<AssocItemsetid="1"support="1.0"numberOfItems="1"/><AssocItemRef
itemRef="1"/></AssocItemset><AssocItemsetid="2"support="1.0"numberOfItems="1"/><AssocItemRef
itemRef="3"/></AssocItemset><!--andonefrequentitemsetwithtwoitems.--><AssocItemsetid="3"support="1.0"numberOfItems="2"/><AssocItemRef
itemRef="1"/><AssocItemRef
itemRef="3"/></AssocItemset><!--Tworulessatisfytherequirements--><AssocRulesupport="1.0"confidence="1.0"antecedent="1"consequent="2"/><AssocRulesupport="1.0"confidence="1.0"antecedent="2"consequent="1"/></AssociationModel></PMML>ItemSetsAssociationRulesJDM(JavaDataMiningAPI)。旨在提供一個訪問數(shù)據(jù)挖掘工具的標準API,支持數(shù)據(jù)挖掘模型的建立、使用,數(shù)據(jù)及元數(shù)據(jù)的創(chuàng)建、存儲、訪問及維護,從而使得Java應(yīng)用程序能夠能夠方便集成數(shù)據(jù)挖掘技術(shù)。8.1數(shù)據(jù)挖掘標準與規(guī)范SemanticWeb相關(guān)標準TimBerners-Lee在XML2000會議報告中首次提出了語義Web的層次模型(LayerCake)。其特點在與:基于XML和RDF/RDFS,構(gòu)建本體和邏輯推理規(guī)則,以完成基于語義的知識表示和推理,從而為計算機所理解和處理。8.1數(shù)據(jù)挖掘標準與規(guī)范第一層是Unicode(統(tǒng)一編碼)和URI(UniformResourceIdentifier,統(tǒng)一資源標識器)。UNICODE于1993年成為國際標準組織ISO的一項國際標準ISO/IEC10646,其宗旨是全球所有文種統(tǒng)一編碼。URI包含三個部分:被用來訪問資源的統(tǒng)一命名規(guī)則分配體系、資源宿主機器的名稱、路徑形式的資源名稱。與URL本不同的是,URI只是一個標識符,不直接提供訪問資源的方法。8.1數(shù)據(jù)挖掘標準與規(guī)范第二層是XML(EXtensibleMarkupLanguage)。XML具有簡單、自描述、可擴展的特點,并且實現(xiàn)了內(nèi)容、結(jié)構(gòu)和表現(xiàn)三者的分離,因而,更適合于數(shù)據(jù)表示和交換。XMLSchema中的約束主要用于XML文檔的結(jié)構(gòu)合法性驗證。第三層是RDF(ResourceDescriptionFramework,資源描述框架)。元數(shù)據(jù)層。RDF是建立在XML上的元數(shù)據(jù)描述與交換框架,以“資源(Resource)-屬性(Property)-屬性值(PropertyValue)”的形式描述對象。一個例子8.1數(shù)據(jù)挖掘標準與規(guī)范8.1數(shù)據(jù)挖掘標準與規(guī)范8.1數(shù)據(jù)挖掘標準與規(guī)范第四層是RDF-S(RDFSchema)。RDF-S是對RDF的擴展,是RDF的詞匯描述語言(VocabularyDescriptionLanguage),用于定義RDF資源描述文件中出現(xiàn)的詞匯。第五層是本體(Ontology)和規(guī)則(Rule)。領(lǐng)域知識層。OWL用于明確表示詞匯體系中的術(shù)語及術(shù)語間的關(guān)系,在詞義和語義的表達來說,OWL有更強的表達能力。規(guī)則用于描述領(lǐng)域知識中的前提和結(jié)論。SPARQL(SimpleProtocolandRDFQueryLanguage)是W3C推薦的用于對RDF數(shù)據(jù)查詢的語言和協(xié)議。8.1數(shù)據(jù)挖掘標準與規(guī)范本章內(nèi)容8.1數(shù)據(jù)挖掘標準與規(guī)范8.2數(shù)據(jù)挖掘工具8.3數(shù)據(jù)挖掘的研究趨勢Freeopen-sourcedataminingsoftwareandapplicationsGATE:a
naturallanguageprocessing
andlanguageengineeringtool.Orange:Acomponent-baseddataminingand
machinelearning
softwaresuitewritteninthe
Python
language.R:A
programminglanguage
andsoftwareenvironmentforstatisticalcomputing,datamining,andgraphics.RapidMiner:Anenvironmentfor
machinelearning
anddataminingexperiments.UIMA:TheUIMA(UnstructuredInformationManagementArchitecture)isacomponentframeworkforanalyzingunstructuredcontentsuchastext,audioandvideo–originallydevelopedbyIBM.Weka:Asuiteofmachinelearningsoftwareapplicationswritteninthe
Java
programminglanguage.8.2數(shù)據(jù)挖掘工具Commercialdata-miningsoftwareandapplicationsIBMSPSSModeler:dataminingsoftwareprovidedbyIBM.MicrosoftAnalysisServices:dataminingsoftwareprovidedbyMicrosoft.OracleDataMining:dataminingsoftwarebyOracle.SASEnterpriseMiner:dataminingsoftwareprovidedbytheSASInstitute.STATISTICADataMiner:dataminingsoftwareprovidedbyStatSoft.8.2數(shù)據(jù)挖掘工具8.2數(shù)據(jù)挖掘工具MainFeatures49datapreprocessingtools76classification/regressionalgorithms8clusteringalgorithms3algorithmsforfindingassociationrules15attribute/subsetevaluators+10searchalgorithmsforfeatureselectionMainGUI“TheExplorer”(exploratorydataanalysis)“TheExperimenter”(experimentalenvironment)“TheKnowledgeFlow”(newprocessmodelinspiredinterface)8.2數(shù)據(jù)挖掘工具WEKAonlydealswith“flat”files8.2數(shù)據(jù)挖掘工具@relationheart-disease-simplified@attributeagenumeric@attribu
溫馨提示
- 1. 本站所有資源如無特殊說明,都需要本地電腦安裝OFFICE2007和PDF閱讀器。圖紙軟件為CAD,CAXA,PROE,UG,SolidWorks等.壓縮文件請下載最新的WinRAR軟件解壓。
- 2. 本站的文檔不包含任何第三方提供的附件圖紙等,如果需要附件,請聯(lián)系上傳者。文件的所有權(quán)益歸上傳用戶所有。
- 3. 本站RAR壓縮包中若帶圖紙,網(wǎng)頁內(nèi)容里面會有圖紙預(yù)覽,若沒有圖紙預(yù)覽就沒有圖紙。
- 4. 未經(jīng)權(quán)益所有人同意不得將文件中的內(nèi)容挪作商業(yè)或盈利用途。
- 5. 人人文庫網(wǎng)僅提供信息存儲空間,僅對用戶上傳內(nèi)容的表現(xiàn)方式做保護處理,對用戶上傳分享的文檔內(nèi)容本身不做任何修改或編輯,并不能對任何下載內(nèi)容負責。
- 6. 下載文件中如有侵權(quán)或不適當內(nèi)容,請與我們聯(lián)系,我們立即糾正。
- 7. 本站不保證下載資源的準確性、安全性和完整性, 同時也不承擔用戶因使用這些下載資源對自己和他人造成任何形式的傷害或損失。
最新文檔
- 2025版智能便利店技術(shù)授權(quán)及門店運營合同4篇
- 個人財務(wù)規(guī)劃服務(wù)合同2024
- 2025年水電設(shè)施智能化改造安裝合同4篇
- 二零二五版光盤復(fù)制與創(chuàng)意設(shè)計及制作合同3篇
- 三方協(xié)作2024年勞務(wù)分包協(xié)議模板版A版
- 2025版民爆物品安全評估與風險管理合同模板4篇
- 2024通信工程智能化設(shè)備采購及安裝服務(wù)協(xié)議3篇
- 2025年度腳手架安裝與拆卸工程承包合同范本4篇
- 校園心理劇在學(xué)生群體中的運用
- 小學(xué)科學(xué)課程資源的創(chuàng)新利用與教育效果
- 2025年度房地產(chǎn)權(quán)證辦理委托代理合同典范3篇
- 柴油墊資合同模板
- 湖北省五市州2023-2024學(xué)年高一下學(xué)期期末聯(lián)考數(shù)學(xué)試題
- 城市作戰(zhàn)案例研究報告
- 【正版授權(quán)】 ISO 12803:1997 EN Representative sampling of plutonium nitrate solutions for determination of plutonium concentration
- 道德經(jīng)全文及注釋
- 2024中考考前地理沖刺卷及答案(含答題卡)
- 多子女贍養(yǎng)老人協(xié)議書范文
- 彩票市場銷售計劃書
- 支付行業(yè)反洗錢與反恐怖融資
- 基礎(chǔ)設(shè)施綠色施工技術(shù)研究
評論
0/150
提交評論