第八章標準規(guī)范、工具和發(fā)展趨勢2_第1頁
第八章標準規(guī)范、工具和發(fā)展趨勢2_第2頁
第八章標準規(guī)范、工具和發(fā)展趨勢2_第3頁
第八章標準規(guī)范、工具和發(fā)展趨勢2_第4頁
第八章標準規(guī)范、工具和發(fā)展趨勢2_第5頁
已閱讀5頁,還剩39頁未讀, 繼續(xù)免費閱讀

下載本文檔

版權(quán)說明:本文檔由用戶提供并上傳,收益歸屬內(nèi)容提供方,若內(nèi)容存在侵權(quán),請進行舉報或認領(lǐng)

文檔簡介

數(shù)據(jù)挖掘第八章:標準規(guī)范、工具和發(fā)展趨勢本章內(nèi)容8.1數(shù)據(jù)挖掘標準與規(guī)范8.2數(shù)據(jù)挖掘工具8.3數(shù)據(jù)挖掘的研究趨勢基本要求:了解數(shù)據(jù)挖掘在應(yīng)用中的相關(guān)標準規(guī)范及未來的研究趨勢。8.1數(shù)據(jù)挖掘標準與規(guī)范數(shù)據(jù)挖掘過程模型是確保數(shù)據(jù)挖掘工作順利進行的關(guān)鍵。典型的過程模型有:SPSS的5A模型——評估(Assess)、訪問(Access)、分析(Analyze)、行動(Act)、自動化(Automate)SAS的SEMMA模型——采樣(Sample)、探索(Explore)、修正(Modify)、建模(Model)、評估(Assess)跨行業(yè)數(shù)據(jù)挖掘過程標準CRISP-DM(CrossIndustryStandardProcessforDataMining)。TwoCrows公司的數(shù)據(jù)挖掘過程模型,它與正在建立的CRISP-DM有許多相似之處。

數(shù)據(jù)挖掘相關(guān)標準CRISP-DM(交叉行業(yè)數(shù)據(jù)挖掘過程標準,CrossIndustryStandardProcessforDataMining)。SPSS、NCR以及DaimlerChrysler三個在數(shù)據(jù)挖掘領(lǐng)域經(jīng)驗豐富的公司發(fā)起建立一個社團,目的建立數(shù)據(jù)挖掘方法和過程的標準8.1數(shù)據(jù)挖掘標準與規(guī)范Crisp-DMProjectObjectivesDataUnderstandingDataPreparationModelingEvaluationReportingBackground

Requirements,assumptions,constraints

Terminology

Datamininggoals&successcriteria

ProjectplanInitialDatacollectionreport

Datadescriptionreport

DataExplorationreport

DataqualityreportDatadescriptionreport

Datapre-processingstepsModelingassumption

TestdesignModeldescription

Modelassessment(inc.validation)Assessmentofdataminingresultswith

respecttoobjectivesFinalreport:Summary: Objectives DataMiningprocess DataMiningresults DataMiningassessment

-ConclusionsFuturework(BusinessUnderstanding)(Deployment)WidelyacceptedPROCESSMODELfordataminingProvidesaframeworkfordescribingthemodelingprocessindetail“BESTPRACTICE”BusinessUnderstandingPhaseUnderstandthebusinessobjectivesWhatisthestatusquo?UnderstandbusinessprocessesAssociatedcosts/painDefinethesuccesscriteriaDevelopaglossaryofterms:speakthelanguageCost/BenefitAnalysisCurrentSystemsAssessmentIdentifythekeyactorsMinimum:TheSponsorandtheKeyUserWhatformsshouldtheoutputtake?IntegrationofoutputwithexistingtechnologylandscapeUnderstandmarketnormsandstandards8.1數(shù)據(jù)挖掘標準與規(guī)范BusinessUnderstandingPhaseTaskDecompositionBreakdowntheobjectiveintosub-tasksMapsub-taskstodataminingproblemdefinitionsIdentifyConstraintsResourcesLawe.g.DataProtectionBuildaprojectplanListassumptionsandrisk(technical/financial/business/organisational)factors8.1數(shù)據(jù)挖掘標準與規(guī)范DataUnderstandingPhaseCollectDataWhatarethedatasources?InternalandExternalSources(e.g.Axiom,Experian)Documentreasonsforinclusion/exclusionsDependonadomainexpertAccessibilityissuesArethereissuesregardingdatadistributionacrossdifferentdatabases/legacysystemsWherearethedisconnects?8.1數(shù)據(jù)挖掘標準與規(guī)范DataUnderstandingPhaseDataDescriptionDocumentdataqualityissuesComputebasicstatisticsDataExplorationSimpleunivariatedataplots/distributionsInvestigateattributeinteractionsDataQualityIssuesMissingValues:UnderstanditssourceStrangeDistributions8.1數(shù)據(jù)挖掘標準與規(guī)范DataPreparationPhaseIntegrateDataJoiningmultipledatatablesSummarisation/aggregationofdata

SelectDataAttributesubsetselectionRationaleforInclusion/ExclusionDatasamplingTraining/ValidationandTestsets8.1數(shù)據(jù)挖掘標準與規(guī)范DataPreparationPhaseDataTransformationUsingfunctionssuchaslogFactor/PrincipalComponentsanalysisNormalization/Discretisation/Binarisation

CleanDataHandlingmissingvalues/Outliers

DataConstructionDerivedAttributes

8.1數(shù)據(jù)挖掘標準與規(guī)范TheModelingPhaseBuildModelChooseinitialparametersettingsStudymodelbehaviour:Sensitivityanalysis

AssessthemodelBewareofover-fittingInvestigatetheerrordistribution:IdentifysegmentsofthestatespacewherethemodelislesseffectiveIterativelyadjustparametersettings8.1數(shù)據(jù)挖掘標準與規(guī)范TheEvaluationPhaseValidateModelHumanevaluationofresultsbydomainexpertsEvaluateusefulnessofresultsfrombusinessperspectiveDefinecontrolgroupsCalculateliftcurvesExpectedReturnonInvestmentReviewProcessDeterminenextstepsPotentialfordeploymentDeploymentarchitectureMetricsforsuccessofdeployment8.1數(shù)據(jù)挖掘標準與規(guī)范PMML(預(yù)測模型標記語言,PredictiveModelMarkupLanguage)。數(shù)據(jù)挖掘應(yīng)用往往需要多種類型的數(shù)據(jù)挖掘軟件、算法協(xié)同運行,這就要求對挖掘出的模型能夠很好地繼承、復(fù)用與集成。DMG(TheDataMiningGroup,DMG)提出PMML語言。PMML最新版本為4.1,支持16種數(shù)據(jù)挖掘模型,包括:

AssociationModel(關(guān)聯(lián)規(guī)則)、BaselineModel(基準模型)、ClusteringModel(聚類模型)、GeneralRegressionModel(回歸模型)、MiningModel(組合模型)、NaiveBayesModel(樸素貝葉斯)、

NearestNeighborModel(最近鄰模型)NeuralNetwork(神經(jīng)網(wǎng)絡(luò))、RegressionModel(線性、多項式、對數(shù)三種回歸模型)、RuleSetModel(規(guī)則集)、SequenceModel(序列模式)、Scorecard、TimeSeriesModel、SupportVectorMachineModel(支持向量機)、TextModel(文本模型)、TreeModel(決策樹)8.1數(shù)據(jù)挖掘標準與規(guī)范PMML的模型定義由以下幾部分組成:8.1數(shù)據(jù)挖掘標準與規(guī)范TheheaderelementcontainsgeneralinformationaboutthePMMLdocument,suchascopyrightformationforthemodel,itsdescription,andinformationabouttheapplicationusedtogeneratethemodelsuchasnameandversion.8.1數(shù)據(jù)挖掘標準與規(guī)范<PMMLversion="3.2"...<Headercopyright="Copyright(c)2009Togaware"description="RPartDecisionTree"><Extensionname="timestamp"value="2009-02-1506:51:50"extender="Rattle"/><Extensionname="description"value="iristree"extender="Rattle"/><Applicationname="Rattle/PMML"version="1.2.7"/></Header>Thedatadictionaryrecordsinformationaboutthedata?eldsfromwhichthemodelwasbuilt.8.1數(shù)據(jù)挖掘標準與規(guī)范<DataDictionarynumberOfFields="5"><DataFieldname="Species"...<Valuevalue="setosa"/><Valuevalue="versicolor"/><Valuevalue="virginica"/><DataFieldname="Sepal.Length"optype="continuous"dataType="double"/></DataField>DataTransformations:transformationsallowforthemappingofuserdataintoamoredesirableformtobeusedbytheminingmodel.PMMLdefinesseveralkindsofsimpledatatransformations.Normalization:mapvaluestonumbers,theinputcanbecontinuousordiscrete.Discretization:mapcontinuousvaluestodiscretevalues.Valuemapping:mapdiscretevaluestodiscretevalues.Functions(customandbuilt-in):deriveavaluebyapplyingafunctiontooneormoreparameters.Aggregation:usedtosummarizeorcollectgroupsofvalues.8.1數(shù)據(jù)挖掘標準與規(guī)范Model:containsthedefinitionofthedataminingmodel.ModelName(attributemodelName)AlgorithmName(attributealgorithmName)NumberofLayers(attributenumberOfLayers)MiningSchema:listsallfieldsusedinthemodel.Name:mustrefertoafieldinthedatadictionaryUsagetype:definesthewayafieldistobeusedinthemodel.Typicalvaluesare:active,predicted,andsupplementary.Predictedfieldsarethosewhosevaluesarepredictedbythemodel.OutlierTreatment:definestheoutliertreatmenttobeuse.MissingValueReplacementPolicy:ifthisattributeisspecifiedthenamissingvalueisautomaticallyreplacedbythegivenvalues.MissingValueTreatment:indicateshowthemissingvaluereplacementwasderived.8.1數(shù)據(jù)挖掘標準與規(guī)范Targets:allowforpost-processingofthepredictedvalueintheformatofscalingiftheoutputofthemodeliscontinuous.

8.1數(shù)據(jù)挖掘標準與規(guī)范PMMLExample:AssociationRule:8.1數(shù)據(jù)挖掘標準與規(guī)范t1:Cracker,Coke,Watert2:Cracker,Watert3:Cracker,Watert4:Cracker,Coke,WaterModelattributesItemsPMMLExample:AssociationRule:8.1數(shù)據(jù)挖掘標準與規(guī)范t1:Cracker,Coke,Watert2:Cracker,Watert3:Cracker,Watert4:Cracker,Coke,Water<AssocItemsetid="1"support="1.0"numberOfItems="1"/><AssocItemRef

itemRef="1"/></AssocItemset><AssocItemsetid="2"support="1.0"numberOfItems="1"/><AssocItemRef

itemRef="3"/></AssocItemset><!--andonefrequentitemsetwithtwoitems.--><AssocItemsetid="3"support="1.0"numberOfItems="2"/><AssocItemRef

itemRef="1"/><AssocItemRef

itemRef="3"/></AssocItemset><!--Tworulessatisfytherequirements--><AssocRulesupport="1.0"confidence="1.0"antecedent="1"consequent="2"/><AssocRulesupport="1.0"confidence="1.0"antecedent="2"consequent="1"/></AssociationModel></PMML>ItemSetsAssociationRulesJDM(JavaDataMiningAPI)。旨在提供一個訪問數(shù)據(jù)挖掘工具的標準API,支持數(shù)據(jù)挖掘模型的建立、使用,數(shù)據(jù)及元數(shù)據(jù)的創(chuàng)建、存儲、訪問及維護,從而使得Java應(yīng)用程序能夠能夠方便集成數(shù)據(jù)挖掘技術(shù)。8.1數(shù)據(jù)挖掘標準與規(guī)范SemanticWeb相關(guān)標準TimBerners-Lee在XML2000會議報告中首次提出了語義Web的層次模型(LayerCake)。其特點在與:基于XML和RDF/RDFS,構(gòu)建本體和邏輯推理規(guī)則,以完成基于語義的知識表示和推理,從而為計算機所理解和處理。8.1數(shù)據(jù)挖掘標準與規(guī)范第一層是Unicode(統(tǒng)一編碼)和URI(UniformResourceIdentifier,統(tǒng)一資源標識器)。UNICODE于1993年成為國際標準組織ISO的一項國際標準ISO/IEC10646,其宗旨是全球所有文種統(tǒng)一編碼。URI包含三個部分:被用來訪問資源的統(tǒng)一命名規(guī)則分配體系、資源宿主機器的名稱、路徑形式的資源名稱。與URL本不同的是,URI只是一個標識符,不直接提供訪問資源的方法。8.1數(shù)據(jù)挖掘標準與規(guī)范第二層是XML(EXtensibleMarkupLanguage)。XML具有簡單、自描述、可擴展的特點,并且實現(xiàn)了內(nèi)容、結(jié)構(gòu)和表現(xiàn)三者的分離,因而,更適合于數(shù)據(jù)表示和交換。XMLSchema中的約束主要用于XML文檔的結(jié)構(gòu)合法性驗證。第三層是RDF(ResourceDescriptionFramework,資源描述框架)。元數(shù)據(jù)層。RDF是建立在XML上的元數(shù)據(jù)描述與交換框架,以“資源(Resource)-屬性(Property)-屬性值(PropertyValue)”的形式描述對象。一個例子8.1數(shù)據(jù)挖掘標準與規(guī)范8.1數(shù)據(jù)挖掘標準與規(guī)范8.1數(shù)據(jù)挖掘標準與規(guī)范第四層是RDF-S(RDFSchema)。RDF-S是對RDF的擴展,是RDF的詞匯描述語言(VocabularyDescriptionLanguage),用于定義RDF資源描述文件中出現(xiàn)的詞匯。第五層是本體(Ontology)和規(guī)則(Rule)。領(lǐng)域知識層。OWL用于明確表示詞匯體系中的術(shù)語及術(shù)語間的關(guān)系,在詞義和語義的表達來說,OWL有更強的表達能力。規(guī)則用于描述領(lǐng)域知識中的前提和結(jié)論。SPARQL(SimpleProtocolandRDFQueryLanguage)是W3C推薦的用于對RDF數(shù)據(jù)查詢的語言和協(xié)議。8.1數(shù)據(jù)挖掘標準與規(guī)范本章內(nèi)容8.1數(shù)據(jù)挖掘標準與規(guī)范8.2數(shù)據(jù)挖掘工具8.3數(shù)據(jù)挖掘的研究趨勢Freeopen-sourcedataminingsoftwareandapplicationsGATE:a

naturallanguageprocessing

andlanguageengineeringtool.Orange:Acomponent-baseddataminingand

machinelearning

softwaresuitewritteninthe

Python

language.R:A

programminglanguage

andsoftwareenvironmentforstatisticalcomputing,datamining,andgraphics.RapidMiner:Anenvironmentfor

machinelearning

anddataminingexperiments.UIMA:TheUIMA(UnstructuredInformationManagementArchitecture)isacomponentframeworkforanalyzingunstructuredcontentsuchastext,audioandvideo–originallydevelopedbyIBM.Weka:Asuiteofmachinelearningsoftwareapplicationswritteninthe

Java

programminglanguage.8.2數(shù)據(jù)挖掘工具Commercialdata-miningsoftwareandapplicationsIBMSPSSModeler:dataminingsoftwareprovidedbyIBM.MicrosoftAnalysisServices:dataminingsoftwareprovidedbyMicrosoft.OracleDataMining:dataminingsoftwarebyOracle.SASEnterpriseMiner:dataminingsoftwareprovidedbytheSASInstitute.STATISTICADataMiner:dataminingsoftwareprovidedbyStatSoft.8.2數(shù)據(jù)挖掘工具8.2數(shù)據(jù)挖掘工具MainFeatures49datapreprocessingtools76classification/regressionalgorithms8clusteringalgorithms3algorithmsforfindingassociationrules15attribute/subsetevaluators+10searchalgorithmsforfeatureselectionMainGUI“TheExplorer”(exploratorydataanalysis)“TheExperimenter”(experimentalenvironment)“TheKnowledgeFlow”(newprocessmodelinspiredinterface)8.2數(shù)據(jù)挖掘工具WEKAonlydealswith“flat”files8.2數(shù)據(jù)挖掘工具@relationheart-disease-simplified@attributeagenumeric@attribu

溫馨提示

  • 1. 本站所有資源如無特殊說明,都需要本地電腦安裝OFFICE2007和PDF閱讀器。圖紙軟件為CAD,CAXA,PROE,UG,SolidWorks等.壓縮文件請下載最新的WinRAR軟件解壓。
  • 2. 本站的文檔不包含任何第三方提供的附件圖紙等,如果需要附件,請聯(lián)系上傳者。文件的所有權(quán)益歸上傳用戶所有。
  • 3. 本站RAR壓縮包中若帶圖紙,網(wǎng)頁內(nèi)容里面會有圖紙預(yù)覽,若沒有圖紙預(yù)覽就沒有圖紙。
  • 4. 未經(jīng)權(quán)益所有人同意不得將文件中的內(nèi)容挪作商業(yè)或盈利用途。
  • 5. 人人文庫網(wǎng)僅提供信息存儲空間,僅對用戶上傳內(nèi)容的表現(xiàn)方式做保護處理,對用戶上傳分享的文檔內(nèi)容本身不做任何修改或編輯,并不能對任何下載內(nèi)容負責。
  • 6. 下載文件中如有侵權(quán)或不適當內(nèi)容,請與我們聯(lián)系,我們立即糾正。
  • 7. 本站不保證下載資源的準確性、安全性和完整性, 同時也不承擔用戶因使用這些下載資源對自己和他人造成任何形式的傷害或損失。

評論

0/150

提交評論