




版權(quán)說明:本文檔由用戶提供并上傳,收益歸屬內(nèi)容提供方,若內(nèi)容存在侵權(quán),請(qǐng)進(jìn)行舉報(bào)或認(rèn)領(lǐng)
文檔簡介
_
MachineLearningForAbsoluteBeginners
OliverTheobald
SecondEdition
Copyright?2017byOliverTheobald
Allrightsreserved.Nopartofthispublicationmaybereproduced,distributed,ortransmittedinanyformorbyanymeans,includingphotocopying,recording,orotherelectronicormechanicalmethods,withoutthepriorwrittenpermissionofthepublisher,exceptinthecaseofbriefquotationsembodiedincriticalreviewsandcertainothernon-commercialusespermittedbycopyrightlaw.
Contents
INTRODUCTION
WHATISMACHINELEARNING?MLCATEGORIES
THEMLTOOLBOXDATASCRUBBING
SETTINGUPYOURDATAREGRESSIONANALYSISCLUSTERING
BIAS&VARIANCE
ARTIFICIALNEURALNETWORKSDECISIONTREES
ENSEMBLEMODELINGBUILDINGAMODELINPYTHONMODELOPTIMIZATIONFURTHERRESOURCESDOWNLOADINGDATASETSFINALWORD
INTRODUCTION
MachineshavecomealongwaysincetheIndustrialRevolution.Theycontinuetofillfactoryfloorsandmanufacturingplants,butnowtheircapabilitiesextendbeyondmanualactivitiestocognitivetasksthat,untilrecently,onlyhumanswerecapableofperforming.Judgingsongcompetitions,drivingautomobiles,andmoppingthefloorwithprofessionalchessplayersarethreeexamplesofthespecificcomplextasksmachinesarenowcapableofsimulating.
Buttheirremarkablefeatstriggerfearamongsomeobservers.Partofthisfearnestlesontheneckofsurvivalistinsecurities,whereitprovokesthedeep-seatedquestionofwhatif?Whatifintelligentmachinesturnonusinastruggleofthefittest?Whatifintelligentmachinesproduceoffspringwithcapabilitiesthathumansneverintendedtoimparttomachines?Whatifthelegendofthesingularityistrue?
Theothernotablefearisthethreattojobsecurity,andifyou’reatruckdriveroranaccountant,thereisavalidreasontobeworried.AccordingtotheBritishBroadcastingCompany’s(BBC)interactiveonlineresourceWillarobottakemyjob?,professionssuchasbarworker(77%),waiter(90%),charteredaccountant(95%),receptionist(96%),andtaxidriver(57%)eachhaveahighchanceofbecomingautomatedbytheyear2035.
[1]
Butresearchonplannedjobautomationandcrystalballgazingwithrespecttothefutureevolutionofmachinesandartificialintelligence(AI)shouldbereadwithapinchofskepticism.AItechnologyismovingfast,butbroadadoptionisstillanuncharteredpathfraughtwithknownandunforeseenchallenges.Delaysandotherobstaclesareinevitable.
NorismachinelearningasimplecaseofflickingaswitchandaskingthemachinetopredicttheoutcomeoftheSuperBowlandserveyouadeliciousmartini.Machinelearningisfarfromwhatyouwouldcallanout-of-the-boxsolution.
Machinesoperatebasedonstatisticalalgorithmsmanagedandoverseenbyskilledindividuals—knownasdatascientistsandmachinelearningengineers.Thisisonelabormarketwherejobopportunitiesaredestinedfor
growthbutwhere,currently,supplyisstrugglingtomeetdemand.IndustryexpertslamentthatoneofthebiggestobstaclesdelayingtheprogressofAIistheinadequatesupplyofprofessionalswiththenecessaryexpertiseandtraining.
AccordingtoCharlesGreen,theDirectorofThoughtLeadershipatBelatrixSoftware:
“It’sahugechallengetofinddatascientists,peoplewithmachinelearningexperience,orpeoplewiththeskillstoanalyzeandusethedata,aswellasthosewhocancreatethealgorithmsrequiredformachinelearning.Secondly,whilethetechnologyisstillemerging,therearemanyongoingdevelopments.It’sclearthatAIisalongwayfromhowwemightimagineit.”
[2]
Perhapsyourownpathtobecominganexpertinthefieldofmachinelearningstartshere,ormaybeabaselineunderstandingissufficienttosatisfyyourcuriosityfornow.Inanycase,let’sproceedwiththeassumptionthatyouarereceptivetotheideaoftrainingtobecomeasuccessfuldatascientistormachinelearningengineer.
Tobuildandprogramintelligentmachines,youmustfirstunderstandclassicalstatistics.Algorithmsderivedfromclassicalstatisticscontributethemetaphoricalbloodcellsandoxygenthatpowermachinelearning.Layeruponlayeroflinearregression,k-nearestneighbors,andrandomforestssurgethroughthemachineanddrivetheircognitiveabilities.Classicalstatisticsisattheheartofmachinelearningandmanyofthesealgorithmsarebasedonthesamestatisticalequationsyoustudiedinhighschool.Indeed,statisticalalgorithmswereconductedonpaperwellbeforemachinesevertookonthetitleofartificialintelligence.
Computerprogrammingisanotherindispensablepartofmachinelearning.Thereisn’taclick-and-dragorWeb2.0solutiontoperformadvancedmachinelearninginthewayonecanconvenientlybuildawebsitenowadayswithWordPressorStrikingly.Programmingskillsarethereforevitaltomanagedataanddesignstatisticalmodelsthatrunonmachines.
Somestudentsofmachinelearningwillhaveyearsofprogrammingexperiencebuthaven’ttouchedclassicalstatisticssincehighschool.Others,perhaps,neverevenattemptedstatisticsintheirhighschoolyears.Butnottoworry,manyofthemachinelearningalgorithmswediscussinthisbookhaveworkingimplementationsinyourprogramminglanguageofchoice;noequationwritingnecessary.Youcanusecodetoexecutetheactualnumber
crunchingforyou.
Ifyouhavenotlearnedtocodebefore,youwillneedtoifyouwishtomakefurtherprogressinthisfield.Butforthepurposeofthiscompactstarter’scourse,thecurriculumcanbecompletedwithoutanybackgroundincomputerprogramming.Thisbookfocusesonthehigh-levelfundamentalsofmachinelearningaswellasthemathematicalandstatisticalunderpinningsofdesigningmachinelearningmodels.
Forthosewhodowishtolookattheprogrammingaspectofmachinelearning,Chapter13walksyouthroughtheentireprocessofsettingupasupervisedlearningmodelusingthepopularprogramminglanguagePython.
WHATISMACHINELEARNING?
In1959,IBMpublishedapaperintheIBMJournalofResearchandDevelopmentwithan,atthetime,obscureandcurioustitle.AuthoredbyIBM’sArthurSamuel,thepaperinvestedtheuseofmachinelearninginthegameofcheckers“toverifythefactthatacomputercanbeprogrammedsothatitwilllearntoplayabettergameofcheckersthancanbeplayedbythepersonwhowrotetheprogram.”
[3]
Althoughitwasnotthefirstpublicationtousetheterm“machinelearning”perse,ArthurSamueliswidelyconsideredasthefirstpersontocoinanddefinemachinelearningintheformwenowknowtoday.Samuel’slandmarkjournalsubmission,SomeStudiesinMachineLearningUsingtheGameofCheckers,isalsoanearlyindicationofhomosapiens’determinationtoimpartourownsystemoflearningtoman-mademachines.
Figure1:Historicalmentionsof“machinelearning”inpublishedbooks.Source:GoogleNgramViewer,2017
ArthurSamuelintroducesmachinelearninginhispaperasasubfieldofcomputersciencethatgivescomputerstheabilitytolearnwithoutbeingexplicitlyprogrammed.
[4]
Almostsixdecadeslater,thisdefinitionremainswidelyaccepted.
AlthoughnotdirectlymentionedinArthurSamuel’sdefinition,akeyfeatureofmachinelearningistheconceptofself-learning.Thisreferstotheapplicationofstatisticalmodelingtodetectpatternsandimprove
performancebasedondataandempiricalinformation;allwithoutdirectprogrammingcommands.ThisiswhatArthurSamueldescribedastheabilitytolearnwithoutbeingexplicitlyprogrammed.Buthedoesn’tinferthatmachinesformulatedecisionswithnoupfrontprogramming.Onthecontrary,machinelearningisheavilydependentoncomputerprogramming.Instead,Samuelobservedthatmachinesdon’trequireadirectinputcommandtoperformasettaskbutratherinputdata.
Figure2:ComparisonofInputCommandvsInputData
Anexampleofaninputcommandistyping“2+2”intoaprogramminglanguagesuchasPythonandhitting“Enter.”
>>>2+2
4
>>>
Thisrepresentsadirectcommandwithadirectanswer.
Inputdata,however,isdifferent.Dataisfedtothemachine,analgorithmisselected,hyperparameters(settings)areconfiguredandadjusted,andthemachineisinstructedtoconductitsanalysis.Themachineproceedstodecipherpatternsfoundinthedatathroughtheprocessoftrialanderror.Themachine’sdatamodel,formedfromanalyzingdatapatterns,canthenbeusedtopredictfuturevalues.
Althoughthereisarelationshipbetweentheprogrammerandthemachine,theyoperatealayerapartincomparisontotraditionalcomputerprogramming.Thisisbecausethemachineisformulatingdecisionsbasedonexperienceandmimickingtheprocessofhuman-baseddecision-making.
Asanexample,let’ssaythatafterexaminingtheYouTubeviewinghabitsofdatascientistsyourmachineidentifiesastrongrelationshipbetweendata
scientistsandcatvideos.Later,yourmachineidentifiespatternsamongthephysicaltraitsofbaseballplayersandtheirlikelihoodofwinningtheseason’sMostValuablePlayer(MVP)award.Inthefirstscenario,themachineanalyzedwhatvideosdatascientistsenjoywatchingonYouTubebasedonuserengagement;measuredinlikes,subscribes,andrepeatviewing.Inthesecondscenario,themachineassessedthephysicalfeaturesofpreviousbaseballMVPsamongvariousotherfeaturessuchasageandeducation.However,inneitherofthesetwoscenarioswasyourmachineexplicitlyprogrammedtoproduceadirectoutcome.Youfedtheinputdataandconfiguredthenominatedalgorithms,butthefinalpredictionwasdeterminedbythemachinethroughself-learninganddatamodeling.
Youcanthinkofbuildingadatamodelassimilartotrainingaguidedog.Throughspecializedtraining,guidedogslearnhowtorespondinvarioussituations.Forexample,thedogwilllearntoheelataredlightortosafelyleaditsmasteraroundobstacles.Ifthedoghasbeenproperlytrained,then,eventually,thetrainerwillnolongerberequired;theguidedogwillbeabletoapplyitstraininginvariousunsupervisedsituations.Similarly,machinelearningmodelscanbetrainedtoformdecisionsbasedonpastexperience.
Asimpleexampleiscreatingamodelthatdetectsspamemailmessages.Themodelistrainedtoblockemailswithsuspicioussubjectlinesandbodytextcontainingthreeormoreflaggedkeywords:dearfriend,free,invoice,PayPal,Viagra,casino,payment,bankruptcy,andwinner.Atthisstage,though,wearenotyetperformingmachinelearning.Ifwerecallthevisualrepresentationofinputcommandvsinputdata,wecanseethatthisprocessconsistsofonlytwosteps:Command>Action.
Machinelearningentailsathree-stepprocess:Data>Model>Action.
Thus,toincorporatemachinelearningintoourspamdetectionsystem,weneedtoswitchout“command”for“data”andadd“model”inordertoproduceanaction(output).Inthisexample,thedatacomprisessampleemailsandthemodelconsistsofstatistical-basedrules.Theparametersofthemodelincludethesamekeywordsfromouroriginalnegativelist.Themodelisthentrainedandtestedagainstthedata.
Oncethedataisfedintothemodel,thereisastrongchancethatassumptionscontainedinthemodelwillleadtosomeinaccuratepredictions.Forexample,undertherulesofthismodel,thefollowingemailsubjectlinewouldautomaticallybeclassifiedasspam:“PayPalhasreceivedyourpaymentforCasinoRoyalepurchasedoneBay.”
AsthisisagenuineemailsentfromaPayPalauto-responder,thespamdetectionsystemisluredintoproducingafalsepositivebasedonthenegativelistofkeywordscontainedinthemodel.Traditionalprogrammingishighlysusceptibletosuchcasesbecausethereisnobuilt-inmechanismtotestassumptionsandmodifytherulesofthemodel.Machinelearning,ontheotherhand,canadaptandmodifyassumptionsthroughitsthree-stepprocessandbyreactingtoerrors.
Training&TestData
Inmachinelearning,dataissplitintotrainingdataandtestdata.Thefirstsplitofdata,i.e.theinitialreserveofdatayouusetodevelopyourmodel,providesthetrainingdata.Inthespamemaildetectionexample,falsepositivessimilartothePayPalauto-responsemightbedetectedfromthetrainingdata.Newrulesormodificationsmustthenbeadded,e.g.,emailnotificationsissuedfromthesendingaddress“
payments@
”shouldbeexcludedfromspamfiltering.
Afteryouhavesuccessfullydevelopedamodelbasedonthetrainingdataandaresatisfiedwithitsaccuracy,youcanthentestthemodelontheremainingdata,knownasthetestdata.Onceyouaresatisfiedwiththeresultsofboththetrainingdataandtestdata,themachinelearningmodelisreadytofilterincomingemailsandgeneratedecisionsonhowtocategorizethoseincomingmessages.
Thedifferencebetweenmachinelearningandtraditionalprogrammingmayseemtrivialatfirst,butitwillbecomeclearasyourunthroughfurtherexamplesandwitnessthespecialpowerofself-learninginmorenuancedsituations.
Thesecondimportantpointtotakeawayfromthischapterishowmachinelearningfitsintothebroaderlandscapeofdatascienceandcomputerscience.Thismeansunderstandinghowmachinelearninginterrelateswithparentfieldsandsisterdisciplines.Thisisimportant,asyouwillencountertheserelatedtermswhensearchingforrelevantstudymaterials—andyouwillhearthemmentionedadnauseaminintroductorymachinelearningcourses.Relevantdisciplinescanalsobedifficulttotellapartatfirstglance,suchas“machinelearning”and“datamining.”
Let’sbeginwithahigh-levelintroduction.Machinelearning,datamining,computerprogramming,andmostrelevantfields(excludingclassical
statistics)derivefirstfromcomputerscience,whichencompasseseverythingrelatedtothedesignanduseofcomputers.Withintheall-encompassingspaceofcomputerscienceisthenextbroadfield:datascience.Narrowerthancomputerscience,datasciencecomprisesmethodsandsystemstoextractknowledgeandinsightsfromdatathroughtheuseofcomputers.
Figure3:ThelineageofmachinelearningrepresentedbyarowofRussianmatryoshkadolls
Poppingoutfromcomputerscienceanddatascienceasthethirdmatryoshkadollisartificialintelligence.Artificialintelligence,orAI,encompassestheabilityofmachinestoperformintelligentandcognitivetasks.ComparabletothewaytheIndustrialRevolutiongavebirthtoaneraofmachinesthatcouldsimulatephysicaltasks,AIisdrivingthedevelopmentofmachinescapableofsimulatingcognitiveabilities.
Whilestillbroadbutdramaticallymorehonedthancomputerscienceanddatascience,AIcontainsnumeroussubfieldsthatarepopulartoday.Thesesubfieldsincludesearchandplanning,reasoningandknowledgerepresentation,perception,naturallanguageprocessing(NLP),andofcourse,machinelearning.MachinelearningbleedsintootherfieldsofAI,includingNLPandperceptionthroughtheshareduseofself-learningalgorithms.
Figure4:Visualrepresentationoftherelationshipbetweendata-relatedfields
ForstudentswithaninterestinAI,machinelearningprovidesanexcellentstartingpointinthatitoffersamorenarrowandpracticallensofstudycomparedtotheconceptualambiguityofAI.Algorithmsfoundinmachinelearningcanalsobeappliedacrossotherdisciplines,includingperceptionandnaturallanguageprocessing.Inaddition,aMaster’sdegreeisadequatetodevelopacertainlevelofexpertiseinmachinelearning,butyoumayneedaPhDtomakeanytrueprogressinAI.
Asmentioned,machinelearningalsooverlapswithdatamining—asisterdisciplinethatfocusesondiscoveringandunearthingpatternsinlargedatasets.Popularalgorithms,suchask-meansclustering,associationanalysis,andregressionanalysis,areappliedinbothdataminingandmachinelearningtoanalyzedata.Butwheremachinelearningfocusesontheincrementalprocessofself-learninganddatamodelingtoformpredictionsaboutthefuture,dataminingnarrowsinoncleaninglargedatasetstogleanvaluableinsightfromthepast.
Thedifferencebetweendataminingandmachinelearningcanbeexplainedthroughananalogyoftwoteamsofarchaeologists.Thefirstteamismadeupofarchaeologistswhofocustheireffortsonremovingdebristhatliesinthewayofvaluableitems,hidingthemfromdirectsight.Theirprimarygoalsaretoexcavatethearea,findnewvaluablediscoveries,andthenpackuptheirequipmentandmoveon.Adaylater,theywillflytoanotherexoticdestinationtostartanewprojectwithnorelationshiptothesitethey
excavatedthedaybefore.
Thesecondteamisalsointhebusinessofexcavatinghistoricalsites,butthesearchaeologistsuseadifferentmethodology.Theydeliberatelyreframefromexcavatingthemainpitforseveralweeks.Inthattime,theyvisitotherrelevantarchaeologicalsitesintheareaandexaminehoweachsitewasexcavated.Afterreturningtothesiteoftheirownproject,theyapplythisknowledgetoexcavatesmallerpitssurroundingthemainpit.
Thearchaeologiststhenanalyzetheresults.Afterreflectingontheirexperienceexcavatingonepit,theyoptimizetheireffortstoexcavatethenext.Thisincludespredictingtheamountoftimeittakestoexcavateapit,understandingvarianceandpatternsfoundinthelocalterrainanddevelopingnewstrategiestoreduceerrorandimprovetheaccuracyoftheirwork.Fromthisexperience,theyareabletooptimizetheirapproachtoformastrategicmodeltoexcavatethemainpit.
Ifitisnotalreadyclear,thefirstteamsubscribestodataminingandthesecondteamtomachinelearning.Atamicro-level,bothdataminingandmachinelearningappearsimilar,andtheydousemanyofthesametools.Bothteamsmakealivingexcavatinghistoricalsitestodiscovervaluableitems.Butinpractice,theirmethodologyisdifferent.Themachinelearningteamfocusesondividingtheirdatasetintotrainingdataandtestdatatocreateamodel,andimprovingfuturepredictionsbasedonpreviousexperience.Meanwhile,thedataminingteamconcentratesonexcavatingthetargetareaaseffectivelyaspossible—withouttheuseofaself-learningmodel—beforemovingontothenextcleanupjob.
MLCATEGORIES
Machinelearningincorporatesseveralhundredstatistical-basedalgorithmsandchoosingtherightalgorithmorcombinationofalgorithmsforthejobisaconstantchallengeforanyoneworkinginthisfield.Butbeforeweexaminespecificalgorithms,itisimportanttounderstandthethreeoverarchingcategoriesofmachinelearning.Thesethreecategoriesaresupervised,unsupervised,andreinforcement.
SupervisedLearning
Asthefirstbranchofmachinelearning,supervisedlearningconcentratesonlearningpatternsthroughconnectingtherelationshipbetweenvariablesandknownoutcomesandworkingwithlabeleddatasets.
Supervisedlearningworksbyfeedingthemachinesampledatawithvariousfeatures(representedas“X”)andthecorrectvalueoutputofthedata(representedas“y”).Thefactthattheoutputandfeaturevaluesareknownqualifiesthedatasetas“l(fā)abeled.”Thealgorithmthendecipherspatternsthatexistinthedataandcreatesamodelthatcanreproducethesameunderlyingruleswithnewdata.
Forinstance,topredictthemarketrateforthepurchaseofausedcar,asupervisedalgorithmcanformulatepredictionsbyanalyzingtherelationshipbetweencarattributes(includingtheyearofmake,carbrand,mileage,etc.)andthesellingpriceofothercarssoldbasedonhistoricaldata.Giventhatthesupervisedalgorithmknowsthefinalpriceofothercardssold,itcanthenworkbackwardtodeterminetherelationshipbetweenthecharacteristicsofthecaranditsvalue.
Figure1:Carvaluepredictionmodel
Afterthemachinedecipherstherulesandpatternsofthedata,itcreateswhatisknownasamodel:analgorithmicequationforproducinganoutcomewithnewdatabasedontherulesderivedfromthetrainingdata.Oncethemodelisprepared,itcanbeappliedtonewdataandtestedforaccuracy.Afterthemodelhaspassedboththetrainingandtestdatastages,itisreadytobeappliedandusedintherealworld.
InChapter13,wewillcreateamodelforpredictinghousevalueswhereyistheactualhousepriceandXarethevariablesthatimpacty,suchaslandsize,location,andthenumberofrooms.Throughsupervisedlearning,wewillcreatearuletopredicty(housevalue)basedonthegivenvaluesofvariousvariables(X).
Examplesofsupervisedlearningalgorithmsincluderegressionanalysis,decisiontrees,k-nearestneighbors,neuralnetworks,andsupportvectormachines.Eachofthesetechniqueswillbeintroducedlaterinthebook.
UnsupervisedLearning
Inthecaseofunsupervisedlearning,notallvariablesanddatapatternsareclassified.Instead,themachinemustuncoverhiddenpatternsandcreatelabelsthroughtheuseofunsupervisedlearningalgorithms.Thek-meansclusteringalgorithmisapopularexampleofunsupervisedlearning.ThissimplealgorithmgroupsdatapointsthatarefoundtopossesssimilarfeaturesasshowninFigure1.
Figure1:Exampleofk-meansclustering,apopularunsupervisedlearningtechnique
IfyougroupdatapointsbasedonthepurchasingbehaviorofSME(SmallandMedium-sizedEnterprises)andlargeenterprisecustomers,forexample,youarelikelytoseetwoclustersemerge.ThisisbecauseSMEsandlargeenterprisestendtohavedisparatebuyinghabits.Whenitcomestopurchasingcloudinfrastructure,forinstance,basiccloudhostingresourcesandaContentDeliveryNetwork(CDN)mayprovesufficientformostSMEcustomers.Largeenterprisecustomers,though,aremorelikelytopurchaseawiderarrayofcloudproductsandentiresolutionsthatincludeadvancedsecurityandnetworkingproductslikeWAF(WebApplicationFirewall),adedicatedprivateconnection,andVPC(VirtualPrivateCloud).Byanalyzingcustomerpurchasinghabits,unsupervisedlearningiscapableofidentifyingthesetwogroupsofcustomerswithoutspecificlabelsthatclassifythecompanyassmall,mediumorlarge.
Theadvantageofunsupervisedlearningisitenablesyoutodiscoverpatternsinthedatathatyouwereunawareexisted—suchasthepresenceoftwomajorcustomertypes.Clusteringtechniquessuchask-meansclusteringcanalsoprovidethespringboardforconductingfurtheranalysisafterdiscretegroupshavebeendiscovered.
Inindustry,unsupervisedlearningisparticularlypowerfulinfrauddetection
—wherethemostdangerousattacksareoftenthoseyettobeclassified.Onereal-worldexampleisDataVisor,whoessentiallybuilttheirbusinessmodelbasedonunsupervisedlearning.
Foundedin2013inCalifornia,DataVisorprotectscustomersfromfraudulent
onlineactivities,includingspam,fakereviews,fakeappinstalls,andfraudulenttransactions.Whereastraditionalfraudprotectionservicesdrawonsupervisedlearningmodelsandruleengines,DataVisorusesunsupervisedlearningwhichenablesthemtodetectunclassifiedcategoriesofattacksintheirearlystages.
Ontheirwebsite,DataVisorexplainsthat"todetectattacks,existingsolutionsrelyonhumanexperiencetocreaterulesorlabeledtrainingdatatotunemodels.Thismeanstheyareunabletodetectnewattacksthathaven’talreadybeenidentifiedbyhumansorlabeledintrainingdata."
[5]
Thismeansthattraditionalsolutionsanalyzethechainofactivityforaparticularattackandthencreaterulestopredictarepeatattack.Underthisscenario,thedependentvariable(y)istheeventofanattackandtheindependentvariables(X)arethecommonpredictorvariablesofanattack.Examplesofindependentvariablescouldbe:
Asuddenlargeorderfromanunknownuser.I.E.establishedcustomersgenerallyspendlessthan$100perorder,butanewuserspends$8,000inoneorderimmediatelyuponregisteringtheiraccount.
Asuddensurgeofuserratings.I.E.AsatypicalauthorandbookselleronA,it’suncommonformyfirstpublishedworktoreceivemorethanonebookreviewwithinthespaceofonetotwodays.Ingeneral,approximately1in200Amazonreadersleaveabookreviewandmostbooksgoweeksormonthswithoutareview.However,Icommonlyseecompetitorsinthiscategory(datascience)attracting20-50reviewsinoneday!(Unsurprisingly,IalsoseeAmazonremovingthesesuspiciousreviewsweeksormonthslater.)
Identicalorsimilaruserreviewsfromdifferentusers.FollowingthesameAmazonanalogy,Ioftenseeuserreviewsofmybookappearonotherbooksseveralmonthslater(sometimeswithareferencetomynameastheauthorstillincludedinthereview!).Again,Amazoneventuallyremovesthesefakereviewsandsuspendstheseaccountsforbreakingtheirtermsofservice.
Suspiciousshippingaddress.I.E.Forsmallbusinessesthatroutinelyshipproductstolocalcustomers,anorderfromadistantlocation(wheretheydon'tadvertisetheirproducts)caninrarecasesbeanindicatoroffraudulentormaliciousactivity.
Standaloneactivitiessuchasasuddenlargeorderoradistantshippingaddressmayprovetoolittleinformationtopredictsophisticated
cybercriminalactivityandmorelikelytoleadtomanyfalsepositives.Butamodelthatmonitorscombinationsofindependentvariables,suchasasuddenlargepurchaseorderfromtheothersideoftheglobeoralandslideofbookreviewsthatreuseexistingcontentwillgenerallyleadtomoreaccuratepredictions.Asupervisedlearning-basedmodelcoulddeconstructandclassifywhatthesecommonindependentvariablesareanddesignadetectionsystemtoidentifyandpreventrepeatoffenses.
Sophisticatedcybercriminals,though,learntoevadeclassification-basedruleenginesbymodifyingtheirtactics.Inaddition,leadinguptoanattack,attackersoftenregisterandoperatesingleormultipleaccountsandincubatetheseaccountswithactivitiesthatmimiclegitimateusers.Theythenutilizetheirestablishedaccounthistorytoevadedetectionsystems,whicharetrigger-heavyagainstrecentlyregisteredaccounts.Supervisedlearning-basedsolutionsstruggletodetectsleepercellsuntiltheactualdamagehasbeenmadeandespeciallywithregardtonewcategoriesofattacks.
DataVisorandotheranti-fraudsolutionprovidersthereforeleverageunsupervisedlearningtoaddressthelimitationsofsupervisedlearningbyanalyzingpatternsacrosshundredsofmillionsofaccountsandidentifyi
溫馨提示
- 1. 本站所有資源如無特殊說明,都需要本地電腦安裝OFFICE2007和PDF閱讀器。圖紙軟件為CAD,CAXA,PROE,UG,SolidWorks等.壓縮文件請(qǐng)下載最新的WinRAR軟件解壓。
- 2. 本站的文檔不包含任何第三方提供的附件圖紙等,如果需要附件,請(qǐng)聯(lián)系上傳者。文件的所有權(quán)益歸上傳用戶所有。
- 3. 本站RAR壓縮包中若帶圖紙,網(wǎng)頁內(nèi)容里面會(huì)有圖紙預(yù)覽,若沒有圖紙預(yù)覽就沒有圖紙。
- 4. 未經(jīng)權(quán)益所有人同意不得將文件中的內(nèi)容挪作商業(yè)或盈利用途。
- 5. 人人文庫網(wǎng)僅提供信息存儲(chǔ)空間,僅對(duì)用戶上傳內(nèi)容的表現(xiàn)方式做保護(hù)處理,對(duì)用戶上傳分享的文檔內(nèi)容本身不做任何修改或編輯,并不能對(duì)任何下載內(nèi)容負(fù)責(zé)。
- 6. 下載文件中如有侵權(quán)或不適當(dāng)內(nèi)容,請(qǐng)與我們聯(lián)系,我們立即糾正。
- 7. 本站不保證下載資源的準(zhǔn)確性、安全性和完整性, 同時(shí)也不承擔(dān)用戶因使用這些下載資源對(duì)自己和他人造成任何形式的傷害或損失。
最新文檔
- 臨時(shí)司機(jī)招聘合同樣本
- 社區(qū)團(tuán)購項(xiàng)目商業(yè)模式創(chuàng)新與發(fā)展規(guī)劃
- 臨期鋪面轉(zhuǎn)讓合同樣本
- 農(nóng)產(chǎn)品批發(fā)市場發(fā)展前景與可行性分析報(bào)告
- 個(gè)人攤位出租合同標(biāo)準(zhǔn)文本
- 交旅融合趨勢(shì)與市場前景深度解析
- 《工具與技術(shù):7 信息的交流傳播》教學(xué)設(shè)計(jì)-2024-2025學(xué)年教科版科學(xué)六年級(jí)上冊(cè)
- 公司與酒店合同樣本
- Unit 2 Expressing yourself Part A Lets learn Listen and chant(教學(xué)設(shè)計(jì))-2024-2025學(xué)年人教PEP版(2024)英語三年級(jí)下冊(cè)
- 個(gè)人正規(guī)購房合同標(biāo)準(zhǔn)文本
- 現(xiàn)場6S管理的基本要素
- 人工智能知識(shí)競賽題庫(含答案)
- 危機(jī)管理的步驟與危機(jī)處理
- 巖土工程勘察服務(wù)投標(biāo)方案(技術(shù)方案)
- 重慶汽車產(chǎn)業(yè)“走出去”問題研究
- 幼兒園PPT課件之大班繪本《小老鼠的探險(xiǎn)日記》
- 咖啡師培訓(xùn)講義-PPT
- 員工親屬住宿申請(qǐng)表
- 道德講堂:明禮誠信
- 《蔬菜種植》校本教材-學(xué)
- 自我評(píng)價(jià)主要學(xué)術(shù)貢獻(xiàn)、創(chuàng)新成果及其科學(xué)價(jià)值或社會(huì)經(jīng)濟(jì)意義
評(píng)論
0/150
提交評(píng)論