2022 數(shù)據(jù)峰會 數(shù)據(jù)湖 -現(xiàn)代數(shù)據(jù)湖要點_第1頁
2022 數(shù)據(jù)峰會 數(shù)據(jù)湖 -現(xiàn)代數(shù)據(jù)湖要點_第2頁
2022 數(shù)據(jù)峰會 數(shù)據(jù)湖 -現(xiàn)代數(shù)據(jù)湖要點_第3頁
2022 數(shù)據(jù)峰會 數(shù)據(jù)湖 -現(xiàn)代數(shù)據(jù)湖要點_第4頁
2022 數(shù)據(jù)峰會 數(shù)據(jù)湖 -現(xiàn)代數(shù)據(jù)湖要點_第5頁
已閱讀5頁,還剩34頁未讀, 繼續(xù)免費閱讀

下載本文檔

版權說明:本文檔由用戶提供并上傳,收益歸屬內容提供方,若內容存在侵權,請進行舉報或認領

文檔簡介

BuildingTheOpen

DataLakehouse

MarkLyons

Confidential-DoNotShareorDistribute

Mark

Product@Dremio

ProductManagement@

Vertica

ProductManager@

EnerNOC

Confidential-DoNotShareorDistribute

2

Agenda

●BriefHistoryofdataplatforms

●OpenDataArchitecture

●Tableformats

Scale&Performance

Askedtodotheimpossible

WhatYouWant

toAccomplish

DataDemocratization

SupportNewInitiatives

andProjects

FastTimetoValue

Security&Governance

ButIt’sGettingHarderbytheDay

DataIsRapidly

Increasing

AccessRequestsAreRapidlyIncreasing

YourBudgetIsFlat

DataTalentIsScarce

Whydatalakes&cloud?

Confidential-DoNotShareorDistribute

Past&Present

●Schemaonread

●Flexiblew/datafiles/types

●Costeffectivestorage

●Expensivetomaintain/upgrade

●Specializedtalent

●Schemaonwrite

●JSONdatatype

●Consumptionpricing

●Separatestorage&compute

●Managed

Confidential-DoNotShareorDistribute

OpenData

Architecture

Evolution

Confidential-DoNotShareorDistribute

OpenCompute

SQL:

Dremio,Presto,etc.

Streaming:

Databricks,Flink

OpenData

Metastore:HiveMetastore,AWSGlue

Storage

8

DataLake1.0

Spark:Databricks,EMR,etc.

FileFormat:ApacheParquet

S3,ADLS,GCS

●Thedatatierofthedatalakewasverybasic,requiringsignificantengineeringworkfromcustomers

●Analytics/Accessperformancesuffered

●Maintenanceandengineeringcostoverwhelmed

FileFormatsParquet,ORC,csv,json

S3

HDFS

ADLS

DataLakeStorage

OpenTableFormatsIceberg,DeltaLake,Hudi

MakingtheLakehouse

Users&Applications

Interfaces

ArrowF

light,ODB

C/JDBC

DataLakeEngines

Dremio

Spark

Hive

Athena

Confidential-DoNotShareorDistribute

ModernMetastoreNessie

New

OpenTableFormatsIcebergDeltaLake,Hudi

FileFormatsParquet,ORC,csv,json

S3

HDFS

ADLS

DataLakeStorage

,

GoingfurtherthantheEDW

Users&Applications

Interfaces

ArrowF

light,ODB

C/JDBC

DataLakeEngines

Dremio

Spark

Hive

Athena

Confidential-DoNotShareorDistribute

TransactionalTablesontheDataLake

Record-leveldatamutationswithSQLDML

INSERTINTOt1...

UPDATEt1SETcol1=...

DELETEFROMt1WHEREstate='CA'

Automaticpartitioning

CREATETABLEt1PARTITIONEDBY(month(date),

bucket[1000](user))

Instantschemaandpartitionevolution

ALTERTABLEt1ADD/DROP/MODIFY/RENAMECOLUMNc1...

ALTERTABLEt1ADD/DROPPARTITIONFIELD...

Timetravel

SELECT*FROMt1AT/BEFORE<timestamp>

●CreatedbyNetflix,Appleandotherbigtech

●INSERT/UPDATE/DELETEwithanyengine

●StrongmomentuminOSScommunity

●CreatedbyDatabricks

●INSERT/UPDATEwithSpark,SELECTwithanyengine

●PrimarilyusedinconjunctionwithDatabricks

Confidential-DoNotShareorDistribute

11

USEBRANCH'main'

SELECT*FROMt1//mainimplicit

SELECT*FROMt1@etl

SELECT*FROMt1AT'2020-10-26'

SELECT*FROMt1@etl'2020-10-26'

Nessie:Amodernmetastore

DataBranching

Transactionsonsteroids

–Multi-tableconsistency/transactions

–Experimentation(isolateddev/test)

–Pre-prodverification(stage→prod)

CREATEBRANCHetl

[Kafka]DataIngestion

[Spark]Transformation1

[Spark]Transformation2

[Dremio]ReflectionRefresh1

[Dremio]ReflectionRefresh2

USEBRANCHmain

MERGEBRANCHetl

DataVersionControl

Timetravelonsteroids

–Reproducibility

–Compliance

–Historicalcomparisons

USEBRANCH'main'

SELECT*FROMt1//mainimplicit

SELECT*FROMt1@etl

SELECT*FROMt1AT'2020-10-26'

SELECT*FROMt1@etlAT'2020-10-26'

Confidential-DoNotShareorDistribute

12

vs.

Real-WorldTransactions

Multiplesessions

MultipleusersMultipleengines/services

Builtfordataengineers

DataWarehouseTransactions

Onesession

Oneuser

SQL-only

Builtforapplicationdevelopers

CREATEBRANCHetl

[Kafka]Ingestdatasource1

[RDBMS]Ingestdatasource2

[Spark]Transformationstep1

[Dremio]Transformationstep2

[Spark]Insertintotable

[Dremio]Buildreflections

[Spark]Verifydata

USEBRANCHmain

MERGEBRANCHetl

ff

BEGINTRANSACTIONetl;

INSERTINTOt1...

UPDATEt1...

UPDATEt1...

COMMIT;

13

--ifchecksdon’tpass:

--don’tmerge,alert

Safe,real-worldtransactions

CREATEBRANCHetl_1897914;

USEBRANCHetl_1897914;

MERGEINTOordersoUSINGorders_stageo_s

ONo.order_id=o_s.order_id

WHENMATCHEDTHEN

UPDATE...

WHENNOTMATCHEDTHEN

INSERT*

MERGEINTOlineitemliUSINGli_stageli_s

ONli.order_id=li_s.order_id

ANDli.line_number=li_s.line_number

WHENMATCHEDTHEN

UPDATE...

WHENNOTMATCHEDTHEN

INSERT*

SELECTSUM(CASEWHENorder_amount<0THEN1ELSE0END)FROMorders;

SELECTSUM(CASEWHENquantity<0THEN1ELSE0END)FROMlineitem;

--additionaldataqualitychecksbynon-SQLtools

--ifcheckspass:

REFRESHREFLECTIONsales_dashboard_main;

--final

quality

checks

USEBRANCHproduction;

MERGEBRANCHetl_1897914;

Moreontable

formats

Confidential-DoNotShareorDistribute

16

Hivetableformat

Thede-factostandard

Atableisanlsof1ormoredirectories

Pros

●Single,centralanswerto“whatdataisinthistable”forthewholeecosystem

●Workswithbasicallyeveryenginesinceit’sbeenthede-factostandard

Cons

●Ifeverythinginadirectoryisatable’scontents,howdoIupdateanddeletedata?

●IfIneedtoaddmultiple?lesasasingleoperation,howdoImakesureaconsumerdoesn’tseeonlysomeofmyadditions?

●Allofthedirectorylistingsneededforlargetablestaketoolongfrom<startlisting>to<endlisting>

○Planningtime

○Consistencyproblemscanoccur-multiplecallstolistcontentsofthesetofpartitions

‘sgoals

●ACIDtransactionsonS3

●Tableevolution

●Tablecorrectness/consistency

●Fasterqueryplanningandexecution

●Decouplepartitioningfromlayout

●Accomplishalloftheseatscale

Howcanweresolvetheseissues?

→Weneedanewtableformat

Oldway

dir

F

F

F

Atableisan'ls'ofadirectory

New

way

Manifest(listof?les)

F

F

F

Atableisacanonicallistof?les

18

High-levelcomparisontoDatabricksDeltaLake

SIMILARITIES

DIFFERENCES

●ExtensiveintegrationinSpark

●Veryscalable

●Metadatastoredalongsidedatainthedatalake

●Immediatedatafreshness

●Fastplanning,datapruningviastats

Tableformat

Hierarchicalwithpointer-?les;snapshotcreationonwrite

Non-hierarchical;sequentiallylogeachchangeonwrite,inlinesnapshotcreationevery10changes

Whocanreadandwrite

Anyonecanread,anyonecanwrite

Anyonecanread,Databrickschooseswhocanwrite(SparkandprobablySQLanalytics)

Writebehavior

V1:copyonwrite

V2:copyonwrite+mergeonread

Copyonwrite

Fileformats

Parquet,ORC,andAvro,andisintentionallyextensible

Parquetonly

OSS&governance

FullyOS,ApacheGovernance

MostlyOS,butkeycapab

溫馨提示

  • 1. 本站所有資源如無特殊說明,都需要本地電腦安裝OFFICE2007和PDF閱讀器。圖紙軟件為CAD,CAXA,PROE,UG,SolidWorks等.壓縮文件請下載最新的WinRAR軟件解壓。
  • 2. 本站的文檔不包含任何第三方提供的附件圖紙等,如果需要附件,請聯(lián)系上傳者。文件的所有權益歸上傳用戶所有。
  • 3. 本站RAR壓縮包中若帶圖紙,網頁內容里面會有圖紙預覽,若沒有圖紙預覽就沒有圖紙。
  • 4. 未經權益所有人同意不得將文件中的內容挪作商業(yè)或盈利用途。
  • 5. 人人文庫網僅提供信息存儲空間,僅對用戶上傳內容的表現(xiàn)方式做保護處理,對用戶上傳分享的文檔內容本身不做任何修改或編輯,并不能對任何下載內容負責。
  • 6. 下載文件中如有侵權或不適當內容,請與我們聯(lián)系,我們立即糾正。
  • 7. 本站不保證下載資源的準確性、安全性和完整性, 同時也不承擔用戶因使用這些下載資源對自己和他人造成任何形式的傷害或損失。

評論

0/150

提交評論