




版權(quán)說(shuō)明:本文檔由用戶提供并上傳,收益歸屬內(nèi)容提供方,若內(nèi)容存在侵權(quán),請(qǐng)進(jìn)行舉報(bào)或認(rèn)領(lǐng)
文檔簡(jiǎn)介
1、 LightLDA: Big Topic Models on Modest Computer ClustersJinhui Yuan, Fei Gao, Qirong Ho, Wei Dai, Jinliang Wei, Xun Zheng, Eric P. Xing, Tie-Yan Liu and Wei-Ying MaMicrosoft Research AsiaCarnegie Mellon UniversityTopic Models7/20/20222Statistical method to understand the latent topics in textsTopics:
2、 Distributions over wordsBag-of-Words Bag-of-TopicsCompact features for matchingfigure courtesy of David BleiLatent Dirichlet Allocation (LDA)37/20/2022Blei, et al. 2003Something more friendly200B10M1M400Industrial-Scale LDA ModelsRAM(TB)#parametersPeacockAliasLDA#cores#tokensPeacockAliasLDA7/20/202
3、241B Documents200 Tokens10M Words1M TopicsBig DataBig ModelSystem#Tokens#Words#Topics#CoresNote AliasLDA (CMU/Google)50B5M2,00060,000KDD/OSDI 2014Peacock (Tencent)5B0.2M100,0003,000ACM TIST 2014Our Work: LightLDA- Target: 1M topics; 1B docs; 10M words in vocabularyHighly scalable systemDevelop a par
4、allel machine learning platform to scale out computations using a modest cluster of machines7/20/20225Standard LDA Training and Inference67/20/2022Word-topic tableDocument-topic tablePer-token sampling complexity proportional to the number of topics: (), thus hard to scale up to large number of topi
5、cs.Reduce Sampling Complexity77/20/2022Reusability of Alias TableTermsAlias table constructionReused forAlias table for each word, reusable for all documentsAmortized O(1) complexity?YesYesNo87/20/2022SparseLDA and AliasLDA: Partial Amortization7/20/20229Amortizable O(1)Amortizable O(1)SparseLDA Yao
6、, et al. KDD 2009AliasLDA Li, et al. KDD 2014LightLDA: Full Amortization7/20/202210Amortizable O(1)Amortizable O(1)Multiplicative factorization to enable full amortization of time complexityLeverage sparsity to further reduce space complexityMultiplicative factorizationReducing Space ComplexityExper
7、imental Results (Single-core)11Time per iterationLikelihood vs. number of iterations3/16/2015LightLDA - Tie-Yan Liu CMUAlmost no loss on the per-iteration convergence qualityMuch less time for each iterationExperimental Results (Single-core)127/20/2022NYTimes Dataset300K documents99M tokensPubMed Da
8、taset8.2M documents738M tokensLightLDA achieves better log-likelihood than SparseLDA and AliasLDA in much shorter time!With a single core only, LightLDA uses 20 hours to train 10K topics from 1B tokens (PubMed). With a commodity machine of 20 cores, LightLDA can finish training in 1.5 hours. This si
9、ngle-machine capability is equivalent to (if not beyond) a medium-size cluster of SparseLDA or AliasLDA.Highly scalable systemDevelop a parallel machine learning platform to scale out computations using a modest cluster of machines7/20/202213Our Work: LightLDA- Target: 1M topics; 1B docs; 10M words
10、in vocabularySystem Architecture LightLDALightLDALightLDAModel SchedulingModel SchedulingModel SchedulingASP/SSPHybrid data structure and dual bufferHybrid data structure and dual bufferHybrid data structure and dual bufferDistributed Shared Model14Petuum parameter server based data parallelism to h
11、andle big dataA new model scheduling scheme to handle big modelsHybrid data structure to trade off memory usage and access speedIntensive system optimization (double buffer, pipelining, memory management, etc)7/20/2022Model Scheduling7/20/202215Model Parallelism in PeacockModel SchedulingModel is cu
12、t into slices, sending model slices to data instead of sending data to model.Model slices are pulled from server and trained in a round robin fashion, analogous to block coordinate descent (BCD). Data samples and intermediate results are locally stored.Lower communication cost since data model, and
13、model becomes sparse during trainingModel is cut into pieces and put onto different machines; sending data to model pieces. To finish training, intermediate results for every data sample need to be transferred between machines.Communication cost includes both the word-topic table (small) and doc-top
14、ic table (large) Hybrid Data StructureModel is simply too big (10 trillion parameters)Naive dense storage (array) requires prohibitive amounts of memorySparse storage (hash map) hurts sampling efficiency.Leverage power law distribution of wordsUse dense array to store frequent words to achieve high
15、throughput; Use sparse hash tables to store long-tail words, to reduce memory usage (100 times).167/20/2022Hybrid Data StructureModel is simply too big (10 trillion parameters)Naive dense storage (array) requires prohibitive amounts of memorySparse storage (hash map) hurts sampling efficiency.Levera
16、ge power law distribution of wordsUse dense array to store frequent words to achieve high throughput; Use sparse hash tables to store long-tail words, to reduce memory usage (100 times).177/20/2022RowTagSizePt.0dense1sparse202sparse10An array as the indexExperimental SettingsComputer cluster: 24 mac
17、hines (20 cores and 256GB RAM for each machine)Network connection: 1Gbps between machinesParameter server deployment: physically, each machine hosts both parameter server and workers (16 cores for workers and 4 cores for parameter server)Bing Web data: 1.2 billion documents, 200 billion tokens187/20
18、/2022System Throughput w.r.t multi-core and multi-nodeExperimental Results197/20/202220M words, 1M topics, 20T parametersNearly linear speed-up is achieved; after the first 10 iterations, the communication cost is almost masked by computations (further improvable if a high-speed network is used).Con
19、clusionsLightLDA is able to learn super large topic models on modest clustersWith 24 commodity machines, we trained one million topics (20 trillion parameters) from Bing web data (200 billion tokens) in just two days. Machine learning insights vs. brute-force parallelization:As compared to other sys
20、tems, we use much fewer cores to learn much more topics from much larger data! 207/20/2022RAM(TB)#parametersPeacockAliasLDA(CMU/Google)LightLDAModel size VS. RAM capacity#CPU cores#tokensPeacock(Tencent)AliasLDA(CMU/Google)LightLDAData size VS. CPU coresFuture WorkBeyond topic models!Multiplicative
21、factorization based fast sampling may have its general implication to many graphical model problems.Model scheduling scheme and other system innovations may be used to handle other large machine learning task as well. We are in the process of open-sourcing LightLDAStay tuned if you want to use it to
22、 train your own big topic models!7/20/202221Thanks!237/20/2022247/20/2022Alias Method Constructing Alias Table25Construction of Alias table: non-uniform distribution uniform distributionWith O(K) complexity in time and spaceReuse for future samplings:uniformly sample from alias tableO(1) timeWalker,
23、 19777/20/2022How to amortizing the O(K) initialization time through reusing? Constructing Alias Table7/20/202226Separate bins into two sets:L = C,G, for bins lower than the average value .H = A,T, for bins higher than the average value .Constructing Alias Table7/20/2022277/20/202228Looking Up Alias
24、 TableSystem Optimization7/20/202229Dual Model and Cross-thread Aggregation30Immutable model in RAM for compuationSampler in thread 4UpdatesData blockData partition for thread 1Data partition for thread 2Data partition for thread 3Data partition for thread 4Sampler in thread 3Sampler in thread 2Samp
25、ler in thread 1UpdatesUpdatesUpdatesLocal aggregator threadParameter ServerMutable model in RAM for sync-upSwitchRemove multi-thread contentionReduce communication costOverlap computations and communicationsNetworkNetworkPre-fetch7/20/2022Pipelining31Size of data block: 32GB with 4B tokens 2K second
26、s for disk reading (for 150MB I/O bandwidth).Sampling throughput: 2M tokens/worker/sec 2K seconds to sample 4B tokens.Size of model slice: 20GB, so as to be maskable by 2K seconds on a 1Gbps network.7/20/2022Memory Layout7/20/202232Pre-Allocated Memory LayoutIncrease efficiency by avoiding dynamic m
27、emory allocation33Serve Table(30GB)Data BlockOf Current Iteration(32GB)Data BlockOf Next Iteration(32GB)A Slice of Model for Current Iteration(36GB)A Slice of Model for NextIteration(36GB)Alias Table for Current Iteration(36GB)Delta ArraysFor AppendingUpdates from Sampler(10GB) Delta Table for Aggre
28、gating Local Updates of a Particular Model Slice(4GB)216GB7/20/2022Each Memory Quota is Delicately Allocated Store 1M*1M model with hybrid data structure requires 700GB RAM. Therefore, each machine needs 700/24= 29GB space to store server table.Each data block contains 4B tokens, each token requires
29、 8Bytes, therefore the size of each data block is 32GB.The number of 4B tokens is determined by sampling throughput, 2M tokens/machine/sec, so 4B tokens need 2000 seconds to complete the sampling.To sweep a data block, we needs 20 slices of model in model slicing. Each slice of model can sample 2000
30、/20 = 100 seconds.347/20/2022Each Memory Quota is Delicately Allocated In the initial stage, the whole model is 700GB, the maximum size of a model slice is 36GB, so we need split the whole model into 20 slices.When close to convergence, the model will be very sparse, it is much smaller than 700GB.We
31、 need to construct an alias table for each active slice of model. While the alias table is already sparsified, in the initial stage, it roughly has the same size with a slice of model.The maximum number of updates can not exceed 2 times of the number of sampled tokens in a particular slice. Therefor
32、e, the delta table and delta array are rather small.357/20/2022Details about Hybrid Structure7/20/202236A Hybrid Data StructureWith a hybrid data structure, we reduce memory usage by over 100 times, while maintaining an efficient random access.RowTagSizePt.0dense1sparse202sparse10An array as the ind
33、exA pre-allocated memory pool to reduce the overhead caused by dynamic memory allocation.7/20/202237Hybrid Data Structure387/20/2022Average Probes for a Given Load Factor39Load factor0.10.500.750.800.900.99SuccessfulChaining1.051.251.381.401.451.50Open AddressingLinear probing1.05.550.5Qua
34、dratic probing1.051.442.012.212.855.11Double hashing1.051.391.852.012.564.6UnsuccessfulChaining0.100.500.750.800.900.99Open AddressingLinear probing13505000Quadratic probing45.8111.4103.6Double hashing1.112.04.05.010100http:/www.augustana.ca/mohrj/courses/1999.fall/csc210/lectur
35、e_notes/hashing.html7/20/2022Out-of-Core and Fault Tolerance7/20/202240Out-of-CoreBinary serializationEfficient block-wise IO, high disk-r/w throughputSave the marshaling timeMake the token and its topic adjacent, (CPU cache friendly)Easy to estimate data size, 4B tokens requires 4B * 8Byte = 32GB41
36、w11t11w12t12w21t21w22t22Doc 1Doc 27/20/2022Zero-Cost Fault Tolerance by Warm-Start42Data block on diskData block in RAM (before sampling)Data block in RAM(after sampling)Data block on diskLoad into RAMGibbs SamplerStore into DiskAtomic File OverwriteWe dont need any checkpoint.We dont dump the model
37、, instead we dump (token,topic) data which is accomplished automatically by out-of-core module.Any fails, the system supports warm start.Atomic file overwrite can avoid file corruption in case system crashes while it is writing the file.7/20/2022Warm StartCold startIn cold-start, the topic of a toke
38、n is initialized by uniform random number generator. The system counts the (word, topic) occurrence and get the initial model.Warm startIn warm-start, the topic of a token is persistent from the previous training.The system counts the (word, topic) occurrence and the initial model. This is equivalent to failover by checkpoint.437/20/2022The Disadvantages of Dumping ModelIf we dump model in the checkpoint, we need an additional burn-in stage before continuing the training procedure based o
溫馨提示
- 1. 本站所有資源如無(wú)特殊說(shuō)明,都需要本地電腦安裝OFFICE2007和PDF閱讀器。圖紙軟件為CAD,CAXA,PROE,UG,SolidWorks等.壓縮文件請(qǐng)下載最新的WinRAR軟件解壓。
- 2. 本站的文檔不包含任何第三方提供的附件圖紙等,如果需要附件,請(qǐng)聯(lián)系上傳者。文件的所有權(quán)益歸上傳用戶所有。
- 3. 本站RAR壓縮包中若帶圖紙,網(wǎng)頁(yè)內(nèi)容里面會(huì)有圖紙預(yù)覽,若沒(méi)有圖紙預(yù)覽就沒(méi)有圖紙。
- 4. 未經(jīng)權(quán)益所有人同意不得將文件中的內(nèi)容挪作商業(yè)或盈利用途。
- 5. 人人文庫(kù)網(wǎng)僅提供信息存儲(chǔ)空間,僅對(duì)用戶上傳內(nèi)容的表現(xiàn)方式做保護(hù)處理,對(duì)用戶上傳分享的文檔內(nèi)容本身不做任何修改或編輯,并不能對(duì)任何下載內(nèi)容負(fù)責(zé)。
- 6. 下載文件中如有侵權(quán)或不適當(dāng)內(nèi)容,請(qǐng)與我們聯(lián)系,我們立即糾正。
- 7. 本站不保證下載資源的準(zhǔn)確性、安全性和完整性, 同時(shí)也不承擔(dān)用戶因使用這些下載資源對(duì)自己和他人造成任何形式的傷害或損失。
最新文檔
- 大連大學(xué)《機(jī)械制圖A(下)》2023-2024學(xué)年第二學(xué)期期末試卷
- 2024-2025學(xué)年湖南省醴陵市高三教學(xué)質(zhì)量監(jiān)測(cè)(二)英語(yǔ)試題含解析
- 浙江廣廈建設(shè)職業(yè)技術(shù)大學(xué)《基礎(chǔ)工程道橋》2023-2024學(xué)年第二學(xué)期期末試卷
- 補(bǔ)接施工方案
- 信息技術(shù) 第二冊(cè)(五年制高職)課件 8.2.3.2 循環(huán)結(jié)構(gòu)的語(yǔ)法
- 心理建設(shè)系統(tǒng)培訓(xùn)
- 青海省醫(yī)療衛(wèi)生事業(yè)單位招聘(醫(yī)學(xué)檢驗(yàn))歷年考試真題庫(kù)及答案
- 家長(zhǎng)溝通工作
- 2025屆云南省玉溪市高三二模數(shù)學(xué)試題(解析版)
- 完整禮儀培訓(xùn)課程
- 團(tuán)章考試試題及答案
- 廠房、綜合樓工程腳手架專項(xiàng)安全方案
- 企業(yè)服飾生產(chǎn)制造單模板
- 10KV配單系統(tǒng)柱上開(kāi)關(guān)培訓(xùn)資料
- 江蘇旅游職業(yè)學(xué)院輔導(dǎo)員考試題庫(kù)
- 張朋《了凡四訓(xùn)》課件
- 2023年4月全國(guó)自學(xué)考試00147人力資源管理一試題及答案
- 生藥學(xué)全套課件
- 廣東省五年一貫制語(yǔ)文考試題目
- 幼兒園家長(zhǎng)進(jìn)課堂講課
- 建筑工程管理畢業(yè)論文
評(píng)論
0/150
提交評(píng)論