版權(quán)說明:本文檔由用戶提供并上傳,收益歸屬內(nèi)容提供方,若內(nèi)容存在侵權(quán),請(qǐng)進(jìn)行舉報(bào)或認(rèn)領(lǐng)
文檔簡(jiǎn)介
1、An Introduction to IR11th CourseChapter 12:Language Models for Information RetrievalLanguages, grammars, and language modelsA “(formal) language” is the set of strings (called sentences or clauses) generated by a grammar:G=(VT, VN, S; R) VT : terminal symbols (words, tokens);VN : non-terminal symbol
2、s (syntax nodes);SVN: start (top-most) symbolR = | ,(VTVN)* : the rule set.A “(probabilistic) language model” ML further gives each sentence of a language L a probability:ML = P() | L .Note: For any (VTVN)+, P(VTVN)+ ) can be computed from P(tVT) and P(AVN). Language models for IRThe general idea:Gi
3、ven a query q, rank the documents di by P(di | q). The rest is how to estimate P(d | q).dmax = argmaxd P(d |q) = argmaxd P(q|d)P(d), P(d) : prior probablity of doc d, e.g. PageRank(d) multiplied by f(ClickRate). Standard Probabilistic IRqueryd1d2dnInformation needdocument collectionmatchingIR based
4、on Language Model (LM)queryd1d2dnInformation needdocument collectiongenerationA common search heuristic is to use words that you expect to find in matching documents as your query why, I saw Sergey Brin advocating that strategy on late night TV one night in my hotel room, so it must be good!The LM a
5、pproach directly exploits that idea!Formal Language (Model)Traditional generative model: generates stringsFinite state machines or regular grammars, etc.Example: IwishI wishI wish I wishI wish I wish I wishI wish I wish I wish I wish*wish I wishStochastic Language ModelsModels probability of generat
6、ing strings in the language (commonly all strings over alphabet ) 0.2the0.1a0.01man0.01woman0.03said0.02likesthemanlikesthewoman0.20.010.020.20.01multiplyModel MP(s | M) = 0.00000008 Stochastic Language ModelsModel probability of generating any string0.2the0.01class0.0001sayst0.0001pleaseth0.0001yon
7、0.0005maiden0.01womanModel M1Model M2maidenclasspleasethyonthe0.00050.010.00010.00010.20.010.00010.020.10.2P(s|M2) P(s|M1)0.2the0.0001class0.03sayst0.02pleaseth0.1yon0.01maiden0.0001womanStochastic Language ModelsA statistical model for generating textProbability distribution over strings in a given
8、 languageMP ( | M )= P ( | M ) P ( | M, )P ( | M, )P ( | M, )Unigram and higher-order modelsUnigram Language ModelsBigram (generally, n-gram) Language ModelsOther Language ModelsGrammar-based models (PCFGs), etc.Probably not the first thing to try in IR= P ( )P ( | )P ( | )P ( | ) P ( ) P ( ) P ( )
9、P ( )P ( ) P ( ) P ( | ) P ( | ) P ( | )Easy.Effective!Using Language Models in IRTreat each document as the basis for a model (e.g., unigram sufficient statistics)Rank document d based on P(d | q)P(d | q) = P(q | d) x P(d) / P(q)P(q) is the same for all documents, so ignoreP(d) the prior is often t
10、reated as the same for all dBut we could use criteria like authority, length, genreP(q | d) is the probability of q given ds modelVery general formal approachThe fundamental problem of LMsUsually we dont know the model MBut have a sample of text representative of that modelEstimate a language model
11、from a sampleThen compute the observation probabilityP ( | M ( ) )MLanguage Models for IRLanguage Modeling ApproachesAttempt to model query generation processDocuments are ranked by the probability that a query would be observed as a random sample from the respective document modelMultinomial approa
12、chRetrieval based on probabilistic LMTreat the generation of queries as a random process.ApproachInfer a language model for each document.Estimate the probability of generating the query according to each of these models.Rank the documents according to these probabilities.Usually a unigram estimate
13、of words is usedSome work on bigrams, paralleling van RijsbergenRetrieval based on probabilistic LMIntuitionUsers Have a reasonable idea of terms that are likely to occur in documents of interest.They will choose query terms that distinguish these documents from others in the collection.Collection s
14、tatistics Are integral parts of the language model.Are not used heuristically as in many other approaches.In theory. In practice, theres usually some wiggle room for empirically set parametersQuery generation probability (1)Ranking formulaThe probability of producing the query given the language mod
15、el of document d using MLE is:Unigram assumption:Given a particular language model, the query terms occur independently : language model of document d : raw tf of term t in document d : total number of tokens in document dInsufficient dataZero probabilityMay not wish to assign a probability of zero
16、to a document that is missing one or more of the query terms gives conjunction semanticsGeneral approachA non-occurring term is possible, but no more likely than would be expected by chance in the collection.If , : raw count of term t in the collection : raw collection size(total number of tokens in
17、 the collection)Insufficient dataZero probabilities spell disasterWe need to smooth probabilitiesDiscount nonzero probabilitiesGive some probability mass to unseen thingsTheres a wide space of approaches to smoothing probability distributions to deal with this problem, such as adding 1, or to counts
18、, Dirichlet priors, discounting, and interpolationSee FSNLP ch. 6 or CS224N if you want moreA simple idea that works well in practice is to use a mixture between the document multinomial and the collection multinomial distributionMixture modelP(w|d) = Pmle(w|Md) + (1 )Pmle(w|Mc)Mixes the probability
19、 from the document with the general collection frequency of the word.Correctly setting is very importantA high value of lambda makes the search “conjunctive-like” suitable for short queriesA low value is more suitable for long queriesCan tune to optimize performancePerhaps make it dependent on docum
20、ent size (cf. Dirichlet prior or Witten-Bell smoothing)Basic mixture model summaryGeneral formulation of the LM for IRThe user has a document in mind, and generates the query from this document.The equation represents the probability that the document that the user had in mind was in fact this one.
21、P( t ) = PMLE( t | MCollection) general language modelindividual-document modelExampleDocument collection (2 documents)d1: Xerox reports a profit but revenue is downd2: Lucent narrows quarter loss but revenue decreases furtherModel: MLE unigram from documents; = Query: revenue downP(Q|d1) = (1/8 + 2
22、/16)/2 x (1/8 + 1/16)/2 = 1/8 x 3/32 = 3/256P(Q|d2) = (1/8 + 2/16)/2 x (0 + 1/16)/2 = 1/8 x 1/32 = 1/256Ranking: d1 d2Ponte and Croft ExperimentsDataTREC topics 202-250 on TREC disks 2 and 3Natural language queries consisting of one sentence eachTREC topics 51-100 on TREC disk 3 using the concept fi
23、eldsLists of good termsNumber: 054Domain: International EconomicsTopic: Satellite Launch ContractsDescription: Concept(s):Contract, agreementLaunch vehicle, rocket, payload, satelliteLaunch services, Precision/recall results 202-250Precision/recall results 51-100The main difference is whether “Relev
24、ance” figures explicitly in the model or notLM approach attempts to do away with modeling relevanceLM approach asssumes that documents and expressions of information problems are of the same typeComputationally tractable, intuitively appealingLM vs. Prob. Model for IRProblems of basic LM approachAss
25、umption of equivalence between document and information problem representation is unrealisticVery simple models of languageRelevance feedback is difficult to integrate, as are user preferences, and other general issues of relevanceCant easily accommodate phrases, passages, Boolean operatorsCurrent e
26、xtensions focus on putting relevance back into the model, etc.LM vs. Prob. Model for IRExtension: 3-level model3-level modelWhole collection model ( )Specific-topic model; relevant-documents model ( )Individual-document model ( )Relevance hypothesisA request(query; topic) is generated from a specifi
27、c-topic model , .Iff a document is relevant to the topic, the same model will apply to the document.It will replace part of the individual-document model in explaining the document.The probability of relevance of a documentThe probability that this model explains part of the documentThe probability
28、that the , , combination is better than the , combination3-level modelqueryd1d2dnInformation needdocument collectiongenerationAlternative Models of Text GenerationQuery ModelQueryDoc ModelDocSearcherWriterIs this the same model?Retrieval Using Language ModelsQuery ModelQueryDoc ModelDocRetrieval: Qu
29、ery likelihood (1), Document likelihood (2), Model comparison (3)123Query LikelihoodP(Q|Dm)Major issue is estimating document modeli.e. smoothing techniques instead of tf.idf weightsGood retrieval resultse.g. UMass, BBN, Twente, CMU Problems dealing with relevance feedback, query expansion, structur
30、ed queriesDocument LikelihoodRank by likelihood ratio P(D|R)/P(D|NR)treat as a generation problemP(w|R) is estimated by P(w|Qm)Qm is the query or relevance modelP(w|NR) is estimated by collection probabilities P(w)Issue is estimation of query modelTreat query as generated by mixture of topic and bac
31、kgroundEstimate relevance model from related documents (query expansion)Relevance feedback is easily incorporatedGood retrieval results e.g. UMass at SIGIR 01inconsistent with heterogeneous document collectionsModel ComparisonEstimate query and document models and compareSuitable measure is KL diver
32、gence D(Qm|Dm)equivalent to query-likelihood approach if simple empirical distribution used for query modelMore general risk minimization framework has been proposedZhai and Lafferty 2001Better results than query-likelihood or document-likelihood approachesTwo-stage smoothing:Another Reason for Smoo
33、thingQuery = “the algorithms for data mining”d1: 0.04 0.001 0.02 0.002 0.003 d2: 0.02 0.001 0.01 0.003 0.004p( “algorithms”|d1) = p(“algorithm”|d2)p( “data”|d1) p(“data”|d2)p( “mining”|d1) p(q|d2)!We should make p(“the”) and p(“for”) less different for all docs. Two-stage Smoothingc(w,d)|d|P(w|d) =+
34、p(w|C)+Stage-1 -Explain unseen words-Dirichlet prior(Bayesian)(1-)+ p(w|U)Stage-2 -Explain noise in query-2-component mixtureHow can one do relevance feedback if using language modeling approach?Introduce a query model & treat feedback as query model updatingRetrieval function: Query-likelihood = KL
35、-DivergenceFeedback: Expansion-based = Model-basedExpansion-based vs. Model-based Document DResultsFeedback DocsDoc modelDoc modelScoringScoringQuery QDocument DQuery QFeedback DocsResultsExpansion-basedFeedbackmodifymodifyModel-basedFeedbackQuery modelQuery likelihoodKL-divergenceFeedback as Model
36、InterpolationQuery QDocument DResultsFeedback Docs F=d1, d2 , , dnGenerative model=0No feedback=1Full feedbackTranslation model (Berger and Lafferty)Basic LMs do not address issues of synonymy.Or any deviation in expression of information need from language of documentsA translation model lets you g
37、enerate query words not in document via “translation” to synonyms etc.Or to do cross-language IR, or multimedia IR Basic LM TranslationNeed to learn a translation model (using a dictionary or via statistical machine translation)Language models: pro & conNovel way of looking at the problem of text re
38、trieval based on probabilistic language modelingConceptually simple and explanatoryFormal mathematical modelNatural use of collection statistics, not heuristics (almost)LMs provide effective retrieval and can be improved to the extent that the following conditions can be metOur language models are a
39、ccurate representations of the data.Users have some sense of term distribution.*Or we get more sophisticated with translation modelComparison With Vector SpaceTheres some relation to traditional tf.idf models:(unscaled) term frequency is directly in modelthe probabilities do length normalization of term frequenciesthe effect of doing a mixture with overall collection frequencies is a little like idf: terms rare in the general coll
溫馨提示
- 1. 本站所有資源如無特殊說明,都需要本地電腦安裝OFFICE2007和PDF閱讀器。圖紙軟件為CAD,CAXA,PROE,UG,SolidWorks等.壓縮文件請(qǐng)下載最新的WinRAR軟件解壓。
- 2. 本站的文檔不包含任何第三方提供的附件圖紙等,如果需要附件,請(qǐng)聯(lián)系上傳者。文件的所有權(quán)益歸上傳用戶所有。
- 3. 本站RAR壓縮包中若帶圖紙,網(wǎng)頁內(nèi)容里面會(huì)有圖紙預(yù)覽,若沒有圖紙預(yù)覽就沒有圖紙。
- 4. 未經(jīng)權(quán)益所有人同意不得將文件中的內(nèi)容挪作商業(yè)或盈利用途。
- 5. 人人文庫網(wǎng)僅提供信息存儲(chǔ)空間,僅對(duì)用戶上傳內(nèi)容的表現(xiàn)方式做保護(hù)處理,對(duì)用戶上傳分享的文檔內(nèi)容本身不做任何修改或編輯,并不能對(duì)任何下載內(nèi)容負(fù)責(zé)。
- 6. 下載文件中如有侵權(quán)或不適當(dāng)內(nèi)容,請(qǐng)與我們聯(lián)系,我們立即糾正。
- 7. 本站不保證下載資源的準(zhǔn)確性、安全性和完整性, 同時(shí)也不承擔(dān)用戶因使用這些下載資源對(duì)自己和他人造成任何形式的傷害或損失。
最新文檔
- 2024年房瑾與配偶解除股權(quán)合同
- 2024年度版權(quán)許可與使用合同
- 2024年出口代理合同代理費(fèi)用及出口市場(chǎng)策略協(xié)議
- 2024年攝影服務(wù)合同標(biāo)的及成品要求
- 2024年建筑材料供應(yīng)商合同
- 2024年新式電商一件代發(fā)合同標(biāo)準(zhǔn)版
- 2024年房屋過戶協(xié)議:手續(xù)、費(fèi)用及時(shí)間節(jié)點(diǎn)
- 2024年房屋交易合同標(biāo)準(zhǔn)版
- 2024年技術(shù)授權(quán)與銷售合同
- DB4115T 039-2018 信陽養(yǎng)生菜烹飪技藝 蒜仔燒泥鰍
- (正式版)HGT 22820-2024 化工安全儀表系統(tǒng)工程設(shè)計(jì)規(guī)范
- 綜合實(shí)踐活動(dòng)課《早餐與健康》優(yōu)質(zhì)課件
- 《中華民族共同體概論》考試復(fù)習(xí)題庫(含答案)
- 2022-2023學(xué)年武漢市江岸區(qū)七年級(jí)英語上學(xué)期期中質(zhì)量檢測(cè)卷附答案
- 新能源汽車技術(shù)職業(yè)生涯人物訪談報(bào)告
- kummell 病ppt課件
- 小班綜合活動(dòng)《出生的秘密》
- 習(xí)題參考答案
- 綠化養(yǎng)護(hù)報(bào)價(jià)表(共8頁)
- 結(jié)構(gòu)工程工作危害分析(JHA)
- 列管式冷卻器GLC型冷卻器尺寸表
評(píng)論
0/150
提交評(píng)論