![現(xiàn)代信息檢索導論英文課件:06-TermWeightingForScoring_第1頁](http://file4.renrendoc.com/view/9b75681a72c685eb3a68361f08f61ff0/9b75681a72c685eb3a68361f08f61ff01.gif)
![現(xiàn)代信息檢索導論英文課件:06-TermWeightingForScoring_第2頁](http://file4.renrendoc.com/view/9b75681a72c685eb3a68361f08f61ff0/9b75681a72c685eb3a68361f08f61ff02.gif)
![現(xiàn)代信息檢索導論英文課件:06-TermWeightingForScoring_第3頁](http://file4.renrendoc.com/view/9b75681a72c685eb3a68361f08f61ff0/9b75681a72c685eb3a68361f08f61ff03.gif)
![現(xiàn)代信息檢索導論英文課件:06-TermWeightingForScoring_第4頁](http://file4.renrendoc.com/view/9b75681a72c685eb3a68361f08f61ff0/9b75681a72c685eb3a68361f08f61ff04.gif)
![現(xiàn)代信息檢索導論英文課件:06-TermWeightingForScoring_第5頁](http://file4.renrendoc.com/view/9b75681a72c685eb3a68361f08f61ff0/9b75681a72c685eb3a68361f08f61ff05.gif)
版權說明:本文檔由用戶提供并上傳,收益歸屬內容提供方,若內容存在侵權,請進行舉報或認領
文檔簡介
1、An Introduction to IR6th CourseChapter 6 and 7:Term weighting, relevance ranking (scoring), VSM (vector space model), and methods for fast ranking This lectureParametric and field searchesZones in documentsScoring documents: zone weightingIndex support for scoringTerm weightingVector space scoringEf
2、ficient score computing (with huge indexes)Parametric searchMost documents have, in addition to text, some “meta-data” in fields e.g.,Language = FrenchFormat = pdfSubject = Physics etc.Date = Feb 2000A parametric search interface allows the user to combine a full-text query with selections on these
3、field values e.g.,language, date range, etc.FieldsValuesParametric/field searchIn these examples, we select field valuesValues can be hierarchical, e.g.,Geography: Continent Country State CityA paradigm for navigating through the document collection, e.g.,“Aerospace companies in Brazil” can be arriv
4、ed at first by selecting Geography then Line of Business, or vice versaFilter docs in contention and run text searches scoped to subsetIndex support for parametric searchMust be able to support queries of the formFind pdf documents that contain “standford university”A field selection (on doc format)
5、 and a phrase queryField selection use inverted index of field values docidsOrganized by field nameUse compression etc. as beforeParametric index supportOptional provide richer search on field values e.g., wildcardsFind books whose Author field contains s*trupRange search find docs authored between
6、September and DecemberInverted index doesnt work (as well)Use techniques from database range searchSee for instance /btree/ for a summary of B-treesUse query optimization heuristics as beforeField retrievalIn some cases, must retrieve field valuesE.g., ISBN numbers of books by s*trup Maintain “forwa
7、rd” index for each doc, those field values that are “retrievable”Indexing control file specifies which fields are retrievable (and can be updated)Storing primary data here, not just an index(as opposed to “inverted”)ZonesA zone is an identified region within a docE.g., Title, Abstract, BibliographyG
8、enerally culled from marked-up input or document metadata (e.g., powerpoint)Contents of a zone are free textNot a “finite” vocabularyIndexes for each zone - allow queries likesorting in Title AND smith in Bibliography AND recur* in BodyNot queries like “all papers whose authors cite themselves”Why?Z
9、one indexes simple viewTitleAuthorBodyetc.So we have a database now?Not really.Databases do lots of things we dont needTransactionsRecovery (our index is not the system of record; if it breaks, simply reconstruct from the original source)Indeed, we never have to store text in a search engine only in
10、dexesWere focusing on optimized indexes for text-oriented queries, not an SQL engine.Scoring and Ranking- “Ranked Boolean retrieval”ScoringThus far, our queries have all been BooleanDocs either match or notGood for expert users with precise understanding of their needs and the corpusApplications can
11、 consume 1000s of resultsNot good for (the majority of) users with poor Boolean formulation of their needsMost users dont want to wade through 1000s of results cf. use of web search enginesScoringWe wish to return in order the documents most likely to be useful to the searcherHow can we rank order t
12、he docs in the corpus with respect to a query?Assign a score say in 0,1for each doc on each queryBegin with a perfect world no spammersNobody stuffing keywords into a doc to make it match queriesMore on “adversarial IR” under web searchLinear zone combinationsFirst generation of scoring methods: use
13、 a linear combination of Booleans:E.g., Score = 0.6* + 0.3* + 0.05* + 0.05*Each expression such as takes on a value in 0,1.Then the overall score is in 0,1.For this example the scores can only takeon a finite set of values what are they?Linear zone combinationsIn fact, the expressions between on the
14、 last slide could be any Boolean queryWho generates the Score expression (with weights such as 0.6 etc.)?In uncommon cases the user, in the UIMost commonly, a query parser that takes the users Boolean query and runs it on the indexes for each zoneExerciseOn the query bill OR rights suppose that we r
15、etrieve the following docs from the various zone indexes:billrightsbillrightsbillrightsAuthorTitleBody1528335925158399Compute the scorefor each doc based on the weightings 0.6,0.3,0.1General ideaWe are given a weight vector whose components sum up to 1.There is a weight for each zone/field.Given a B
16、oolean query, we assign a score to each doc by adding up the weighted contributions of the zones/fields.Typically users want to see the K highest-scoring docs.Index support for zone combinationsIn the simplest version we have a separate inverted index for each zoneVariant: have a single index with a
17、 separate dictionary entry for each term and zoneE.g.,bill.authorbill.titlebill.body125832519Of course, compress zone nameslike author/title/body.Zone combinations indexThe above scheme is still wasteful: each term is potentially replicated for each zoneIn a slightly better scheme, we encode the zon
18、e in the postings:At query time, accumulate contributions to the total score of a document from the various postings, e.g.,bill1.author, 1.body2.author, 2.body3.titleAs before, the zone names get compressed.bill1.author, 1.body2.author, 2.body3.titlerights3.title, 3.body5.title, 5.bodyScore accumula
19、tionAs we walk the postings for the query bill OR rights, we accumulate scores for each doc in a linear merge as before.Note: we get both bill and rights in the Title field of doc 3, but score it no higher.Should we give more weight to more hits?1230.4Where do these weights come from?Machi
20、ne learned relevanceGivenA test corpusA suite of test queriesA set of relevance judgmentsLearn a set of weights such that relevance judgments matchedCan be formulated as ordinal regressionMore in next weeks lectureFull text queriesWe just scored the Boolean query bill OR rightsMost users more likely
21、 to type bill rights or bill of rightsHow do we interpret these full text queries?No Boolean connectivesOf several query terms some may be missing in a docOnly some query terms may occur in the title, etc.Full text queriesTo use zone combinations for free text queries, we needA way of assigning a sc
22、ore to a pair Zero query terms in the zone should mean a zero scoreMore query terms in the zone should mean a higher scoreScores dont have to be BooleanWill look at some alternatives nowIncidence matricesRecall: Document (or a zone in it) is binary vector X in 0,1vQuery is a vectorScore: Overlap mea
23、sure:ExampleOn the query ides of march, Shakespeares Julius Caesar has a score of 3All other Shakespeare plays have a score of 2 (because they contain march) or 1Thus in a rank order, Julius Caesar would come out topsOverlap matchingWhats wrong with the overlap measure?It doesnt consider:Term freque
24、ncy in documentTerm scarcity in collection (document mention frequency)of is more common than ides or marchLength of documents(And queries: score not normalized)Overlap matchingOne can normalize in various ways:Jaccard coefficient:Cosine measure:What documents would score best using Jaccard against
25、a typical query?Does the cosine measure fix this problem?Scoring: density-basedThus far: position and overlap of terms in a doc title, author etc.Obvious next: idea if a document talks about a topic more, then it is a better matchThis applies even when we only have a single query term.Document relev
26、ant if it has a lot of the termsThis leads to the idea of term weighting.Term weightingMore general form of the “Ranked Boolean retrieval”Term-document count matricesConsider the number of occurrences of a term in a document: Bag of words modelDocument is a vector in v: a column below Bag of words v
27、iew of a docThus the docJohn is quicker than Mary.is indistinguishable from the docMary is quicker than John.Which of the indexes discussedso far distinguish these two docs?Counts vs. frequenciesConsider again the ides of march query.Julius Caesar has 5 occurrences of idesNo other play has idesmarch
28、 occurs in over a dozenAll the plays contain ofBy this scoring measure, the top-scoring play is likely to be the one with the most ofsDigression: terminologyWARNING: In a lot of IR literature, “frequency” is used to mean “count”Thus term frequency in IR literature is used to mean number of occurrenc
29、es in a docNot divided by document length (which would actually make it a frequency)We will conform to this misnomerIn saying term frequency we mean the number of occurrences of a term in a document.Term frequency tfLong docs are favored because theyre more likely to contain query termsCan fix this
30、to some extent by normalizing for document lengthBut is raw tf the right measure?Weighting term frequency: tfWhat is the relative importance of0 vs. 1 occurrence of a term in a doc1 vs. 2 occurrences2 vs. 3 occurrences Unclear: while it seems that more is better, a lot isnt proportionally better tha
31、n a fewCan just use raw tfAnother option commonly used in practice:Score computationScore for a query q = sum over terms t in q:Note: 0 if no query terms in documentThis score can be zone-combinedCan use wf instead of tf in the aboveStill doesnt consider term scarcity in collection (ides is rarer th
32、an of)Weighting should depend on the term overallWhich of these tells you more about a doc?10 occurrences of hernia?10 occurrences of the?Would like to attenuate the weight of a common termBut what is “common”?Suggest looking at collection frequency (cf )The total number of occurrences of the term i
33、n the entire collection of documentsDocument frequencyBut document frequency (df ) may be better:df = number of docs in the corpus containing the termWordcfdfferrari1042217insurance104403997Document/collection frequency weighting is only possible in known (static) collection.So how do we make use of
34、 df ?tf x idf term weightstf x idf measure combines:term frequency (tf )or wf, some measure of term density in a docinverse document frequency (idf ) measure of informativeness of a term: its rarity across the whole corpuscould just be raw count of number of documents the term occurs in (idfi = 1/df
35、i)but by far the most commonly used version is:See Kishore Papineni, NAACL 2, 2002 for theoretical justificationSummary: tf x idf (or tf.idf)Assign a tf.idf weight to each term i in each document dIncreases with the number of occurrences within a docIncreases with the rarity of the term across the w
36、hole corpusWhat is the wtof a term thatoccurs in allof the docs?Real-valued term-document matricesFunction (scaling) of count of a word in a document: Bag of words modelEach is a vector in vHere log-scaled tf.idfNote can be 1!Documents as vectorsEach doc j can now be viewed as a vector of wfidf valu
37、es, one component for each termSo we have a vector spaceterms are axesdocs live in this spaceeven with stemming, may have 20,000+ dimensions(The corpus of documents gives us a matrix, which we could also view as a vector space in which words live transposable data)RecapWe began by looking at zones i
38、n scoringParametric and field searchesEnded up viewing documents as vectors in a vector space:tfidf and (term-dimensional) vector spacesWe will pursue this view further:A vector space model, for scoringDocuments as vectors (of terms)At the end of Lecture 6 we said:Each doc d can now be viewed as a v
39、ector of wfidf values, one component for each termSo we have a vector spaceterms are axesdocs live in this spaceeven with stemming, may have 50,000+ dimensionsWhy turn docs into vectors?First application: Query-by-exampleGiven a doc d, find others “l(fā)ike” it.Now that d is a vector, find vectors (docs
40、) “near” it.IntuitionPostulate: Documents that are “close together” in the vector space talk about the same things.t1d2d1d3d4d5t3t2Desiderata for proximityIf d1 is near d2, then d2 is near d1.If d1 near d2, and d2 near d3, then d1 is not far from d3.No doc is closer to d than d itself.First cutIdea:
41、 Distance between d1 and d2 is the length of the vector |d1 d2|.Euclidean distanceWhy is this not a great idea?We still havent dealt with the issue of length normalizationShort documents would be more similar to each other by virtue of length, not topicHowever, we can implicitly normalize by looking
42、 at angles insteadCosine similarityDistance between vectors d1 and d2 captured by the cosine of the angle x between them.Note this is similarity, not distanceNo triangle inequality for similarity.t 1d 2d 1t 3t 2Cosine similarityA vector can be normalized (given a length of 1) by dividing each of its
43、 components by its length here we use the L2 normThis maps vectors onto the unit sphere:Then, Longer documents dont get more weightCosine similarityCosine of angle between two vectorsThe denominator involves the lengths of the vectors.NormalizationNormalized vectorsFor normalized vectors, the cosine
44、 is simply the dot product:ExampleDocs: Austens Sense and Sensibility, Pride and Prejudice; Brontes Wuthering Heights. tf weightscos(SAS, PAP) = .996 x .993 + .087 x .120 + .017 x 0.0 = 0.999cos(SAS, WH) = .996 x .847 + .087 x .466 + .017 x .254 = 0.889Cosine similarity exercisesExercise: Rank the f
45、ollowing by decreasing cosine similarity. Assume tf-idf weighting:Two docs that have only frequent words (the, a, an, of) in common.Two docs that have no words in common.Two docs that have many rare words in common (wingspan, tailfin).ExerciseEuclidean distance between vectors:Show that, for normali
46、zed vectors, Euclidean distance gives the same proximity ordering as the cosine measureQueries in the vector space modelCentral idea: the query as a vector:We regard the query as short documentWe return the documents ranked by the closeness of their vectors to the query, also represented as a vector
47、.Note that dq is very sparse!Summary: Whats the point of using vector spaces?A well-formed algebraic space for retrievalKey: A users query can be viewed as a (very) short document.Query becomes a vector in the same space as the docs.Can measure each docs proximity to it.Natural measure of scores/ran
48、king no longer Boolean.Queries are expressed as bags of wordsOther similarity measures: see /strehl/diss/node52.html for a surveyDigression: spamming indicesThis was all invented before the days when people were in the business of spamming web search engines. Consider:Indexing a sensible passive doc
49、ument collection vs.An active document collection, where people (and indeed, service companies) are shaping documents in order to maximize scoresVector space similarity may not be as useful in this context.“Virtual doc” of propagated anchor texts may helpInteraction: vectors and phrasesScoring phras
50、es doesnt fit naturally into the vector space world:“tangerine trees” “marmalade skies”Positional indexes dont calculate or store tf.idf information for “tangerine trees”Biword indexes treat certain phrases as termsFor these, we can pre-compute tf.idf.Theoretical problem of correlated dimensionsProb
51、lem: we cannot expect end-user formulating queries to know what phrases are indexedWe can use a positional index to boost or ensure phrase occurrence Vectors and Boolean queriesVectors and Boolean queries really dont work together very wellIn the space of terms, vector proximity selects by spheres:
52、e.g., all docs having cosine similarity 0.5 to the queryBoolean queries on the other hand, select by (hyper-)rectangles and their unions/intersectionsRound peg - square holeVectors and wild cardsHow about the query tan* marm*?Can we view this as a bag of words?Thought: expand each wild-card into the
53、 matching set of dictionary terms.Danger unlike the Boolean case, we now have tfs and idfs to deal with.Net not a good idea.Vector spaces and other operatorsVector space queries are apt for no-syntax, bag-of-words queriesClean metaphor for similar-document queriesNot a good combination with Boolean,
54、 wild-card, positional query operatorsBut Query language vs. scoringMay allow user a certain query language, sayFree text basic queriesPhrase, wildcard etc. in Advanced Queries.For scoring (oblivious to user) may use all of the above, e.g. for a free text queryHighest-ranked hits have query as a phr
55、aseNext, docs that have all query terms near each otherThen, docs that have some query terms, or all of them spread out, with tf x idf weights for scoringExercisesHow would you augment the inverted index built in lectures 13 to support cosine ranking computations?Walk through the steps of serving a
56、query.The math of the vector space model is quite straightforward, but being able to do cosine ranking efficiently at runtime is nontrivialEfficient cosine rankingFind the k docs in the corpus “nearest” to the query k largest query-doc cosines.Efficient ranking:Computing a single cosine efficiently.
57、Choosing the k largest cosine values efficiently.Can we do this without computing all n cosines?n = number of documents in collectionEfficient cosine rankingWhat were doing in effect: solving the k-nearest neighbor problem for a query vectorIn general, we do not know how to do this efficiently for h
58、igh-dimensional spacesBut it is solvable for short queries, and standard indexes are optimized to do thisComputing a single cosineFor every term i, with each doc j, store term frequency tfij.Some tradeoffs on whether to store term count, term weight, or weighted by idfi. At query time, use an array
59、of accumulators Aj to accumulate component-wise sumIf youre indexing 5 billion documents (web search) an array of accumulators is infeasibleIdeas?Encoding document frequenciesAdd tft,d to postings listsNow almost always as frequency scale at runtimeUnary code is quite effective here code (Lecture 3)
60、 is an even better choiceOverall, requires little additional spaceabacus 8aargh 10acacia 351,2 7,3 83,1 87,2 1,1 5,1 13,1 17,1 7,1 8,2 40,1 97,3 Why?Computing the k largest cosines: selection vs. sortingTypically we want to retrieve the top k docs (in the cosine ranking for the query)not to totally
溫馨提示
- 1. 本站所有資源如無特殊說明,都需要本地電腦安裝OFFICE2007和PDF閱讀器。圖紙軟件為CAD,CAXA,PROE,UG,SolidWorks等.壓縮文件請下載最新的WinRAR軟件解壓。
- 2. 本站的文檔不包含任何第三方提供的附件圖紙等,如果需要附件,請聯(lián)系上傳者。文件的所有權益歸上傳用戶所有。
- 3. 本站RAR壓縮包中若帶圖紙,網頁內容里面會有圖紙預覽,若沒有圖紙預覽就沒有圖紙。
- 4. 未經權益所有人同意不得將文件中的內容挪作商業(yè)或盈利用途。
- 5. 人人文庫網僅提供信息存儲空間,僅對用戶上傳內容的表現(xiàn)方式做保護處理,對用戶上傳分享的文檔內容本身不做任何修改或編輯,并不能對任何下載內容負責。
- 6. 下載文件中如有侵權或不適當內容,請與我們聯(lián)系,我們立即糾正。
- 7. 本站不保證下載資源的準確性、安全性和完整性, 同時也不承擔用戶因使用這些下載資源對自己和他人造成任何形式的傷害或損失。
最新文檔
- 寵物營養(yǎng)與健康護理考核試卷
- 汽車零部件生產質量控制免責合同
- 批發(fā)業(yè)務中異常情況處理能力考核試卷
- 服務外包的合同
- 化妝品與衛(wèi)生用品創(chuàng)新零售模式考核試卷
- 證券交易委托代理合同
- 地下綜合管廊工程管線綜合協(xié)調管理考核試卷
- 塑料型材的耐沖擊性能提升考核試卷
- 圖書出租業(yè)務的人力資源規(guī)劃考核試卷
- 2024年食品行業(yè)原材料采購合同
- 第02講 導數與函數的單調性(教師版)-2025版高中數學一輪復習考點幫
- 2024屆新高考語文高中古詩文必背72篇 【原文+注音+翻譯】
- 中華人民共和國學前教育法
- 2024年貴州公務員考試申論試題(B卷)
- 三年級(下冊)西師版數學全冊重點知識點
- 期末練習卷(試題)-2024-2025學年四年級上冊數學滬教版
- 2025年公務員考試申論試題與參考答案
- 中國高血壓防治指南(2024年修訂版)要點解讀
- 二十屆三中全會精神應知應會知識測試30題(附答案)
- 小學三年級下冊奧數題100道附答案
評論
0/150
提交評論