版權(quán)說明:本文檔由用戶提供并上傳,收益歸屬內(nèi)容提供方,若內(nèi)容存在侵權(quán),請(qǐng)進(jìn)行舉報(bào)或認(rèn)領(lǐng)
文檔簡介
1、跨語言資訊檢索導(dǎo)論Hsin-Hsi Chen (陳信希)Department of Computer Science and Information EngineeringNational Taiwan UniversityOutlinenMultilingual EnvironmentsnWhat is Cross-Language Information Retrieval?nMajor Problems in CLIRnMajor Approaches in CLIRnCase Study: CLIR in NPDMnSummaryMultilingual CollectionsnThe
2、re are 6,703 languages listed in the EthnologuenDigital librariesnOCLC Online Computer Library Center serves more than 17,000 libraries in 52 countries and contains over 30 million bibliographic records with over 500 million records ownership attached in more than 370 languagesnWorld Wide WebnAround
3、 40% of Internet users do not speak English, however, 80% of Web sites are still in English0200400600800Speakers (Millions)ChineseHindi-UrduPortugueseRussianJapanese真實(shí)世界語言運(yùn)用人口( g11n/faq.htm)中文英語印度語西班牙語葡萄牙語孟加拉語俄語阿拉伯語日語(Statistics from Euro-Marketing Associates, 1998)西班牙語德語日語法語中文荷蘭語葡萄牙語義大利語瑞典語韓文glreac
4、h/globstats/(Statistics from Euro-Marketing Associates, 1999)中文人口比例(6.1%) 南非, Sdafrika)nCoverage of the vocabularynThere is not a one-to-one mapping between two languagesnTranslating queries automatically (lack of syntax)nTranslating documents automatically (performance, )nComputing mixed result lis
5、tsCross-Language Information RetrievalCont r ol l ed Vocabul ar yThes aur us - bas edOnt ol ogy- bas edDi ct i onar y- bas edKnowl edge- bas edTer m- al i gnedSent ence- al i gnedPar al l elCompar abl eDocument - al i gnedUnal i gnedCor pus - bas edHybr i dFr ee TextQuer y Tr ans l at i onText Tr an
6、s l at i onVect or Tr ans l at i onDocument Tr ans l at i onNo Tr ans l at i onCr os s - Language I nf or mat i on Ret r i evalQuery Translation Based CLIREnglishQueryTranslationDeviceChineseQueryMonolingualChineseRetrievalSystemRetrievedChineseDocumentsTranslating the 400 Millionnon-English Pages o
7、f the WWWn. would take 100000 days (300 years) on one fast PC. Or, 1 month on 3600 PCs.Knowledge-BasednExamplesnSubject ThesaurusnHierarchical and associative relations.nUnique term assigned to each node.nConcept ListnTerm space partitioned into concept spaces.nTerm ListnList of cross-language synon
8、yms.nLexiconnMachine readable syntax and/or semantics.Ontology-Based ApproachesnExploit complex knowledge representations e.g., EuroWordNet nA Proposal for Conceptual Indexing using EuroWordNetDictionary-Based ApproachesnExploit machine-readable dictionaries.nProblemsntranslation ambiguity + target
9、polysemyncoverage (unknown words, abbreviations, .)Dictionary-Based Approaches(Continued)nIssue 1: selection strategynSelect all.nSelect N randomly.nSelect best N.nIssue 2: which levelnwordnphraseSelection Strategy: Select AllnHull and Grefenstette 1996nTake concatenation of all term translation.E:
10、politically motivated civil disturbancesF: troubles civils a caractere politiquetrouble - turmoil, discord, trouble, unrest, disturbance, disordercivil - civil, civilian, courteouscaractere - character, naturepolitique - political, diplomatic, politician, policynOriginal English (0.393) vs. Automati
11、c word-based transfer dictionary (0.235): 59.8%.nerrors: multi-word expressions and ambiguitySelection Strategy: Select All(Continued)nDavis 1997 (TREC5)nReplace each English query term with all of its Spanish equivalent terms from the Collins bilingual dictionary.nMonolingual (0.2895) vs. All-equiv
12、alent substitution (0.1422): 49.12%Evaluation MethodnAverage Precision (5-, 9-, 11-points)nModelSpanish QueryMonoIR EngineEnglish QueryBilingualDictionaryMonoIR EngineTRECSpanishCorpusSpanishEquivalentsEnglish QueryMonoIR EngineTRECSpanishCorpusSpanishEquivalentsby POSPOSBilingualDictionaryTRECSpani
13、shCorpusSelection Strategy: Select NnSimple word-by-word translationnEach query term is replaced by the word or group of words given for the first sense of the terms definition.n50-60% drop in performance (average precision)Selection Strategy: Select N(Continued)nword/phrase translationnTake at most
14、 three translations of each word, one from each of the first three senses. Take phrase translation if appearing in dictionary.n30-50% worse than good translationnWell-translated phrases can greatly improve effectiveness, but poorly translated phrases may negate the improvements.nWBW (0.0244), phrasa
15、l (0.0148), good phrasal (0.0610) -39.3% +150.3%Selection Strategy: Select Best NnHayashi, Kikui and Susaki 1997nsearch for a dictionary entry corresponding to the longest sequence of words from left to rightnchoose the most frequently used word (or phrases) in a text corpus collected from WWWnno re
16、port for this query translation approachnDavis 1997 (TREC5)nPOS disambiguationnMonolingual (0.2895) vs. All-equivalent substitution (0.1422) vs. POS disambiguation (0.1949): near 67.3%Corpus-Based ApproachesnCategorizationnTerm-AlignednSentence-AlignednDocument-Aligned (Parallel, Comparable)nUnalign
17、ednUsagenSetup ThesaurusnVector MappingTerm-Aligned CorporanFine-grained alignment in parallel corporanOard 1996nTerm alignment is a challenging problem.ParallelBinlingualCorpusCooccurranceStatisticsTranslationTablesMachineTranslationSystemEnglish QuerySpanishQuerySentence-Aligned CorporanDavis & Du
18、nning 1996 (TREC4)nHigh-frequency TermsBrief Summaryndictionary-based methodsnSpecialized vocabulary not in the dictionaries will not be translated.nAmbiguities will add extraneous terms to the query.nparallel/comparable corpora-based methodsnParallel corpora are not always available.nAvailable corp
19、ora tend to be relative small or to cover only a small number of subjects.nPerformance is dependent on how well the corpora are aligned.Brief Summary (Continued)nDictionaries are very useful.nAchieve 50% on their ownnParallel corpora have limitations.nDomain shiftsnTerm alignment accuracynDictionari
20、es and corpora are complementary.nDictionaries provide broad and shallow coverage.nCorpora provide narrow (domain-specific) but deep (more terminology) coverage of the language.Hybrid MethodsnWhat knowledge can be employed?nlexical knowledgencorpus knowledgen.Hybrid Methods (Continued)nQuery Expansi
21、onnIssue 1: contextnpseudo relevance feedback (local feedback):A query is modified by the addition of terms found in the top retrieved documents.nlocal context analysis:Queries are expanded by the addition of the top ranked concepts from the top passages.Hybrid Methods (Continued) Issue 2: when befo
22、re query translation after query translationHybrid Methods (Continued)nBallesteros & Croft 1997Original SpanishTREC QuerieshumantranslationEnglish (BASE)QueriesSpanishQueriesautomaticdictionarytranslationEnglishQueriesqueryexpansionSpanishQueriesqueryexpansionSpanishQueriesautomaticdictionarytransla
23、tionINQUERYHybrid Methods (Continued) Performance Evaluation pre-translationMRD (0.0823) vs. LF (0.1099) vs. LCA10 (0.1) +33.5% +38.5% post-translationMRD (0.0823) vs. LF (0.0916) vs. LCA20 (0.1022) +11.3% +24.1% combined pre- and post-translationMRD (0.0823) vs. LF (0.1242) vs. LCA20 (0.8) +51.0% +
24、65.0% 32% below a monolingual baselineCross-Language Evaluation ForumnA collaboration between the DELOS Network of Excellence for Digital Libraries and the US National Institute for Standards and Technology (NIST)nExtension of CLIR track at TREC (1997-1999)Main GoalsnPromote research in cross-langua
25、ge system development for European languages by providing an appropriate infrastructure for:nCLIR system evaluation, testing and tuningnComparison and discussion of resultsCLEF 2000 Task Description nFour evaluation tracks in CLEF 2000nmultilingual information retrievalnbilingual information retriev
26、alnmonolingual (non-English) information retrievalndomain-specific IRCase Study: CLIR for NPDM3M in Digital Libraries/MuseumsnMulti-medianSelecting suitable media to represent contents nMulti-lingualityn Decreasing the language barriersnMulti-culturenIntegrating multiple cultures NPDM ProjectnPalace
27、 Museum, Taipei, one of the famous museums in the worldnNSC supports a pioneer study of a digital museum project NPDM starting from 2000 nEnamels from the Ming and Ching Dynasties nFamous Album Leaves of the Sung Dynasty nIllustrations in Buddhist Scriptures with Relative Drawings Design IssuesnStan
28、dardizationnA standard metadata protocol is indispensable for the interchange of resources with other museums.nMultimedia nA suitable presentation scheme is required.nInternationalization nto share the valuable resources of NPDM with users of different languagesnto utilize knowledge presented in a f
29、oreign languageTranslingual Issue nCLIRnto allow users to issue queries in one language to access documents in another languagenthe query language is English and the document language is ChinesenTwo common approachesnQuery translationnDocument translationResources in NPDM pilotnan enamel, a calligra
30、phy, a painting, or an illustrationnMICI-DCnMetadata Interchange for Chinese InformationnAccessible fields to usersnShort descriptions vs. full textsnBilingual versions vs. Chinese onlynFields for maintenance onlySearch ModesnFree searchnusers describe their information need using natural languages
31、(Chinese or English)nSpecific topic searchnusers fill in specific fields denoting authors, titles, dates, and so on ExamplenInformation neednRetrieval “Travelers Among Mountains and Streams, Fan Kuan (“范寬谿山行旅圖) nPossible queriesnAuthor: Fan Kuan; Kuan, Fan nTime: Sung Dynasty nTitle: Mountains and S
32、treams; Travel among mountains; Travel among streams; Mountain and stream painting nFree search: landscape painting; travelers, huge mountain, Nature; scenery; Shensi province EnglishNamesChineseNamesMachineTransliterationEnglishTitlesChineseTitlesDocumentTranslationNameSearchTitleSearchEnglishQuery
33、QueryDisambiguationSpecificBilingualDictionaryGenericBilingualDictionaryChineseQueryQueryTranslationChinese IRSystemNPDMCollectionResultsECIR in NPDMSpecific Topic Searchnproper names are important query termsnCreators such as “林逋 (Lin Pu), “李建中 (Li Chien-chung), “歐陽脩 (Ou-yang Hsiu), etc. nEmperors
34、such as “康熙 (Kang-hsi), “乾隆 (Chien-lung), “徽宗 (Hui-tsung), etc.nDynasty such as 宋 (Sung), “明 (Ming), “清 (Ching), etc.Name Transliteration nThe alphabets of Chinese and English are totally different nWade-Giles (WG) and Pinyin are two famous systems to romanize Chinese in libraries nbackward translit
35、erationnTransliterate target language terms back to source language ones nChen, Huang, and Tsai (COLING, 1998)nLin and Chen (ROCLING, 2000)Name Mapping TablenDivide a name into a sequence of Chinese characters, and transform each character into phonemesnLook up phoneme-to-WG (Pinyin) mapping table,
36、and derive a canonical form for the name nExamplen“林逋 “ “Lin Pu (WG) Name SimilaritynExtract named entity from the query nSelect the most similar named entity from name mapping tablenNaming sequence/schemenLastName FirstName1, e.g., Chu Hsi (朱熹) nFirstName1 LastName, e.g., Hsi Chu (朱熹) nLastName Fir
37、stName1-FirstName2, e.g., Hsu Tao-ning (許道寧) nFirstName1-FirstName2 LastName, e.g., Tao-ning Hsu (許道寧) nAny order, e.g., Tao Ning Hsu (許道寧) nAny transliteration, e.g., Ju Shi (朱熹) Titlen谿山行旅圖 “Travelers among Mountains and Streamsntravelers, mountains, and streams are basic componentsnUsers can expr
38、ess their information need through the descriptions of a desired art nSystem will measure the similarity of art titles (descriptions) and a query Free SearchnA query is composed of several concepts. nConcepts are either transliterated or translated.nThe query translation similar to a small scale IR system nResourcesnName-mapping
溫馨提示
- 1. 本站所有資源如無特殊說明,都需要本地電腦安裝OFFICE2007和PDF閱讀器。圖紙軟件為CAD,CAXA,PROE,UG,SolidWorks等.壓縮文件請(qǐng)下載最新的WinRAR軟件解壓。
- 2. 本站的文檔不包含任何第三方提供的附件圖紙等,如果需要附件,請(qǐng)聯(lián)系上傳者。文件的所有權(quán)益歸上傳用戶所有。
- 3. 本站RAR壓縮包中若帶圖紙,網(wǎng)頁內(nèi)容里面會(huì)有圖紙預(yù)覽,若沒有圖紙預(yù)覽就沒有圖紙。
- 4. 未經(jīng)權(quán)益所有人同意不得將文件中的內(nèi)容挪作商業(yè)或盈利用途。
- 5. 人人文庫網(wǎng)僅提供信息存儲(chǔ)空間,僅對(duì)用戶上傳內(nèi)容的表現(xiàn)方式做保護(hù)處理,對(duì)用戶上傳分享的文檔內(nèi)容本身不做任何修改或編輯,并不能對(duì)任何下載內(nèi)容負(fù)責(zé)。
- 6. 下載文件中如有侵權(quán)或不適當(dāng)內(nèi)容,請(qǐng)與我們聯(lián)系,我們立即糾正。
- 7. 本站不保證下載資源的準(zhǔn)確性、安全性和完整性, 同時(shí)也不承擔(dān)用戶因使用這些下載資源對(duì)自己和他人造成任何形式的傷害或損失。
最新文檔
- 電力供應(yīng)會(huì)計(jì)崗位聘用協(xié)議
- 培訓(xùn)中心停車場運(yùn)營辦法
- 地鐵車輛段建設(shè)機(jī)械臺(tái)班施工合同
- 甜品店門頭租賃協(xié)議
- 農(nóng)村林地租賃合同:林業(yè)碳匯項(xiàng)目
- 藝術(shù)團(tuán)體管理助理招聘協(xié)議
- 設(shè)計(jì)單位流程優(yōu)化方案
- 咖啡館炊事員工作守則
- 建筑工程備案審批合同ktv
- 機(jī)場航站樓廣告牌安裝施工合同
- C++面試題、c++面試題
- 曾國藩為人識(shí)人及用人
- 雙喜公司雙喜世紀(jì)婚禮策劃活動(dòng)
- 色貌與色貌模型
- (2021年)浙江省杭州市警察招考公安專業(yè)科目真題(含答案)
- 99S203消防水泵接合器安裝
- 高考口語考試試題答案
- 中國佛教文化課件
- 民用無人駕駛航空器飛行題庫(判斷100)
- 氣管插管術(shù) 氣管插管術(shù)
- DB32T 4301-2022《裝配式結(jié)構(gòu)工程施工質(zhì)量驗(yàn)收規(guī)程》(修訂)
評(píng)論
0/150
提交評(píng)論