




版權(quán)說明:本文檔由用戶提供并上傳,收益歸屬內(nèi)容提供方,若內(nèi)容存在侵權(quán),請進行舉報或認領(lǐng)
文檔簡介
1、跨語言資訊檢索導論Hsin-Hsi Chen (陳信希)Department of Computer Science and Information EngineeringNational Taiwan UniversityOutlinenMultilingual EnvironmentsnWhat is Cross-Language Information Retrieval?nMajor Problems in CLIRnMajor Approaches in CLIRnCase Study: CLIR in NPDMnSummaryMultilingual CollectionsnThe
2、re are 6,703 languages listed in the EthnologuenDigital librariesnOCLC Online Computer Library Center serves more than 17,000 libraries in 52 countries and contains over 30 million bibliographic records with over 500 million records ownership attached in more than 370 languagesnWorld Wide WebnAround
3、 40% of Internet users do not speak English, however, 80% of Web sites are still in English0200400600800Speakers (Millions)ChineseHindi-UrduPortugueseRussianJapanese真實世界語言運用人口( g11n/faq.htm)中文英語印度語西班牙語葡萄牙語孟加拉語俄語阿拉伯語日語(Statistics from Euro-Marketing Associates, 2019)西班牙語德語日語法語中文荷蘭語葡萄牙語義大利語瑞典語韓文glreac
4、h/globstats/(Statistics from Euro-Marketing Associates, 2019)中文人口比例(6.1%) 南非, Sdafrika)nCoverage of the vocabularynThere is not a one-to-one mapping between two languagesnTranslating queries automatically (lack of syntax)nTranslating documents automatically (performance, )nComputing mixed result lis
5、tsCross-Language Information RetrievalCont r ol l ed Vocabul ar yThes aur us - bas edOnt ol ogy- bas edDi ct i onar y- bas edKnowl edge- bas edTer m- al i gnedSent ence- al i gnedPar al l elCompar abl eDocument - al i gnedUnal i gnedCor pus - bas edHybr i dFr ee TextQuer y Tr ans l at i onText Tr an
6、s l at i onVect or Tr ans l at i onDocument Tr ans l at i onNo Tr ans l at i onCr os s - Language I nf or mat i on Ret r i evalQuery Translation Based CLIREnglishQueryTranslationDeviceChineseQueryMonolingualChineseRetrievalSystemRetrievedChineseDocumentsTranslating the 400 Millionnon-English Pages o
7、f the WWWn. would take 100000 days (300 years) on one fast PC. Or, 1 month on 3600 PCs.Knowledge-BasednExamplesnSubject ThesaurusnHierarchical and associative relations.nUnique term assigned to each node.nConcept ListnTerm space partitioned into concept spaces.nTerm ListnList of cross-language synon
8、yms.nLexiconnMachine readable syntax and/or semantics.Ontology-Based ApproachesnExploit complex knowledge representations e.g., EuroWordNet nA Proposal for Conceptual Indexing using EuroWordNetDictionary-Based ApproachesnExploit machine-readable dictionaries.nProblemsntranslation ambiguity + target
9、polysemyncoverage (unknown words, abbreviations, .)Dictionary-Based Approaches(Continued)nIssue 1: selection strategynSelect all.nSelect N randomly.nSelect best N.nIssue 2: which levelnwordnphraseSelection Strategy: Select AllnHull and Grefenstette 2019nTake concatenation of all term translation.E:
10、politically motivated civil disturbancesF: troubles civils a caractere politiquetrouble - turmoil, discord, trouble, unrest, disturbance, disordercivil - civil, civilian, courteouscaractere - character, naturepolitique - political, diplomatic, politician, policynOriginal English (0.393) vs. Automati
11、c word-based transfer dictionary (0.235): 59.8%.nerrors: multi-word expressions and ambiguitySelection Strategy: Select All(Continued)nDavis 2019 (TREC5)nReplace each English query term with all of its Spanish equivalent terms from the Collins bilingual dictionary.nMonolingual (0.2895) vs. All-equiv
12、alent substitution (0.1422): 49.12%Evaluation MethodnAverage Precision (5-, 9-, 11-points)nModelSpanish QueryMonoIR EngineEnglish QueryBilingualDictionaryMonoIR EngineTRECSpanishCorpusSpanishEquivalentsEnglish QueryMonoIR EngineTRECSpanishCorpusSpanishEquivalentsby POSPOSBilingualDictionaryTRECSpani
13、shCorpusSelection Strategy: Select NnSimple word-by-word translationnEach query term is replaced by the word or group of words given for the first sense of the terms definition.n50-60% drop in performance (average precision)Selection Strategy: Select N(Continued)nword/phrase translationnTake at most
14、 three translations of each word, one from each of the first three senses. Take phrase translation if appearing in dictionary.n30-50% worse than good translationnWell-translated phrases can greatly improve effectiveness, but poorly translated phrases may negate the improvements.nWBW (0.0244), phrasa
15、l (0.0148), good phrasal (0.0610) -39.3% +150.3%Selection Strategy: Select Best NnHayashi, Kikui and Susaki 2019nsearch for a dictionary entry corresponding to the longest sequence of words from left to rightnchoose the most frequently used word (or phrases) in a text corpus collected from WWWnno re
16、port for this query translation approachnDavis 2019 (TREC5)nPOS disambiguationnMonolingual (0.2895) vs. All-equivalent substitution (0.1422) vs. POS disambiguation (0.1949): near 67.3%Corpus-Based ApproachesnCategorizationnTerm-AlignednSentence-AlignednDocument-Aligned (Parallel, Comparable)nUnalign
17、ednUsagenSetup ThesaurusnVector MappingTerm-Aligned CorporanFine-grained alignment in parallel corporanOard 2019nTerm alignment is a challenging problem.ParallelBinlingualCorpusCooccurranceStatisticsTranslationTablesMachineTranslationSystemEnglish QuerySpanishQuerySentence-Aligned CorporanDavis &
18、; Dunning 2019 (TREC4)nHigh-frequency TermsBrief Summaryndictionary-based methodsnSpecialized vocabulary not in the dictionaries will not be translated.nAmbiguities will add extraneous terms to the query.nparallel/comparable corpora-based methodsnParallel corpora are not always available.nAvailable
19、corpora tend to be relative small or to cover only a small number of subjects.nPerformance is dependent on how well the corpora are aligned.Brief Summary (Continued)nDictionaries are very useful.nAchieve 50% on their ownnParallel corpora have limitations.nDomain shiftsnTerm alignment accuracynDictio
20、naries and corpora are complementary.nDictionaries provide broad and shallow coverage.nCorpora provide narrow (domain-specific) but deep (more terminology) coverage of the language.Hybrid MethodsnWhat knowledge can be employed?nlexical knowledgencorpus knowledgen.Hybrid Methods (Continued)nQuery Exp
21、ansionnIssue 1: contextnpseudo relevance feedback (local feedback):A query is modified by the addition of terms found in the top retrieved documents.nlocal context analysis:Queries are expanded by the addition of the top ranked concepts from the top passages.Hybrid Methods (Continued) Issue 2: when
22、before query translation after query translationHybrid Methods (Continued)nBallesteros & Croft 2019Original SpanishTREC QuerieshumantranslationEnglish (BASE)QueriesSpanishQueriesautomaticdictionarytranslationEnglishQueriesqueryexpansionSpanishQueriesqueryexpansionSpanishQueriesautomaticdictionar
23、ytranslationINQUERYHybrid Methods (Continued) Performance Evaluation pre-translationMRD (0.0823) vs. LF (0.1099) vs. LCA10 (0.1) +33.5% +38.5% post-translationMRD (0.0823) vs. LF (0.0916) vs. LCA20 (0.1022) +11.3% +24.1% combined pre- and post-translationMRD (0.0823) vs. LF (0.1242) vs. LCA20 (0.8)
24、+51.0% +65.0% 32% below a monolingual baselineCross-Language Evaluation ForumnA collaboration between the DELOS Network of Excellence for Digital Libraries and the US National Institute for Standards and Technology (NIST)nExtension of CLIR track at TREC (2019-2019)Main GoalsnPromote research in cros
25、s-language system development for European languages by providing an appropriate infrastructure for:nCLIR system evaluation, testing and tuningnComparison and discussion of resultsCLEF 2000 Task Description nFour evaluation tracks in CLEF 2000nmultilingual information retrievalnbilingual information
26、 retrievalnmonolingual (non-English) information retrievalndomain-specific IRCase Study: CLIR for NPDM3M in Digital Libraries/MuseumsnMulti-medianSelecting suitable media to represent contents nMulti-lingualityn Decreasing the language barriersnMulti-culturenIntegrating multiple cultures NPDM Projec
27、tnPalace Museum, Taipei, one of the famous museums in the worldnNSC supports a pioneer study of a digital museum project NPDM starting from 2000 nEnamels from the Ming and Ching Dynasties nFamous Album Leaves of the Sung Dynasty nIllustrations in Buddhist Scriptures with Relative Drawings Design Iss
28、uesnStandardizationnA standard metadata protocol is indispensable for the interchange of resources with other museums.nMultimedia nA suitable presentation scheme is required.nInternationalization nto share the valuable resources of NPDM with users of different languagesnto utilize knowledge presente
29、d in a foreign languageTranslingual Issue nCLIRnto allow users to issue queries in one language to access documents in another languagenthe query language is English and the document language is ChinesenTwo common approachesnQuery translationnDocument translationResources in NPDM pilotnan enamel, a
30、calligraphy, a painting, or an illustrationnMICI-DCnMetadata Interchange for Chinese InformationnAccessible fields to usersnShort descriptions vs. full textsnBilingual versions vs. Chinese onlynFields for maintenance onlySearch ModesnFree searchnusers describe their information need using natural la
31、nguages (Chinese or English)nSpecific topic searchnusers fill in specific fields denoting authors, titles, dates, and so on ExamplenInformation neednRetrieval “Travelers Among Mountains and Streams, Fan Kuan (“范寬谿山行旅圖) nPossible queriesnAuthor: Fan Kuan; Kuan, Fan nTime: Sung Dynasty nTitle: Mountai
32、ns and Streams; Travel among mountains; Travel among streams; Mountain and stream painting nFree search: landscape painting; travelers, huge mountain, Nature; scenery; Shensi province EnglishNamesChineseNamesMachineTransliterationEnglishTitlesChineseTitlesDocumentTranslationNameSearchTitleSearchEngl
33、ishQueryQueryDisambiguationSpecificBilingualDictionaryGenericBilingualDictionaryChineseQueryQueryTranslationChinese IRSystemNPDMCollectionResultsECIR in NPDMSpecific Topic Searchnproper names are important query termsnCreators such as “林逋 (Lin Pu), “李建中 (Li Chien-chung), “歐陽脩 (Ou-yang Hsiu), etc. nE
34、mperors such as “康熙 (Kang-hsi), “乾隆 (Chien-lung), “徽宗 (Hui-tsung), etc.nDynasty such as 宋 (Sung), “明 (Ming), “清 (Ching), etc.Name Transliteration nThe alphabets of Chinese and English are totally different nWade-Giles (WG) and Pinyin are two famous systems to romanize Chinese in libraries nbackward
35、transliterationnTransliterate target language terms back to source language ones nChen, Huang, and Tsai (COLING, 2019)nLin and Chen (ROCLING, 2000)Name Mapping TablenDivide a name into a sequence of Chinese characters, and transform each character into phonemesnLook up phoneme-to-WG (Pinyin) mapping
36、 table, and derive a canonical form for the name nExamplen“林逋 “ “Lin Pu (WG) Name SimilaritynExtract named entity from the query nSelect the most similar named entity from name mapping tablenNaming sequence/schemenLastName FirstName1, e.g., Chu Hsi (朱熹) nFirstName1 LastName, e.g., Hsi Chu (朱熹) nLast
37、Name FirstName1-FirstName2, e.g., Hsu Tao-ning (許道寧) nFirstName1-FirstName2 LastName, e.g., Tao-ning Hsu (許道寧) nAny order, e.g., Tao Ning Hsu (許道寧) nAny transliteration, e.g., Ju Shi (朱熹) Titlen谿山行旅圖 “Travelers among Mountains and Streamsntravelers, mountains, and streams are basic componentsnUsers
38、can express their information need through the descriptions of a desired art nSystem will measure the similarity of art titles (descriptions) and a query Free SearchnA query is composed of several concepts. nConcepts are either transliterated or translated.nThe query translation similar to a small scale IR system nResourcesnName-mapping
溫馨提示
- 1. 本站所有資源如無特殊說明,都需要本地電腦安裝OFFICE2007和PDF閱讀器。圖紙軟件為CAD,CAXA,PROE,UG,SolidWorks等.壓縮文件請下載最新的WinRAR軟件解壓。
- 2. 本站的文檔不包含任何第三方提供的附件圖紙等,如果需要附件,請聯(lián)系上傳者。文件的所有權(quán)益歸上傳用戶所有。
- 3. 本站RAR壓縮包中若帶圖紙,網(wǎng)頁內(nèi)容里面會有圖紙預(yù)覽,若沒有圖紙預(yù)覽就沒有圖紙。
- 4. 未經(jīng)權(quán)益所有人同意不得將文件中的內(nèi)容挪作商業(yè)或盈利用途。
- 5. 人人文庫網(wǎng)僅提供信息存儲空間,僅對用戶上傳內(nèi)容的表現(xiàn)方式做保護處理,對用戶上傳分享的文檔內(nèi)容本身不做任何修改或編輯,并不能對任何下載內(nèi)容負責。
- 6. 下載文件中如有侵權(quán)或不適當內(nèi)容,請與我們聯(lián)系,我們立即糾正。
- 7. 本站不保證下載資源的準確性、安全性和完整性, 同時也不承擔用戶因使用這些下載資源對自己和他人造成任何形式的傷害或損失。
最新文檔
- 2025至2030年中國堿性玫瑰精B市場調(diào)查研究報告
- 2025-2035年全球及中國汽車封裝發(fā)動機行業(yè)市場發(fā)展現(xiàn)狀及發(fā)展前景研究報告
- 工業(yè)智變:未來制造之路
- 2024年中國少兒生日蛋糕市場調(diào)查研究報告
- 工程技術(shù)創(chuàng)新之旅
- 頸椎骨折合并截癱病人的護理
- 腦癱的作業(yè)治療
- 腦梗后期治療
- 銀行住房貸款營銷培訓
- 營銷年度培訓方案
- GB/T 9799-2024金屬及其他無機覆蓋層鋼鐵上經(jīng)過處理的鋅電鍍層
- 2024年山東高考歷史卷試卷分析與2025年備考啟示-2025屆高三歷史一輪復習
- 申論標準方格紙-A4-可直接打印
- HG/T 6313-2024 化工園區(qū)智慧化評價導則(正式版)
- 物業(yè)外包管理
- 住院醫(yī)師規(guī)范化培訓計劃及目標
- 中職語文必考文言文15篇
- 2024-2028年蒸汽發(fā)電機市場發(fā)展現(xiàn)狀調(diào)查及供需格局分析預(yù)測報告
- 《團隊的凝聚力》課件
- 肺栓塞治療指南2024
- 古建筑修繕工程方案
評論
0/150
提交評論