版權(quán)說(shuō)明:本文檔由用戶提供并上傳,收益歸屬內(nèi)容提供方,若內(nèi)容存在侵權(quán),請(qǐng)進(jìn)行舉報(bào)或認(rèn)領(lǐng)
文檔簡(jiǎn)介
1、跨語(yǔ)言資訊檢索導(dǎo)論Hsin-Hsi Chen (陳信希)Department of Computer Science and Information EngineeringNational Taiwan UniversityOutlinenMultilingual EnvironmentsnWhat is Cross-Language Information Retrieval?nMajor Problems in CLIRnMajor Approaches in CLIRnCase Study: CLIR in NPDMnSummaryMultilingual CollectionsnThe
2、re are 6,703 languages listed in the EthnologuenDigital librariesnOCLC Online Computer Library Center serves more than 17,000 libraries in 52 countries and contains over 30 million bibliographic records with over 500 million records ownership attached in more than 370 languagesnWorld Wide WebnAround
3、 40% of Internet users do not speak English, however, 80% of Web sites are still in English0200400600800Speakers (Millions)ChineseHindi-UrduPortugueseRussianJapanese真實(shí)世界語(yǔ)言運(yùn)用人口( g11n/faq.htm)中文英語(yǔ)印度語(yǔ)西班牙語(yǔ)葡萄牙語(yǔ)孟加拉語(yǔ)俄語(yǔ)阿拉伯語(yǔ)日語(yǔ)(Statistics from Euro-Marketing Associates, 2019)西班牙語(yǔ)德語(yǔ)日語(yǔ)法語(yǔ)中文荷蘭語(yǔ)葡萄牙語(yǔ)義大利語(yǔ)瑞典語(yǔ)韓文glreac
4、h/globstats/(Statistics from Euro-Marketing Associates, 2019)中文人口比例(6.1%) 南非, Sdafrika)nCoverage of the vocabularynThere is not a one-to-one mapping between two languagesnTranslating queries automatically (lack of syntax)nTranslating documents automatically (performance, )nComputing mixed result lis
5、tsCross-Language Information RetrievalCont r ol l ed Vocabul ar yThes aur us - bas edOnt ol ogy- bas edDi ct i onar y- bas edKnowl edge- bas edTer m- al i gnedSent ence- al i gnedPar al l elCompar abl eDocument - al i gnedUnal i gnedCor pus - bas edHybr i dFr ee TextQuer y Tr ans l at i onText Tr an
6、s l at i onVect or Tr ans l at i onDocument Tr ans l at i onNo Tr ans l at i onCr os s - Language I nf or mat i on Ret r i evalQuery Translation Based CLIREnglishQueryTranslationDeviceChineseQueryMonolingualChineseRetrievalSystemRetrievedChineseDocumentsTranslating the 400 Millionnon-English Pages o
7、f the WWWn. would take 100000 days (300 years) on one fast PC. Or, 1 month on 3600 PCs.Knowledge-BasednExamplesnSubject ThesaurusnHierarchical and associative relations.nUnique term assigned to each node.nConcept ListnTerm space partitioned into concept spaces.nTerm ListnList of cross-language synon
8、yms.nLexiconnMachine readable syntax and/or semantics.Ontology-Based ApproachesnExploit complex knowledge representations e.g., EuroWordNet nA Proposal for Conceptual Indexing using EuroWordNetDictionary-Based ApproachesnExploit machine-readable dictionaries.nProblemsntranslation ambiguity + target
9、polysemyncoverage (unknown words, abbreviations, .)Dictionary-Based Approaches(Continued)nIssue 1: selection strategynSelect all.nSelect N randomly.nSelect best N.nIssue 2: which levelnwordnphraseSelection Strategy: Select AllnHull and Grefenstette 2019nTake concatenation of all term translation.E:
10、politically motivated civil disturbancesF: troubles civils a caractere politiquetrouble - turmoil, discord, trouble, unrest, disturbance, disordercivil - civil, civilian, courteouscaractere - character, naturepolitique - political, diplomatic, politician, policynOriginal English (0.393) vs. Automati
11、c word-based transfer dictionary (0.235): 59.8%.nerrors: multi-word expressions and ambiguitySelection Strategy: Select All(Continued)nDavis 2019 (TREC5)nReplace each English query term with all of its Spanish equivalent terms from the Collins bilingual dictionary.nMonolingual (0.2895) vs. All-equiv
12、alent substitution (0.1422): 49.12%Evaluation MethodnAverage Precision (5-, 9-, 11-points)nModelSpanish QueryMonoIR EngineEnglish QueryBilingualDictionaryMonoIR EngineTRECSpanishCorpusSpanishEquivalentsEnglish QueryMonoIR EngineTRECSpanishCorpusSpanishEquivalentsby POSPOSBilingualDictionaryTRECSpani
13、shCorpusSelection Strategy: Select NnSimple word-by-word translationnEach query term is replaced by the word or group of words given for the first sense of the terms definition.n50-60% drop in performance (average precision)Selection Strategy: Select N(Continued)nword/phrase translationnTake at most
14、 three translations of each word, one from each of the first three senses. Take phrase translation if appearing in dictionary.n30-50% worse than good translationnWell-translated phrases can greatly improve effectiveness, but poorly translated phrases may negate the improvements.nWBW (0.0244), phrasa
15、l (0.0148), good phrasal (0.0610) -39.3% +150.3%Selection Strategy: Select Best NnHayashi, Kikui and Susaki 2019nsearch for a dictionary entry corresponding to the longest sequence of words from left to rightnchoose the most frequently used word (or phrases) in a text corpus collected from WWWnno re
16、port for this query translation approachnDavis 2019 (TREC5)nPOS disambiguationnMonolingual (0.2895) vs. All-equivalent substitution (0.1422) vs. POS disambiguation (0.1949): near 67.3%Corpus-Based ApproachesnCategorizationnTerm-AlignednSentence-AlignednDocument-Aligned (Parallel, Comparable)nUnalign
17、ednUsagenSetup ThesaurusnVector MappingTerm-Aligned CorporanFine-grained alignment in parallel corporanOard 2019nTerm alignment is a challenging problem.ParallelBinlingualCorpusCooccurranceStatisticsTranslationTablesMachineTranslationSystemEnglish QuerySpanishQuerySentence-Aligned CorporanDavis &
18、; Dunning 2019 (TREC4)nHigh-frequency TermsBrief Summaryndictionary-based methodsnSpecialized vocabulary not in the dictionaries will not be translated.nAmbiguities will add extraneous terms to the query.nparallel/comparable corpora-based methodsnParallel corpora are not always available.nAvailable
19、corpora tend to be relative small or to cover only a small number of subjects.nPerformance is dependent on how well the corpora are aligned.Brief Summary (Continued)nDictionaries are very useful.nAchieve 50% on their ownnParallel corpora have limitations.nDomain shiftsnTerm alignment accuracynDictio
20、naries and corpora are complementary.nDictionaries provide broad and shallow coverage.nCorpora provide narrow (domain-specific) but deep (more terminology) coverage of the language.Hybrid MethodsnWhat knowledge can be employed?nlexical knowledgencorpus knowledgen.Hybrid Methods (Continued)nQuery Exp
21、ansionnIssue 1: contextnpseudo relevance feedback (local feedback):A query is modified by the addition of terms found in the top retrieved documents.nlocal context analysis:Queries are expanded by the addition of the top ranked concepts from the top passages.Hybrid Methods (Continued) Issue 2: when
22、before query translation after query translationHybrid Methods (Continued)nBallesteros & Croft 2019Original SpanishTREC QuerieshumantranslationEnglish (BASE)QueriesSpanishQueriesautomaticdictionarytranslationEnglishQueriesqueryexpansionSpanishQueriesqueryexpansionSpanishQueriesautomaticdictionar
23、ytranslationINQUERYHybrid Methods (Continued) Performance Evaluation pre-translationMRD (0.0823) vs. LF (0.1099) vs. LCA10 (0.1) +33.5% +38.5% post-translationMRD (0.0823) vs. LF (0.0916) vs. LCA20 (0.1022) +11.3% +24.1% combined pre- and post-translationMRD (0.0823) vs. LF (0.1242) vs. LCA20 (0.8)
24、+51.0% +65.0% 32% below a monolingual baselineCross-Language Evaluation ForumnA collaboration between the DELOS Network of Excellence for Digital Libraries and the US National Institute for Standards and Technology (NIST)nExtension of CLIR track at TREC (2019-2019)Main GoalsnPromote research in cros
25、s-language system development for European languages by providing an appropriate infrastructure for:nCLIR system evaluation, testing and tuningnComparison and discussion of resultsCLEF 2000 Task Description nFour evaluation tracks in CLEF 2000nmultilingual information retrievalnbilingual information
26、 retrievalnmonolingual (non-English) information retrievalndomain-specific IRCase Study: CLIR for NPDM3M in Digital Libraries/MuseumsnMulti-medianSelecting suitable media to represent contents nMulti-lingualityn Decreasing the language barriersnMulti-culturenIntegrating multiple cultures NPDM Projec
27、tnPalace Museum, Taipei, one of the famous museums in the worldnNSC supports a pioneer study of a digital museum project NPDM starting from 2000 nEnamels from the Ming and Ching Dynasties nFamous Album Leaves of the Sung Dynasty nIllustrations in Buddhist Scriptures with Relative Drawings Design Iss
28、uesnStandardizationnA standard metadata protocol is indispensable for the interchange of resources with other museums.nMultimedia nA suitable presentation scheme is required.nInternationalization nto share the valuable resources of NPDM with users of different languagesnto utilize knowledge presente
29、d in a foreign languageTranslingual Issue nCLIRnto allow users to issue queries in one language to access documents in another languagenthe query language is English and the document language is ChinesenTwo common approachesnQuery translationnDocument translationResources in NPDM pilotnan enamel, a
30、calligraphy, a painting, or an illustrationnMICI-DCnMetadata Interchange for Chinese InformationnAccessible fields to usersnShort descriptions vs. full textsnBilingual versions vs. Chinese onlynFields for maintenance onlySearch ModesnFree searchnusers describe their information need using natural la
31、nguages (Chinese or English)nSpecific topic searchnusers fill in specific fields denoting authors, titles, dates, and so on ExamplenInformation neednRetrieval “Travelers Among Mountains and Streams, Fan Kuan (“范寬谿山行旅圖) nPossible queriesnAuthor: Fan Kuan; Kuan, Fan nTime: Sung Dynasty nTitle: Mountai
32、ns and Streams; Travel among mountains; Travel among streams; Mountain and stream painting nFree search: landscape painting; travelers, huge mountain, Nature; scenery; Shensi province EnglishNamesChineseNamesMachineTransliterationEnglishTitlesChineseTitlesDocumentTranslationNameSearchTitleSearchEngl
33、ishQueryQueryDisambiguationSpecificBilingualDictionaryGenericBilingualDictionaryChineseQueryQueryTranslationChinese IRSystemNPDMCollectionResultsECIR in NPDMSpecific Topic Searchnproper names are important query termsnCreators such as “林逋 (Lin Pu), “李建中 (Li Chien-chung), “歐陽(yáng)脩 (Ou-yang Hsiu), etc. nE
34、mperors such as “康熙 (Kang-hsi), “乾隆 (Chien-lung), “徽宗 (Hui-tsung), etc.nDynasty such as 宋 (Sung), “明 (Ming), “清 (Ching), etc.Name Transliteration nThe alphabets of Chinese and English are totally different nWade-Giles (WG) and Pinyin are two famous systems to romanize Chinese in libraries nbackward
35、transliterationnTransliterate target language terms back to source language ones nChen, Huang, and Tsai (COLING, 2019)nLin and Chen (ROCLING, 2000)Name Mapping TablenDivide a name into a sequence of Chinese characters, and transform each character into phonemesnLook up phoneme-to-WG (Pinyin) mapping
36、 table, and derive a canonical form for the name nExamplen“林逋 “ “Lin Pu (WG) Name SimilaritynExtract named entity from the query nSelect the most similar named entity from name mapping tablenNaming sequence/schemenLastName FirstName1, e.g., Chu Hsi (朱熹) nFirstName1 LastName, e.g., Hsi Chu (朱熹) nLast
37、Name FirstName1-FirstName2, e.g., Hsu Tao-ning (許道寧) nFirstName1-FirstName2 LastName, e.g., Tao-ning Hsu (許道寧) nAny order, e.g., Tao Ning Hsu (許道寧) nAny transliteration, e.g., Ju Shi (朱熹) Titlen谿山行旅圖 “Travelers among Mountains and Streamsntravelers, mountains, and streams are basic componentsnUsers
38、can express their information need through the descriptions of a desired art nSystem will measure the similarity of art titles (descriptions) and a query Free SearchnA query is composed of several concepts. nConcepts are either transliterated or translated.nThe query translation similar to a small scale IR system nResourcesnName-mapping
溫馨提示
- 1. 本站所有資源如無(wú)特殊說(shuō)明,都需要本地電腦安裝OFFICE2007和PDF閱讀器。圖紙軟件為CAD,CAXA,PROE,UG,SolidWorks等.壓縮文件請(qǐng)下載最新的WinRAR軟件解壓。
- 2. 本站的文檔不包含任何第三方提供的附件圖紙等,如果需要附件,請(qǐng)聯(lián)系上傳者。文件的所有權(quán)益歸上傳用戶所有。
- 3. 本站RAR壓縮包中若帶圖紙,網(wǎng)頁(yè)內(nèi)容里面會(huì)有圖紙預(yù)覽,若沒(méi)有圖紙預(yù)覽就沒(méi)有圖紙。
- 4. 未經(jīng)權(quán)益所有人同意不得將文件中的內(nèi)容挪作商業(yè)或盈利用途。
- 5. 人人文庫(kù)網(wǎng)僅提供信息存儲(chǔ)空間,僅對(duì)用戶上傳內(nèi)容的表現(xiàn)方式做保護(hù)處理,對(duì)用戶上傳分享的文檔內(nèi)容本身不做任何修改或編輯,并不能對(duì)任何下載內(nèi)容負(fù)責(zé)。
- 6. 下載文件中如有侵權(quán)或不適當(dāng)內(nèi)容,請(qǐng)與我們聯(lián)系,我們立即糾正。
- 7. 本站不保證下載資源的準(zhǔn)確性、安全性和完整性, 同時(shí)也不承擔(dān)用戶因使用這些下載資源對(duì)自己和他人造成任何形式的傷害或損失。
最新文檔
- 2025版酒店酒水行業(yè)人才培養(yǎng)與輸送服務(wù)合同3篇
- 2025版車(chē)間承包與安全生產(chǎn)合作協(xié)議3篇
- 課題申報(bào)書(shū):大數(shù)據(jù)背景下高維異質(zhì)數(shù)據(jù)去中心化聯(lián)邦學(xué)習(xí)及應(yīng)用
- 課題申報(bào)書(shū):城市社區(qū)基層治理中“藝術(shù)協(xié)作”的路徑與策略研究
- 2025年度XX污水處理廠自動(dòng)化控制系統(tǒng)升級(jí)合同3篇
- 2024年高速公路路基土方工程承攬合同一
- 二零二五年度2025版離婚協(xié)議中離婚后子女撫養(yǎng)權(quán)及監(jiān)護(hù)權(quán)協(xié)議3篇
- 2024年度酒店管理服務(wù)合同范本3篇
- 2025版西瓜休閑農(nóng)業(yè)觀光園合作協(xié)議3篇
- 二零二五年鄉(xiāng)村旅游地產(chǎn)項(xiàng)目股權(quán)合作協(xié)議3篇
- 穿越河流工程定向鉆專(zhuān)項(xiàng)施工方案
- 地球物理學(xué)進(jìn)展投稿須知
- 機(jī)床精度檢驗(yàn)標(biāo)準(zhǔn) VDI3441 a ISO230-2
- 社會(huì)主義新農(nóng)村建設(shè)建筑廢料利用探究
- 解析電力施工項(xiàng)目的信息化管理
- 火炬介紹 音速火炬等
- 制劑申請(qǐng)書(shū)(共16頁(yè))
- 《質(zhì)量守恒定律》評(píng)課稿
- 人教版七年級(jí)上冊(cè)地理《第4章居民與聚落 第3節(jié)人類(lèi)的聚居地——聚落》課件
- 對(duì)縣委常委班子及成員批評(píng)意見(jiàn)范文
- 數(shù)據(jù)中心IDC項(xiàng)目建議書(shū)
評(píng)論
0/150
提交評(píng)論