Algorithms for Nearest Neighbor Search-大學(xué)課件-在線_第1頁
Algorithms for Nearest Neighbor Search-大學(xué)課件-在線_第2頁
Algorithms for Nearest Neighbor Search-大學(xué)課件-在線_第3頁
Algorithms for Nearest Neighbor Search-大學(xué)課件-在線_第4頁
Algorithms for Nearest Neighbor Search-大學(xué)課件-在線_第5頁
已閱讀5頁,還剩30頁未讀, 繼續(xù)免費(fèi)閱讀

下載本文檔

版權(quán)說明:本文檔由用戶提供并上傳,收益歸屬內(nèi)容提供方,若內(nèi)容存在侵權(quán),請進(jìn)行舉報(bào)或認(rèn)領(lǐng)

文檔簡介

Algorithms

for

Nearest

NeighborSearchPiotr

IndykMITNearest

Neighbor

SearchGiven:

a

set

P

of

n

points

in

Rd

Goal:

a

data

structure,

which

given

a

quepoint

q,

finds

the

nearest

neighbor

p

ofin

PpqOutline

of

this

talkVariantsMotivationMain

memory

algorithms:quadtreeskd-treesLocality

Sensitive

HashingSecondary

storage

algorithms:R-tree

(and

its

variants)VA-fileVariants

of

nearest

neighbor

Near

neighbor

(range

search):

find

one/alpoints

in

P

within

distance

r

from

q

Spatial

join:

given

two

sets

P,Q,

find

allpairs

p

in

P,

q

in

Q,

such

that

p

is

withindistance

r

from

q

Approximate

near

neighbor:

find

one/allpoints

p’

in

P,

whose

distance

to

q

is

atmost

(1+e)

times

the

distance

from

q

to

itsnearest

neighborMotivationDepends

on

the

value

of

d:low

d:

graphics,

vision,

GIS,

etchigh

d:similarity

search

in

databases

(text,

imagesfinding

pairs

of

similar

objects

(e.g.,

copyrviolation

detection)useful

subroutine

for

clusteringAlgorithmsMain

memory

(Computational

Geometry)linear

scantree-based:quadtreekd-treehashing-based:

Locality-Sensitive

HashingSecondary

storage

(Databases)R-tree

(and

numerous

variants)Vector

Approximation

File

(VA-file)QuadtreeSimplest

spatial

structure

on

Earth

!Quadtree

ctd.Split

the

space

into

2d

equal

subsquaresRepeat

until

done:only

one

pixel

leftonly

one

point

leftonly

a

few

points

leftVariants:split

only

one

dimension

at

a

timek-d-trees

(in

a

moment)Range

searchNear

neighbor

(range

search):put

the

root

on

the

stackrepeatpop

the

next

node

T

from

the

stackfor

each

child

C

of

T:if

C

is

a

leaf,

examine

point(s)

in

Cif

C

intersects

with

the

ball

of

radius

r

around

q,

add

Cthe

stackNear

neighbor

ctdNearest

neighborStart

range

search

with

r

=Whenever

a

point

is

found,

update

r

Only

investigate

nodes

with

respect

tocurrent

rQuadtree

ctd.Simple

data

structureVersatile,

easy

to

implementSo

why

doesn’t

this

talk

end

here

?Empty

spaces:

if

the

points

form

sparse

cloudit

takes

a

while

to

reach

themSpace

exponential

in

dimensionTime

exponential

in

dimension,

e.g.,

points

othe

hypercubeSpace

issues:

exampleK-d-trees

[Bentley’75]Main

ideas:only

one-dimensional

splitsinstead

of

splitting

in

the

middle,

choose

thsplit

“carefully”

(many

variations)near(est)

neighbor

queries:

as

for

quadtreesAdvantages:no

(or

less)

empty

spacesonly

linear

spaceExponential

query

time

still

possibleExponential

query

timeWhat

does

it

mean

exactly

?Unless

we

do

something

really

stupid,

query

time

ismost

dnTherefore,

the

actual

query

time

isMin[

dn,

exponential(d)

]

This

is

still

quite

bad

though,

when

the

dimensiois

around

20-30

Unfortunately,

it

seems

inevitable

(both

in

theoand

practice)Approximate

nearest

neighbor

Can

do

it

using

(augmented)

k-d

trees,

byinterrupting

search

earlier

[Arya

et

al’Still

exponential

time

(in

the

worst

caseTry

a

different

approach:for

exact

queries,

we

can

use

binary

searchtrees

or

hashingcan

we

adapt

hashing

to

nearest

neighborsearch

?Locality-Sensitive

Hashing[Indyk-Motwani’98]

Hash

functions

are

locality-sensitive,

ia

random

hash

random

function

h,

for

anypair

of

points

p,q

we

have:Pr[h(p)=h(q)]

is

“high”

if

p

is

“close”

tqPr[h(p)=h(q)]

is

“l(fā)ow”

if

p

is”far”

fromqDo

such

functions

exist

?Consider

the

hypercube,

i.e.,pointsfrom{0,1}dHamming

distance

D(p,q)=

#

positions

onwhich

p

and

q

differDefine

hash

function

h

by

choosing

a

set

Iof

k

random

coordinates,

and

settingh(p)

=projection

of

p

onIExampleTake–

d=10,

p=0101110010–

k=2,

I={2,5}Then

h(p)=11h’s

are

locality-sensitivePr[h(p)=h(q)]=(1-D(p,q)/d)kWe

can

vary

the

probability

by

changing

kk=1k=2distancedistancePrPrHow

can

we

use

LSH

?Choose

several

h1..hlInitialize

a

hash

array

for

each

hiStore

each

point

p

in

the

bucket

hi(p)

of

ti-th

hash

array,

i=1...lIn

order

to

answer

query

qfor

each

i=1..l,

retrieve

points

in

a

bucket

hreturn

the

closest

point

foundWhat

does

this

algorithm

do

?

By

proper

choice

of

parameters

k

and

l,

we

canmake,

for

any

p,

the

probability

thathi(p)=hi(q)

for

some

ilook

like

this:Can

control:Position

of

the

slopeHow

steep

it

isdistanceThe

LSH

algorithm

Therefore,

we

can

solve

(approximately)

the

nearneighbor

problem

with

given

parameter

rWorst-case

analysis

guarantees

dn1/(1+e)

query

ti

Practical

evaluation

indicates

much

better

beha[GIM’99,HGI’00,Buh’00,BT’00]Drawbacks:

works

best

for

Hamming

distance

(although

can

be

generalizeto

Euclidean

space)requires

radius

r

to

be

fixed

in

advanceSecondary

storage

Seek

time

same

as

time

needed

to

transferhundreds

of

KBsGrouping

the

data

is

crucialDifferent

approach

required:in

main

memory,

any

reduction

in

the

numberof

inspected

points

was

goodon

disk,

this

is

not

the

case

!Disk-based

algorithmsR-tree

[Guttman’84]departing

point

for

many

variationsover

600

citations

!

(according

to

CiteSeer)“optimistic”

approach:

try

to

answer

queries

inlogarithmic

timeVector

Approximation

File

[WSB’98]“pessimistic”

approach:

if

we

need

to

scan

the

whdata

set,

we

better

do

it

fastLSH

works

also

on

diskR-tree

“Bottom-up”

approach

(k-d-tree

was“top-down”)

:Start

with

a

set

of

points/rectanglesPartition

the

set

into

groups

of

small

cardinFor

each

group,

find

minimum

rectanglecontaining

objects

from

this

groupRepeatR-tree

ctd.R-tree

ctd.Advantages:Supports

near(est)

neighbor

search

(similarbefore)Works

for

points

and

rectanglesAvoids

empty

spacesMany

variants:

X-tree,

SS-tree,

SR-tree

etcWorks

well

for

low

dimensionsNot

so

great

for

high

dimensionsVA-file

[Weber,

Schek,Blott’98]Approach:In

high-dimensional

spaces,

all

tree-basedindexing

structures

examine

large

fraction

ofleavesIf

we

need

to

visit

so

many

nodes

anyway,

it

isbetter

to

scan

the

whole

data

set

and

avoidperforming

seeks

altogether1

seek

=

transfer

of

few

hundred

KBVA-file

ctd.

Natural

question:

how

to

speed-up

linearscan

?Answer:

use

approximationUse

only

i

bits

per

dimension

(and

speed-up

thscan

by

a

factor

of

32/i)Identify

all

points

which

could

be

returned

aan

answerVerify

the

points

using

original

data

setTime

to

sum

up“Curse

of

dimensionality”

is

indeed

a

curse

In

main

memory,

we

can

perform

sublinear-timesearch

using

trees

or

hashing

In

secondary

storage,

linear

scan

is

p

溫馨提示

  • 1. 本站所有資源如無特殊說明,都需要本地電腦安裝OFFICE2007和PDF閱讀器。圖紙軟件為CAD,CAXA,PROE,UG,SolidWorks等.壓縮文件請下載最新的WinRAR軟件解壓。
  • 2. 本站的文檔不包含任何第三方提供的附件圖紙等,如果需要附件,請聯(lián)系上傳者。文件的所有權(quán)益歸上傳用戶所有。
  • 3. 本站RAR壓縮包中若帶圖紙,網(wǎng)頁內(nèi)容里面會有圖紙預(yù)覽,若沒有圖紙預(yù)覽就沒有圖紙。
  • 4. 未經(jīng)權(quán)益所有人同意不得將文件中的內(nèi)容挪作商業(yè)或盈利用途。
  • 5. 人人文庫網(wǎng)僅提供信息存儲空間,僅對用戶上傳內(nèi)容的表現(xiàn)方式做保護(hù)處理,對用戶上傳分享的文檔內(nèi)容本身不做任何修改或編輯,并不能對任何下載內(nèi)容負(fù)責(zé)。
  • 6. 下載文件中如有侵權(quán)或不適當(dāng)內(nèi)容,請與我們聯(lián)系,我們立即糾正。
  • 7. 本站不保證下載資源的準(zhǔn)確性、安全性和完整性, 同時也不承擔(dān)用戶因使用這些下載資源對自己和他人造成任何形式的傷害或損失。

評論

0/150

提交評論