版權(quán)說明:本文檔由用戶提供并上傳,收益歸屬內(nèi)容提供方,若內(nèi)容存在侵權(quán),請進(jìn)行舉報(bào)或認(rèn)領(lǐng)
文檔簡介
Algorithms
for
Nearest
NeighborSearchPiotr
IndykMITNearest
Neighbor
SearchGiven:
a
set
P
of
n
points
in
Rd
Goal:
a
data
structure,
which
given
a
quepoint
q,
finds
the
nearest
neighbor
p
ofin
PpqOutline
of
this
talkVariantsMotivationMain
memory
algorithms:quadtreeskd-treesLocality
Sensitive
HashingSecondary
storage
algorithms:R-tree
(and
its
variants)VA-fileVariants
of
nearest
neighbor
Near
neighbor
(range
search):
find
one/alpoints
in
P
within
distance
r
from
q
Spatial
join:
given
two
sets
P,Q,
find
allpairs
p
in
P,
q
in
Q,
such
that
p
is
withindistance
r
from
q
Approximate
near
neighbor:
find
one/allpoints
p’
in
P,
whose
distance
to
q
is
atmost
(1+e)
times
the
distance
from
q
to
itsnearest
neighborMotivationDepends
on
the
value
of
d:low
d:
graphics,
vision,
GIS,
etchigh
d:similarity
search
in
databases
(text,
imagesfinding
pairs
of
similar
objects
(e.g.,
copyrviolation
detection)useful
subroutine
for
clusteringAlgorithmsMain
memory
(Computational
Geometry)linear
scantree-based:quadtreekd-treehashing-based:
Locality-Sensitive
HashingSecondary
storage
(Databases)R-tree
(and
numerous
variants)Vector
Approximation
File
(VA-file)QuadtreeSimplest
spatial
structure
on
Earth
!Quadtree
ctd.Split
the
space
into
2d
equal
subsquaresRepeat
until
done:only
one
pixel
leftonly
one
point
leftonly
a
few
points
leftVariants:split
only
one
dimension
at
a
timek-d-trees
(in
a
moment)Range
searchNear
neighbor
(range
search):put
the
root
on
the
stackrepeatpop
the
next
node
T
from
the
stackfor
each
child
C
of
T:if
C
is
a
leaf,
examine
point(s)
in
Cif
C
intersects
with
the
ball
of
radius
r
around
q,
add
Cthe
stackNear
neighbor
ctdNearest
neighborStart
range
search
with
r
=Whenever
a
point
is
found,
update
r
Only
investigate
nodes
with
respect
tocurrent
rQuadtree
ctd.Simple
data
structureVersatile,
easy
to
implementSo
why
doesn’t
this
talk
end
here
?Empty
spaces:
if
the
points
form
sparse
cloudit
takes
a
while
to
reach
themSpace
exponential
in
dimensionTime
exponential
in
dimension,
e.g.,
points
othe
hypercubeSpace
issues:
exampleK-d-trees
[Bentley’75]Main
ideas:only
one-dimensional
splitsinstead
of
splitting
in
the
middle,
choose
thsplit
“carefully”
(many
variations)near(est)
neighbor
queries:
as
for
quadtreesAdvantages:no
(or
less)
empty
spacesonly
linear
spaceExponential
query
time
still
possibleExponential
query
timeWhat
does
it
mean
exactly
?Unless
we
do
something
really
stupid,
query
time
ismost
dnTherefore,
the
actual
query
time
isMin[
dn,
exponential(d)
]
This
is
still
quite
bad
though,
when
the
dimensiois
around
20-30
Unfortunately,
it
seems
inevitable
(both
in
theoand
practice)Approximate
nearest
neighbor
Can
do
it
using
(augmented)
k-d
trees,
byinterrupting
search
earlier
[Arya
et
al’Still
exponential
time
(in
the
worst
caseTry
a
different
approach:for
exact
queries,
we
can
use
binary
searchtrees
or
hashingcan
we
adapt
hashing
to
nearest
neighborsearch
?Locality-Sensitive
Hashing[Indyk-Motwani’98]
Hash
functions
are
locality-sensitive,
ia
random
hash
random
function
h,
for
anypair
of
points
p,q
we
have:Pr[h(p)=h(q)]
is
“high”
if
p
is
“close”
tqPr[h(p)=h(q)]
is
“l(fā)ow”
if
p
is”far”
fromqDo
such
functions
exist
?Consider
the
hypercube,
i.e.,pointsfrom{0,1}dHamming
distance
D(p,q)=
#
positions
onwhich
p
and
q
differDefine
hash
function
h
by
choosing
a
set
Iof
k
random
coordinates,
and
settingh(p)
=projection
of
p
onIExampleTake–
d=10,
p=0101110010–
k=2,
I={2,5}Then
h(p)=11h’s
are
locality-sensitivePr[h(p)=h(q)]=(1-D(p,q)/d)kWe
can
vary
the
probability
by
changing
kk=1k=2distancedistancePrPrHow
can
we
use
LSH
?Choose
several
h1..hlInitialize
a
hash
array
for
each
hiStore
each
point
p
in
the
bucket
hi(p)
of
ti-th
hash
array,
i=1...lIn
order
to
answer
query
qfor
each
i=1..l,
retrieve
points
in
a
bucket
hreturn
the
closest
point
foundWhat
does
this
algorithm
do
?
By
proper
choice
of
parameters
k
and
l,
we
canmake,
for
any
p,
the
probability
thathi(p)=hi(q)
for
some
ilook
like
this:Can
control:Position
of
the
slopeHow
steep
it
isdistanceThe
LSH
algorithm
Therefore,
we
can
solve
(approximately)
the
nearneighbor
problem
with
given
parameter
rWorst-case
analysis
guarantees
dn1/(1+e)
query
ti
Practical
evaluation
indicates
much
better
beha[GIM’99,HGI’00,Buh’00,BT’00]Drawbacks:
works
best
for
Hamming
distance
(although
can
be
generalizeto
Euclidean
space)requires
radius
r
to
be
fixed
in
advanceSecondary
storage
Seek
time
same
as
time
needed
to
transferhundreds
of
KBsGrouping
the
data
is
crucialDifferent
approach
required:in
main
memory,
any
reduction
in
the
numberof
inspected
points
was
goodon
disk,
this
is
not
the
case
!Disk-based
algorithmsR-tree
[Guttman’84]departing
point
for
many
variationsover
600
citations
!
(according
to
CiteSeer)“optimistic”
approach:
try
to
answer
queries
inlogarithmic
timeVector
Approximation
File
[WSB’98]“pessimistic”
approach:
if
we
need
to
scan
the
whdata
set,
we
better
do
it
fastLSH
works
also
on
diskR-tree
“Bottom-up”
approach
(k-d-tree
was“top-down”)
:Start
with
a
set
of
points/rectanglesPartition
the
set
into
groups
of
small
cardinFor
each
group,
find
minimum
rectanglecontaining
objects
from
this
groupRepeatR-tree
ctd.R-tree
ctd.Advantages:Supports
near(est)
neighbor
search
(similarbefore)Works
for
points
and
rectanglesAvoids
empty
spacesMany
variants:
X-tree,
SS-tree,
SR-tree
etcWorks
well
for
low
dimensionsNot
so
great
for
high
dimensionsVA-file
[Weber,
Schek,Blott’98]Approach:In
high-dimensional
spaces,
all
tree-basedindexing
structures
examine
large
fraction
ofleavesIf
we
need
to
visit
so
many
nodes
anyway,
it
isbetter
to
scan
the
whole
data
set
and
avoidperforming
seeks
altogether1
seek
=
transfer
of
few
hundred
KBVA-file
ctd.
Natural
question:
how
to
speed-up
linearscan
?Answer:
use
approximationUse
only
i
bits
per
dimension
(and
speed-up
thscan
by
a
factor
of
32/i)Identify
all
points
which
could
be
returned
aan
answerVerify
the
points
using
original
data
setTime
to
sum
up“Curse
of
dimensionality”
is
indeed
a
curse
In
main
memory,
we
can
perform
sublinear-timesearch
using
trees
or
hashing
In
secondary
storage,
linear
scan
is
p
溫馨提示
- 1. 本站所有資源如無特殊說明,都需要本地電腦安裝OFFICE2007和PDF閱讀器。圖紙軟件為CAD,CAXA,PROE,UG,SolidWorks等.壓縮文件請下載最新的WinRAR軟件解壓。
- 2. 本站的文檔不包含任何第三方提供的附件圖紙等,如果需要附件,請聯(lián)系上傳者。文件的所有權(quán)益歸上傳用戶所有。
- 3. 本站RAR壓縮包中若帶圖紙,網(wǎng)頁內(nèi)容里面會有圖紙預(yù)覽,若沒有圖紙預(yù)覽就沒有圖紙。
- 4. 未經(jīng)權(quán)益所有人同意不得將文件中的內(nèi)容挪作商業(yè)或盈利用途。
- 5. 人人文庫網(wǎng)僅提供信息存儲空間,僅對用戶上傳內(nèi)容的表現(xiàn)方式做保護(hù)處理,對用戶上傳分享的文檔內(nèi)容本身不做任何修改或編輯,并不能對任何下載內(nèi)容負(fù)責(zé)。
- 6. 下載文件中如有侵權(quán)或不適當(dāng)內(nèi)容,請與我們聯(lián)系,我們立即糾正。
- 7. 本站不保證下載資源的準(zhǔn)確性、安全性和完整性, 同時也不承擔(dān)用戶因使用這些下載資源對自己和他人造成任何形式的傷害或損失。
最新文檔
- 工程監(jiān)理勞務(wù)分包協(xié)議
- 車站附近人行道改造合同
- 游泳池電工招聘合同模板
- 家電銷售經(jīng)紀(jì)人合作協(xié)議
- 政府宣傳片編劇招聘協(xié)議
- 清潔能源高速公路合同管理辦法
- 社區(qū)活動中心球場施工合同
- 紡織生產(chǎn)電動工具租賃協(xié)議
- 污水處理廠改造圍擋施工合同
- 皮膚病醫(yī)院聘用協(xié)議樣本
- OGS工藝介紹(for聞泰20140715)
- 2023年上海英語高考卷及答案完整版
- 上海層廠房造價指標(biāo)
- 2023年復(fù)旦大學(xué)博士研究生入學(xué)考試專家推薦信模板
- 危險源風(fēng)險告知及控制措施(維修電工)
- 自動控制理論的早期發(fā)展歷史課件
- 國家開放大學(xué)《機(jī)械設(shè)計(jì)基礎(chǔ)》機(jī)考試題001-009參考答案
- 電氣二次系統(tǒng)簡介課件
- 《碗中日月》:作家丁立梅親自示范中考、高考真題作文60篇
- 大班科學(xué)《奇妙的信》課件
- 考古繪圖課件
評論
0/150
提交評論