🎉 从330转为WashU帮助手册啦!

CSE347
模块
Cse347 L9

Lecture 9

Randomized Algorithms

Hashing

Hashing with chaining:

Input: We have integers in range [1,n1]=U[1,n-1]=U. We want to map them to a hash table TT with mm slots.

Hash function: h:U[m]h:U\rightarrow [m]

Goal: Hashing a set SUS\subseteq U, S=n|S|=n into TT such that the number of elements in each slot is at most 11.

Collisions

When multiple keys are mapped to the same slot, we call it a collision, we keep a linked list of all the keys that map to the same slot.

Runtime of insert, query, delete of elements =Θ(length of the chain)=\Theta(\textup{length of the chain})

Worst-case runtime of insert, query, delete of elements =Θ(n)=\Theta(n)

Therefore, we want chains to be short, or Θ(1)\Theta(1), as long as S|S| is reasonably sized, or equivalently, we want the number in any set SS to hash uniformly across all slots.

Simple Uniform Hashing Assumptions

The nn elements we want to hash (the set SS) is picked uniformly at random from UU. Therefore, we could see that this simple hash function works fine:

h(x)=xmodmh(x)=x\mod m

Question: What happens if an adversary knows this function and designs SS to make the worst-case runtime happen?

Answer: The adversary can make the runtime of each operation Θ(n)\Theta(n) by simply making all the elements hash to the same slot.

Randomization to the rescue

We don't want the adversary to know the hash function based on just looking at the code.

Idea: Randomize the choice of the hash function.

Randomized Algorithm

Definition

A randomized algorithm is an algorithm the algorithm makes internal random choices.

2 kinds of randomized algorithms:

  1. Las Vegas: The runtime is random, but the output is always correct.
  2. Monte Carlo: The runtime is fixed, but the output is sometimes incorrect.

We will focus on Las Vegas algorithms in this course.

O(n)=E[T(n)]O(n)=E[T(n)] or some other probabilistic quantity.

Randomization can help

Idea: Randomize the choice of hash function hh from a family of hash functions, HH.

If we randomly pick a hash function from this family, then the probability that the hash function is bad on any particular set SS is small.

Intuitively, the adversary can not pick a bad input since most hash functions are good for any particular input SS.

Universal Hashing: Goal

We want to design a universal family of hash functions, HH, such that the probability that the hash table behaves badly on any input SS is small.

Universal Hashing: Definition

Suppose we have mm buckets in the hash table. We also have 22 inputs xyx\neq y and x,yUx,y\in U. We want xx and yy to be unlikely to hash to the same bucket.

HH is a universal family of hash functions if for any two elements xyx\neq y,

PrhH[h(x)=h(y)]=1mPr_{h\in H}[h(x)=h(y)]=\frac{1}{m}

where hh is picked uniformly at random from the family HH.

Universal Hashing: Analysis

Claim: If we choose hh randomly from a universal family of hash functions, HH, then the hash table will exhibit good behavior on any set SS of size nn with high probability.

Question: What are some good properties and what does it mean by with high probability?

Claim: Given a universal family of hash functions, HH, S={a1,a2,,an}NS=\{a_1,a_2,\cdots,a_n\}\subset \mathbb{N}. For any probability 0δ10\leq \delta\leq 1, if n4mδn\leq \sqrt{4m\delta}, the chance that no two keys hash to the same slot is 1δ\geq1-\delta.

Example: If we pick δ=12\delta=\frac{1}{2}. As long as n<2mn<\sqrt{2m}, the chance that no two keys hash to the same slot is 12\geq\frac{1}{2}.

If we pick δ=13\delta=\frac{1}{3}. As long as n<43mn<\sqrt{\frac{4}{3}m}, the chance that no two keys hash to the same slot is 23\geq\frac{2}{3}.

Proof Strategy:

  1. Compute the expected value of collisions. Note that collisions occurs when two different values are hashed to the same slot. (Indicator random variables)
  2. Apply a "tail" bound that converts the expected value to probability. (Markov's inequality)
Compute the expected number of collisions

Let mm be the size of the hash table. nn is the number of keys in the set SS. NN is the size of the universe.

For inputs x,yS,xyx,y\in S,x\neq y, we define a random variable

Cxy={1if h(x)=h(y)0otherwiseC_{xy}= \begin{cases} 1 & \text{if } h(x)=h(y) \\ 0 & \text{otherwise} \end{cases}

CxyC_{xy} is called an indicator random variable, that takes value 00 or 11.

The expected number of collisions is

E[Cxy]=1×Pr[Cxy=1]+0×Pr[Cxy=0]=Pr[Cxy=1]=1mE[C_{xy}]=1\times Pr[C_{xy}=1]+0\times Pr[C_{xy}=0]=Pr[C_{xy}=1]=\frac{1}{m}

Define CxC_x: random variable that represents the cost of inserting/searching/deleting xx from the hash table.

CxC_x\leq total number of elements that collide with xx (= number of elements yy such that h(x)=h(y)h(x)=h(y)).

Cx=yS,yx,h(x)=h(y)1C_x=\sum_{y\in S,y\neq x,h(x)=h(y)}1

So, Cx=yS,yxCxyC_x=\sum_{y\in S,y\neq x}C_{xy}.

By linearity of expectation,

E[Cx]=yS,yxE[Cxy]=yS,yx1m=n1mE[C_x]=\sum_{y\in S,y\neq x}E[C_{xy}]=\sum_{y\in S,y\neq x}\frac{1}{m}=\frac{n-1}{m}

E[C]=Θ(1)E[C]=\Theta(1) if n=O(m)n=O(m). Total cost of KK insert/search operations is O(k)O(k). by linearity of expectation.

Say CC is the total number of collisions.

C=xSCx2C=\frac{\sum_{x\in S}C_x}{2} because each collision is counted twice.

E[C]=12xSE[Cx]=12xSn1m=n(n1)2mE[C]=\frac{1}{2}\sum_{x\in S}E[C_x]=\frac{1}{2}\sum_{x\in S}\frac{n-1}{m}=\frac{n(n-1)}{2m}

If we want E[C]δE[C]\leq \delta, then we need n=2mδn=\sqrt{2m\delta}.

The probability of no collisions C=0C=0

We know that the expected value of number of collisions is now δ\leq \delta, but what about the probability of NO collisions?

Markov's inequality: P[Xk]E[X]kP[X\geq k]\leq\frac{E[X]}{k} For non-negative random variable XX, Pr[XkE[X]]1kPr[X\geq k\cdot E[X]]\leq \frac{1}{k}.

Use Markov's inequality: For non-negative random variable XX, Pr[XkE[X]]1kPr[X\geq k\cdot E[X]]\leq \frac{1}{k}.

Apply this to CC:

Pr[C1δE[C]]<δPr[C1]<δPr[C\geq \frac{1}{\delta}E[C]]<\delta\Rightarrow Pr[C\geq 1]<\delta

So, if we want Pr[C=0]>1δPr[C=0]>1-\delta, n<2mδn<\sqrt{2m\delta} with probability 1δ1-\delta, you will have no collisions.

More general conclusion

Claim: For a universal hash function family HH, if number of keys nBmδn\leq \sqrt{Bm\delta}, then the probability that at most B+1B+1 keys hash to the same slot is >1δ> 1-\delta.

Example: Quicksort

Based on partitioning [assume all elements are distinct]: Partition(A[pr]A[p\cdots r])

  • Rearranges AA into A[pq1],A[q],A[q+1r]A[p\cdots q-1],A[q],A[q+1\cdots r]

Runtime: O(rp)O(r-p), linear time.

def partition(A,p,r):
    x=A[r]
    lo=p
    for i in range(p,r):
        if A[i]<x:
            A[lo],A[i]=A[i],A[lo]
            lo+=1
    A[lo],A[r]=A[r],A[lo]
    return lo
 
def quicksort(A,p,r):
    if p<r:
        q=partition(A,p,r)
        quicksort(A,p,q-1)
        quicksort(A,q+1,r)

Runtime analysis

Let the number of element in AlowA_{low} be kk.

T(n)=Θ(n)+T(k)+T(nk1)T(n)=\Theta(n)+T(k)+T(n-k-1)

By even split assumption, k=n2k=\frac{n}{2}.

T(n)=T(n2)+T(n21)+Θ(n)Θ(nlogn)T(n)=T(\frac{n}{2})+T(\frac{n}{2}-1)+\Theta(n)\approx \Theta(n\log n)

Which is approximately the same as merge sort.

Average case analysis is always suspicious.

Randomized Quicksort

  • Pick a random pivot element.
  • Analyze the expected runtime. over the random choices of pivot.
 
def randomized_partition(A,p,r):
    ix=random.randint(p,r)
    x=A[ix]
    A[r],A[ix]=A[ix],A[r]
    lo=p
    for i in range(p,r):
        if A[i]<x:
            A[lo],A[i]=A[i],A[lo]
            lo+=1
    A[lo],A[r]=A[r],A[lo]
    return lo
 
def randomized_quicksort(A,p,r):
    if p<r:
        q=randomized_partition(A,p,r)
        randomized_quicksort(A,p,q-1)
        randomized_quicksort(A,q+1,r)
E[T(n)]=E(T(nk1)+T(k)+cn)=E(T(nk1))+E(T(k))+cnE[T(n)]=E(T(n-k-1)+T(k)+cn)=E(T(n-k-1))+E(T(k))+cn

by linearity of expectation.

Pr[pivot has rank k]=1nPr[\textup{pivot has rank }k]=\frac{1}{n}

So,

E[T(n)]=1nk=0n1(E[T(k)]+E[T(nk1)])+cn=cn+k=0n1Pr[nk1=j]T(j)+k=0n1Pr[k=j]T(j)=cn+k=0n11nT(j)+k=0n11nT(j)=cn+2nk=0n1T(j)\begin{aligned} E[T(n)]&=\frac{1}{n}\sum_{k=0}^{n-1}(E[T(k)]+E[T(n-k-1)])+cn\\ &=cn+\sum_{k=0}^{n-1}Pr[n-k-1=j]T(j)+\sum_{k=0}^{n-1}Pr[k=j]T(j)\\ &=cn+\sum_{k=0}^{n-1}\frac{1}{n}T(j)+\sum_{k=0}^{n-1}\frac{1}{n}T(j)\\ &=cn+\frac{2}{n}\sum_{k=0}^{n-1}T(j) \end{aligned}

Claim: the solution to this recurrence is E[T(n)]=O(nlogn)E[T(n)]=O(n\log n) or T(n)=cnlogn+1T(n)=c'n\log n+1.

Proof:

We prove by induction.

Base case: n=1,T(n)=T(1)=cn=1,T(n)=T(1)=c

Inductive step: Assume that T(k)=cklogk+1T(k)=c'k\log k+1 for all k<nk<n.

Then,

T(n)=cn+2nk=0n1T(k)=cn+2nk=0n1(cklogk+1)=cn+2cnk=0n1klogk+2nk=0n11\begin{aligned} T(n)&=cn+\frac{2}{n}\sum_{k=0}^{n-1}T(k)\\ &=cn+\frac{2}{n}\sum_{k=0}^{n-1}(c'k\log k+1)\\ &=cn+\frac{2c'}{n}\sum_{k=0}^{n-1}k\log k+\frac{2}{n}\sum_{k=0}^{n-1}1 \end{aligned}

Then we use the fact that k=0n1klogkn2logn2n28\sum_{k=0}^{n-1}k\log k\leq \frac{n^2\log n}{2}-\frac{n^2}{8} (can be proved by induction).

T(n)=cn+2cn(n2logn2n28)+2nn=cnlogn14cn+cn+2=(cnlogn+1)(14cncn1)\begin{aligned} T(n)&=cn+\frac{2c'}{n}\left(\frac{n^2\log n}{2}-\frac{n^2}{8}\right)+\frac{2}{n}n\\ &=c'n\log n-\frac{1}{4}c'n+cn+2\\ &=(c'n\log n+1)-\left(\frac{1}{4}c'n-cn-1\right) \end{aligned}

We need to prove that 14cncn10\frac{1}{4}c'n-cn-1\geq 0.

Choose cc' and cc such that 14cncn+1\frac{1}{4}c'n\geq cn+1 for all n2n\geq 2.

If c8cc'\geq 8c, then T(n)cnlogn+1T(n)\leq c'n\log n+1.

E[T(n)]cnlogn+1=O(nlogn)E[T(n)]\leq c'n\log n+1=O(n\log n)

EOP

A more elegant proof:

Let XijX_{ij} be an indicator random variable that is 11 if element of rank ii is compared to element of rank jj.

Running time: X=i=0n2j=i+1n1XijX=\sum_{i=0}^{n-2}\sum_{j=i+1}^{n-1}X_{ij}

So, the expected number of comparisons is

E[Xij]=Pr[Xij=1]×1+Pr[Xij=0]×0=Pr[Xij=1]E[X_{ij}]=Pr[X_{ij}=1]\times 1+Pr[X_{ij}=0]\times 0=Pr[X_{ij}=1]

This is equivalent to the expected number of comparisons in randomized quicksort.

The expected number of running time is

E[X]=E[i=0n2j=i+1n1Xij]=i=0n2j=i+1n1E[Xij]=i=0n2j=i+1n1Pr[Xij=1]\begin{aligned} E[X]&=E[\sum_{i=0}^{n-2}\sum_{j=i+1}^{n-1}X_{ij}]\\ &=\sum_{i=0}^{n-2}\sum_{j=i+1}^{n-1}E[X_{ij}]\\ &=\sum_{i=0}^{n-2}\sum_{j=i+1}^{n-1}Pr[X_{ij}=1] \end{aligned}

For any two elements zi,zjSz_i,z_j\in S, the probability that ziz_i is compared to zjz_j is (either ziz_i or zjz_j is picked first as the pivot before the any elements of the ranks larger than ii and less than jj)

Pr[Xij=1]=Pr[zi is picked first]+Pr[zj is picked first]=1ji+1+1ji+1=2ji+1\begin{aligned} Pr[X_{ij}=1]&=Pr[z_i\text{ is picked first}]+Pr[z_j\text{ is picked first}]\\ &=\frac{1}{j-i+1}+\frac{1}{j-i+1}\\ &=\frac{2}{j-i+1} \end{aligned}

So, with harmonic number, Hn=k=1n1kH_n=\sum_{k=1}^{n}\frac{1}{k},

E[X]=i=0n2j=i+1n12ji+12i=0n2k=1ni11k2i=0n2clog(n)=2clog(n)i=0n21=Θ(nlogn)\begin{aligned} E[X]&=\sum_{i=0}^{n-2}\sum_{j=i+1}^{n-1}\frac{2}{j-i+1}\\ &\leq 2\sum_{i=0}^{n-2}\sum_{k=1}^{n-i-1}\frac{1}{k}\\ &\leq 2\sum_{i=0}^{n-2}c\log(n)\\ &=2c\log(n)\sum_{i=0}^{n-2}1\\ &=\Theta(n\log n) \end{aligned}

EOP