Lecture 9

Randomized Algorithms

Hashing

Hashing with chaining:

Input: We have integers in range $[1,n-1]=U$ . We want to map them to a hash table $T$ with $m$ slots.

Hash function: $h:U\rightarrow [m]$

Goal: Hashing a set $S\subseteq U$ , $|S|=n$ into $T$ such that the number of elements in each slot is at most $1$ .

Collisions

When multiple keys are mapped to the same slot, we call it a collision, we keep a linked list of all the keys that map to the same slot.

Runtime of insert, query, delete of elements $=\Theta(\textup{length of the chain})$

Worst-case runtime of insert, query, delete of elements $=\Theta(n)$

Therefore, we want chains to be short, or $\Theta(1)$ , as long as $|S|$ is reasonably sized, or equivalently, we want the number in any set $S$ to hash uniformly across all slots.

Simple Uniform Hashing Assumptions

The $n$ elements we want to hash (the set $S$ ) is picked uniformly at random from $U$ . Therefore, we could see that this simple hash function works fine:

h(x)=x\mod m

Question: What happens if an adversary knows this function and designs $S$ to make the worst-case runtime happen?

Answer: The adversary can make the runtime of each operation $\Theta(n)$ by simply making all the elements hash to the same slot.

Randomization to the rescue

We don't want the adversary to know the hash function based on just looking at the code.

Idea: Randomize the choice of the hash function.

Randomized Algorithm

Definition

A randomized algorithm is an algorithm the algorithm makes internal random choices.

2 kinds of randomized algorithms:

Las Vegas: The runtime is random, but the output is always correct.
Monte Carlo: The runtime is fixed, but the output is sometimes incorrect.

We will focus on Las Vegas algorithms in this course.

$O(n)=E[T(n)]$ or some other probabilistic quantity.

Randomization can help

Idea: Randomize the choice of hash function $h$ from a family of hash functions, $H$ .

If we randomly pick a hash function from this family, then the probability that the hash function is bad on any particular set $S$ is small.

Intuitively, the adversary can not pick a bad input since most hash functions are good for any particular input $S$ .

Universal Hashing: Goal

We want to design a universal family of hash functions, $H$ , such that the probability that the hash table behaves badly on any input $S$ is small.

Universal Hashing: Definition

Suppose we have $m$ buckets in the hash table. We also have $2$ inputs $x\neq y$ and $x,y\in U$ . We want $x$ and $y$ to be unlikely to hash to the same bucket.

$H$ is a universal family of hash functions if for any two elements $x\neq y$ ,

Pr_{h\in H}[h(x)=h(y)]=\frac{1}{m}

where $h$ is picked uniformly at random from the family $H$ .

Universal Hashing: Analysis

Claim: If we choose $h$ randomly from a universal family of hash functions, $H$ , then the hash table will exhibit good behavior on any set $S$ of size $n$ with high probability.

Question: What are some good properties and what does it mean by with high probability?

Claim: Given a universal family of hash functions, $H$ , $S=\{a_1,a_2,\cdots,a_n\}\subset \mathbb{N}$ . For any probability $0\leq \delta\leq 1$ , if $n\leq \sqrt{4m\delta}$ , the chance that no two keys hash to the same slot is $\geq1-\delta$ .

Example: If we pick $\delta=\frac{1}{2}$ . As long as $n<\sqrt{2m}$ , the chance that no two keys hash to the same slot is $\geq\frac{1}{2}$ .

If we pick $\delta=\frac{1}{3}$ . As long as $n<\sqrt{\frac{4}{3}m}$ , the chance that no two keys hash to the same slot is $\geq\frac{2}{3}$ .

Proof Strategy:

Compute the expected value of collisions. Note that collisions occurs when two different values are hashed to the same slot. (Indicator random variables)
Apply a "tail" bound that converts the expected value to probability. (Markov's inequality)

Compute the expected number of collisions

Let $m$ be the size of the hash table. $n$ is the number of keys in the set $S$ . $N$ is the size of the universe.

For inputs $x,y\in S,x\neq y$ , we define a random variable

C_{xy}= \begin{cases} 1 & \text{if } h(x)=h(y) \\ 0 & \text{otherwise} \end{cases}

$C_{xy}$ is called an indicator random variable, that takes value $0$ or $1$ .

The expected number of collisions is

E[C_{xy}]=1\times Pr[C_{xy}=1]+0\times Pr[C_{xy}=0]=Pr[C_{xy}=1]=\frac{1}{m}

Define $C_x$ : random variable that represents the cost of inserting/searching/deleting $x$ from the hash table.

$C_x\leq$ total number of elements that collide with $x$ (= number of elements $y$ such that $h(x)=h(y)$ ).

C_x=\sum_{y\in S,y\neq x,h(x)=h(y)}1

So, $C_x=\sum_{y\in S,y\neq x}C_{xy}$ .

By linearity of expectation,

E[C_x]=\sum_{y\in S,y\neq x}E[C_{xy}]=\sum_{y\in S,y\neq x}\frac{1}{m}=\frac{n-1}{m}

$E[C]=\Theta(1)$ if $n=O(m)$ . Total cost of $K$ insert/search operations is $O(k)$ . by linearity of expectation.

Say $C$ is the total number of collisions.

$C=\frac{\sum_{x\in S}C_x}{2}$ because each collision is counted twice.

E[C]=\frac{1}{2}\sum_{x\in S}E[C_x]=\frac{1}{2}\sum_{x\in S}\frac{n-1}{m}=\frac{n(n-1)}{2m}

If we want $E[C]\leq \delta$ , then we need $n=\sqrt{2m\delta}$ .

The probability of no collisions $C=0$

We know that the expected value of number of collisions is now $\leq \delta$ , but what about the probability of NO collisions?

Markov's inequality: $P[X\geq k]\leq\frac{E[X]}{k}$ For non-negative random variable $X$ , $Pr[X\geq k\cdot E[X]]\leq \frac{1}{k}$ .

Use Markov's inequality: For non-negative random variable $X$ , $Pr[X\geq k\cdot E[X]]\leq \frac{1}{k}$ .

Apply this to $C$ :

Pr[C\geq \frac{1}{\delta}E[C]]<\delta\Rightarrow Pr[C\geq 1]<\delta

So, if we want $Pr[C=0]>1-\delta$ , $n<\sqrt{2m\delta}$ with probability $1-\delta$ , you will have no collisions.

More general conclusion

Claim: For a universal hash function family $H$ , if number of keys $n\leq \sqrt{Bm\delta}$ , then the probability that at most $B+1$ keys hash to the same slot is $> 1-\delta$ .

Example: Quicksort

Based on partitioning [assume all elements are distinct]: Partition( $A[p\cdots r]$ )

Rearranges $A$ into $A[p\cdots q-1],A[q],A[q+1\cdots r]$

Runtime: $O(r-p)$ , linear time.

def partition(A,p,r):
    x=A[r]
    lo=p
    for i in range(p,r):
        if A[i]<x:
            A[lo],A[i]=A[i],A[lo]
            lo+=1
    A[lo],A[r]=A[r],A[lo]
    return lo
 
def quicksort(A,p,r):
    if p<r:
        q=partition(A,p,r)
        quicksort(A,p,q-1)
        quicksort(A,q+1,r)

Runtime analysis

Let the number of element in $A_{low}$ be $k$ .

T(n)=\Theta(n)+T(k)+T(n-k-1)

By even split assumption, $k=\frac{n}{2}$ .

T(n)=T(\frac{n}{2})+T(\frac{n}{2}-1)+\Theta(n)\approx \Theta(n\log n)

Which is approximately the same as merge sort.

Average case analysis is always suspicious.

Randomized Quicksort

Pick a random pivot element.
Analyze the expected runtime. over the random choices of pivot.

 
def randomized_partition(A,p,r):
    ix=random.randint(p,r)
    x=A[ix]
    A[r],A[ix]=A[ix],A[r]
    lo=p
    for i in range(p,r):
        if A[i]<x:
            A[lo],A[i]=A[i],A[lo]
            lo+=1
    A[lo],A[r]=A[r],A[lo]
    return lo
 
def randomized_quicksort(A,p,r):
    if p<r:
        q=randomized_partition(A,p,r)
        randomized_quicksort(A,p,q-1)
        randomized_quicksort(A,q+1,r)

E[T(n)]=E(T(n-k-1)+T(k)+cn)=E(T(n-k-1))+E(T(k))+cn

by linearity of expectation.

Pr[\textup{pivot has rank }k]=\frac{1}{n}

So,

\begin{aligned} E[T(n)]&=\frac{1}{n}\sum_{k=0}^{n-1}(E[T(k)]+E[T(n-k-1)])+cn\\ &=cn+\sum_{k=0}^{n-1}Pr[n-k-1=j]T(j)+\sum_{k=0}^{n-1}Pr[k=j]T(j)\\ &=cn+\sum_{k=0}^{n-1}\frac{1}{n}T(j)+\sum_{k=0}^{n-1}\frac{1}{n}T(j)\\ &=cn+\frac{2}{n}\sum_{k=0}^{n-1}T(j) \end{aligned}

Claim: the solution to this recurrence is $E[T(n)]=O(n\log n)$ or $T(n)=c'n\log n+1$ .

Proof:

We prove by induction.

Base case: $n=1,T(n)=T(1)=c$

Inductive step: Assume that $T(k)=c'k\log k+1$ for all $k<n$ .

Then,

\begin{aligned} T(n)&=cn+\frac{2}{n}\sum_{k=0}^{n-1}T(k)\\ &=cn+\frac{2}{n}\sum_{k=0}^{n-1}(c'k\log k+1)\\ &=cn+\frac{2c'}{n}\sum_{k=0}^{n-1}k\log k+\frac{2}{n}\sum_{k=0}^{n-1}1 \end{aligned}

Then we use the fact that $\sum_{k=0}^{n-1}k\log k\leq \frac{n^2\log n}{2}-\frac{n^2}{8}$ (can be proved by induction).

\begin{aligned} T(n)&=cn+\frac{2c'}{n}\left(\frac{n^2\log n}{2}-\frac{n^2}{8}\right)+\frac{2}{n}n\\ &=c'n\log n-\frac{1}{4}c'n+cn+2\\ &=(c'n\log n+1)-\left(\frac{1}{4}c'n-cn-1\right) \end{aligned}

We need to prove that $\frac{1}{4}c'n-cn-1\geq 0$ .

Choose $c'$ and $c$ such that $\frac{1}{4}c'n\geq cn+1$ for all $n\geq 2$ .

If $c'\geq 8c$ , then $T(n)\leq c'n\log n+1$ .

$E[T(n)]\leq c'n\log n+1=O(n\log n)$

EOP

A more elegant proof:

Let $X_{ij}$ be an indicator random variable that is $1$ if element of rank $i$ is compared to element of rank $j$ .

Running time: $X=\sum_{i=0}^{n-2}\sum_{j=i+1}^{n-1}X_{ij}$

So, the expected number of comparisons is

E[X_{ij}]=Pr[X_{ij}=1]\times 1+Pr[X_{ij}=0]\times 0=Pr[X_{ij}=1]

This is equivalent to the expected number of comparisons in randomized quicksort.

The expected number of running time is

\begin{aligned} E[X]&=E[\sum_{i=0}^{n-2}\sum_{j=i+1}^{n-1}X_{ij}]\\ &=\sum_{i=0}^{n-2}\sum_{j=i+1}^{n-1}E[X_{ij}]\\ &=\sum_{i=0}^{n-2}\sum_{j=i+1}^{n-1}Pr[X_{ij}=1] \end{aligned}

For any two elements $z_i,z_j\in S$ , the probability that $z_i$ is compared to $z_j$ is (either $z_i$ or $z_j$ is picked first as the pivot before the any elements of the ranks larger than $i$ and less than $j$ )

\begin{aligned} Pr[X_{ij}=1]&=Pr[z_i\text{ is picked first}]+Pr[z_j\text{ is picked first}]\\ &=\frac{1}{j-i+1}+\frac{1}{j-i+1}\\ &=\frac{2}{j-i+1} \end{aligned}

So, with harmonic number, $H_n=\sum_{k=1}^{n}\frac{1}{k}$ ,

\begin{aligned} E[X]&=\sum_{i=0}^{n-2}\sum_{j=i+1}^{n-1}\frac{2}{j-i+1}\\ &\leq 2\sum_{i=0}^{n-2}\sum_{k=1}^{n-i-1}\frac{1}{k}\\ &\leq 2\sum_{i=0}^{n-2}c\log(n)\\ &=2c\log(n)\sum_{i=0}^{n-2}1\\ &=\Theta(n\log n) \end{aligned}

EOP

Cse347 L8