The Set Cover Problem

Given

A ground set of $n$ elements: $E=\{e_1,\cdots,e_n\}$
A collection of subsets of $E$: $S_1,\cdots,S_m\subseteq E$
A non-negative weight $w_j\geq0$ for each $S_j,1\leq j\leq m$.

Find an $I\subseteq\{1,\cdots,m\}$ s.t.

$\bigcup_{j\in I}S_j=E$
$\sum_{j\in I}w_j$ is minimum

If $w_j=1$ for each $j$, the problem is called unweighted set cover.

Integer Programming Formulation

To formulate a problem as IP, we need to define:

Decision variables: decision to be made
Constraints: linear inequalities on the decision variables
Objective function: a linear function on the decision variables
Decision variables: $x_j=1$ if we include set $S_j$ in the solution, $x_j=0$ otherwise.
Constraints: $\sum_{j:e_i\in S_j}x_j\geq1$ for each $i=1,\cdots,n$, to ensure that each element $e_i$ is covered by at least one subset.
Objective function: $\sum_{j=1}^mw_jx_j$.

The IP formulation of set cover problem is: $$ \min\sum_{j=1}^mw_jx_j $$ such that $$ \sum_{j:e_i\in S_j}x_j\geq1 (i=1,\cdots,n), x_j\in{0,1} (j=1,\cdots,,m) $$

Linear Programming Relaxation

IPs cannot be solved in polynomial time. We sometimes relax the integrality constraint in order to get some information: $$ \min\sum_{j=1}^mw_jx_j $$ such that $$ \sum_{j:e_i\in S_j}x_j\geq1 (i=1,\cdots,n), x_j\in[0,1] (j=1,\cdots,,m) $$ Observation:

The constraint $x_j\leq1$ is redundant
Any feasible solution for the IP is a feasible solution for the LP
The value of an optimal solution of LP $Z_{LP}^*$ is a lower bound for the value of an optimal solution for the IP $Z_{IP}^*$, thus $Z_{LP}^*\leq Z_{IP}^*=OPT$.

Rounding

Deterministic Rounding

Let $f_i$ be the number of subsets in which the element $e_i$ appears. $$ f_i=|{j|e_i\in S_j}| $$ Let $f=\max_if_i$, $1\leq i\leq n$.

Rounding Algorithm 1

Solve LP, let $x^*$ be an optimal solution to LP.

Define a solution $\hat x$ for IP as: $$ \hat x_j=\begin{cases} 1\text{ if }x_j^*\geq\frac{1}{f}\ 0\text{ Otherwise} \end{cases} $$ for each $j=1,\cdots,m$.

Let $I$ be the set of $j$ s.t. $\hat x_j=1$.

Lemma: The collection of sets $S_j$, for $j\in I$ is a set cover.

For each element $e_i$,

$\sum_{j:e_i\in S_j}x_j^*\geq1$, because $x^*$ is a solution of LP.

There are $f_i\leq f$ by definition of $f_i$ and $f$, at least one of them is $\geq\frac{1}{f_i}\geq\frac{1}{f}$, otherwise the sum might be $<1$.

Hence, for some $j$ s.t. $e_i\in S_j$, it holds that $x_j^*\geq\frac{1}{f}$.

Therefore, $j\in I$ and element $e_i$ is covered.

Theorem: Rounding Algorithm 1 produces a $f$-approximation for set cover.

Proof 3 points:

It runs in polynomial time

It produces a feasible solution

The ratio between the value of the solution found and that of the optimal one is at most $f$.

It is not hard to see 1 and 2. Proof of point 3:

For each $j\in I$, $x_j^*\geq\frac{1}{f}\Rightarrow1\leq f\cdot x_j^*\Rightarrow w_j\leq w_j\cdot(f\cdot x_j^*)$ $$ \Rightarrow \sum_{j\in I}w_j\leq\sum_{j\in I}w_j\cdot(f\cdot x_j^{*)\leq\sum_{j=1}}mw_j\cdot(f\cdot x_j^{*)=f\cdot\sum_{j=1}}mw_jx_j^* $$

\[ =f\cdot Z_{LP}^*\leq f\cdot Z_{IP}^*=f\cdot OPT \]

Thus, $$ \frac{\sum_{j\in I}w_j}{OPT}\leq f. $$

Dual Rounding

Dual of the LP:

We pay $y_i$ to cover element $e_i$ ($y_i$ is the price of $e_i$)
Intuitively, we want that the elements that require high-weight subsets have a high price
The overall price for the elements on a set $S_j$ cannot be more than the weight of $S_j$: $\sum_{i:e_i\in S_j}y_i\leq w_j$
We want to find the highest total price for the elements

We can solve this problem as a linear program: $$ \max\sum_{i=1}^ny_i $$ s.t. $$ \sum_{i:e_i\in S_j}y_i\leq w_j (j=1,\cdots,m), y_i\geq0 (i=1,\cdots,n). $$ Such linear program is the dual of the set cover LP relaxation.

The dual program has

a variable $y_i$ for each constraint of the primal
a constraint for each variable $x_j$ of the primal

Properties of the Dual

Weak duality: any feasible solution of the dual linear program has a value no greater than any feasible solution of the primal linear program. In set cover:

\[ \sum_{i=1}^ny_i\leq\sum_{i=1}^ny_i\sum_{j:e_i\in S_j}x_j=\sum_{j=1}^mx_j\sum_{i:e_i\in S_j}y_i\leq\sum_{j=1}^mw_jx_j \]

Strong duality: If there exists a feasible solution for both the primal and the dual linear programs, then their optimal values are equal. In set cover, for each pair of primal and dual optimal solution $x^*$ and $y^*$:

\[ \sum_{j=1}^mw_jx_j^*=\sum_{i=1}^ny_i^* \]

Dual Rounding Algorithm

Solve the dual LP. Let $y^*$ be an optimal solution for the dual LP.

Choose the subsets for which the dual constraints are tight: We choose $S_j$ if $$ \sum_{i:e_i\in S_j}y_i^*=w_j $$ Let $I'=\{j|\sum_{i:e_i\in S_j}y_i^*=w_j\}$

Such an algorithm also produces a $f$-approximation for set cover.

Lemma: The collection of sets $S_j$, for $j\in I$ is a set cover.

Assume that $e_k$ is uncovered. For each $S_j$, $e_k\in S_j,\sum_{i:e_i\in S_j}y_i^*<w_j$

Let $\epsilon=\min_{j:e_k\in S_j}(w_j-\sum_{i:e_i\in S_j}y_i^*)$.

Consider a dual solution $y'$ s.t.

\[ y_i'=\begin{cases} y_i^*+\epsilon\text{ if }i=k\\ y_i^*\text{ Otherwise} \end{cases} \]

$y'$ is feasible for the dual program: for each $j$ s.t. $e_k\in S_j,\sum_{i:e_i\in S_j}y_i'=\sum_{i:e_i\in S_j}y_i^*+\epsilon\leq w_j$.

The value of $y'$ is greater than that of $y^*$: $\sum_{i=1}^ny_i'=\sum_{i=1}^ny_i^*+\epsilon>\sum_{i=1}^ny_i^*$.

Theorem: The dual rounding is an $f$-approximation algorithm for the set cover problem.

By choosing condition of algorithm:

\[ \sum_{j\in I'}w_j=\sum_{j\in I'}\sum_{i:e_i\in S_j}y_i^*(1) \]

By definition of $f_i$:

\[ (1)=\sum_{i=1}^n|\{j\in I'|e_i\in S_j\}|\cdot y_i^*\leq\sum_{i=1}^nf_iy_i^*(2) \]

By definition of $f$:

\[ (2)\leq f\cdot\sum_{i=1}^ny_i^*(3) \]

By strong duality and relation between LP and IP

\[ (e)=f\cdot\sum_{j=1}^mw_jx_j^*\leq f\cdot OPT \]

Complementary Slackness

By strong duality, we have $$ \sum_{j=1}^mw_jx_j=\sum_{i=1}^ny_i, $$ Therefore for $x^*$ and $y^*$ the inequalities must hold with equality.

If $y_i^*>0$, then $\sum_{j:e_i\in S_j}x_j^*=1$.
If $x_j^*>0$, then $\sum_{i:e_i\in S_j}y_i^*=w_j$

Whenever a primal/dual linear variable is non-zero, the corresponding dual/primal constraint is tight.

The solution produced by the dual rounding algorithm is not better than the one produced by the primal rounding algorithm.

Primal-Dual Method

Primal-Dual Algorithm

$y=0,I=\emptyset$

while there exists $e_i\not\in\bigcup_{j\in I}S_j$ do

Increase $y_i$ until, for some $\ell$ s.t. $e_i\in S_\ell, \sum_{j:e_j\in S_\ell}y_j=w_\ell$
for all such sets $S_\ell$ do: $I=I\cup\{\ell\}$.

By using the proof of the previous lemma, we can increase $y_i$ in the while loop by $$ \min_{j:e_i\in S_j}(w_j-\sum_{k:e_k\in S_j}y_k), $$ the constraint is tight for the $j$ that gives the minimum.

The running time is $O(mn)$ and the approximation factor is $f$.

Greedy Algorithm

Greedy algorithm find local optimal solution, which might not be the global optimal one.

Advantage of greedy algorithm is that they are easy to implement.

Greedy Algorithm for Set Cover

$I\gets\emptyset$

for $j=1\to m$ do

$\hat S_j=S_j$

while $I$ is not a set cover do

$\ell=\arg\min_{j:\hat S_j\neq\emptyset}\frac{w_j}{\hat S_j}$
$I=I\cup\{\ell\}$
for $j=1\to m$ do $\hat S_j=\hat S_j\backslash S_\ell$

return $I$

Running time $O(m^2)$.

Each round, we find the subset maximizing the cost-effectiveness of weight compare to the number of uncovered elements.

Analysis of Greedy Algorithm

Some Math Tools

Harmonic series:

$H_k=\sum_{i=1}^k\frac{1}{i}$
$\lim_{k\to\infty}(H_k-\ln k)=\gamma<0.578$
$H_k\approx \ln k$

Given two sequence of positive numbers $a_1,\cdots, a_n$ and $b_1,\cdots, b_n$: $$ \min_{i\in[k]}\frac{a_i}{b_i}\leq\frac{\sum_{i=1}^{ka_i}{\sum_{i=1}}kb_i}\leq\max_{i\in[k]}\frac{a_i}{b_i} $$ Theorem: The greedy algorithm is an $H_n$-approximation algorithm for the set cover problem

Proof idea:

The optimal solution covers $n$ elements with cost $OPT$

Therefore, there must be a subset that covers its elements with an average weight of at most $OPT/n$.

After that $k$ elements are covered, the optimal solution covers $n-k$ elements with cost $OPT$.

Therefore, there must be a subset that covers its elements with an average weight of at most $OPT/(n-k)$.

The greedy pays $OPT/(n-k+1)$ to cover the $k$-th element.

Approximation ratio $\sum_{k=1}^n\frac{1}{n-k+1}=H_n$.

Formally, the total number of iteration is $\ell$.

At iteration $k$:

$n_k$: uncovered elements ($n_1=n,n_{\ell+1}=0$)

$I_k$: indices chosen in iteration $1,2,\cdots,k-1$ ($I=I_{\ell+1}$)

$\hat S_j$: uncovered elements of $S_j$ at the beginning of the iteration $\hat S_j= S_j\backslash\cup_{p\in I_k}S_p$

For the $j$ chosen at iteration $k$, we will prove that

\[ w_j\leq\frac{n_k-n_{k+1}}{n_k}\cdot OPT. \]

Let $O$ be the set of indices of an optimal solution:

\[ \min_{j:\hat S_j\neq\emptyset}\frac{w_j}{|\hat S_j|}\leq\frac{\sum_{j\in O}w_j}{\sum _{j\in O}|\hat S_j|}=\frac{OPT}{\sum_{j\in O}|\hat S_j|} \]

$O$ is a set cover $\Rightarrow\cup_{j\in O}\hat S_j$ covers all the $n_k$ uncovered elements

\[ \sum_{j\in O}|\hat S_j|\geq n\Rightarrow\frac{OPT}{\sum_{j\in O}|\hat S_j|}\leq\frac{OPT}{n_k}. \]

$j$ is the index that minimizes the ratio

\[ w_j\leq\frac{|\hat S_j|OPT}{n_k}=\frac{n_k-n_{k+1}}{n_k}OPT \]

Thus,

\[ \sum_{j\in I}w_j\leq\sum_{k=1}^\ell\frac{n_k-n_{k+1}}{n_k}OPT\leq OPT\cdot\sum_{k=1}^\ell(\frac{1}{n_k}+\frac{1}{n_k-1}+\cdots+\frac{1}{n_{k+1}+1})=H_n\cdot OPT \]

Q.E.D.

Dual Fitting

We can improve the above analysis by using dual fitting.

Let $g=\max_j|S_j|$.

Theorem: The greedy algorithm is an $H_g$-approximation algorithm for the set cover problem.

We construct an infeasible dual solution $y$ s.t $$ \sum_{j\in I}w_j=\sum_{i=1}^ny_i $$ Suppose we choose set $S_j$ at iteration $k$. For each $e_i\in\hat S_j$, we set $y_i=w_j/|\hat S_j|$.

Variable $y_i$ is set only once: each $e_i\in \hat S_j$ is uncovered at iteration $k$ and it is covered in the remaining iterations. Moreover, by definition of $y_i$, $w_j=\sum_{i:e_i\in\hat S_j}y_i$. Therefore, $$ \sum_{j\in I}w_j=\sum_{i=1}^ny_i. $$ Proving that $y'=\frac{y}{H_g}$ is feasible, we must show that for each $S_j$, $$ \sum_{i:e_i\in S_j}y_i'\leq w_j. $$ Consider a subset $S_j$, let $a_k$ be the number of elements in $S_j$ that are uncovered at the beginning of the $k$-th iteration, $a_1=|S_j|,a_{\ell+1}=0$.

Let $A_k$ be the uncovered elements in $S_j$ that are covered in the $k$-th iteration, $|A_k|=a_k-a_{k+1}$.

If a set $S_p$ is chosen in the $k$-th iteration, then it is the one with the minimum ratio, therefore $$ y_i'=\frac{w_p}{H_g|\hat S_p|}\leq\frac{w_j}{H_ga_k}\forall i\in A_k $$

\[ \sum_{i:e_i\in S_j}y_i'=\sum_{k=1}^\ell\sum_{i:e_i\in A_k}y_i'\leq\sum_{k=1}^\ell(a_k-a_{k+1})\frac{w_j}{H_ga_k}\leq\frac{w_j}{H_g}\sum_{k=1}^\ell\frac{a_k-a_{k+1}}{a_k} \]

\[ \leq\frac{w_j}{H_g}\sum_{k=1}^\ell(\frac{1}{a_k}+\frac{1}{a_k-1}+\cdots+\frac{1}{a_{k+1}+1})\leq\frac{w_j}{H_g}\sum_{i=1}^{|S_j|}\frac{1}{i}=\frac{w_j}{H_g}H_{|S_j|}\leq w_j \]

Inapproximability of Set Cover

If there exists a $c\ln n$-approximation algorithm for the unweighted set cover problem for some constant $c<1$, then there is an $DTIME(n^{O(\log\log n)})$-time algorithm for each NP-complete problem. (2014: then $P=NP$)

Randomized Rounding

Monte Carlo and Las Vegas

Two types of randomized algorithms.

Las Vegas: Returns a solution with a deterministic performance guarantee, the running time is a random variable.

Monte Carlo: Returns a solution with a performance guarantee with high probability, the running time is deterministic.

Randomized Rounding of IP

Model the problem with an IP
Solve LP relaxation of the IP
Compute a solution to the IP by setting each variable with a random process whose probability distributes depends on the LP solution.

The solution might be infeasible or not a good approximation.

Randomized Rounding for Set Cover

Solve LP, let $x^*$ be an optimal solution to LP.

for each $S_j$ do

Flip a biased coin with bias $x_j^*$
Include $S_j$ in the solution if the result is head // include each $S_j$ with probability $x_j^*$

Random variable $X_j=1$ if $S_j$ is in the solution, $0$ otherwise.

$\Pr\{X_j=1\}=x_j^*,\Pr\{X_j=0\}=1-x_j^*$

The approximation ratio is very good: $$ \mathbb E[\sum_{j=1}^{mw_jX_j]=\sum_{j=1}}mw_j\Pr{X_j=1}=\sum_{j=1}^mw_jx_j=Z_{LP}^\leq OPT. $$ However, the solution is likely to be infeasible (not a set cover)

Probability of Infeasible Solution

For a set $S_j$ that contains $e_i$:

$\Pr\{S_j\text{ is included}\}=x_j^*$
$\Pr\{S_j\text{ is not included}\}=1-x_j^*$

The probability that $e_i$ is not covered is the probability that none of the sets containing $e_i$ are included in the solution.

The set $S_j$ that contain $e_i$ are included independently

\[ \Pr\{e_i\text{ is not covered}\}=\prod_{j:e_i\in S_j}(1-x_j^*)\leq\prod_{j:e_i\in S_j}e^{-x_j^*}=e^{-\sum_{j:e_i\in S_j}x_j^*}\leq e^{-1} \]

It is possible to approach $1/e$ arbitrarily closely.

High Probability

A family of randomized algorithms is correct with high probability if, for any constant $c$, its chance of failure is at most an invers polynomial $O(n^{-c})$.

If a randomized algorithm is s.t.

\[ \Pr\{e_i\text{ is not covered}\}\leq\frac{1}{n^c} \]

for some constant $c$,

\[ \Pr\{\exists\text{ an uncovered element}\}\leq\sum_{i=1}^n\Pr\{e_i\text{ is not covered}\}\leq\frac{1}{n^{c-1}}. \]

We produce a set cover with high probability.

Refined Randomized Rounding

For each $S_j$ we run the process $c\ln n$ times.

The probability that an element $e_i$ is not covered will become $\leq\frac{1}{n^c}$.

Theorem: The randomized rounding algorithm is a randomized $O(\ln n)$-approximation algorithm that produces a set cover with high probability.

2 points:

The expected value of the solution produced is a factor $O(\ln n)$ far from the optimum.

The expected value of the solution produced given that it is a set cover is a factor $O(\ln n)$ far from the optimum.

For the first point:

Let $p_j(x_j^*)$ be the probability that a given subset $S_j$ is included in the solution as a function of $x_j^*$.

By construction $p_j(x_j^*)=1-(1-x_j^*)^{c\ln n}$

Derivative of $p_j$ at $x_j^*$:

\[ p'(x_j^*)=(c\ln n)(1-x_j^*)^{c\ln n-1}\leq c\ln n, \]

since $x_j^*\in[0,1]$ and $c\ln n\geq 1$.

$p_j(x_j^*)\leq (c\ln n)x_j^*$, since $p_j(0)=0$ and the slope of $p_j$ is at most $c\ln n$.

Let event variable $X_j=1$ if $S_j$ is in the solution, $0$ otherwise.

\[ \mathbb E[\sum_{j=1}^mw_jX_j]=\sum_{j=1}^mw_j\Pr\{X_j=1\}\leq\sum_{j=1}^mw_j(c\ln n)x_j^*=(c\ln n)\sum_{j=1}^mw_jx_j^*=(c\ln n)Z_{LP}^* \]

For the second point:

Let $F$ be the event that the solution is a set cover and let $\bar F$ be the complement of this event.

$\Pr\{\bar F\}\leq\frac{1}{n^{c-1}}$ and $\Pr\{F\}\ge1-\frac{1}{n^{c-1}}$.

By properties of conditional expectation

\[ \mathbb E[\sum_{j=1}^mw_jX_j]=\mathbb E[\sum_{j=1}^mw_jX_j|F]\cdot\Pr\{F\}+\mathbb E[\sum_{j=1}^mw_jX_j|\bar F]\cdot\Pr\{\bar F\}. \]

Thus,

\[ \mathbb E[\sum_{j=1}^mw_jX_j|F]=\frac{1}{\Pr\{F\}}(\mathbb E[\sum_{j=1}^mw_jX_j]-\mathbb E[\sum_{j=1}^mw_jX_j|\bar F]). \]

We have $\Pr\{F\}\ge1-\frac{1}{n^{c-1}}$, $\mathbb E[\sum_{j=1}^mw_jX_j|\bar F]\geq 0$, $1-\frac{1}{n^{c-1}}\geq\frac{1}{2}$ and $c\geq 2$:

\[ \mathbb E[\sum_{j=1}^mw_jX_j|F]\leq\frac{1}{\Pr\{F\}}(\mathbb E[\sum_{j=1}^mw_jX_j])\leq\frac{(c\ln n)Z_{LP}^*}{1-\frac{1}{n^{c-1}}}\leq(2c\ln n)Z_{LP}^*. \]

Q.E.D.

Summary

Randomized algorithm is slower than greedy algorithm.

It achieves an expected approximation ratio.

Why randomization?

Sometimes randomized algorithms are easier to describe and analyze.
Often a randomized algorithm can be derandomized
Sometimes we can prove that the performance guarantee holds with high probability.