Algorithmica (2002) 33: 201–226 DOI: 10.1007/s00453-001-0110-y
Algorithmica ©
2002 Springer-Verlag New York Inc.
Exact and Approximation Algorithms for Clustering1 P. K. Agarwal2 and C. M. Procopiuc2 ) -time algorithm for solving the k-center problem in Rd , Abstract. In this paper we present an n O(k under L ∞ - and L 2 -metrics. The algorithm extends to other metrics, and to the discrete k-center problem. We also describe a simple (1 + ε)-approximation algorithm for the k-center problem, with running time 1−1/d ) 1−1/d ) O(n log k) + (k/ε) O(k . Finally, we present an n O(k -time algorithm for solving the L-capacitated 1−1/d ) or L = O(1). k-center problem, provided that L = Ä(n/k 1−1/d
Key Words. k-Center clustering, Capacitated clustering, Approximation algorithms.
1. Introduction. Clustering a set of points into a few groups is frequently used for statistical analysis and classification in numerous applications, including information retrieval [10], [11], [12], [31], facility location [15], [34], data mining [2], [33], spatial data bases [13], [25], [30], data compression [28], image processing [24], [32], astrophysics [27], and scientific computing [9]. Because of such a diversity of applications, several variants of clustering problems have been proposed and widely studied. However, they all require a partition of a given set of points into clusters that optimizes a given objective function. In this paper we present efficient algorithms for several clustering problems. We restrict our attention to the case in which the data points are embedded in a metric space. While not all real-life applications can be modeled in this manner, a large number of them do consider their input to be points in a metric space (Rd , L p ) (L p is usually L 1 or L 2 ), and use clustering criteria similar to the ones we consider below [10], [15], [27], [30], [34]. Problem Statement and Our Model. Let S be a set of n points in a d-dimensional metric space (Rd , ρ), and let k ≤ n be a positive integer. A k-clustering of S is a partition 6 of S into k subsets S1 , . . . , Sk . Each Si is called a cluster and k is called the number of clusters. We define the size of a cluster Si to be the maximum distance (under the ρ-metric) between a fixed point ci , called the center of the cluster, and a point of Si . The size of a k-clustering 6 is the maximum size of a cluster in 6. Such a clustering is called center k-clustering.3 The underlying metric ρ depends on the application. 1
The work on this paper was partially supported by National Science Foundation Grant CCR-93-01259, by Army Research Office MURI Grant DAAH04-96-1-0013, by a Sloan fellowship, by an NYI award and matching funds from the Xerox Corporation, and by a grant from the U.S.–Israeli Binational Science Foundation. A preliminary version of this paper appeared in the Proceedings of the Ninth Annual ACM–SIAM Symposium on Discrete Algorithms, 1998. 2 Center for Geometric Computing, Department of Computer Science, Box 90129, Duke University, Durham, NC 27708-0129, USA.
[email protected],
[email protected]. 3 Another commonly used definition of the size of a cluster S is the maximum distance between any two i points of Si . Such a k-clustering is called a pairwise k-clustering. The size of a k-clustering is the maximum size of a cluster in 6. Received July 25, 2000; revised April 6, 2001. Communicated by B. Chazelle. Online publication March 4, 2002.
202
P. K. Agarwal and C. M. Procopiuc
The so-called k-center problem is defined as follows: Given a set S of n points in a d-dimensional metric space (Rd , ρ) and an integer k, compute a k-clustering of S of the smallest possible size. The k-center problem can be formulated as covering S by k congruent disks (under the ρ-metric) of the smallest possible size. If the centers of the clusters are required to be a subset of the input points, the problem is called the discrete k-center problem. In some applications (e.g., facility location [15], [34], astrophysics [27]), not only the size of a cluster is important but also the number of points in the cluster. We can generalize the notion of k-clustering as follows. Given an integer L > 0, an L-capacitated k-clustering is a k-clustering in which there are at most L points in each cluster. Given a capacity L, the L-capacitated k-center problem is to compute a k-clustering of the smallest size so that each cluster has at most L points. Previous Results. There is a vast literature on clustering problems, see, for example, the books [3], [15], [22], the survey paper [1], and the references therein. Even the simplest clustering problems are known to be NP-hard, including the Euclidean k-center problem in the plane [17], [29]. In fact, it is NP-hard to approximate the two-dimensional k-center problem within a factor smaller than 2 even under the L ∞ -metric [16]. The greedy algorithm by Gonzalez [18] gives a 2-approximation algorithm for the k-center problem in any metric space. This algorithm requires O(kn) distance computations. The running time was improved by Feder and Greene [16] to O(n log k) for any L p -metric. Several efficient algorithms have been developed for Euclidean and rectilinear k-center problems when k is small. See the survey by Agarwal and Sharir [1] for a summary of such results. The algorithm described by Gonzalez [19] can solve the planar k-center problem under the L ∞ metric in n O(l) time, when the points lie in a horizontal√strip of height l; it can be extended to higher dimensions. Hwang et al. [21] gave an n O( k) -time algorithm for the Euclidean k-center problem in the plane. Their algorithm is based on the following property: if S can be covered by k congruent disks of radius r , then there exist a set D of k disks of radius r and a line ` so that √ D covers S, each disk of D contains two points of S on its boundary, ` intersects O( k) disks of D, and at most 2k/3 disks of D are contained in each of the halfplanes bounded by `. Using this property, they develop a √ O( k) dynamic-programming-based algorithm that determines in n time whether S can be covered by k disks of radius r . By doing a binary search, √they compute an optimal k-clustering. In another paper, Hwang et al. [20] gave an n O( k) -time algorithm which also works for the Euclidean discrete k-center problem. Not much is known about the capacitated k-center problem. Bar-Ilan et al. [8] gave the first polynomial time approximation algorithm for the capacitated k-center problem, achieving an approximation factor of 10. They also proposed an approximation algorithm for the capacitated problem with fixed cluster size. The algorithm approximates the minimum number of clusters by a factor of dln ne. Khuller and Sussmann [26] improved the approximation factor for capacitated k-center to 6 (5 for a slightly different version of the problem in which the k centers need not be distinct). Approximation results for other variants of clustering problems have recently been obtained, both for their capacitated and uncapacitated versions. In the k-median prob-
Exact and Approximation Algorithms for Clustering
203
lem one must locate k facility points in the space and assign each of the original n points to a facility, so that the sum of distances is minimized. The problem has an equivalent integer programming instance, and its approximate solutions are obtained using relaxation techniques; see [23] and references therein. A different approach is presented in [6], building on the technique introduced by Arora in [4] and [5]: if the points lie in the Euclidean plane, then, with high probability, a (1 + ε)-approximation for the k-median problem is obtained in time n O(1/ε) . Yet another approach is to start with a random solution and greedily improve it, via a local search heuristic; see [7]. A generalization of the k-median problem, usually referred to as the facility location problem, considers that opening a facility has an associated cost. The goal is to minimize the overall cost for opening facilities and connecting points to facilities. Shmoys et al. [34] presented constant factor approximation algorithms for the uncapacitated and capacitated versions of this problem. The approximation factors were improved in subsequent papers; see [7] for an overview and recent results. ) Our Results. One of the main results of this paper is an n O(k -time algorithm for d the k-center problem in R , under any L p -metric (Section 2.1). In fact our algorithm works for more general metrics, as discussed at the end of Section 2.1. For the sake of simplicity, we describe the algorithm for the L ∞ -metric in the plane. In this case our algorithm is based on the following observation: if S can √ be covered by a set C of k unit squares, then either S lies in a√ horizontal strip of width k +2 or there exists a horizontal line ` that intersects at most k squares of C. Using this observation and the algorithm by Gonzalez [19], we develop a dynamic-programming-based algorithm for finding an 1−1/d ) -time algorithm for the discrete optimal solution. Our approach also yields an n O(k k-center problem in Rd . √ Recall that the algorithm by Hwang et al. [21] also runs in time n O( k) for d = 2. Our algorithm is however not only considerably simpler than theirs, but the constant hidden in the big-Oh notation is also smaller. Moreover, our algorithm is straightforward to extend to higher dimensions. In Section 3 we describe a simple (1 + ε)-approximation algorithm for the k-center 1−1/d ) . It combines the greedy problem, whose running time is O(n log k) + (k/ε) O(k algorithm by Gonzalez [18] with our algorithm for computing an optimal k-center. Finally, we study the capacitated k-center problem in Section 4. Our exact algorithm does not extend to the capacitated k-center problem for all ranges of capacity (see √ Section 4.2 for a more detailed discussion), but we present an n O( k) -time algorithm for computing an L-capacitated √ k-center under any L p -metric in the plane, provided that L = O(1) or L = Ä(n/ k). Our argument can be generalized to higher dimensions too. 1−1/d
2. The Exact Algorithm. In this section we describe a subexponential algorithm for computing the optimal k-center for a set S of n points in (Rd , L p ). For simplicity, we first describe the algorithm for the L ∞ -metric and then extend it to other metrics. As noted before, the k-center problem under the L ∞ -metric can be formulated as covering S by k congruent (axis-parallel) hypercubes of the smallest possible size. The following is a well-known result.
204
P. K. Agarwal and C. M. Procopiuc
LEMMA 2.1. Let S be a set of points, and let σ ∗ be the size of the optimal k-clustering of S under the L ∞ -metric. Then σ ∗ = d∞ ( pi , p j )/2 for some pi , p j ∈ S. PROOF. If C is an optimal cover of S, then there exist a hypercube C ∈ C and two opposite facets f 1 , f 2 of C so that each of f 1 and f 2 contains at least one point of S. Otherwise, we can shrink all hypercubes of size σ ∗ and obtain a cover of S of smaller size. Hence, 2σ ∗ = d∞ ( pi , p j ) for some pi , p j ∈ S. By the above lemma, the size of an optimal k-clustering is half of one of the O(n 2 ) distances (under the L ∞ -metric) between pairs of input points. The algorithm performs a binary search on the set of these distances, testing at each step whether S can be covered by k congruent hypercubes of size σ, for some given σ > 0. A similar lemma holds for any L p -metric. For example, if the metric is L 2 , then the clusters are balls and there exists a k-clustering of S of smallest size so that each ball in the cluster is defined by at most d + 1 points of S [14]. Then σ ∗ is the radius of a ball defined by at most d + 1 input points. There are O(n d+1 ) such radii, and we can compute each one in time O(d 3 ). It thus suffices to describe an algorithm for the decision problem. We first present the decision procedure for the planar case under the L ∞ -metric (Section 2.1), and then discuss how to extend our algorithm to any L p -metric and to higher dimensions (Section 2.2). Even for moderate values of n, the decision procedure of Section 2.1 is not practical. However, we can obtain a faster algorithm if we accept an approximate solution for the k-center problem. We present such an algorithm in Section 3. 2.1. The Decision Procedure in (R2 , L ∞ ). Assume without loss of generality that σ = 12 . In this case the problem is to decide whether S can be covered by at most k axis-parallel unit squares (i.e., squares of side length 1), and if so, to return such a cover. In general, S can have more than one cover by at most k unit squares, and the decision procedure may return any one of them. It will always be convenient to compute a “canonical” cover, called the optimal cover of S, which we define below. DEFINITION 2.2. 2n-vector
The index of a set of squares C = {C1 , . . . , Cq }, for q ≤ n, is the
I(C) = (∞, . . . , ∞, x 1 , . . . , xq , ∞, . . . , ∞, y1 , . . . , yq ), | {z } | {z } n
n
where (xi , yi ) is the center of the square Ci , 1 ≤ i ≤ q, and x 1 ≤ x2 ≤ · · · ≤ xq . If xi = xi+1 , then yi < yi+1 . The multisets {xi }i=1,q and {yi }i=1,q are called the x-set and y-set of C, respectively. Given a set S of n points, the optimal cover of S by unit squares is the cover C ∗ whose index is maximal in the lexicographic order. An immediate observation is that the optimal cover of S has the minimum number of squares among all the covers of S. By definition, the optimal cover is unique. The following lemma will be useful later: LEMMA 2.3. Let C be a set of at most n squares. For any subset A ⊆ C and for any set of squares B, if I(B) >lex I(A), then I((C\A) ∪ B) >lex I(C).
Exact and Approximation Algorithms for Clustering
205
Fig. 1. An anchored cover of S.
PROOF. Let C 0 = (C\A) ∪ B. The assumption I(B) >lex I(A) implies that |B| ≤ |A|, so |C 0 | ≤ |C| ≤ n. If |C 0 | < |C|, then clearly I(C 0 ) >lex I(C). Assume that |C 0 | = |C|, which implies |A| = |B|. Let X A and X B denote the x-sets of A and B, respectively, sorted in a nondecreasing order. The condition I(A)
The optimal cover of S is an anchored cover.
Given an integer k, our algorithm either computes the optimal cover (if the number of squares in the optimal cover is at most k) or returns that S cannot be covered by k unit squares. The algorithm is based on the following property of the optimal cover, which we prove later in the section: there exists a set of horizontal lines that √ divide the plane into strips so that any strip that contains input points has height at most k + 2 and each √ of these lines intersects at most k squares of the optimal cover. We refer to these strips as “thin” strips. We compute the optimal covers within these strips and then take their union. Extra care is taken to handle squares that intersect the strip boundaries. We show that the lines can be chosen from a finite set of lines, and we use dynamic programming. To compute the optimal cover on a thin strip, we use a modified version of the algorithm by Gonzalez [19], which we refer to as the STRIP ALGORITHM. Strip Algorithm. Let S be a set of n points in R2 lying in a horizontal strip of height l (i.e., the y-coordinates of the points in S differ by at most l). Gonzalez [19] describes
206
P. K. Agarwal and C. M. Procopiuc
a sweep-line algorithm for computing an optimal cover of S in time n O(l) . However, he defines an optimal cover to be any anchored cover of S by the minimum number of squares. We want to be more precise and use our definition of the optimal cover. We show how to modify the algorithm so that it computes the optimal cover defined earlier. The algorithm relies on the following lemma, whose proof follows from a simple packing argument, as shown in [19, Lemma 2]. LEMMA 2.6 [19]. Let S be a set of n points lying in a horizontal strip of height l. Every vertical line intersects at most 2l − 1 squares of the optimal cover of S. We say that a set of squares C is a subcover with respect to a vertical line if all squares in C are anchored and C covers all points of S that lie to the left of or on the vertical line. If we swept a line from left to right and generated all possible subcovers with respect to the sweep line, then, at the end, we would have generated the optimal cover. However, such a method would have exponential running time. Gonzalez [19] makes the observation that, for any subcover C, if a square C ∈ C does not intersect the current sweep line, then C can be ignored for the remainder of the procedure since it cannot cover any of the remaining points. It thus suffices to “identify” a subcover with the subset of its squares that intersect the sweep line. Formally, we define the signature of a subcover C with respect to a vertical line ` to be the set of squares in C that intersect `. If two subcovers have the same signature with respect to the sweep line, it suffices to maintain only the one that has the larger index. We show below that such an approach still ensures that at the end we generate the optimal cover. The algorithm sweeps the plane from left to right and maintains a family F = {C1 , . . . , Cu } of all possible subcovers with respect to the sweep line, so that each Ci satisfies the following properties: (i) At most 2l − 1 squares of Ci are intersected by the sweep line. (ii) Let 6i be the signature of Ci with respect to the sweep line. Then 6i 6= 6 j for any i 6= j. Moreover, Ci has the largest index among all subcovers with signature 6i with respect to the sweep line. REMARK 2.7. In [19] invariant (ii) is different. The subcover Ci is required to have the smallest size among all subcovers whose signature with respect to the sweep line is 6i . We have modified this condition to ensure that the algorithm generates the optimal cover according to our definition. The event points are the x-coordinates of the points in S, at which the sweep line stops and updates F so that (i) and (ii) are satisfied. For the sake of convenience, we add x = −∞ as an event point. Let x 0 = −∞ < x1 = x( p1 ) ≤ · · · ≤ xn = x( pn ) be the event points ( p1 , . . . , pn are the points of S sorted in a nondecreasing order of their x-coordinates). Initially, F = {∅} and the sweep line is in x0 . Whenever the sweep line reaches a new event xi (hence, it passes through point pi ), the algorithm updates the set F by executing,
Exact and Approximation Algorithms for Clustering
207
for each C j ∈ F, the following two steps: (a) If pi is covered by C j , keep C j in F. (b) If the sweep line intersects less than 2l − 1 squares of C j , then do the following. For each q ∈ S, let Cq be the unit square having pi as its left anchor and q as its bottom anchor (assuming this is possible). Add C j ∪ {Cq } to F. After these steps are executed for each C j ∈ F, we execute a filtering phase. For any two subcovers in F that have identical signatures with respect to the sweep line, throw away the subcover with the smaller index. Thus, after F is updated, all its elements are subcovers with respect to the current position of the sweep line and satisfy invariants (i) and (ii). At the end of the algorithm, we report the element of F with the maximum index. Let C ∗ denote the optimal cover of S, and let F f be the final set F. We prove that ∗ C ∈ Ff . Because all elements of F f are covers of S, it follows that the procedure returns C ∗ . LEMMA 2.8.
C ∗ ∈ Ff .
PROOF. For 0 ≤ i ≤ n, let Ci∗ ⊆ C ∗ be the set of squares that intersect the halfplane x ≤ xi , and let Fi be the set F after the algorithm processed the point pi (thus, Fn = F f ). We prove by induction on i (0 ≤ i ≤ n) that Ci∗ ∈ Fi , and the claim follows. Since C0∗ = ∅, the induction hypothesis holds for i = 0. Suppose that C j∗ ∈ F j for all ∗ 0 ∈ Fi+1 . Let Fi+1 be the set of subcovers generated from j ≤ i. We now prove that Ci+1 0 after the filtering phase). Fi by applying rules (a)–(b) (hence, Fi+1 is obtained from Fi+1 ∗ 0 ∗ ∗ 0 We first prove that Ci+1 ∈ Fi+1 . There are two cases. If Ci+1 = Ci∗ , then Ci+1 ∈ Fi+1 by ∗ ∗ update rule (a). Otherwise, Ci+1 = Ci ∪ {C}, where C is an anchored square having pi as its left anchor. Thus, C is one of the squares Cq considered in update rule (b) and so ∗ 0 ∈ Fi+1 . Ci+1 ∗ ∗ ∈ Fi+1 , i.e., Ci+1 is not one of the sets eliminated by filtering. We now prove that Ci+1 0 ∗ Let C ∈ Fi+1 , C 6= Ci+1 , be a subcover with respect to the line `: x = xi+1 , so that C ∗ ∗ and Ci+1 have identical signatures with respect to `. We argue that I(C)
lex I(Ci+1 ). Both Ci+1 and C cover the points p1 , . . . , pi+1 , and they have identical signatures with respect to `. It ∗ is also covered by C. Hence, the set Cnew = then follows that any point covered by Ci+1 ∗ ∗ (C \Ci+1 ) ∪ C is a cover of S. By Lemma 2.3, I(Cnew ) >lex I(C ∗ ), a contradiction. Condition (i) implies that |F| = O(n 4l−2 ) for each position of the sweep line. The set F is updated at most n times. At each event point, steps (a) and (b) for all elements of F, together with the filtering phase, can be executed in time O(n · |F|). See the original paper [19] for details. Even though our filtering phase uses a different criterion than [19], its implementation is very similar. We thus conclude with the following. LEMMA 2.9. Given a set S of n points lying in a strip of height l, the STRIP ALGORITHM described above computes the optimal cover of S in time O(n 4l ).
208
P. K. Agarwal and C. M. Procopiuc
COROLLARY 2.10. n points in R2 that lie in a strip of √ √ The optimal cover of a set 4of k+8 height at most k + 2 can be computed in O(n ) time. Handling Thick Strips. We now describe the algorithm for the general case in which S √ may not lie in a horizontal strip of height at most k +2. Let Y be the set of y-coordinates of the horizontal sides of anchored squares, i.e., Y = {y( p), y( p) + 1 | p ∈ S}. We choose λ to be a small constant so that λ < min 0
y,y ∈Y, y6= y 0
1 |y − y 0 | ≤ , 2 2
and we define a set H of horizontal lines as follows: (∗)
H = {y = yi + λ, y = yi − λ | yi ∈ Y }.
LEMMA 2.11. Let S be a set of n points in R2 lying in a horizontal strip w, and let C be √ an anchored cover of S by at most k unit squares. If the height of w is greater than √k + 2, then there exists a line h ∈ H lying in the interior of w that intersects at most k squares of C. PROOF. Let `1 and `2 denote the lower (resp. upper) boundary of w, and let `01 , `02 ∈ H be the lines that lie immediately above `1 and immediately below `2 , respectively. If the distance between `1 and `01 is more than 1, we claim that `01 does not intersect any square of C. Suppose to the contrary that `01 intersects a square C ∈ C. Let the equation of `01 be y = ya , and let yb be the y-coordinate of the bottom side of C. Since C is anchored, a point of S lies on the bottom side of C, and therefore yb ∈ Y and the lines y = yb −λ and y = yb + λ are in H. Consequently, ya ≤ yb + λ and yb − λ ≥ ya − 2λ > ya − 1, which implies that the line y = yb − λ lies in w. This contradicts the fact that `01 is the first line in H lying above `1 . Hence, `01 does not intersect any square of C and we choose h = `01 . Similarly, if the distance between `2 and `02 is more than 1, we can choose h = `02 . So assume that the distance between `i and `i0 is less than 1. Then the horizontal strip √ √ w 0 bounded by `01 and `02 has height k +δ for some δ > 0. Fix a parameter ε < δ/(2 k) and divide w 0 by equidistant horizontal lines h 1 , . . . , h r , so that the distance between h i and h i+1 is 1 + ε, the distance between `01 and h 1 is ε, and the distance between h r and `02 is at most 1 + ε. See Figure√ 2. Each square in C intersects at most √ one √ h i . Suppose that each h i intersects more than k√squares of C.√Then √ r < |C|/ k ≤ k and the height of w 0 is at most ε + (1 + ε)r < k + ε(1 + k) <√ k + δ, a contradiction. Hence, there exists an h i that intersects at most k squares of C. Let y − (resp. y + ) be the predecessor (resp. successor) in Y of the y-coordinate of h i . By construction, H + . The line h intersects the same set of contains a line h: y = γ so that y − < γ < y√ squares of C as h i , so h also intersects at most k squares of C. DEFINITION 2.12. (i) For any h ∈ H and D ⊆ 0, we say that (h, D) is a canonical √ pair if |D| ≤ k and all squares in D intersect h.
Exact and Approximation Algorithms for Clustering
209 `2 0 `2
hr
h3 h2 h1
0
`1 `1
Fig. 2. Finding line h in Lemma 2.11.
(ii) Let w be a horizontal strip with lower boundary `1 and upper boundary `2 , and let (`1 , A) and (`2 , B) be canonical pairs. We define a c-strip to be the triple τ = (w, A, B). Let Sτ ⊆ S be the set of input points that lie in w, but not in any square of A ∪ B. We define Cτ , the optimal cover associated with τ, to be the optimal cover of Sτ . (iii) A c-strip τ = (w, A, B) is called legal if no square of Cτ intersects the boundary of w, and |Cτ ∪ A ∪ B| ≤ k. Otherwise, τ is illegal. Using Lemma 2.11, we obtain the following result. √ LEMMA 2.13. Let τ = (w, A, B) be a legal c-strip of height greater than k + 2. If Sτ 6= ∅, then there exists a canonical pair (h, D) so that h divides w into two strips w− , w+ , each of height strictly less than the height of w, and so that τ− = (w− , A, D) and τ+ = (w+ , D, B) are legal c-strips. Moreover, Cτ = Cτ− ∪ Cτ+ ∪ (D\(A ∪ B)). √ PROOF. By Lemma 2.11, there exists a line h ∈ H that intersects at most k squares of Cτ ∪ A ∪ B. Let D be the set of squares in Cτ ∪ A ∪ B intersected by h. Then τ− = (w− , A, D) and τ+ = (w+ , D, B) are c-strips. Let C1 (resp. C2 ) be the set of squares of Cτ that lie strictly below (resp. above) h. Then C1 (resp. C2 ) is the optimal cover of Sτ− (resp. Sτ+ ). Indeed, if some other set C10 6= C1 were the optimal cover of Sτ− , then (Cτ \C1 ) ∪ C10 would be a cover of Sτ and, by Lemma 2.3, I((Cτ \C1 ) ∪ C10 ) >lex I(Cτ ), contradicting the optimality of Cτ . A similar proof holds for C2 . Since τ is a legal c-strip, no square of C1 ∪ C2 intersects the boundaries of τ and |C1 ∪ A ∪ D|, |C2 ∪ D ∪ B| ≤ |Cτ ∪ A ∪ B| ≤ k. Therefore, τ+ and τ− are legal c-strips. From the discussion above, Cτ = C1 ∪ C2 ∪ (D\(A ∪ B)) = Cτ− ∪ Cτ+ ∪ (D\(A ∪ B)). The Decision√Algorithm. In the following we say that a c-strip τ is thin if the height of w is at most k + 2. Otherwise, τ is thick. Lemma 2.13 suggests the following recursive algorithm. Start with a legal strip τ B that contains all input points. If τ B is a thin strip, compute Cτ B using the STRIP ALGORITHM. Otherwise, divide τ B into two smaller legal strips by a canonical pair (h, D), and solve the problem recursively for each of the two strips. Instead of considering all possible ways of dividing a thick legal c-strip, we use dynamic programming and work our way up the recursion tree, starting from the leaves that correspond to thin strips. We now describe the algorithm in more detail.
210
P. K. Agarwal and C. M. Procopiuc
The algorithm builds a table, each of whose entries is indexed by a c-strip τ = (w, A, B). If τ is legal, then the entry stores k(τ ) = |Cτ | and the index I (τ ) of Cτ ; otherwise, k(τ ) is set to ∞ and I (τ ) is undefined. Fill the entry τ as follows: • If Sτ = ∅, then k(τ ) = 0 and I (τ ) = (∞, . . . , ∞). • If τ is a thin strip, compute Cτ using the STRIP ALGORITHM for Sτ . If any square of Cτ intersects a boundary of w or if |Cτ ∪ A ∪ B| > k, then k(τ ) = ∞ and I (τ ) is undefined. Otherwise, store k(τ ) and I (τ ). • If τ is a thick strip, compute all canonical pairs (h, D) for which h lies inside w. For each canonical pair (h, D), let w− and w+ be the portions of w lying below and above h, respectively, and let τ− = (w− , A, D) and τ+ = (w+ , D, B). Compute µτ = min (k(τ− ) + k(τ+ ) + |D\(A ∪ B)|), (h,D)
where the minimum is taken over all canonical pairs. Let ( if µτ + |A ∪ B| ≤ k, µτ k(τ ) = ∞ otherwise. If k(τ ) = ∞, then I (τ ) is undefined. Otherwise, let (h ∗ , D∗ ) be the pair for which the minimum is attained. Then I (τ ) is the index of Cτ−∗ ∪ Cτ+∗ ∪ (D∗ \(A ∪ B)), where τ−∗ and τ+∗ are the two c-strips determined by (h ∗ , D∗ ) (as described in the proof of Lemma 2.13). This index can be computed in O(k) time from I (τ−∗ ) and I (τ+∗ ). If more than one canonical pair attains the minimum, then choose the canonical pair that maximizes I (τ ). Let hw1 , . . . , w M i be the sequence of horizontal strips determined by pairs of lines in H, sorted in a nondecreasing order of their heights, M = O(n 2 ). Fill out all entries of the table having wi as the first component of the index before computing the entries with wi+1 in their index. Let w B be the strip bounded by the lowest and highest lines of H. We claim that if S can be covered by at most k unit squares, then the entry corresponding to τ B = (w B , ∅, ∅) gives the size and the index of the optimal cover of S. If k(τ B ) ≤ k, the algorithm returns YES together with k(τ B ) and I (τ B ). Otherwise, the algorithm returns NO. By induction on i, we can prove that if τ = (wi , A, B) is a legal c-strip, then the algorithm sets k(τ ) = |Cτ | and I (τ ) = I(Cτ ). If τ is an illegal c-strip, then the algorithm sets k(τ ) = ∞. This implies the claim above and proves that the decision procedure works correctly. To analyze the running time, we need an estimate on the number of possible c-strips. √ There are O(n 2 ) different choices for w. There are O(n 2 k ) different choices for each of A and B (because the squares are√ anchored, we need two input points to specify a square), and so the table size is O(n 4 k+2 ). By Corollary 2.10, the running time of STRIP √ 4 k+8 ALGORITHM is O(n ). For a thick c-strip, the√time to fill out its entry is proportional to the√number of canonical pairs, which is O(n 2 k+1 ). Hence, the total running time is O(n 8 k+10 ). We obtain the following. LEMMA 2.14. Let S be a set of n points in the metric space (R2 , L ∞ ), and let k √> 0 be an integer. We can decide whether S can be covered by k unit squares in time O(n 8 k+10 ).
Exact and Approximation Algorithms for Clustering
211
By a binary search over the set of L ∞ -distances between pairs of points in S we compute the optimal cover of S by k congruent squares. Thus, we conclude with the following result. THEOREM 2.15. Let S be a set of n points in the metric space (R2 , L ∞ ),√and let k > 0 be an integer. We can compute the optimal k-clustering of S in time O(n 8 k+10 log n). 2.2. The Decision Procedure in (Rd , L p ). We first extend the previous algorithm to (R2 , L p ), and then extend it to (Rd , L ∞ ). The extension to (Rd , L p ) is a combination of the two. The Decision Procedure in (R2 , L p ). If the underlying metric is L p , the decision problem becomes whether S can be covered by k L p -disks of radius 12 (by an L p -disk we mean a disk in the L p -metric). The index of a set of L p -disks is defined similarly to the index of a set of squares (see Definition 2.2). A disk in a cover is anchored if it has two input points on its boundary that are not contained in any other disk of the cover. Any cover can be transformed into a cover with only anchored disks, without increasing the number of disks in it. We again define the optimal cover of S to be the index-maximal set of anchored disks that cover S. This implies that the optimal cover has the minimum number of disks among all covers of S. We now define the set Y as follows: for every pair of input points, compute the (at most) two L p -disks of radius 12 passing through those points; for each of the disks, the y-coordinates of the lowest and highest point on the disk boundary are added to the set Y. Let H be the set of O(n 2 ) lines as defined in (∗). Lemma 2.11 remains the same, the STRIP ALGORITHM can be extended to any L p -metric [19], and the decision procedure can be modified in a straightforward manner. The size ALGORITHM are computed in a similar manner, of the table and the running time of STRIP √ with only the constants hidden in the O( k) notation changing. We conclude with the following. and let k > 0 THEOREM 2.16. Let S be a set of n points in the metric space (R2 , L p ), √ be an integer. We can compute the optimal k-clustering of S in time n O( k) . REMARK 2.17. The algorithm can be modified to solve the discrete k-center problem, as well. The techniques presented in this section extend to any metric ρ for which the following conditions are met: the disk of radius 12 under the metric ρ has constant complexity and constant aspect ratio, and a unit square can be covered by a constant number of disks of radius 12 under the metric ρ. The Decision Procedure in (Rd , L ∞ ). The main ingredients we need for computing the exact cover in (Rd , L ∞ ) are the STRIP ALGORITHM in Rd and an extension of Lemma 2.11 to higher dimensions. We define a d-dimensional strip to be the intersection of 2d − 2 halfspaces of the form xi ≥ ai and xi ≤ bi , for 1 ≤ i ≤ d − 1, where ai , bi ∈ R. We denote by xi ( p) the ith coordinate of the point p. A d-dimensional unit hypercube is defined as [ai , bi ] × · · · × [ad , bd ], where bi − ai = 1 for all 1 ≤ i ≤ d. Let C = {C1 , . . . , Cq } be a set of unit hypercubes, and let pi be the center of Ci , 1 ≤ i ≤ q. We define the index
212
P. K. Agarwal and C. M. Procopiuc
of C, denoted I(C), to be the dn-vector obtained by concatenating the n-vectors (∞, . . . , ∞, xi ( p j1 ), . . . , xi ( p jq )),
1 ≤ i ≤ d,
where ( j1 , . . . , jq ) is a permutation of (1, . . . , q), satisfying the following condition: for any s, 1 ≤ s ≤ q, (x1 ( p js ), . . . , xd ( p js )) 0 1−1/d ) . be an integer. We can compute the optimal k-clustering of S in time n O(k
3. The Approximation Algorithm. We now present an approximation algorithm for the k-center problem that returns a (1 + ε)-approximation of the optimal cover. More precisely, we want to solve the following problem: Given a set S of n points in the metric space (R2 , L p ), an integer k ≤ n, and a real parameter ε > 0, compute a k-clustering of S of size at most (1 + ε)σ ∗ , where σ ∗ is the size of an optimal k-clustering of S. For ε ≥ 1, we can use the algorithm by Feder and Greene [16] to compute a 2-approximation in O(n log k) time, so we assume 0 < ε < 1. We present the algorithm for the metric space (R2 , L p ). The extension to higher dimensions is natural. First, compute a real number σ0 so that σ ∗ ≤ σ0 ≤ 2σ ∗ , using
Exact and Approximation Algorithms for Clustering
213
Fig. 3. The sets S and P. Black circles are points of S and white circles are points of P.
the algorithm by Feder and Greene [16] in time O(n log k). Let Z be a grid of size δ=
(∗∗)
σ0 ε , 3 · 21/ p
i.e., Z = {(iδ, jδ) | i, j ∈ Z}. Let P be the set of grid points v so that at least one of the grid cells adjacent to v contains a point of S. See Figure 3. Next, compute the optimal cover of P in (R2 , L p ), using the algorithm described in the previous section. Suppose the algorithm returns C = {C1 , . . . , Ck } as the optimal cover of P, and that the size of each L p -disk Ci is σ1 . For each Ci ∈ C, let Ci0 be the L p disk with the same center as Ci and of size σ2 = σ1 + σ0 ε/6. Return C 0 = {C10 , . . . , Ck0 }. The proof of correctness of the above algorithm follows from the following lemma. LEMMA 3.1. (i) C 0 is a cover of S. (ii) σ2 ≤ (1 + ε)σ ∗ . PROOF. (i) For any q ∈ S, there exists a point v ∈ P so that d∞ (q, v) ≤ δ/2. Indeed, q lies in a grid cell of size δ, so at least one vertex v of the cell is within L ∞ -distance δ/2 from q. Then, by (∗∗), d p (q, v) ≤ 21/ p d∞ (q, v) ≤ σ0 ε/6. If Ci is an L p -disk of C that covers v, then Ci0 covers p. (ii) We first prove that σ1 ≤ σ ∗ + σ0 ε/3. Let C ∗ be the optimal cover of S. For each L p -disk Ci∗ ∈ C ∗ , let Ci` be the L p -disk of size σ ∗ + 21/ p δ and with the same center as Ci∗ . We claim that C ` = {C1` , . . . , Ck` } is a cover of P. Indeed, for any v ∈ P, there exists q ∈ S so that q lies in a cell adjacent to v (by construction). Then d∞ (q, v) ≤ δ and so d p (q, v) ≤ 21/ p δ. If Ci∗ is an L p -disk of C ∗ that covers q, then Ci` covers v. Because C is the optimal cover of P, we obtain σ1 ≤ σ ∗ + 21/ p δ ≤ σ ∗ + σ0 ε/3. Hence, σ2 = σ1 + σ0 ε/6 ≤ σ ∗ + σ0 ε/2 ≤ (1 + ε)σ ∗ .
214
P. K. Agarwal and C. M. Procopiuc
An easy calculation shows√that |P| = O(k/ε2 ). Hence, the optimal cover for P can be computed in time (k/ε) O( k) . We thus obtain the following. THEOREM 3.2. Let S be a set of n points in the metric space (R2 , L p ), and let k > 0 be an integer. For any ε > 0, we can compute a k-clustering of S of size within (1 + ε) √ O( k) . of the optimum in time O(n log k) + (k/ε) Extending this approach to higher dimensions, we obtain the following result. THEOREM 3.3. Let S be a set of n points in the metric space (Rd , L p ), and let k > 0 be an integer. For any ε > 0, we can compute a k-clustering of S of size within (1 + ε) 1−1/d ) . of the optimum in time O(dn log k) + (k/ε) O(k
4. The Capacitated k-Center Problem. In this section we propose a subexponential algorithm for the L-capacitated k-center problem in Rd , provided that L, the maximum number of points allowed in a cluster, is either Ä(n/k 1−1/d ) or O(1). To simplify the exposition, we assume that the points in S have distinct x- and y-coordinates. The overall approach is similar to that in Section 2. We describe the algorithm only for R2 under the L ∞ -metric. The algorithm can be extended to other metrics and to higher dimensions following the same ideas as in Section 2. Under the L ∞ -metric, the decision problem for the L-capacitated k-center can be formulated as follows: Let S be a set of n points in the plane, and let k and L be two positive integers. Do there exist a set C of k unit squares and an assignment of the points in S to C so that each point p of S is assigned to exactly one of the squares containing p and each square is assigned at most L points? If the answer is “yes,” the algorithm should return a set C of at most k squares and an assignment function χ: S → C that satisfies the required conditions. We therefore denote the output as (C, χ). Throughout this section, by a cover of S we mean an L-capacitated cover of S by unit squares, together with an assignment function satisfying the required conditions. We use the term smallest cover to denote a cover with the minimum number of squares. Due to some technical details, we are forced to abandon the definition of optimal cover from the previous section. However, we guarantee that the decision algorithm described below computes a smallest capacitated cover by unit squares, which in turn guarantees that the overall algorithm computes an L-capacitated k-clustering of minimum size. We define an anchored square as in Definition 2.4. By our assumption on the coordinates of points in S, the anchors of an anchored square are unique. We therefore identify an anchored square C with its anchors ( pi , p j ), where pi is the left anchor, and p j is the bottom anchor of C. With a slight abuse of notation, we will not distinguish between an anchored square and its anchors. A cover (C, χ) is called an anchored cover if each square C ∈ C is anchored and the anchors of C are assigned to C itself. The following analog of Lemma 2.5, whose proof is omitted here, suggests that it suffices to compute a smallest anchored cover. LEMMA 4.1.
There exists a smallest cover (C, χ) of S so that (C, χ) is anchored.
Exact and Approximation Algorithms for Clustering
215
In the remainder of this section, whenever we refer to a square, we assume it to be anchored. 4.1. Decision Algorithm. We now describe √ our decision algorithm for the L-capacitated k-center problem when L = Ä(n/ k) or L = O(1). The restriction on L comes from the fact that we use a modified version of STRIP ALGORITHM, called CAPACITATED STRIP ALGORITHM, whose running time is n O(`) , where ` is the height of the smallest horizontal strip containing S, only when L lies in the above range (see Sections 4.2 and 4.3). The general approach is similar to that of Section 2: the plane is divided into horizontal strips bounded by lines in H , we compute a smallest anchored cover for each strip (assuming there exists a cover of size k), and then take the union of these covers. Extra care is taken to handle squares that intersect the strip boundaries, and to ensure that the load constraint is not violated. We define the set of horizontal lines H as in Section 2.1 and extend Definition 2.12 as follows. DEFINITION 4.2. (i) We define a c-strip to be the tuple τ = (w, A, B, L), where w is a horizontal strip whose bottom (resp. √ top) boundary is a line `1 (resp. `2 ) in H ; A ⊆ 0 (resp. B ⊆ 0) is a set of at most k anchored squares intersecting `1 (resp. `2 ); and L: A ∪ B → {0, . . . , L} is a load function for A ∪ B. For each C ∈ A ∪ B, L(C) indicates the maximum number of points in the strip w (other than its anchors) that can be assigned to C. Let Sτ ⊆ S ∩ w be the set of input points inside w that are not anchors of any square in A ∪ B. An L-capacitated cover (Cτ , χτ ) of Sτ is called compatible with τ if A ∪ B ⊆ Cτ and each square C ∈ A ∪ B is assigned at most L(C) points of Sτ . (ii) A c-strip τ is legal if it has at least one compatible cover (Cτ , χτ ) so that |Cτ | ≤ k. Otherwise, τ is illegal. (iii) For any h ∈ H, D ⊆ √ 0, and LD : D → {0, . . . , L}, we say that (h, (D, LD )) is a canonical pair if |D| ≤ k and every square in D intersects h. Given a c-strip τ as above and a canonical pair (h, (D, LD )) so that h lies in w, we say that (h, (D, LD )) is compatible with τ if LD (C) ≤ L(C) for all C ∈ D ∩ (A ∪ B). Given a c-strip τ = (w, A, B, L), the subproblem on τ requires computing a smallest cover (Cτ , χτ ) for Sτ compatible with τ, provided that there exists such a cover of size at most k. In all the subproblems, the load function L will reflect the condition that the anchors of a square C ∈ A ∪ B are assigned to C itself. Therefore, if the left and bottom anchors of C are different (resp. the same), then L(C) will be at most L − 2 (resp. L − 1), because Sτ excludes the anchors of A ∪ B. In general, L(C) will be less than L − 2 because some points lying outside w will also be assigned to C. Note that, unlike Section 2.1, Sτ now contains points lying inside the squares of A ∪ B, and a compatible cover (Cτ , χτ ) is not unique. If τ is thin, (Cτ , χτ ) is computed using the CAPACITATED STRIP ALGORITHM described in Section 4.3. If τ is thick, the computation of a smallest cover (Cτ , χτ ) is based on the following lemma. √ LEMMA 4.3. Let τ = (w, A, B, L) be a legal c-strip of height greater than k + 2. If Sτ 6= ∅, there exists a canonical pair (h, (D, LD )) compatible with τ so that h divides
216
P. K. Agarwal and C. M. Procopiuc
w into two strips w− , w+ , each of height strictly less than the height of w, and so that the c-strips τ− = (w− , A, D, L− ) and
τ+ = (w+ , D, B, L+ )
are legal, where L− , L+ are load functions for A ∪ D and D ∪ B, respectively, defined as follows (#C is the number of distinct anchors of C): ( if C ∈ D, LD (C) L− (C) = L(C) if C ∈ A\D, if C ∈ D ∩ (A ∪ B), L(C) − LD (C) if C ∈ D\(A ∪ B), L+ (C) = L − LD (C) − #C L(C) if C ∈ B\D. Moreover, if (Cτ− , χτ− ) (resp. (Cτ+ , χτ+ )) is a smallest cover compatible with τ− (resp. τ+ ), then (Cτ , χτ ) is a smallest cover compatible with τ, where Cτ = Cτ− ∪ Cτ+ and for every p ∈ Sτ , if p is an anchor of C ∈ D, C if p ∈ w− , p not an anchor in D, χτ ( p) = χτ− ( p) if p ∈ w+ , p not an anchor in D. χτ+ ( p) Before giving the formal proof, we refer the reader to Figure 4 for an example which will make the above definitions of τ− and τ+ more intuitive. A strip (w, A, B, L) is shown in part (a), where A = {A, B} and B = {D, E}. In our example L = 11. The load constraints for A and B are shown inside the squares. Thus, L(A) = 5, L(B) = 1, L(D) = 2, and L(E) = 9. In part (b) we show a canonical pair (h, D), where D = {A, C}. For the squares of D, two constraints are shown: the one below represents L− = LD , and the one above represents L+ . The fact that (h, (D, LD )) is compatible with τ can be interpreted as follows: if square A can be assigned at most 5 points from w, then it can be assigned no more than 5 points from w− . In our example, we decided that A and C can be assigned at most 3 and 6 points, respectively, i.e.,
`2
2
D
5
`1
A (a)
`2
2
E
9
h `1
1
B
D 3
2
6
3
A (b)
Fig. 4. An instance of strips τ− and τ+ in Lemma 4.3.
E
9
C 1
B
Exact and Approximation Algorithms for Clustering
217
LD (A) = 3 and LD (C) = 6. Therefore, at most L(A) − LD (A) = 2 points from w+ are assigned to A. Similarly, at most L − LD (C) − 2 = 3 points from w+ are assigned to C, assuming that the left and bottom anchors of C are distinct; recall that the anchors of C are assigned to C itself. with τ. By Proof of Lemma 4.3. Let (C ∗ , χ ∗ ) be a fixed smallest cover compatible √ Lemma 2.11, there exists a line h ∈ H that intersects at most k squares of C ∗ . We denote by w− , w+ , the regions of w that lies below and above h, respectively. Let D ⊆ C ∗ be the set of squares intersected by h. For any C ∈ D, we define LD (C) to be the number of points of Sτ that lie inside w− and that are assigned to C by χ ∗ . Then (h, (D, LD )) is a canonical pair compatible with τ (the compatibility follows immediately from the definition of LD and the fact that (C ∗ , χ ∗ ) is compatible with τ ). If C−∗ (resp. C+∗ ) is the set of squares in C ∗ that intersect w− (resp. w+ ), and if χ−∗ (resp. χ+∗ ) is the restriction of χ ∗ to the largest subset of Sτ ∩ w− (resp. Sτ ∩ w+ ) that does not contain the anchors of any square in D, then (C−∗ , χ−∗ ) and (C+∗ , χ+∗ ) are compatible with τ− and τ+ , respectively. Hence, τ− and τ+ are legal c-strips. For the second part of the claim, the fact that (Cτ− , χτ− ) and (Cτ+ , χτ+ ) are compatible with τ− and τ+ , respectively, implies that (Cτ , χτ ) is compatible with τ. Because Cτ− is a smallest cover compatible with τ− , we obtain |Cτ− | ≤ |C−∗ |. Similarly, |Cτ+ | ≤ |C+∗ |. Hence, |Cτ | = |Cτ− | + |Cτ+ | − |D| ≤ |C−∗ | + |C+∗ | − |D| = |C ∗ |. Since C ∗ is a smallest cover compatible with τ, this implies |Cτ | = |C ∗ | and so Cτ is a smallest cover. We can now again use a dynamic-programming approach to compute a smallest cover. We proceed as in Section 2.1. If a c-strip is thin, we use the procedure described √ in Section 4.3. Note that this procedure assumes that either L = O(1) or L = Ä(n/ k). If a c-strip is thick, we use Lemma 4.3 to compute a smallest cover. √ √ The table size is n O( k) , and CAPACITATED STRIP ALGORITHM runs in n O( k) time (see Theorem 4.15, using the fact that n 0 , L ≤ n), so the running time of our algorithm √ O( k) is n . Putting it all together, we obtain the following result. THEOREM 4.4. Given a set S of n input points in the plane, and integers k and L, the optimal solution of the L-capacitated k-center problem for S can be obtained in time √ √ n O( k) if one of the following is true: L = Ä(n/ k) or L = O(1). In the next two subsections we describe CAPACITATED STRIP ALGORITHM. Section 4.2 proves some properties of a capacitated cover needed for the algorithm. 4.2. Properties of Capacitated Covers. Recall that Lemma 2.6 lies at the heart of STRIP ALGORITHM. Unfortunately, this lemma does not hold for capacitated covers. In particular, we can construct a counterexample of a set S of n points lying in a horizontal strip of height l so that for any smallest cover (C, χ) of S, there are vertical lines that intersect Ä(L ·l) squares of C (see Figure 5): S consists of L ·l/2 clusters, each containing L points and lying in a unit square. Each such unit square has one point in its upper left
218
P. K. Agarwal and C. M. Procopiuc
` Fig. 5. Counterexample for Lemma 2.6 in the capacitated case.
corner and L − 1 points very close to its lower right corner (represented as the small black rectangle). The squares form l/2 groups, each consisting of L squares. All groups lie in a rectangle of height l and width 2, and they are suffficiently far from each other so that no unit square can cover points from two distinct groups. The clustering in the figure is the only smallest L-capacitated clustering of the points. The vertical line ` intersects Ä(L · l) squares. We can, however, prove an analogue of Lemma 2.6 if L = Ä(n/l) and a slightly weaker result if L = O(1). LEMMA 4.5. Let S ⊆ R2 be a set of n points lying in a horizontal strip of height l, and let L be a natural number. If L = Ä(n/l), then any vertical line intersects O(l) squares of any smallest cover of S. PROOF. Let L ≥ a · n/l, for some constant a ≥ 0, and let C be a smallest cover of S. We claim that any vertical line intersects at most (2 + 1/a)l squares of C. Suppose to the contrary that there exists a vertical line ` that intersects a set C` ⊆ C of more than (2 + 1/a)l squares. Note that C` lies in a rectangle R of width 2 and height l, centered at `. All points of S lying in R, and thus all points assigned to squares in C` , can be covered by at most 2l squares. If some square covers L 0 > L points, we assign L of the points to that square, and create a new square that covers the remaining L 0 − L points. We repeat this procedure until all squares satisy the load constraint, and then we translate the squares so that they are anchored. Because L ≥ a · n/l, we create at most l/a new squares, for a total of (2 + 1/a)l. By replacing C ` with these at most (2 + 1/a)l squares, we obtain a smaller cover of S, thereby contradicting the assumption that C is a smallest cover.
Exact and Approximation Algorithms for Clustering
219
A more interesting situation occurs when L = O(1). In general, a vertical line may intersect Ä(n) squares of a smallest cover, but we will prove a slightly weaker result that is sufficient to extend STRIP ALGORITHM. DEFINITION 4.6. Let (C, χ) be a cover of S, and let C ∈ C be a square. Then C is active at a vertical line `: x = x¯ if there exist two points p, q ∈ S assigned to C so that x( p) ≤ x¯ and x(q) > x. ¯ LEMMA 4.7. Let S ⊆ R2 be a set of n points lying in a horizontal strip of height l, and let L be a constant. There exists a smallest cover (C, χ) of S in which at most O(l) squares of C are active at any vertical line. We prove the lemma using the following two results. LEMMA 4.8. Let S be a set of n points lying in a unit square. There exists a smallest cover (C, χ) of S in which at most one square of C is active at any vertical line. PROOF. Let the points of S be ordered in the increasing order of their x-coordinates. Assign the first L points to a unit square C1 , the next L points to a unit square C2 , and so on, choosing the squares Ci so that they are anchored. Since S lies in a unit square, it is always possible to construct the squares Ci . Let (C, χ) be the cover of S obtained by this method. Clearly, (C, χ) is a smallest cover, and at most one square of C is active at any vertical line (recall that the points are in general position). LEMMA 4.9. Let S1 , . . . , Su be a family of pairwise-disjoint sets of points Su in the plane Si is called so that each Si lies in a unit square Ui . A square C in a cover of S = i=1 monochromatic if all the points assigned to C belong to one Si ; otherwise it is called polychromatic. There exists a smallest anchored cover (C, χ) of S in which at most L u squares are polychromatic. PROOF. We first prove the lemma for u = 2. Let (C, χ) be a smallest cover of S with the minimum number of polychromatic squares. For any pair i, j of positive integers, with i + j = L , denote by Mij ⊆ C the set of squares C so that at most i points of S1 and at most j points of S2 are assigned to C. We claim that |Mij | < L. Indeed, if |Mij | ≥ L, then let C1 , . . . , C L be L squares of Mij . Let X i ⊆ Si , i ∈ {1, 2}, be the set of points assigned to the squares C1 , . . . , C L . Hence, |X 1 | ≤ iL and |X 2 | ≤ jL. Replace C1 , . . . , C L in C by i copies of the unit square U1 and j copies of the unit square U2 , and assign the points of X 1 (resp. X 2 ) to these copies of U1 (resp. U2 ), so that the load constraint is satisfied. Translate each copy of U1 (resp. U2 ) so that it becomes an anchored square. Since i + j = L, we still have a smallest cover of S, but with fewer polychromatic squares, a contradiction. The number of polychromatic squares in C is upper bounded by X |Mij | < L 2 . 1≤i, j≤L−1, i+ j=L
220
P. K. Agarwal and C. M. Procopiuc
For larger values of u, we consider all tuples (i 1 , . . . , i u ) so that 0 ≤ i 1 , . . . , i u ≤ L −1 and i 1 + · · · + i u = L . For each such tuple, let Mi1 ···iu ⊆ C be the set of squares C so that at most i j points of Sj are assigned to C, 1 ≤ j ≤ u. An argument similar to the one above proves that |Mi1 ···iu | < L . A simple counting shows that there are at most L u−1 distinct tuples (i 1 , . . . , i u ) with the specified properties. Hence, the number of polychromatic squares in C is at most L u . Proof of Lemma 4.7. Let U be a family of pairwise-disjoint unit squares containing all points of S. For each Ui ∈ U, let Si ⊆ S be the set of points contained in Ui . We say that two squares Ui , U j ∈ U are neighbors if the L ∞ -distance between Si and Sj is at most one. Since the squares in U are pairwise disjoint, any square Ui has at most eight neighbors. Consider a unit square C in a capacitated cover of S. If C is assigned points from the set Si , then all points assigned to C are either contained in Ui or in some neighbor of Ui . This observation and Lemma 4.9 imply that there exists a smallest cover (C, χ) of S with the following property: for each set Si , there are at most L 9 polychromatic squares of C that are assigned at least one point of Si . We call it the polychromatic property. We construct another L-capacitated cover (C 0 , χ 0 ) as follows: Let C ∗ ⊆ C be the set of polychromatic squares of C. Let Si0 denote the points of Si that are not assigned S to any square in C ∗ , and let (Ci , χi ) be a cover of Si0 satisfying Lemma 4.8. Set C 0 = ( Ci )∪C ∗ . Let ( χ( p) if χ( p) ∈ C ∗ , 0 χ ( p) = if p ∈ Si0 . χi ( p) It is easy to see that (C 0 , χ 0 ) is a smallest L-capacitated cover of S. Moreover, C 0 has the same set of polychromatic squares as C and thus satisfies the polychromatic property. We show that there are at most O(l) active squares of (C 0 , χ 0 ) at any vertical line. Indeed, let ` be a vertical line, and let R` be the rectangle of height l and width 2 centered at `. We define U` ⊆ U as the subset of squares that intersect R` . Since the squares in U are pairwise disjoint, |U` | = O(l). All squares of C 0 intersected by ` lie in R` , and they are assigned only points contained in U` . Hence, by the polychromatic property, there are at most L 9 ·|U` | polychromatic squares of C 0 that are intersected by `. From the construction of (C 0 , χ 0 ), it follows that at most |U` | monochromatic squares of C 0 are active at `. The total number of squares that are active at ` is at most (L 9 + 1) · |U` | = O(l). Lemmas 4.5 and 4.7 imply the following. COROLLARY 4.10. Let S ⊆ R2 be a set of n points lying in a horizontal strip of height l, and let L be a positive integer so that L = Ä(n/l) or L = O(1). Then there exists a constant α and a smallest cover (C, χ) of S in which at most α · l squares of C are active at any vertical line. In the next subsection we compute a smallest cover of a set of points S lying in a strip of bounded height, under the additional condition that the points can also be assigned to a given set of squares A with given load constraints LA . In this case a cover (C, χ)
Exact and Approximation Algorithms for Clustering
221
of S must satisfy the following properties: A ⊆ C, |χ −1 (C)| ≤ LA (C), if C ∈ A, and |χ −1 (C)| ≤ L, if C ∈ C\A. We say that (C, χ) is a cover of S with support (A, LA ). We generalize Corollary 4.10 as follows. COROLLARY 4.11. Let S ⊆ R2 be a set of n points lying in a horizontal strip of height l, and let L be a positive integer so that L = Ä(n/l) or L = O(1). Let A be a set of squares and let LA be a load function for A. Then there exists a constant α and a smallest cover (C, χ) of S with support (A, LA ), so that at most α · l squares of C\A are active at any vertical line. PROOF. Let (C 0 , χ 0 ) be a smallest cover of S with support (A, LA ). Assume that (C 0 , χ 0 ) does not have the required property. By Corollary 4.10 applied to the set of points S1 = S\{ p | χ 0 ( p) ∈ A}, there exists a constant α and a smallest cover (B, χ 00 ) of S1 that has at most α ·l squares active at any vertical line. Define C = A∪B, χ( p) = χ 00 ( p), if p ∈ S1 , and χ ( p) = χ 0 ( p), otherwise. Then (C, χ) is a smallest cover of S with support (A, LA ), and at most α · l squares of C\A are active at any vertical line. 4.3. The Capacitated Strip Algorithm. Let S 0 ⊆ S be a subset of n 0 points lying in a horizontal strip w of height l. Let A ⊆ 0 be a set of anchored squares (with respect to S), and let LA be a load function for A. For any square C ∈ 0, let ( LA (C) if C ∈ A, LC = L otherwise. We want to compute a smallest cover (C, χ) of S 0 with support (A, LA ). Recall that a cover (C, χ) of S 0 has support (A, LA ) if A ⊆ C and |χ −1 (C)| ≤ L C for each square C ∈ C. Moreover, any square of C\A must be anchored with respect to S 0 . The algorithm we describe assumes L = Ä(n/l) or L = O(1). It can be adapted to work for any value of L , but then the running time increases to n O(k) . By Corollary 4.11, it suffices to consider only covers (C, χ) with the property that C\A has at most α · l active squares at any vertical line. Therefore, for the remainder of this section, we assume that a cover has this property. DEFINITION 4.12. A cover (C ∗ , χ ∗ ) of S 0 with support (A, LA ) is optimal if I(C ∗ ) is maximal in the lexicographic order (i.e., for any cover (C, χ) of S 0 with support (A, LA ), I(C ∗ ) ≥lex I(C)). Since we did not impose any restriction on the assignment function, there may exist more than one optimal cover, but for any two optimal covers (C1 , χ1 ) and (C2 , χ2 ), C1 = C2 . Clearly, any optimal cover is also a smallest cover. Let C ∗ denote the set of squares so that all optimal covers of S 0 are of the form (C ∗ , χ). The algorithm we describe below will compute one such optimal cover. The overall approach is similar to that of STRIP ALGORITHM. We sweep a vertical line from left to right and maintain a family F of “subcovers” for the points to the left of the sweep line. Let (C ∗ , χ ∗ ) be a fixed optimal cover. For a given vertical line
222
P. K. Agarwal and C. M. Procopiuc
`: x = x, ¯ let C`∗ = A ∪ {χ ∗ ( p) | x( p) ≤ x}. ¯ Intuitively, we maintain the invariant that, for each position of the sweep line `, C`∗ is one of the subcovers. We use Corollary 4.11 to keep the size of F small. Corollary 4.11 and the algorithm in Section 2.1 suggest that we should maintain a family F of “subcovers” so that any two subcovers in F have different sets of active squares. Whether a square C is active at a given line ` not only depends on whether C intersects `, but also depends on whether any points lying to the right of ` are assigned to C. Therefore, we have to maintain some kind of assignment information along with each subcover. An obvious approach is to maintain the assignment function explicitly, but then we would have to maintain too many subcovers. However, as we will see below, it suffices to know how many points lying to the right of the sweep line are assigned to every square within a subcover. This reduces the number of subcovers to be maintained to n O(l) . Special consideration has to be given to the squares in A. They will not be classified as active or inactive with respect to the sweep line (recall that the bound on the number of active squares from Corollary 4.11 applies only to squares of C\A). However, we also maintain the number of points lying to the right of the sweep line that are assigned to every square of A, and use this information in the invariants of our algorithm. With these observations in mind, we now define subcovers formally and state the invariants maintained by the sweep-line algorithm. DEFINITION 4.13. Let 4 = {(C1 , λ1 ), . . . , (Cr , λr )}, where C h = ( pih , p jh ) ∈ 0 is an anchored unit square, λh ∈ {0, . . . , L}, 1 ≤ h ≤ r , and C1 , . . . , Cr are in the increasing order of the x-coordinates of their left sides. Let C4 = {C1 , . . . , Cr }. For any vertical line `: x = x, ¯ we say that 4 is a subcover with respect to ` if and only if the following conditions hold: (S1) A ⊆ C4 . (S2) For all C h ∈ C4 \A, pih , p jh ∈ S 0 . ¯ be the set of points in the left halfplane of `. There (S3) Let S` = { p ∈ S 0 | x( p) ≤ x} exists an assignment function χ: S` ∪ { pih , p jh | C h ∈ C4 \A} → C4 so that, for all 1 ≤ h ≤ r, λh + |χ −1 (C h )| ≤ L Ch . Intuitively, for each 1 ≤ h ≤ r , the value λh represents our guess on the number of points to the right of ` that are assigned to C h , excluding the anchors. Thus, condition (S3) ensures that it is possible to assign all points lying in the left halfplane of ` to C4 , and leave enough room in each square for points from the right halfplane, according to our guesses. Recall that a square C is active at ` if and only if there exist two points p, q ∈ S 0 so that x( p) ≤ x¯ < x(q) and χ( p) = χ(q) = C. In our setup, this condition will be true ¯ We therefore call for a square C h ∈ C4 \A if C h intersects `, and λh > 0 or x( p jh ) > x. a square of C4 \A active if these two conditions hold. If Ch is active, we call (Ch , λh ) an active pair of 4. We say that a pair (C h , λh ) of 4 is alive with respect to a vertical line ` if either (Ch , λh ) is active with respect to ` or Ch ∈ A. Let A(4, `) denote the subset of pairs of 4 that are alive with respect to `. If ` is obvious from the context, we simply denote it by A(4). We are now ready to describe our algorithm. During the sweep, we maintain a family F = {41 , . . . , 4u } of all possible subcovers with respect to the sweep line ` so that the
Exact and Approximation Algorithms for Clustering
223
following conditions are satisfied: (C1) For 1 ≤ i ≤ u, |A(4i , `)| ≤ α · l + |A|, where α is the constant defined in Corollary 4.11. (C2) For all 1 ≤ i 6= j ≤ u, A(4i , `) 6= A(4 j , `). (C3) For any subcover 4i0 of S 0 with respect to ` so that A(4i0 , `) = A(4i , `), I(C4i ) >lex I(C4i0 ). Let q1 = (x1 , y1 ), . . . , qn 0 = (xn 0 , yn 0 ) be the points of S sorted in the increasing order of their x-coordinates. The event points at which the sweep line stops and updates the family F are x0 = −∞, x1 , . . . , xn 0 . Initially, at x = x0 , F is computed as follows. Let A1 , . . . , Am be the sequence of the squares of A in the increasing order of the x-coordinates of their left sides. Then F = {{(A1 , λ1 ), . . . , (Am , λm )} | λi ∈ {0, . . . , L Ai }, 1 ≤ i ≤ m}. When the sweep line reaches the event point xi (i.e., it passes through qi ), F is updated as follows. For each 4 j ∈ F, we delete 4 j from F and replace it with a family of subcovers of S 0 with respect to `: x = xi . For simplicity, assume that qi is not an anchor of 4 j , and that in any cover (C, χ) of S 0 , |χ −1 (χ (qi ))| > 1 (i.e., at least one other point is assigned to the same square as qi ). If at least one of these assumptions fails, a constant number of additional subcovers must be added to F. (1) For each pair (C h , λh ) ∈ 4 j , so that C h covers qi and λh > 0, add to F the set 4 j,h = (4 j \{(C h , λh )}) ∪ {(C h , λh − 1)}. (2) If |A(4 j , `)| < α ·l +|A| (α is the constant in Corollary 4.11), then add the following sets to F: (a) 4 j ∪ {((qi , qi ), λ)} for all 1 ≤ λ ≤ L − 1. (b) 4 j ∪ {((qi , q), λ)} for all 0 ≤ λ ≤ L − 2 and for all q 6= qi such that (qi , q) defines an anchored unit square. Roughly speaking, Step (1) assigns qi to an active square that contains qi , and Step (2) creates a new active square to cover qi . After these steps are executed for all 4 j ∈ F, we execute a filtering phase as follows: for any two sets 4 j , 4h ∈ F, if A(4 j ) = A(4h ) and I(C4j )
Let 4∗ be the set computed by the above sweep-line algorithm. Then
PROOF. Let (C ∗ , χ ∗ ) be a fixed optimal cover of S 0 . For 0 ≤ i ≤ n 0 , let Ci∗ = A ∪ {χ ∗ ( p) | x( p) ≤ xi }. Assume that C1∗ , . . . , Cr∗ are the squares of Ci∗ in the increasing order of the x-coordinates of their left sides. For each square C h∗ ∈ Ci∗ , let λ∗h be
224
P. K. Agarwal and C. M. Procopiuc
the number of points p, excluding the anchors of Ch∗ , so that χ ∗ ( p) = Ch∗ and x( p) > xi . We prove by induction on i that 4i∗ = {(C1∗ , λ∗1 ), . . . , (Cr∗ , λr∗ )} ∈ F after processing the point qi . This implies that 4∗n 0 ∈ F at the end of the sweep. Moreover, I(C4∗n0 ) = I(C ∗ ) is maximal over all sets 4 ∈ F at the end of the sweep. This follows from the maximality of I(C ∗ ) and from the fact that any set 4 in the final set F is a cover of S 0 . ∗ ∈ F afThe basis case for the induction is readily verified. Assume now that 4i−1 ∗ (the ter processing point qi−1 . For simplicity, assume that qi is not an anchor of 4i−1 argument easily extends to the case when this is not true). ∗ , then λ∗h > 0. Hence, There are two possible cases. If χ ∗ (qi ) = Ch∗ ∈ Ci−1 ∗ \{(C h∗ , λ∗h )}) ∪ {(C h∗ , λ∗h − 1)}, 4i∗ = (4i−1
and therefore by Step (1) 4i∗ is added to F. ∗ ∗ , then Ci∗ = Ci−1 ∪ {(qi , q)} for some point q ∈ S 0 (not necessarily If χ ∗ (qi ) 6∈ Ci−1 ∗ −1 distinct from qi ). We assume that (χ ) ((qi , q)) > 1 (the argument can be extended to handle the case when this is not true). This implies that (qi , q) is active at `: x = xi . Let λ∗ be the number of points p ∈ S 0 \{qi , q} so that χ ∗ ( p) = (qi , q) and x( p) > xi . Then ∗ ∪ {((qi , q), λ∗ )}. 4i∗ = 4i−1 ∗ . Thus, 4i∗ is added to We prove below that the condition in Step (2) holds for 4i−1 F in Step (2). Let Ai∗ be the set of active squares of (C ∗ , χ ∗ ) with respect to the line `: x = xi . As noted above, (qi , q) ∈ Ai∗ . Since |Ai∗ | ≤ α · l, we deduce that ∗ , `)| < α · l + |A|. |Ai∗ \{(qi , q)}| ≤ α · l − 1 < α · l. This implies that |A(4i−1 ∗ It remains to prove that 4i is not deleted from F during the filtering phase. Let 4i be another subcover with respect to ` so that A(4i , `) = A(4i∗ , `). Using an argument similar to the one in the proof of Lemma 2.8, we show that I(C4i )
By invariants (C1) and (C2), at the end of each update operation F has at most L |A| ·(n 0 · L) = (n 0 · L) O(l+|A|) subcovers. An update operation can be executed in time polynomial in |F|, so the overall running time is (n 0 ·L) O(l+|A|) . We conclude with the following. O(l)
THEOREM 4.15. Let S 0 ⊆ S be a set of n 0 input points in a horizontal strip of height l, let A be a set of anchored unit squares with respect to S, and let LA be a load function for A. If L is an integer so that L = Ä(n 0 /l) or L = O(1), we can compute in time (n 0 · L) O(l+|A|) a smallest L-capacitated anchored cover (C, χ) of S 0 so that A ⊆ C and |χ −1 (C)| ≤ LA (C), ∀C ∈ A.
5. Conclusions and Open Problems. We have provided a general method for solving a large class of clustering problems in subexponential time in any fixed dimension. Our method can be adapted to handle some additional restrictions on the clustering (e.g., discrete k-center), and works for any metric satisfying a small set of properties (see Remark 2.17), including any L p -metric.
Exact and Approximation Algorithms for Clustering
225
Our subexponential exact algorithm for k-center problems extends to only a subclass of L-capacitated problems. An interesting open problem is to design a subexponential algorithm that works for any value of L. We believe that the approximation algorithm described in Section 3 could be further improved by choosing a nonuniform grid Z . A priori knowledge on the distribution of points could help us decide how to choose Z . We would like to have some test criteria that we can use on a given input S to decide how to compute an approximate k-clustering of S efficiently.
References [1] [2]
[3] [4] [5] [6] [7] [8] [9] [10] [11]
[12]
[13]
[14] [15] [16] [17] [18] [19]
P. K. Agarwal and M. Sharir, Efficient algorithms for geometric optimization, ACM Comput. Surv., 30 (1998), 412–458. R. Agrawal, A. Ghosh, T. Imielinski, B. Iyer, and A. Swami, An interval classifier for database mining applications, Proc. 18th Internat. Conf. Very Large Databases, Morgan Kauffman, San Mateo, CA, August 1992, pp. 560–573. M. R. Anderberg, Cluster Analysis for Applications, Academic Press, New York, 1973. S. Arora, Polynomial time approximation schemes for Euclidean TSP and other geometric problems, Proc. 37th Annu. IEEE Sympos. Found. Comput. Sci., 1996, pp. 2–11. S. Arora, Nearly linear time approximation schemes for Euclidean TSP and other geometric problems, Proc. 38th Annu. IEEE Sympos. Found. Comput. Sci., 1997, pp. 554–563. S. Arora, P. Raghavan, and S. Rao, Approximation schemes for Euclidean k-median and related problems, Proc. 30th Annu. ACM Sympos. Theory Comput., 1998, pp. 106–113. V. Arya, N. Garg, R. Khanderkar, A. Meyerson, K. Munagala, and V. Pandit, Local search heuristics for kmedian and facility location problems, Proc 33rd Annu. ACM Sympos. Theory Comput., 2001, pp. 21–29. J. Bar-Ilan, G. Kortsarz, and D. Peleg, How to allocate network centers, J. Algorithms, 15 (1993), 385–415. M. Berger and I. Rigoutsos, An algorithm for point clustering and grid generation, IEEE Trans. Systems Man Cybernet., 21 (1991), 1278–1286. M. Charikar, C. Chekuri, T. Feder, and R. Motwani, Incremental clustering and dynamic information retrieval, Proc. 29th Annu. ACM Sympos. Theory Comput., 1997, pp. 626–635. D. R. Cutting, D. R. Karger, and J. O. Pedersen, Constant interaction-time scatter/gather browsing of very large document collections, Proc. 16th Annu. Internat. ACM SIGIR Conf. Res. Devel. Inform. Retrieval, 1993, pp. 126–134. D. R. Cutting, D. R. Karger, J. O. Pedersen, and J. W. Tukey, Scatter/gather: a cluster-based approach to browsing large document collections, Proc. 16th Annu. Internat. ACM SIGIR Conf. Res. Devel. Inform. Retrieval, 1992, pp. 318–329. A. A. Diwan, S. Rane, S. Seshadri, and S. Sudarshan, Clustering techniques for minimizing external path length, Proc. 22nd Internat. Conf. Very Large Databases, Morgan Kauffman, San Mateo, CA, 1996, pp. 342–353. Z. Drezner, The p-centre problems—heuristic and optimal algorithms, J. Oper. Res. Soc., 35 (1984), 741–748. Z. Drezner (ed.), Facility Location, Springer-Verlag, New York, 1995. T. Feder and D. H. Greene, Optimal algorithms for approximate clustering, Proc. 20th Annu. ACM Sympos. Theory Comput., 1988, pp. 434–444. R. J. Fowler, M. S. Paterson, and S. L. Tanimoto, Optimal packing and covering in the plane are NP-complete, Inform. Process. Lett., 12 (1981), 133–137. T. Gonzalez, Clustering to minimize the maximum intercluster distance, Theoret. Comput. Sci., 38 (1985), 293–306. T. Gonzalez, Covering a set of points in multidimensional space, Inform. Process. Lett., 40 (1991), 181–188.
226
P. K. Agarwal and C. M. Procopiuc
[20]
R. Z. Hwang, R. C. Chang, and R. C. T. Lee, The generalized searching over separators strategy to solve some NP-hard problems in subexponential time, Algorithmica, 9 (1993), 398–423. R. Z. Hwang, R. C. T. Lee, and R. C. Chang, The slab dividing approach to solve the Euclidean p-center problem, Algorithmica, 9 (1993), 1–22. A. K. Jain and R. C. Dubes, Algorithms for Clustering Data, Prentice-Hall, Englewood Cliffs, NJ, 1988. K. Jain and V. V. Vazirani, Primal-dual approximation algorithms for metric facility location and kmedian problems, Proc. 40th Annu. Sympos. Found. Comput. Sci., 1999, pp. 2–13. J. Jolion, P. Meer, and S. Batauche, Robust clustering with applications in computer vision, IEEE Trans. Pattern Anal. Mach. Intell., 13 (1991), 791–802. L. Kaufman and P. J. Rousseeuw, Finding Groups in Data: An Introduction to Cluster Analysis, Wiley, New York, 1990. S. Khuller and Y. J. Sussmann, The capacitated k-center problem, Proc. 4th Annu. European Sympos. Algorithms, Lecture Notes in Computer Science, Vol. 1136, Springer-Verlag, Berlin, 1996, pp. 152–166. R. Lupton, F. M. Maley, and N. Young, Data collection for the sloan digital sky survey: a network-flow heuristic, Proc. 7th Annu. ACM–SIAM Sympos. Discrete Algorithms, 1996, pp. 296–303. J. Makhoul, S. Roucos, and H. Gish, Vector quantization in speech coding, Proc. IEEE, 73 (1985), 1551–1588. N. Megiddo and K. J. Supowit, On the complexity of some common geometric location problems, SIAM J. Comput., 13 (1984), 182–196. R. T. Ng and J. Han, Efficient and effective clustering methods for spatial data mining, Proc. 20th Internat. Conf. Very Large Databases, 1994, pp. 144–155. P. Raghavan, Information retrieval algorithms: a survey, Proc. 18th ACM–SIAM Sympos. Discrete Algorithms, 1997, pp. 11–18. P. Schroeter and J. Big¨un, Hierarchical image segmentation by multi-dimensional clustering and orientation-adaptive boundary refinement, Pattern Recognition, 28 (1995), 695–709. J. Shafer, R. Agrawal, and M. Mehta, Sprint: a scalable parallel classifier for data mining, Proc. 22nd Internat. Conf. Very Large Databases, Morgan Kauffman, San Mateo, CA, 1996, pp. 544–555. D. Shmoys, E. Tardos, and K. Aardel, Approximation algorithms for facility location problems, Proc. 29th Annu. ACM Sympos. Theory Comput., 1997, pp. 265–274.
[21] [22] [23] [24] [25] [26] [27] [28] [29] [30] [31] [32] [33] [34]