J Comb Optim https://doi.org/10.1007/s10878-018-0278-6
Local search approximation algorithms for the k-means problem with penalties Dongmei Zhang1 · Chunlin Hao2 · Chenchen Wu3 · Dachuan Xu4 · Zhenning Zhang2
© Springer Science+Business Media, LLC, part of Springer Nature 2018
Abstract In this paper, we study the k-means problem with (nonuniform) penalties (k-MPWP) which is a natural generalization of the classic k-means problem. In the k-MPWP, we are given an n-client set D ⊂ Rd , a penalty cost p j > 0 for each j ∈ D, and an integer k ≤ n. The goal is to open a center subset F ⊂ Rd with |F| ≤ k and to choose a client subset P ⊆ D as the penalized client set such that the total
A preliminary version of this paper appeared in Proceedings of the 23rd International Computing and Combinatorics Conference, pp. 568–574, 2017.
B
Dachuan Xu
[email protected] Dongmei Zhang
[email protected] Chunlin Hao
[email protected] Chenchen Wu
[email protected] Zhenning Zhang
[email protected]
1
School of Computer Science and Technology, Shandong Jianzhu University, Jinan 250101, People’s Republic of China
2
Department of Information and Operations Research, College of Applied Sciences, Beijing University of Technology, 100 Pingleyuan, Chaoyang District, Beijing 100124, People’s Republic of China
3
College of Science, Tianjin University of Technology, Tianjin 300384, People’s Republic of China
4
Beijing Institute for Scientific and Engineering Computing, Beijing University of Technology, 100 Pingleyuan, Chaoyang District, Beijing 100124, People’s Republic of China
123
J Comb Optim
cost (including the sum of squares of distance for each client in D\P to the nearest open center and the sum of penalty cost for each client in P) is minimized. We offer a local search (81 + ε)-approximation algorithm for the k-MPWP by using single-swap operation. We further improve the above approximation ratio to (25 + ε) by using multi-swap operation. Keywords Approximation algorithm · k-means · Penalty · Local search
1 Introduction The k-means problem is a classic problem in the machine learning area. The problem is defined formally as follows. Given a n-client set D in Rd and an integer k, the goal is to partition D into k clusters X 1 , …, X k and assign each cluster a center such that the sum of squares of distances from each client to its center is minimized. Since the k-means problem is NP-hard (Aloise et al. 2009; Dasgupta 2007; Mahajan et al. 2009), one natural way to deal with this problem is to design efficient heuristic algorithm. Among all the heuristic algorithms, the Lloyd’s algorithm is the most popular one Lloyd (1957, 1982). Another important way is to design approximation algorithm. Most approximation algorithms for the k-means problem are based on the important observation of Matoušek (2000) which points out the connection between and k-means and k-median problems via constructing an approximate centroid set. For the metric k-median problem, Charikar et al. (1999) give the first constant 6 23 approximation algorithm by using LP rounding technique. Arya et al. (2004) present a local search (3 + ε)-approximation algorithm. Li and Svensson (2016) present a novel (2.732 + )-approximation algorithm by exploiting the power of pseudoapproximation. Byrka et al. (2017) improve upon Li–Svenssons approximation √ ratio from 2.732 + to 2.675 + . Zhang (2007) gives a local search (2 + 3 + ε)approximation algorithm for the general k-facility location problem. For the k-means problem, Jain and Vazirani (2001) give the first constant 108-approximation algorithm. Kanungo et al. (2004) present a local search (9 + ε)approximation algorithm. The best approximation ratio 6.357 is due to Ahmadian et al. (2017) by using primal-dual technique. Makarychev et al. (2016) present a bicriteria approximation algorithm based on LP rounding and local search, which attains a guarantee α(β) depending on the number βk of clusters. There are many practical and interesting variants for the k-means problem (cf. Bandyapadhyay and Varadarajan 2016; Georgogiannis 2016; Tseng 2007). In this paper, we introduce a k-means problem with (nonuniform) penalties (kMPWP) which is a natural generalization of the classic k-means problem. In the k-MPWP, we are given a n-client set D ⊂ Rd , a penalty cost p j > 0 for each j ∈ D, and an integer k ≤ n. The goal is to open a center subset F ⊂ Rd with |F| ≤ k and to choose a client subset P ⊆ D as the penalized client set such that the total cost (including the sum of squares of distance for each client in D\P to the nearest open center and the sum of penalty cost for each client in P) is minimized. We remark that Tseng (2007) introduces the penalized and weighted k-means problem which is similar to the k-MPWP but with uniform penalties. The penalty versions for
123
J Comb Optim
the classic facility location problem and k-median problem are proposed by Charikar et al. (2001). The currently best approximation ratios for the facility location problems with linear/submodular penalties are given by Li et al. (2015). Wang et al. (2016) give a √ local search (2 + 3 + ε)-approximation algorithm for the k-facility location problem with penalties. The contributions of our paper are summarized as follows. – Introduce firstly the k-means problem with (nonuniform) penalties which generalizes the k-means problem; – Offer (81 + ε)- and (25 + ε)- approximation algorithms for the k-MPWP by using single-swap and multi-swap local search techniques respectively. To the best of our knowledge, no constant approximation algorithm is known for this problem even for the uniform penalties version. The organization of this paper is as follows. In Sect. 2, we present a single-swap local search (81 + ε)-approximation algorithm for the k-MPWP. We further improve this approximation ratio to 25 + ε by using multi-swap in Sect. 3. We give some discussions in Sect. 4.
2 A local search (81 + ε)-approximation algorithm 2.1 Notations The square of distance between any two points a, b ∈ Rd , is defined as follows, Δ(a, b) := ||a − b||2 . Given a client subset U ⊆ D, a point c ∈ Rd , we define the total sum of squares distances of U with respect to c and the centroid of U as follows, Δ(c, U ) :=
Δ(c, j) =
j∈U
1 cen(U ) := j. |U |
||c − j||2 ,
j∈U
j∈U
The following lemma gives an important property of centroidal solution. Lemma 1 (Kanungo et al. (2004)) For any subset U ⊆ D and a point c ∈ Rd , we have Δ(c, U ) = Δ(cen(U ), U ) + |U |Δ(cen(U ), c). (1) Let S be a feasible solution (center subset) to the k-MPWP with penalized client set P. We observe that P = { j ∈ D| p j < mins∈S Δ( j, s)}. This fact indicates that the penalized client set P is determined by the center subset S. Let O be a global optimal solution to the k-MPWP with penalized set P ∗ = { j ∈ D| p j < mino∈O Δ( j, o)}. For each client j ∈ D\P, we denote S j := mins∈S Δ( j, s) and s j := arg mins∈S Δ( j, s). For each client j ∈ D\P ∗ , we denote O j := mino∈O Δ( j, o) and
123
J Comb Optim
o j := arg mino∈O Δ( j, o). For each center o ∈ O, we denote so := arg mins∈S Δ(o, s) and say that so captures o. Furthermore, we denote – – – –
N S (s) := { j ∈ D\P|s j = s}, ∀s ∈ S; Cs := j∈D\P S j = s∈S Δ(s, N S (s)), C p := j∈P p j , cost(S) := Cs + C p ; ∗ N O (o) := { j ∈ D\P |o j = o}, ∀o ∈ O; ∗ ∗ Cs := O = ∗ j j∈D \P o∈O Δ(o, N O (o)), C p := j∈P ∗ p j , cost(O) := Cs∗ + C ∗p .
We remark that o = cen(N O (o)) for each o ∈ O. Matoušek (2000) introduces the following concept of approximate centroid set. Definition 2 A set of C ⊂ Rd is an εˆ -approximate centroid set for D ⊂ Rd if for any set U ⊆ D, we have Δ(c, j) ≤ 1 + εˆ min Δ(c, j). min c∈C
c∈Rd
j∈U
j∈U
One can compute a εˆ -approximate centroid set C for D of polynomial size in polynomial time (Makarychev et al. 2016; Matoušek 2000). Equipped with this result, we assume that all candidate centers are chosen from C. 2.2 Local search algorithm For any feasible solution S, we define the single-swap operation swap(a, b) as follows. Given a ∈ S and b ∈ C\S, the center a is deleted from S and the center b is added to S. We define the neighborhood of S associated with the above operation as follows, Ngh1 (S) := {S\{a} ∪ {b}|a ∈ S, b ∈ C\S}. For any given ε > 0, we present our single-swap local search algorithm as follows. Algorithm 1 (ALG1(ε)) Step 0 (Initialization) Set εˆ := 72 + ε −
722 + 64ε.
Construct εˆ -approximate centroid set C. Arbitrarily choose a feasible solution S from C. Step 1 (Local search) Compute Smin := arg
min
S ∈Ngh1 (S)
cost(S ).
Step 2 (Stop criterion) If cost(Smin ) ≥ cost(S), output S along with the corresponding penalized client set P = { j ∈ D| p j < mins∈S Δ( j, s)}. Otherwise, set S := Smin and go to Step 1.
123
J Comb Optim
2.3 Analysis To proceed the analysis, we need the following technical lemma. Lemma 3 Let S and O be a local optimal solution and a global optimal solution to the k-MPWP, respectively. We have,
Sj Oj ≤
Oj,
(2)
Δ(so j , j) ≤ S j + 2 O j , ∀ j ∈ D\(P ∪ P ∗ ).
(3)
j∈D \(P∪P ∗ )
Sj ·
j∈D \(P∪P ∗ )
j∈D \(P∪P ∗ )
Proof Using the well-known Cauchy–Schwarz inequality, we obtain (2) immediately. ∗ Now √we prove (3). For each j ∈ D\(P ∪ P ), it follows from the triangle inequality for Δ(·, ·) and the definition of so that
Δ(so j , o j ) + Δ(o j , j) ≤ Δ(s j , o j ) + Δ(o j , j) ≤ Δ(s j , j) + 2 Δ(o j , j) = Sj + 2 Oj.
Δ(so j , j) ≤
In order to proceed the analysis, we need to introduce a center oˆ ∈ C associated with each o ∈ O. For each o ∈ O, let us define oˆ := arg min Δ(c, N O (o)). c∈C
(4)
From Definition 2 and (4), we have
Δ(o, ˆ j) = Δ(o, ˆ N O (o))
j∈N O (o)
= min Δ(c, N O (o)) c∈C
≤ (1 + εˆ ) min Δ(c, N O (o)) c∈Rd
= (1 + εˆ )Δ(o, N O (o)) Δ(o, j). = (1 + εˆ )
(5)
j∈N O (o)
We remark that the idea of choosing oˆ satisfying inequality (5) is essentially due to Ward (2017). Then, we estimate the cost of S in the following theorem.
123
J Comb Optim
Theorem 4 For any given ε > 0, ALG1(ε) produces a local optimal solution S satisfying Cs + C p ≤ (81 + ε)Cs∗ + 18C ∗p . Proof If s ∈ S captures a center o ∈ O, we call s as bad center. Otherwise, we call s as good center. All the bad centers constitute a subset Bad(S). Denote m := |Bad(S)|. All elements of Bad(S) are listed as Bad(S) = {s1 , . . . , sm }. For each i ∈ {1, . . . , m}, let Si := {si } and Oi := {o ∈ O|o is captured by si }. For each i, arbitrarily add |Oi | − 1 good centers in S\∪i=1,...,m Si to Si . Thus, we partition S and O into two sets of groups S1 , S2 , . . . , Sm and O1 , O2 , . . . , Om with |Si | = |Oi | for each i ∈ {1, 2, . . . , m}. We construct (s, o) pairs as follows (cf. Fig. 1). Case 1 For each i with |Si | = |Oi | = 1, we construct the pair (si , oi ) with Si = {si } and Oi = {oi }. Case 2 For each i with |Si | = |Oi | = m i ≥ 2, denote Si := {si , si2 , . . . , sim i } and Oi := {oi1 , oi2 , . . . , oim i }. We construct the pairs (si2 , oi1 ), (si2 , oi2 ), (si3 , oi3 ), . . ., (sim i , oim i ). We consider the constructed pair (s, o) in the following three cases. Case 1 oˆ ∈ / S. Consider the operation swap(s, o). ˆ We have the corresponding inequality as follows (cf. Fig. 2).
0 ≤ cost(S\{s} ∪ {o}) ˆ − cost(S) ≤ (pj − Sj) + j∈N S (s)∩P ∗
+
j∈N O (o)\P
Fig. 1 Partition S and O
123
Δ(so j , j) − S j
j∈N S (s)\(N O (o)∪P ∗ )
(Δ(o, ˆ j) − S j ) +
j∈N O (o)∩P
(Δ(o, ˆ j) − p j )
J Comb Optim
Fig. 2 Partition of N S (s) ∪ N O (o)
≤
j∈N S (s)∩P ∗
(pj − Sj) +
j∈N S (s)∩P ∗
+
((1 + εˆ )Δ(o, j) − S j ) +
j∈N O (o)\P
≤
Δ(so j , j) − S j
j∈N S (s)\(N O (o)∪P ∗ )
+
(pj − Sj) +
Δ(so j , j) − S j
j∈N S (s)\P ∗
((1 + εˆ )O j − S j ) +
j∈N O (o)\P
((1 + εˆ )Δ(o, j) − p j )
j∈N O (o)∩P
((1 + εˆ )O j − p j ).
(6)
j∈N O (o)∩P
Case 2 oˆ ∈ S and oˆ = s. We have S\{s} ∪ {o} ˆ = S, or cost(S\{s} ∪ {o}) ˆ = cost(S), which implies (6). Case 3 oˆ ∈ S and oˆ = s. We have S\{s} ∪ {o} ˆ = S\{s}, or cost(S\{s} ∪ {o}) ˆ = cost(S\{s}} ≥ cost(S), which implies (6). Summing all the (s, o) pairs, we have 0≤2
(pj − Sj) + 2
s∈S j∈N S (s)∩P ∗
+
((1 + εˆ )O j − S j ) +
(pj − Sj) + 2
j∈(D \P)∩P ∗
+
((1 + εˆ )O j − p j )
o∈O j∈N O (o)∩P
Δ(so j , j) − S j
j∈D \(P∪P ∗ )
((1 + εˆ )O j − S j ) +
j∈D \(P∪P ∗ )
+
Δ(so j , j) − S j
s∈S j∈N S (s)\P ∗
o∈O j∈N O (o)\P
=2
((1 + εˆ )O j − p j )
j∈(D \P ∗ )∩P
( p j − p j ).
j∈P ∗ ∩P
123
J Comb Optim
Combing with Lemma 3, we have
0≤2
j∈(D \P)∩P ∗
+
Sj Oj
((1 + εˆ )O j − p j )
j∈(D \P ∗ )∩P
Oj + 2
j∈(D \P ∗ )∩P
≤ (9 + εˆ )
Oj −
j∈D \P ∗
pj −
j∈P ∗
pj −
Sj Oj
Sj −
j∈(D \P)∩P ∗
Sj + 8
j∈D \(P∪P ∗ )
pj
j∈P
Oj
Sj
j∈D \(P∪P ∗ )
pj
Oj −
pj − 2
j∈D \P
j∈D \P ∗
j∈P ∗
j∈D \(P∪P ∗ )
j∈P ∗
j∈P
≤ (9 + εˆ )
Sj + 8
j∈D \(P∪P ∗ )
+ (1 + εˆ )
Oj −
j∈D \(P∪P ∗ )
+2
((1 + εˆ )O j − S j ) +
≤ (9 + εˆ )
(pj − pj)
j∈P ∗ ∩P
+2
Oj +
j∈D \(P∪P ∗ )
j∈D \(P∪P ∗ )
+
(pj − Sj) + 8
Sj + 8 Oj Sj j∈D \P ∗
j∈D \P
j∈D \P
pj
j∈P
≤ (9 + εˆ )Cs∗ − Cs + 8 Cs∗ Cs + 2C ∗p − C p 8 ≤ ((9 + εˆ )Cs∗ + 2C ∗p ) − (Cs + C p ) + √ ((9 + εˆ )Cs∗ + 2C ∗p )(Cs + C p ) 9 + εˆ 1 9 + εˆ ((9 + εˆ )Cs∗ + 2C ∗p ) − 9 + εˆ (Cs + C p ) = √ 9 + εˆ
+ 8 ((9 + εˆ )Cs∗ + 2C ∗p )(Cs + C p ) 9 + εˆ 1 = √ 25 + εˆ − 4 Cs + C p (9 + εˆ )Cs∗ + 2C ∗p − √ 9 + εˆ 25 + εˆ − 4 9 + εˆ ∗ ∗ × 25 + εˆ − 4 (9 + εˆ )Cs + 2C p + √ Cs + C p . (7) 25 + εˆ − 4 Since
123
25 + εˆ − 4
(9 + εˆ )Cs∗
+ 2C ∗p
+
9 + εˆ Cs + C p ≥ 0, √ 25 + εˆ − 4
J Comb Optim
together with (7), we have
25 + εˆ − 4 Cs + C p ≤
9 + εˆ (9 + εˆ )Cs∗ + 2C ∗p , √ 25 + εˆ − 4
which indicates Cs + C p ≤ √ √
9 + εˆ 25 + εˆ − 4
∗ ∗
2 (9 + εˆ )Cs + 2C p
2
(9 + εˆ )Cs∗ + 2C ∗p 9 + εˆ √
2
2 25 + εˆ + 4 C ∗p = 25 + εˆ + 4 Cs∗ + 2 9 + εˆ √
5 + εˆ − 25 + εˆ ∗ = 81 + εˆ + 8 25 + εˆ − 5 Cs + 2 9 − 8 C ∗p 9 + εˆ
≤ 81 + εˆ + 8 25 + εˆ − 5 Cs∗ + 18C ∗p . (8) =
25 + εˆ + 4
Recalling the definition of εˆ in Step 0 of ALG1(ε), we have ε = εˆ + 8
25 + εˆ − 5 .
Combing with (8), we have Cs + C p ≤ (81 + ε)Cs∗ + 18C ∗p .
Using standard technique of Arya et al. (2004) and Charikar and Guha (1999), we can obtain a polynomial time local search algorithm which sacrifices any given ε > 0 in the approximation ratio.
3 Improved local search (25 + ε)-approximation algorithm For any feasible solution S, we define the so-called multi-swap operation swap(A, B) as follows. In the operation, we are given two subsets A ⊆ S and B ⊆ C\S with |A| = |B| ≤ p, where p is a fixed integer. All centers in A are deleted from S. Meanwhile, all centers in B are added to S. We define the neighborhood of S with respect to the above multi-swap operation as follows, Ngh p (S) := {S\A ∪ B|A ⊆ S, B ⊆ C\S, |A| = |B| ≤ p} .
123
J Comb Optim
For any given ε > 0, we present our multi-swap local search algorithm as follows. Algorithm 2 (ALG2(ε)) Step 0 (Initialization) Set
p :=
√
5
25 + ε − 5 6 5 εˆ := + 2 . p p
,
Construct εˆ -approximate centroid set C. Arbitrarily choose a feasible solution S from C. Step 1 (Local search) Compute Smin := arg
min
S ∈Ngh p (S)
cost(S ).
Step 2 (Stop criterion) If cost(Smin ) ≥ cost(S), output S along with the corresponding penalized client set P = { j ∈ D| p j < mins∈S Δ( j, s)}. Otherwise, set S := Smin and go to Step 1. Theorem 5 For any given ε > 0, ALG2(ε) produces a local optimal solution S satisfying ε ∗ C p. Cs + C p ≤ (25 + ε) Cs∗ + 5 + 5 Proof Recall the partitions of S and O in the proof of Theorem 4. We construct (Si , Oi ) or (s, o) pairs as follows (cf. Fig. 3 for p = 3).
Fig. 3 Multi-swap
123
J Comb Optim
Case 1 For each i with |Si | = |Oi | ≤ p, we construct the pair (Si , Oi ). Case 2 For each i with |Si | = |Oi | = m i > p, we construct (m i − 1)m i pairs (s, o) with s ∈ Si \{si } and o ∈ Oi . Let us further consider the above two cases along with the corresponding multiswaps as follows. Case 1 For each i with |Si | = |Oi | ≤ p, we consider the pair (Si , Oi ) in the following three cases. Case 1.1 Oˆ i ⊆ C\S. Consider the swap(Si , Oˆ i ). We have 0 ≤ cost(S\{Si } ∪ { Oˆ i }) − cost(S) (pj − Sj) + ≤
s∈Si j∈N (s)\ ∗ S o∈O N O (o)∪P
s∈Si j∈N S (s)∩P ∗
+
(Δ(o, ˆ j) − S j ) +
o∈Oi j∈N O (o)\P
≤
(pj − Sj) +
s∈Si j∈N S (s)∩P ∗
+
Δ(so j , j) − S j
i
(Δ(o, ˆ j) − p j )
o∈Oi j∈N O (o)∩P
Δ(so j , j) − S j
s∈Si j∈N S (s)\P ∗
((1 + εˆ )Δ(o, j) − S j )
o∈Oi j∈N O (o)\P
+
((1 + εˆ )Δ(o, j) − p j )
o∈Oi j∈N O (o)∩P
≤
(pj − Sj) +
s∈Si j∈N S (s)∩P ∗
+
Δ(so j , j) − S j
s∈Si j∈N S (s)\P ∗
((1 + εˆ )O j − S j )
o∈Oi j∈N O (o)\P
+
((1 + εˆ )O j − p j ).
(9)
o∈Oi j∈N O (o)∩P
Case 1.2 Oˆ i ⊆ S. We have S\{Si } ∪ { Oˆ i } ⊆ S, or cost(S\{Si } ∪ { Oˆ i }) ≥ cost(S), which implies (9). Case 1.3 Oˆ i ∩ S = ∅ and Oˆ i \S = ∅. Since |Si \ Oˆ i | = |Si | − |Si ∩ Oˆ i | = | Oˆ i | − |Si ∩ Oˆ i | ≥ | Oˆ i | − |S ∩ Oˆ i | = | Oˆ i \S|, we have
cost(S) ≤ cost S\(Si \ Oˆ i ) ∪ ( Oˆ i \S) = cost S\Si ∪ Oˆ i , which implies (9).
123
J Comb Optim
Case 2 For each i with |Si | = |Oi | = m i > p, we consider swap(s, o) ˆ for each pair (s, o) ∈ (Si \{si }) × Oi . We have the inequality (6). Summing inequality (9) with weight 1 for each pair (Si , Oi ) with |Si | = |Oi | ≤ p and inequality (6) with weight 1/(m i − 1) for each pair (s, o) ∈ (Si \{si }) × Oi with |Si | = |Oi | = m i > p, together with inequality m i /(m i − 1) ≤ ( p + 1)/ p and Lemma 3, we have 1 0 ≤ 1+ (pj − Sj) p s∈S j∈N S (s)∩P ∗ 1 Δ(so j , j) − S j + 1+ p s∈S j∈N S (s)\P ∗ ((1 + εˆ )O j − S j ) + + o∈O j∈N O (o)\P
1 (pj − Sj) ≤ 1+ p j∈(D \P)∩P ∗
1 Oj + Sj Oj +4 1 + p j∈D \(P∪P ∗ ) + ((1 + εˆ )O j − S j ) + j∈D \(P∪P ∗ )
+
((1 + εˆ )O j − p j )
o∈O j∈N O (o)∩P
((1 + εˆ )O j − p j )
j∈(D \P ∗ )∩P
(pj − pj)
j∈P ∗ ∩P
4 ≤ 5+ + εˆ p
1 Oj − Sj + 4 1 + Sj Oj p j∈D \(P∪P ∗ ) j∈D \(P∪P ∗ ) j∈D \(P∪P ∗ ) 1 1 +(1+ εˆ ) O j − 1+ Sj + 1 + pj − pj p p ∗ ∗ ∗ j∈P j∈P j∈(D \P )∩P j∈(D \P)∩P 4 ≤ 5 + + εˆ Oj − Sj p j∈D \P ∗ j∈D \P 1 +4 1 + Oj Sj p j∈D \(P∪P ∗ ) j∈D \(P∪P ∗ ) 1 + 1+ pj − pj p j∈P ∗ j∈P 4 1 ≤ 5 + + εˆ Oj − Sj + 4 1 + Oj Sj p p j∈D \P ∗ j∈D \P j∈D \P ∗ j∈D \P 1 + 1+ pj − pj p ∗
j∈P
123
j∈P
J Comb Optim
4 1 ∗ 1 C ∗p − C p = 5 + + εˆ Cs∗ − Cs + 4 1 + Cs Cs + 1 + p p p 4 1 C ∗p − Cs + C p ≤ 5 + + εˆ Cs∗ + 1 + p p
1 4 1+ p 4 1 ∗ 5 + + εˆ Cs + 1 + C ∗p Cs + C p . + p p 5 + 4p + εˆ
Denote t ( p, εˆ ) :=
9+
(10)
4 12 1 + 2 + εˆ − 2 1 + . p p p
(11)
Using factorization technique, together with (10)–(11), we have 1 0≤ 5 + 4p + εˆ ⎛ ⎞ 5 + 4p + εˆ 4 1 5 + + εˆ Cs∗ + 1 + C ∗p − t ( p, εˆ ) Cs + C p ⎠ ×⎝ t ( p, εˆ ) p p ⎛ ⎞ 5 + 4p + εˆ 4 1 × ⎝ t ( p, εˆ ) 5 + + εˆ Cs∗ + 1 + C ∗p + Cs + C p ⎠ . p p t ( p, εˆ ) Since
5 + 4p + εˆ 4 1 5 + + εˆ Cs∗ + 1 + C ∗p + t ( p, εˆ ) Cs + C p ≥ 0, p p t ( p, εˆ )
we have
t ( p, εˆ ) Cs + C p ≤
+ εˆ 4 1 5 + + εˆ Cs∗ + 1 + C ∗p , t ( p, εˆ ) p p
5+
4 p
which indicates Cs + C p
2 5 + 4p + εˆ 1 + 1p 5 + 4p + εˆ ≤ Cs∗ + C ∗p 2 t ( p, εˆ ) t ( p, εˆ ) ⎛ ⎞2 4 5 + p + εˆ ⎜ ⎟
⎠ Cs∗ = ⎝ 12 4 1 9 + p + p2 + εˆ − 2 1 + p
123
J Comb Optim
1 + 1p ∗ +
2 C p 12 4 1 9 + p + p2 + εˆ − 2 1 + p 2 12 1 4 = 9+ Cs∗ + 2 + εˆ + 2 1 + p p p
2 4 1 9 + 12 1 + 1p p + p 2 + εˆ + 2 1 + p C ∗p . + 5 + 4p + εˆ 5+
4 p
+ εˆ
(12)
From the definitions of εˆ and p, we have
2 2 4 9 12 1 18 1 + 2 + εˆ + 2 1 + + 2 +2 1+ 9+ = 9+ p p p p p p 2 1 3 1 2 = 3+ +2 1+ = 25 1+ p p p ⎛ ⎞2 1 ⎠ = 25 ⎝1 + √
5 25+ε−5
2 √ 25 + ε − 5 ≤ 25 1 + 5 = 25 + ε,
(13)
and
1 + 1p 9+
12 p
+
5+
4 + εˆ p2 4 p + εˆ
2 + 2 1 + 1p
From (12)–(14), we complete the proof.
1 2 ε = 5 1+ ≤ 5 + . (14) p 5
4 Discussions In this paper, we study the k-MPWP and present (81+ε)- and (25+ε)- approximation algorithms using single-swap and multi-swap respectively. Since there is a local search (9 + ε)-approximation algorithm for the classic k-means problem (Kanungo et al. 2004), it is natural to ask whether our (25+ε)-approximation can be further improved. Another research direction is to design approximation algorithm for this problem using LP rounding or primal-dual technique. Acknowledgements The research of the first author is supported by Higher Educational Science and Technology Program of Shandong Province (No. J15LN23) and the Science and Technology Development
123
J Comb Optim Plan Project of Jinan City (No. 201401211). The second author is supported by Ri-Xin Talents Project of Beijing University of Technology. The third author is supported by Natural Science Foundation of China (No. 11501412). The fourth author is supported by Natural Science Foundation of China (No. 11531014). The fifth author is supported by Beijing Excellent Talents Funding (No. 2014000020124G046).
References Ahmadian S, Norouzi-Fard A, Svensson O, Ward J (2017) Better guarantees for k-means and Euclidean k-median by primal-dual algorithms. In: Proceedings of FOCS, pp 61–72 Aloise D, Deshpande A, Hansen P, Popat P (2009) NP-hardness of Euclidean sum-of-squares clustering. Mach Learn 75:245–249 Arya V, Garg N, Khandekar R, Meyerson A, Munagala K, Pandit V (2004) Local search heuristics for k-median and facility location problems. SIAM J Comput 33:544–562 Bandyapadhyay S, Varadarajan K (2016) On variants of k-means clustering. In: Proceedings of SoCG, article no. 14:14:1–14:15 Byrka J, Pensyl T, Rybicki B, Srinivasan A, Trinh K (2017) An improved approximation for k-median, and positive correlation in budgeted optimization. ACM Transactions on Algorithms, 13(2): Article No. 23 Charikar M, Guha S (1999) Improved combinatorial algorithms for the facility location and k-median problems. In: Proceedings of FOCS, pp 378–388 Charikar M, Guha S, Tardos É, Shmoys DB (1999) A constant-factor approximation algorithm for the k-median problem. In: Proceedings of STOC, pp 1–10 Charikar M, Khuller S, Mount DM, Narasimhan G (2001) Algorithms for facility location problems with outliers. In: Proceedings of SODA, pp 642–651 Dasgupta S (2007) The hardness of k-means clustering. Technical report CS2007-0890, University of California, San Diego Georgogiannis A (2016) Robust k-means: a theoretical revisit. In: Proceedings of NIPS, pp 2883–2891 Jain K, Vazirani VV (2001) Approximation algorithms for metric facility location and k-median problems using the primal-dual schema and Lagrangian relaxation. J ACM 48:274–296 Kanungo T, Mount DM, Netanyahu NS, Piatko CD, Silverman R, Wu AY (2004) A local search approximation algorithm for k-means clustering. Comput Geom Theory Appl 28:89–112 Li Y, Du D, Xiu N, Xu D (2015) Improved approximation algorithms for the facility location problems with linear/submodular penalties. Algorithmica 73:460–482 Li S, Svensson O (2016) Approximating k-median via pseudo-approximation. SIAM J Comput 45:530–547 Lloyd S (1957) Least squares quantization in PCM. Technical report, Bell Laboratories Lloyd S (1982) Least squares quantization in PCM. IEEE Trans Inf Theory 28:129–137 Mahajan M, Nimbhorkar P, Varadarajan K (2009) The planar k-means problem is NP-hard. In: Proceedings of WALCOM, pp 274–285 Makarychev K, Makarychev Y, Sviridenko M, Ward J (2016) A bi-criteria approximation algorithm for k-means. In: Proceedings of APPROX/RONDOM, article no. 14, pp 14:1–14:20 Matoušek J (2000) On approximate geometric k-clustering. Discrete Comput Geom 24:61–84 Tseng GC (2007) Penalized and weighted k-means for clustering with scattered objects and prior information in high-throughput biological data. Bioinformatics 23:2247–2255 Wang Y, Xu D, Du D, Wu C An approximation algorithm for k-facility location problem with linear penalties using local search scheme. J Comb Optim. https://doi.org/10.1007/s10878-016-0080-2 Ward J (2017) Private communication Zhang P (2007) A new approximation algorithm for the k-facility location problem. Theor Comput Sci 384:126–135
123