1 Introduction
Clustering is a central problem in Computer Science, with many applications in data science, machine learning etc. One of the most famous and beststudied problems in this area is Euclidean
Means: given a set of points (or demands) in and an integer , select points (centers) so as to minimize . Here is the Euclidean distance between points and and for a set of points , . In other words, we wish to select centers so as to minimize the sum of the squared Euclidean distances between each demand and the closest center. Equivalently, a feasible solution is given by a partition of the demands into subsets (clusters). The cost of a cluster is , where is the center of mass of . We recall that can also be expressed as . Our goal is to minimize the total cost of these clusters.Euclidean Means is wellstudied in terms of approximation algorithms. It is known to be APXhard. More precisely, it is hard to approximate Means below a factor in polynomial time unless [6, 16]. The hardness was improved to under the Unique Games Conjecture [9]
. Some heuristics are known to perform very well in practice, however their approximation factor is
or worse on general instances [3, 4, 17, 21]. Constant approximation algorithms are known. A localsearch algorithm by Kanugo et al. [15] provides a approximation^{6}^{6}6Throughout this paper by we mean an arbitrarily small positive constant. W.l.o.g. we assume .. The authors also show that natural localsearch based algorithms cannot perform better than this. This ratio was improved to by Ahmadian et al. [1, 2] using a primaldual approach. They also prove a approximation for general (possibly nonEuclidean) metrics. Better approximation factors are known under reasonable restrictions on the input [5, 7, 10, 20]. A PTAS is known for constant [19] or for constant dimension [10, 12]. Notice that can be always assumed to be by a standard application of the JohnsonLindenstrauss transform [14]. This was recently improved to [8] and finally to [18].In this paper we describe a simple modification of the analysis of Ahmadian et al. [2] which leads to a slightly improved approximation for Euclidean Means (see Section 2).
Theorem 1.
There exists a deterministic polynomialtime algorithm for Euclidean Means with approximation ratio for any positive constant , where
The above approximation ratio is w.r.t. the optimal fractional solution to a standard LP relaxation for the problem (defined later). As a side result (see Section 3), we prove a lower bound on the integrality gap of this relaxation (we are not aware of any explicit such lower bound in the literature).
Theorem 2.
The integrality gap of , even in the Euclidean plane (i.e., for ), is at least .
1.1 Preliminaries
As mentioned earlier, one can formulate Euclidean Means in term of the selection of centers. In this case, it is convenient to discretize the possible choices for the centers, hence obtaining a polynomialsize set of candidate centers, at the cost of an extra factor in the approximation ratio (we will neglect this factor in the approximation ratios since it is absorbed by analogous factors in the rest of the analysis). In particular we will use the construction in [11] (Lemma 24) that chooses as the centers of mass of any collection of up to points with repetitions. In particular in this case.
Let be an abbreviation for . Then a standard LPrelaxation for Means is as follows:
In an integral solution, we interpret as being a selected center in ( is open), and as demand being assigned to center ^{7}^{7}7Technically each demand is automatically assigned to the closest open center. However it is convenient to allow also suboptimal assignments in the LP relaxation.. The first family of constraints states that each demand has to be assigned to some center, the second one that a demand can only be assigned to an open center, and the third one that we can open at most centers.
For any parameter (Lagrangian multiplier), the Lagrangian relaxation of (w.r.t. the last matrix constraint) and its dual are as follows:
(1)  
Above replaces the dual variable corresponding to the second constraint in the primal in the standard formulation of the dual LP. Notice that, by removing the fixed term in the objective functions of and , one obtains the standard LP relaxation for the Facility Location problem (FL) with uniform facility cost and its dual .
We say that a approximation algorithm for a FL instance of the above type is Lagrangian Multiplier Preserving (LMP) if it returns a set of facilities that satisfies:
where is the value of the optimal solution to .
2 A Refined Approximation for Euclidean Means
In this section we present our refined approximation for Euclidean Means. We start by presenting the LMP approximation algorithm for the FL instances arising from Means described in [2] in Section 2.1. We then present the analysis of that algorithm as in [2] in Section 2.2. In Section 2.3 we describe our refined analysis of the same algorithm. Finally, in Section 2.4 we sketch how to use this to approximate Means.
2.1 A PrimalDual LMP Algorithm for Euclidean Facility Location
We consider an instance of Euclidean FL induced by a Means instance in the mentioned way, for a given Lagrangian multiplier .
We consider exactly the same Lagrangian Multiplier Preserving (LMP) primaldual algorithm as in [2]. In more detail, let be a parameter to be fixed later. The algorithm consists of a dualgrowth phase and a pruning phase. The dualgrowth phase is exactly as in the classical primaldual algorithm JV by Jain and Vazirani [13]. We start with all the dual variables set to and an empty set of tentatively open facilities. The clients such that for some are frozen, and the other clients are active. We grow the dual variables of active clients at uniform rate until one of the following two events happens. The first event is that some constraint of type (1) becomes tight. At that point the corresponding facility is added to and all clients with are set to frozen. The second event is that for some some . In that case is set to frozen. In any case, the facility that causes to become frozen is called the witness of . The phase halts when all clients are frozen.
In the pruning phase we will close some facilities in , hence obtaining the final set of open facilities . Here deviates from . For each client , let be the set of facilities such that contributed with a positive amount to the opening of . Symmetrically, for , let be the clients that contributed with a positive amount to the opening of . For , we let , where the values are considered at the end of the dualgrowth phase. We set conventionally for . Intuitively, is the “time” when facility is tentatively open (at which point all the dual variables of contributing clients stop growing). We define a conflict graph over tentatively open facilities as follows. The node set of is . We place an edge between iff the following two conditions hold: (1) for some client , (in words, contributes to the opening of both and ) and (2) one has . In this graph we compute a maximal independent set , which provides the desired solution to the facility location problem (where each client is assigned to the closest facility in ).
We remark that the pruning phase of differs from the one of only in the definition of , where condition (2) is not required to hold (or, equivalently, behaves like for ).
2.2 The Analysis in [2]
The general goal is to show that
for some as small as possible. This shows that the algorithm is an LMP approximation for the problem. It is sufficient to prove that, for each client , one has
Let and . We distinguish cases depending on the value of :
Case A: .
Let . Then for any ,
Case B: .
Here we use the properties of Euclidean metrics. The sum is the sum of the squared distances from to the facilities in . This quantity is lower bounded by the sum of the squared distances from to the centroid of . Recall that . We also observe that, by construction, for any two distinct one has
where the last inequality follows from the fact that is contributing to the opening of both and . Altogether one obtains
Thus
Using the fact that for all , hence , one gets
We conclude that
This gives the desired inequality assuming that .
Case C: .
Consider the witness of . Notice that and . Hence
If , then . Otherwise there exists such that . Thus In both cases one has hence
This gives the desired inequality for .
Fixing .
Altogether we can set . The best choice for (namely, the one that minimizes ) is the solution of . This is achieved for and gives .
2.3 A Refined Analysis
We refine the analysis in Case B as follows. Let . We already proved that, for , . Hence it is sufficient to upper bound
Instead of using the upper bound we use the average
Then it is sufficient to upper bound
The derivative in of the above function is . Hence the maximum is achieved for the smallest possible value of . Recall that we already showed that . Hence a valid upper bound is
This imposes rather than in Case B. Notice that this is an improvement for . The best choice of is now obtained by imposing . This gives and .
2.4 From Facility Location to kMeans
We can use the refined approximation for Euclidean Facility Location from previous section to derive a approximation for Euclidean kMeans, for any constant . Here we follow the approach of [2] with only minor changes. In more detail, the authors consider a variant of the FL algorithm described before, whose approximation factor is rather than . A careful use of this algorithm leads to a solution opening precisely facilities, which leads to the desired approximation factor. In their analysis the authors use slight modifications of the inequality (coming from Case C, which is the same in their and our analysis). The goal is to prove that the modified algorithm is approximate. Here and are used as parameters. Therefore it is sufficient to replace their values of these parameters with the ones coming from our refined analysis. The rest is identical.
3 Lower Bound on the Integrality Gap
In this section we describe our lower bound instance for the integrality gap of . It is convenient to consider first the following slightly different relaxation, based on clusters (with as defined in Section 1):
Here denotes the set of possible clusters, i.e. the possible subsets of points. In an integral solution means that cluster is part of our solution.
Our instance is on the Euclidean plane, and its points are the (10) vertices of two regular pentagons of side length . These pentagons are placed so that any two vertices of distinct pentagons are at large enough distance to be fixed later. Here . We remark that our argument can be easily extended to an arbitrary number of points by taking such pentagons for any integer so that the pairwise distance between vertices of distinct pentagons is at least , and setting .
A feasible fractional solution is obtained by setting for every consisting of a pair of consecutive vertices in the same pentagon (so we are considering fractional clusters in total). Obviously this solution is feasible. The cost of each such cluster is . Hence the cost of this fractional solution is .
Next consider the optimal integral solution, consisting of clusters. Recall that the radius of each pentagon (i.e. the distance from a vertex to its center) is and the distance between two nonconsecutive vertices in the same pentagon is . A solution with two clusters consisting of the vertices of each pentagon costs . Any cluster involving vertices of distinct pentagons costs at least , hence for large enough the optimal solution forms clusters only with vertices of the same pentagon. In more detail the optimal solution consists of clusters containing the vertices of one pentagon and clusters containing the vertices of the remaining pentagon. Let be the minimum cost associated with one pentagon assuming that we form clusters with its vertices. Clearly . Regarding , it is obviously convenient to choose two consecutive vertices in the unique cluster of size . Thus . For , we note, as is easy to verify, that clusters with consecutive vertices are less expensive than the alternatives. For , one might form one cluster of size and one of size . This would cost . Alternatively, one might form one cluster of size and one of size , at smaller cost . Thus . For , one might form two clusters of size and one of size , or two clusters of size and one of size . The associated cost in the two cases is and , resp. Hence . So the overall cost of the optimal integral solution is . Thus the integrality gap of is at least .
Consider next . Here a technical complication comes from the definition of which is not part of the input instance of Means. The same construction as above works if we let contain the centers of mass of any set of or points. Notice that this is automatically guaranteed by the construction in [11] for . In this case the optimal integral solutions to and are the same in the considered example. Furthermore one obtains a feasible fractional solution to of cost by setting for the centers of mass of any two consecutive vertices of the same pentagon, and setting for each point and the two closest centers with positive . This concludes the proof of Theorem 2.
Acknowledgments
Work supported in part by the NSF grant 1909972 and the SNF Excellence Grant 200020B_182865/1.
References
 [1] (2017) Better guarantees for kmeans and Euclidean kmedian by primaldual algorithms. In 58th IEEE Annual Symposium on Foundations of Computer Science, FOCS 2017, Berkeley, CA, USA, October 1517, 2017, C. Umans (Ed.), pp. 61–72. External Links: Link, Document Cited by: §1.
 [2] (2020) Better guarantees for kmeans and Euclidean kmedian by primaldual algorithms. SIAM J. Comput. 49 (4). External Links: Link, Document Cited by: §1, §1, §2.1, §2.2, §2.4, §2.
 [3] (2006) How slow is the kmeans method?. In Proceedings of the 22nd ACM Symposium on Computational Geometry, Sedona, Arizona, USA, June 57, 2006, N. Amenta and O. Cheong (Eds.), pp. 144–153. External Links: Link, Document Cited by: §1.
 [4] (2007) Kmeans++: the advantages of careful seeding. In Proceedings of the Eighteenth Annual ACMSIAM Symposium on Discrete Algorithms, SODA 2007, New Orleans, Louisiana, USA, January 79, 2007, N. Bansal, K. Pruhs, and C. Stein (Eds.), pp. 1027–1035. External Links: Link Cited by: §1.
 [5] (2010) Stability yields a PTAS for kmedian and kmeans clustering. In 51th Annual IEEE Symposium on Foundations of Computer Science, FOCS 2010, October 2326, 2010, Las Vegas, Nevada, USA, pp. 309–318. External Links: Link, Document Cited by: §1.
 [6] (2015) The hardness of approximation of Euclidean kmeans. In 31st International Symposium on Computational Geometry, SoCG 2015, June 2225, 2015, Eindhoven, The Netherlands, L. Arge and J. Pach (Eds.), LIPIcs, Vol. 34, pp. 754–767. External Links: Link, Document Cited by: §1.
 [7] (2009) Approximate clustering without the approximation. In Proceedings of the Twentieth Annual ACMSIAM Symposium on Discrete Algorithms, SODA 2009, New York, NY, USA, January 46, 2009, C. Mathieu (Ed.), pp. 1068–1077. External Links: Link Cited by: §1.

[8]
(2019)
Oblivious dimension reduction for kmeans: beyond subspaces and the JohnsonLindenstrauss lemma.
In
Proceedings of the 51st Annual ACM SIGACT Symposium on Theory of Computing, STOC 2019, Phoenix, AZ, USA, June 2326, 2019
, M. Charikar and E. Cohen (Eds.), pp. 1039–1050. External Links: Link, Document Cited by: §1.  [9] (2019) Inapproximability of clustering in metrics. In 60th IEEE Annual Symposium on Foundations of Computer Science, FOCS 2019, Baltimore, Maryland, USA, November 912, 2019, D. Zuckerman (Ed.), pp. 519–539. External Links: Link, Document Cited by: §1.
 [10] (2019) Local search yields approximation schemes for kmeans and kmedian in Euclidean and minorfree metrics. SIAM J. Comput. 48 (2), pp. 644–667. External Links: Link, Document Cited by: §1.
 [11] (2003) Approximation schemes for clustering problems. In Proceedings of the 35th Annual ACM Symposium on Theory of Computing, June 911, 2003, San Diego, CA, USA, L. L. Larmore and M. X. Goemans (Eds.), pp. 50–58. External Links: Link, Document Cited by: §1.1, §3.
 [12] (2019) Local search yields a PTAS for kmeans in doubling metrics. SIAM J. Comput. 48 (2), pp. 452–480. External Links: Link, Document Cited by: §1.
 [13] (2001) Approximation algorithms for metric facility location and kmedian problems using the primaldual schema and Lagrangian relaxation. J. ACM 48 (2), pp. 274–296. Cited by: §2.1.
 [14] (1984) Extensions of Lipschitz mappings into a Hilbert space. Contemporary Mathematics 26, pp. 189–206. Cited by: §1.
 [15] (2004) A local search approximation algorithm for kmeans clustering. Comput. Geom. 28 (23), pp. 89–112. External Links: Link, Document Cited by: §1.
 [16] (2017) Improved and simplified inapproximability for kmeans. Inf. Process. Lett. 120, pp. 40–43. External Links: Link, Document Cited by: §1.
 [17] (1982) Least squares quantization in PCM. IEEE Trans. Inf. Theory 28 (2), pp. 129–136. External Links: Link, Document Cited by: §1.
 [18] (2019) Performance of JohnsonLindenstrauss transform for kmeans and kmedians clustering. In Proceedings of the 51st Annual ACM SIGACT Symposium on Theory of Computing, STOC 2019, Phoenix, AZ, USA, June 2326, 2019, M. Charikar and E. Cohen (Eds.), pp. 1027–1038. External Links: Link, Document Cited by: §1.
 [19] (2000) On approximate geometric kclustering. Discrete Comput. Geom. 24 (1), pp. 61–84. External Links: Link, Document Cited by: §1.
 [20] (2012) The effectiveness of Lloydtype methods for the kmeans problem. J. ACM 59 (6), pp. 28:1–28:22. External Links: Link, Document Cited by: §1.
 [21] (2011) kmeans requires exponentially many iterations even in the plane. Discrete Comput. Geom. 45 (4), pp. 596–616. External Links: Link, Document Cited by: §1.
Comments
There are no comments yet.