Publication Type

Conference Proceeding Article

Version

submittedVersion

Publication Date

12-2022

Abstract

We teach K-Means clustering in introductory data analytics courses because it is one of the simplest and most widely used unsupervised machine learning algorithms. However, one drawback of this algorithm is that it does not offer a clear method to determine the appropriate number of clusters; it does not have a built-in mechanism for K selection. What is usually taught as the solution for the K Selection problem is the so-called elbow method, where we look at the incremental changes in some quality metric (usually, the sum of squared errors, SSE), trying to find a sudden change. In addition to SSE, we can find many other metrics and methods in the literature. In this paper, we survey several of them, and conclude that the Variance Ratio Criterion (VRC) is an appropriate metric we should consider teaching for K Selection. From a pedagogical perspective, VRC has desirable mathematical properties, which help emphasize the statistical underpinnings of the algorithm, thereby reinforcing the students’ understanding through experiential learning. We also list the key concepts targeted by the VRC approach and provide ideas for assignments.

Keywords

K-Means Clustering, Quality Metrics, K Selection, Variance Ratio Criterion

Discipline

Databases and Information Systems | Higher Education

Research Areas

Data Science and Engineering

Publication

2022 IEEE International Conference on Teaching, Assessment, and Learning for Engineering (TALE): Hong Kong, December 4-7: Proceedings

First Page

46

Last Page

53

ISBN

9781665491174

Identifier

10.1109/TALE54877.2022.00016

Publisher

IEEE

City or Country

Piscataway, NJ

Copyright Owner and License

Authors

Additional URL

https://doi.org/10.1109/TALE54877.2022.00016

Share

COinS