Data Science

How to cluster noisy data sets

Series: Kmeans and Its Variants

Real-world data sets often come with many outliers that you might not be able to remove completely during the data cleanup phase. If you have run into this problem, I want to introduce you to the k-medians algorithm. By using the median instead of the mean, and using a more robust dissimilarity metric, it is much less sensitive to outliers.

A simple framework for performance metrics

The list of performance metrics is seemingly never-ending. Especially if you are new to data science, you can easily feel stranded in an ocean of choices. Learn how they connect to each other and how you can use it to choose the best metric for your problem and model.