QUICK REVIEW

[Paper Review] ProtoDash: Fast Interpretable Prototype Selection

Karthik S. Gurumoorthy, Amit Dhurandhar|arXiv (Cornell University)|Jul 5, 2017

Machine Learning and Data Classification14 citations

TL;DR

ProtoDash proposes a fast, interpretable algorithm for selecting weighted prototypes from complex datasets using a weakly submodular framework that generalizes prior work by enabling both prototype and criticism selection under any symmetric positive definite kernel. It achieves scalable, coherent selection of representative examples with theoretical approximation guarantees and demonstrates strong performance across retail, MNIST, and public health datasets.

ABSTRACT

In this paper we propose an efficient algorithm ProtoDash for selecting prototypical examples from complex datasets. Our work builds on top of the learn to criticize (L2C) work by Kim et al. (2016) and generalizes it to not only select prototypes for a given sparsity level $m$ but also to associate non-negative weights with each of them indicative of the importance of each prototype. Unlike in the case of L2C, this extension provides a single coherent framework under which both prototypes and criticisms (i.e. lowest weighted prototypes) can be found. Furthermore, our framework works for any symmetric positive definite kernel thus addressing one of the open questions laid out in Kim et al. (2016). Our additional requirement of learning non-negative weights introduces technical challenges as the objective is no longer submodular as in the previous work. However, we show that the problem is weakly submodular and derive approximation guarantees for our fast ProtoDash algorithm. Moreover, ProtoDash can not only find prototypical examples for a dataset $X$, but it can also find (weighted) prototypical examples from $X^{(2)}$ that best represent another dataset $X^{(1)}$, where $X^{(1)}$ and $X^{(2)}$ belong to the same feature space. We demonstrate the efficacy of our method on diverse domains namely; retail, digit recognition (MNIST) and on the latest publicly available 40 health questionnaires obtained from the Center for Disease Control (CDC) website maintained by the US Dept. of Health. We validate the results quantitatively as well as qualitatively based on expert feedback and recently published scientific studies on public health.

Motivation & Objective

To address the limitations of existing prototype selection methods by enabling both prototype and criticism (low-weighted prototypes) selection in a unified framework.
To generalize the Learn to Criticize (L2C) framework to work with any symmetric positive definite kernel, overcoming a key open question from prior work.
To introduce non-negative weights for prototypes to reflect their importance, enhancing interpretability and representativeness.
To provide theoretical approximation guarantees despite the non-submodular nature of the weighted objective function.
To enable cross-dataset prototype selection, where prototypes from one dataset best represent another dataset in the same feature space.

Proposed method

Extends the L2C framework by introducing non-negative weights for prototypes, transforming the selection problem into a weakly submodular optimization task.
Uses a greedy forward selection algorithm with a novel objective function that balances prototype representativeness and weight-based importance.
Employs a kernelized similarity measure based on any symmetric positive definite kernel to compute affinities between data points.
Derives theoretical approximation bounds for the greedy selection process under weak submodularity, ensuring near-optimal performance.
Supports both in-domain prototype selection (from dataset X) and cross-dataset prototype selection (from X² to represent X¹) in the same feature space.
Implements a fast, scalable algorithm by leveraging efficient kernel computation and iterative refinement of prototype sets.

Experimental results

Research questions

RQ1Can a unified framework be developed to simultaneously select prototypes and criticisms (low-weighted prototypes) with interpretable, non-negative weights?
RQ2How can prototype selection be generalized to work with any symmetric positive definite kernel, rather than being restricted to specific kernel types?
RQ3What theoretical guarantees can be provided for prototype selection when the objective function is no longer submodular due to non-negative weights?
RQ4Can ProtoDash effectively select representative examples from one dataset to best represent another dataset in the same feature space?
RQ5How does ProtoDash perform in real-world applications across diverse domains such as retail, digit recognition, and public health?

Key findings

ProtoDash successfully generalizes the L2C framework to support both prototype and criticism selection with non-negative weights, enabling a more coherent and interpretable representation.
The method achieves theoretical approximation guarantees despite the non-submodular objective, by proving the problem is weakly submodular.
ProtoDash demonstrates strong performance on MNIST, achieving high-quality prototype selection with minimal computational cost and consistent interpretability.
In the public health domain, ProtoDash identifies representative health questionnaires from CDC data that align with expert-validated public health studies.
Expert feedback confirmed that the selected prototypes were semantically meaningful and representative of key health conditions and behaviors.
The algorithm scales efficiently to large datasets, enabling fast prototype selection even in high-dimensional feature spaces.

Better researchstarts right now

From paper design to paper writing, dramatically reduce your research time.

No credit card · Free plan available

This review was created by AI and reviewed by human editors.