[Paper Review] Self-Supervised Feature Learning of 1D Convolutional Neural Networks with Contrastive Loss Using In-Ear Microphone Audio for Eating Detection
This paper proposes a self-supervised feature learning approach using 1D convolutional neural networks with contrastive loss on in-ear microphone audio to detect eating episodes. By leveraging unlabeled audio data from wearables and adapting the SimCLR framework from computer vision, the method achieves performance comparable to supervised and state-of-the-art methods, significantly reducing reliance on costly manual annotations.
The importance of automated and objective monitoring of dietary behavior is becoming increasingly accepted. The advancements in sensor technology along with recent achievements in machine-learning--based signal-processing algorithms have enabled the development of dietary monitoring solutions that yield highly accurate results. A common bottleneck for developing and training machine learning algorithms is obtaining labeled data for training supervised algorithms, and in particular ground truth annotations. Manual ground truth annotation is laborious, cumbersome, can sometimes introduce errors, and is sometimes impossible in free-living data collection. As a result, there is a need to decrease the labeled data required for training. Additionally, unlabeled data, gathered in-the-wild from existing wearables (such as Bluetooth earbuds) can be used to train and fine-tune eating-detection models. In this work, we focus on training a feature extractor for audio signals captured by an in-ear microphone for the task of eating detection in a self-supervised way. We base our approach on the SimCLR method for image classification, proposed by Chen et al. from the domain of computer vision. Results are promising as our self-supervised method achieves similar results to supervised training alternatives, and its overall effectiveness is comparable to current state-of-the-art methods. Code is available at \url{https://github.com/mug-auth/ssl-chewing}.
Motivation & Objective
- To reduce dependency on expensive and error-prone manual labeling for eating detection models.
- To leverage unlabeled audio data collected from in-ear wearables such as Bluetooth earbuds for pre-training.
- To adapt self-supervised contrastive learning from computer vision (SimCLR) to audio signals for dietary monitoring.
- To develop a robust feature extractor for eating detection using only audio from in-ear microphones.
- To evaluate whether self-supervised training can match or approach the performance of supervised learning in eating detection.
Proposed method
- Adapts the SimCLR contrastive learning framework to 1D audio signals captured by in-ear microphones.
- Uses data augmentation techniques such as time cropping and noise injection to generate positive sample views for contrastive learning.
- Employs a 1D convolutional neural network as a feature encoder to learn discriminative representations from audio.
- Applies contrastive loss to maximize agreement between augmented views of the same audio sample while pushing apart views from different samples.
- Fine-tunes the pre-trained model on a small amount of labeled data for downstream eating detection.
- Trains the model end-to-end in a self-supervised manner before transfer learning to the classification task.
Experimental results
Research questions
- RQ1Can self-supervised contrastive learning on in-ear microphone audio achieve comparable performance to supervised learning in eating detection?
- RQ2How effective is the transfer of features learned via self-supervised pre-training to the downstream eating detection task?
- RQ3To what extent can unlabeled in-the-wild audio data reduce the need for manual annotation in dietary monitoring systems?
- RQ4How does the performance of the proposed method compare to state-of-the-art eating detection models?
- RQ5What data augmentation strategies are most effective for audio-based self-supervised learning in this context?
Key findings
- The self-supervised model achieves eating detection performance comparable to supervised training baselines, demonstrating the viability of weak supervision.
- The method significantly reduces reliance on manually annotated data, addressing a key bottleneck in dietary monitoring.
- The transfer performance of the self-supervised feature extractor is competitive with state-of-the-art methods in the field.
- The approach effectively learns discriminative audio representations from in-ear microphone signals without human-annotated labels.
- The use of contrastive learning with data augmentation leads to robust and generalizable features for eating detection.
- Code and models are publicly available, enabling reproducibility and further research in self-supervised dietary monitoring.
Better researchstarts right now
From paper design to paper writing, dramatically reduce your research time.
No credit card · Free plan available
This review was created by AI and reviewed by human editors.