[Paper Review] Applied Federated Learning: Improving Google Keyboard Query Suggestions
The paper demonstrates end-to-end use of federated learning to train, evaluate, and deploy a triggering model on mobile devices to filter Google Keyboard query suggestions without accessing raw user data, improving CTR while preserving privacy.
Federated learning is a distributed form of machine learning where both the training data and model training are decentralized. In this paper, we use federated learning in a commercial, global-scale setting to train, evaluate and deploy a model to improve virtual keyboard search suggestion quality without direct access to the underlying user data. We describe our observations in federated training, compare metrics to live deployments, and present resulting quality increases. In whole, we demonstrate how federated learning can be applied end-to-end to both improve user experiences and enhance user privacy.
Motivation & Objective
- Demonstrate an end-to-end FL workflow for a commercial mobile keyboard feature.
- Assess privacy benefits and performance of on-device FL training and aggregation.
- Show how a triggering model can improve query suggestion quality without central data access.
Proposed method
- Two-stage recommendation system: server-trained baseline model plus FL-trained triggering model.
- On-device data collection of features and labels (impressions/clicks) for FL tasks.
- Federated Averaging for aggregating client updates into global model without central data access.
- On-device evaluation and monitoring to guide model convergence and deployment.
- Threshold-based triggering to balance CTR and retained impressions.
- Logistic regression as the FL model in initial experiments, with potential extension to neural models.
Experimental results
Research questions
- RQ1Can federated learning on mobile devices improve the quality of Gboard's query suggestions without accessing raw user data?
- RQ2What are the practical training dynamics, constraints, and privacy implications when deploying FL end-to-end in production?
- RQ3How does the FL-trained triggering model affect click-through rate and retained impressions compared to a traditional baseline?
- RQ4What challenges arise from diurnal device availability and population skew in FL for on-device privacy-preserving training?
Key findings
- FL-trained triggering model improves click-through rate (CTR) compared to the baseline in live deployments at selected thresholds.
- Training exhibits diurnal patterns: most rounds occur at night when devices are charging on unmetered networks.
- Evaluation shows training and live metrics can diverge due to population skew and environmental constraints.
- Threshold tuning affects the balance between triggering rate and user experience, influencing retained impressions and clicks.
- Logistic regression provided an interpretable and effective starting point for FL in this setting; later iterations incorporated more complex features including LSTM-based text featurization.
Better researchstarts right now
From paper design to paper writing, dramatically reduce your research time.
No credit card · Free plan available
This review was created by AI and reviewed by human editors.