[Paper Review] Disentangling Factors of Variation via Generative Entangling
This paper proposes a higher-order spike-and-slab restricted Boltzmann machine (hossRBM) that disentangles factors of variation in data through generative entangling of multiple binary latent variables. By modeling multiplicative interactions among latent factors, the model learns to infer and disentangle underlying sources of variation—such as identity and facial expression—in an unsupervised manner, achieving state-of-the-art performance on facial expression classification without using label information during training.
Here we propose a novel model family with the objective of learning to disentangle the factors of variation in data. Our approach is based on the spike-and-slab restricted Boltzmann machine which we generalize to include higher-order interactions among multiple latent variables. Seen from a generative perspective, the multiplicative interactions emulates the entangling of factors of variation. Inference in the model can be seen as disentangling these generative factors. Unlike previous attempts at disentangling latent factors, the proposed model is trained using no supervised information regarding the latent factors. We apply our model to the task of facial expression classification.
Motivation & Objective
- To develop a deep generative model capable of disentangling multiple, entangled factors of variation in data without requiring labeled supervision.
- To address the limitation of traditional pooling-based methods that abstract away detailed feature representations, leading to incomplete data representations.
- To explore whether higher-order interactions among binary latent variables can model complex generative entanglement and enable effective disentangling through inference.
- To evaluate the utility of the disentangled representation for downstream tasks like facial expression classification using only unsupervised pretraining.
- To demonstrate that disentangled representations can outperform standard pooling-based or non-disentangled models in classification accuracy.
Proposed method
- Extends the spike-and-slab restricted Boltzmann machine (ssRBM) by introducing higher-order interactions among multiple binary latent variables, forming a three-way interaction between spike variables and two groups of slab variables (g and h units).
- Models the generative process as an entangling mechanism where the multiplicative interaction of latent factors (e.g., identity and expression) produces complex data patterns.
- Uses a structured weight tensor W with dimensions corresponding to spike variables and two pooling groups (g and h), enabling spatially coherent feature learning across blocks.
- Employs unsupervised approximate maximum likelihood learning to train the model parameters without requiring labels for the disentangled factors.
- Performs inference by computing the posterior distribution over latent variables, effectively disentangling the contributions of each factor to the observed data.
- Evaluates the learned representation by using it as input to a linear SVM for facial expression classification, comparing factored and unfactored representations.
Experimental results
Research questions
- RQ1Can higher-order interactions among binary latent variables effectively model the entanglement of multiple factors of variation in data?
- RQ2Does unsupervised learning of such a model lead to disentangled representations that are useful for downstream classification tasks?
- RQ3How does the performance of the disentangled representation compare to standard pooling-based or non-disentangled models in facial expression recognition?
- RQ4Can the model learn meaningful, interpretable groupings of features (e.g., identity vs. expression) without any supervision on the factors?
- RQ5Does the factored representation (post-disentangling) yield better classification accuracy than the full, unfactored representation?
Key findings
- The hossRBM achieved a test accuracy of 77.4% on the Toronto Face Dataset using the factored representation, outperforming all baseline models.
- The model with K=330, M=3, N=3 achieved the highest test accuracy (77.4%) among all configurations tested, demonstrating the effectiveness of higher-order disentangling.
- The factored representation consistently outperformed the unfactored representation across all model sizes, confirming that disentangling leads to more informative features.
- The learned filters within each block showed global cohesion and specialized to subsets of identities and emotions, with g-units encoding emotions and h-units encoding identity.
- The model's performance (77.4%) surpassed pixel-level SVM (71.5%) and MLP (72.72%), and was competitive with more complex deep models like mPoT (82.4%).
- The results validate the hypothesis that disentangling via generative entangling of latent factors improves representation quality for classification tasks in the absence of label supervision.
Better researchstarts right now
From paper design to paper writing, dramatically reduce your research time.
No credit card · Free plan available
This review was created by AI and reviewed by human editors.