[Paper Review] Modular Networks: Learning to Decompose Neural Computation
The paper introduces modular networks that learn to decompose neural computation into reusable modules using a generalized EM training framework, enabling deterministic module selection without regularization and showing gains in language modeling and image classification.
Scaling model capacity has been vital in the success of deep learning. For a typical network, necessary compute resources and training time grow dramatically with model size. Conditional computation is a promising way to increase the number of parameters with a relatively small increase in resources. We propose a training algorithm that flexibly chooses neural modules based on the data to be processed. Both the decomposition and modules are learned end-to-end. In contrast to existing approaches, training does not rely on regularization to enforce diversity in module use. We apply modular networks both to image recognition and language modeling tasks, where we achieve superior performance compared to several baselines. Introspection reveals that modules specialize in interpretable contexts.
Motivation & Objective
- Motivate scalable neural networks by decomposing computation into reusable modules.
- Develop a probabilistic, end-to-end trainable framework that learns both modules and their decomposition.
- Enable deterministic module selection to reduce computation and improve training stability.
- Demonstrate the approach on language modeling and image classification with interpretable module specialization.
Proposed method
- Represent the network as a set of M modules and a controller that selects K modules per layer.
- Model module selection a as a latent variable and maximize a variational lower bound on the likelihood.
- Use generalized EM with a partial E-step (Viterbi-style) to keep q(a) deterministic (q(a)=delta(a,a*)).
- Compute gradients for θ (module parameters) and φ (controller) via E[log p(y,a|x,θ,φ)].
- Train with two strategies for E-step: sample S candidate module compositions and pick the best, or retain the previous a* if no improvement.
- Support deterministic, shared-module usage across layers, enabling dynamic parameter sharing and reuse.
Experimental results
Research questions
- RQ1Can a neural network learn to decompose computation into reusable modules without explicit regularization?
- RQ2Does end-to-end learning of module choices and module parameters yield competitive performance on language modeling and image classification?
- RQ3Do modular networks exhibit interpretable specialization of modules to context or data subsets?
- RQ4How does the proposed training compare to REINFORCE and noisy top-k gating in terms of stability and efficiency?
Key findings
- Modular networks achieve competitive perplexities on Penn Treebank compared to baselines and RL-based methods, with lower training noise.
- Language modeling modules specialize in grammatical/semantic contexts, indicating interpretable usage patterns.
- On CIFAR-10, modular networks improve training accuracy versus a non-modular baseline, though generalization benefits vary with controller design.
- The training method succeeds in using all modules by the end of training, with higher batch module selection entropy indicating diverse usage.
- Compared to REINFORCE and noisy top-k, the EM-based method yields lower perplexities and more deterministic module selection.
- The approach avoids explicit regularizers for diversity, relying on partial EM updates to prevent module collapse.
Better researchstarts right now
From paper design to paper writing, dramatically reduce your research time.
No credit card · Free plan available
This review was created by AI and reviewed by human editors.