[Paper Review] Learning to Collocate Neural Modules for Image Captioning
This paper proposes Learning to Collocate Neural Modules (CNM), a novel image captioning framework that mimics human-like sentence pattern generation by dynamically composing function and content-specific neural modules (for nouns, adjectives, verbs). By using soft module fusion, multi-step reasoning, and a linguistic loss to enforce part-of-speech collocations, CNM achieves state-of-the-art performance, attaining 127.9 CIDEr-D on the Karpathy split and 126.0 c40 on the official MS-COCO test server, while remaining robust under low-data settings.
We do not speak word by word from scratch; our brain quickly structures a pattern like extsc{sth do sth at someplace} and then fill in the detailed descriptions. To render existing encoder-decoder image captioners such human-like reasoning, we propose a novel framework: learning to Collocate Neural Modules (CNM), to generate the `inner pattern' connecting visual encoder and language decoder. Unlike the widely-used neural module networks in visual Q\&A, where the language (ie, question) is fully observable, CNM for captioning is more challenging as the language is being generated and thus is partially observable. To this end, we make the following technical contributions for CNM training: 1) compact module design --- one for function words and three for visual content words (eg, noun, adjective, and verb), 2) soft module fusion and multi-step module execution, robustifying the visual reasoning in partial observation, 3) a linguistic loss for module controller being faithful to part-of-speech collocations (eg, adjective is before noun). Extensive experiments on the challenging MS-COCO image captioning benchmark validate the effectiveness of our CNM image captioner. In particular, CNM achieves a new state-of-the-art 127.9 CIDEr-D on Karpathy split and a single-model 126.0 c40 on the official server. CNM is also robust to few training samples, eg, by training only one sentence per image, CNM can halve the performance loss compared to a strong baseline.
Motivation & Objective
- To address the lack of inductive bias in existing image captioners, which leads to dataset bias and poor generalization.
- To imitate human-like sentence pattern formation—structuring a template before filling in visual concepts—thereby disentangling captioning from spurious co-occurrence patterns.
- To develop a modular, differentiable framework that can reason over visual and linguistic elements in a structured, step-by-step manner despite partial observability of the generated sentence.
- To improve robustness under low-data regimes, such as one caption per image, by leveraging structured reasoning over modules.
Proposed method
- CNM employs four distinct neural modules: one for function words (e.g., 'a'), and three for visual content words—nouns, adjectives, and verbs—each responsible for generating specific part-of-speech types.
- At each decoding step, the model uses soft attention to fuse outputs from all four modules based on the current hidden state, enabling dynamic and robust module selection under partial observation.
- Multi-step reasoning is implemented by stacking modules sequentially, allowing the model to generate complex phrases through iterative refinement of the sentence structure.
- A linguistic loss is introduced to enforce that module attention aligns with part-of-speech collocations—e.g., adjectives must precede nouns—thereby improving grammatical correctness.
- The framework is trained end-to-end using cross-entropy loss, with additional ablation studies to validate the contribution of each component.
- CNM is further enhanced by combining it with SGAE (Sentence Graph Attention Encoder), which improves performance by preserving language bias and enhancing semantic representation.
Experimental results
Research questions
- RQ1Can a modular, pattern-based approach to image captioning reduce reliance on dataset-specific biases such as high-co-occurrence word pairs?
- RQ2How does soft module fusion and multi-step reasoning improve robustness when the language output is only partially observed during generation?
- RQ3To what extent can enforcing linguistic constraints (e.g., part-of-speech order) improve grammatical accuracy and fluency in generated captions?
- RQ4Can the proposed module collocation framework generalize effectively under low-data training regimes, such as one caption per image?
- RQ5How does the integration of commonsense reasoning modules affect performance, and can it resolve limitations in generating contextually appropriate adjectives?
Key findings
- CNM achieves a new state-of-the-art CIDEm-D score of 127.9 on the MS-COCO Karpathy split, outperforming prior methods including strong baselines and models with larger architectures.
- On the official MS-COCO test server, CNM achieves a single-model CIDEr-D score of 126.0, demonstrating strong generalization and competitive performance without ensemble methods.
- When fine-tuned with only one caption per image, CNM reduces performance degradation by half compared to a strong baseline, indicating superior data efficiency.
- The linguistic loss significantly improves grammatical correctness, as shown by reduced overfitting to high-co-occurrence pairs like 'man standing' in favor of more accurate descriptions.
- CNM+SGAE achieves a CIDEr-D of 126.0 on the official server and 123.8 on the c40 split, showing that integrating language bias modeling further enhances performance.
- Ablation studies confirm that soft module fusion and multi-step reasoning are critical for robustness, especially under partial observability during generation.
Better researchstarts right now
From paper design to paper writing, dramatically reduce your research time.
No credit card · Free plan available
This review was created by AI and reviewed by human editors.