Skip to main content
QUICK REVIEW

[论文解读] Blackbird Language Matrices: A Framework to Investigate the Linguistic Competence of Language Models

Paola Merlo, Chunyang Jiang|arXiv (Cornell University)|Feb 24, 2026
Explainable Artificial Intelligence (XAI)被引用 0
一句话总结

tldr: 引入 Blackbird Language Matrices (BLMs),这是一个结构化、多语言、在语言学上有基础、涵盖多层次的多项选择任务集合,用以探测语言模型的语言能力与系统性。展示 BLM 在检验表征、泛化和可解释性方面在 LLMs 的应用。

ABSTRACT

This article describes a novel language task, the Blackbird Language Matrices (BLM) task, inspired by intelligence tests, and illustrates the BLM datasets, their construction and benchmarking, and targeted experiments on chunking and systematicity. BLMs are multiple-choice problems, structured at multiple levels: within each sentence, across the input sequence, within each candidate answer. Because of their rich structure, these curated, but naturalistic datasets are key to answer some core questions about current large language models abilities: do LLMs detect linguistic objects and their properties? Do they detect and use systematic patterns across sentences? Are they more prone to linguistic or reasoning errors, and how do these interact? We show that BLMs, while challenging, can be solved at good levels of performance, in more than one language, with simple baseline models or, at better performance levels, with more tailored models. We show that their representations contain the grammatical objects and attributes relevant to solve a linguistic task. We also show that these solutions are reached by detecting systematic patterns across sentences. The paper supports the point of view that curated, structured datasets support multi-faceted investigations of properties of language and large language models. Because they present a curated, articulated structure, because they comprise both learning contexts and expected answers, and because they are partly built by hand, BLMs fall in the category of datasets that can support explainability investigations, and be useful to ask why large language models behave the way they do.

研究动机与目标

  • Motivate the need for tasks that probe linguistic abstraction and generalisation in LLMs beyond fluency and factual accuracy.
  • Present BLMs as curated, structured, multi-level linguistic puzzles inspired by Raven's Progressive Matrices.
  • Show how BLMs support analysis of linguistic objects, systematic patterns, and information encoded in internal representations.
  • Demonstrate the data generation workflow and the applicability of BLMs across multiple languages and phenomena.

提出的方法

  • Define the BLM task and formal framework with concepts such as linguistic phenomenon LP, context C, answer set A, and augmentation Aug.
  • Describe multiple BLM templates (Agr, CoS, OD, Spray/Load, Roll) and their language-specific adaptations across English, French, Italian, Romanian, Turkish, and Hebrew.
  • Use semi-automatic data construction with seed sentences, hand validation, and controlled augmentation to generate contexts and distractors.
  • Investigate object induction, structure dependency, and compositionality through targeted experiments and decoder-derived sentence embeddings.
  • Examine internal representations and embedding spaces to assess whether LLMs encode constituents, semantic roles, and long-distance dependencies.
Figure 1: Example of a Raven’s Progressive Matrix (RPM) from visual intelligence tests. This instance is generated with two generative rules: (i) the red dot moves one place clockwise when traversing the matrix left to right; (ii) the blue square moves one place anticlockwise when traversing the mat
Figure 1: Example of a Raven’s Progressive Matrix (RPM) from visual intelligence tests. This instance is generated with two generative rules: (i) the red dot moves one place clockwise when traversing the matrix left to right; (ii) the blue square moves one place anticlockwise when traversing the mat

实验结果

研究问题

  • RQ1Do LLMs detect linguistic objects and their properties beyond tokens?
  • RQ2Do LLMs detect and exploit systematic patterns across sentences and languages?
  • RQ3How do linguistic and reasoning errors interact in BLM solving?
  • RQ4What do internal representations in LLMs reveal about chunks, constituents, and semantic roles?
  • RQ5Do abstractions supporting systematicity hold across languages and tasks?

主要发现

  • BLMs can be solved by models at good performance levels across multiple languages using simple baselines or more tailored models.
  • BLM representations contain grammatical objects and attributes relevant to solving the tasks.
  • Solutions arise from detecting systematic patterns across sentences, not just surface cues.
  • BLMs support explainability investigations by structuring learning contexts, expected answers, and hand-built stimuli.
  • The framework enables multi-faceted probing of language models, including object induction, structure dependency, and compositional generalisation.
Figure 13: Data flow for the automatic creation of the BLM structured datasets.
Figure 13: Data flow for the automatic creation of the BLM structured datasets.

更好的研究,从现在开始

从论文设计到论文写作,大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成,并经人工编辑审核。