QUICK REVIEW

[논문 리뷰] Blackbird Language Matrices: A Framework to Investigate the Linguistic Competence of Language Models

Paola Merlo, Chunyang Jiang|arXiv (Cornell University)|2026. 02. 24.

Explainable Artificial Intelligence (XAI)인용 수 0

한 줄 요약

Blackbird Language Matrices(BLMs)를 도입하는 다언어 구조화된, 언어학적으로 근거를 둔 다단계 선다형 과제들의 세트로서, 언어 모델의 언어 능력과 체계성을 탐구합니다. BLMs의 표현, 일반화 및 설명가능성을 LLMs에서 검토하는 데의 활용을 시연합니다.

ABSTRACT

This article describes a novel language task, the Blackbird Language Matrices (BLM) task, inspired by intelligence tests, and illustrates the BLM datasets, their construction and benchmarking, and targeted experiments on chunking and systematicity. BLMs are multiple-choice problems, structured at multiple levels: within each sentence, across the input sequence, within each candidate answer. Because of their rich structure, these curated, but naturalistic datasets are key to answer some core questions about current large language models abilities: do LLMs detect linguistic objects and their properties? Do they detect and use systematic patterns across sentences? Are they more prone to linguistic or reasoning errors, and how do these interact? We show that BLMs, while challenging, can be solved at good levels of performance, in more than one language, with simple baseline models or, at better performance levels, with more tailored models. We show that their representations contain the grammatical objects and attributes relevant to solve a linguistic task. We also show that these solutions are reached by detecting systematic patterns across sentences. The paper supports the point of view that curated, structured datasets support multi-faceted investigations of properties of language and large language models. Because they present a curated, articulated structure, because they comprise both learning contexts and expected answers, and because they are partly built by hand, BLMs fall in the category of datasets that can support explainability investigations, and be useful to ask why large language models behave the way they do.

연구 동기 및 목표

유창성과 사실 정확성을 넘어서 LLMs에서의 언어적 추상화와 일반화를 탐구하는 과제의 필요성을 고취한다.
BLMs를 Raven's Progressive Matrices에서 영감을 받은 선별되고 구조화된 다단계 언어 퍼즐로 제시한다.
BLMs가 언어적 객체, 체계적 패턴, 내부 표현에 인코딩된 정보를 분석하는 데 어떻게 기여하는지 보여준다.
데이터 생성 워크플로우와 다수의 언어 및 현상 전반에 걸친 BLM의 적용 가능성을 시연한다.

제안 방법

언어 현상 LP, 맥락 C, 정답 집합 A, 증강 Aug 등의 개념을 포함한 BLM 과제 및 형식적 프레임워크를 정의한다.
여러 BLM 템플릿(Agr, CoS, OD, Spray/Load, Roll)과 영어, 프랑스어, 이탈리아어, 루마니아어, 터키어, 히브리어에 따른 언어별 적응을 설명한다.
시드 문장, 수작업 검증, 통제된 증강을 통해 맥락과 오답을 생성하는 반자동 데이터 구성 방법을 사용한다.
대상 유도(object induction), 구조 의존성, 합성성(compositionality)을 목표 실험과 디코더에서 파생된 문장 임베딩을 통해 조사한다.
LLM이 구성요소, 의미역, 장거리 의존성을 인코딩하는지 여부를 평가하기 위해 내부 표현과 임베딩 공간을 검토한다.

Figure 1: Example of a Raven’s Progressive Matrix (RPM) from visual intelligence tests. This instance is generated with two generative rules: (i) the red dot moves one place clockwise when traversing the matrix left to right; (ii) the blue square moves one place anticlockwise when traversing the mat

실험 결과

연구 질문

RQ1LLMs는 토큰을 넘어 언어적 객체와 그 속성을 탐지하는가?
RQ2LLMs는 문장과 언어 전반에 걸친 체계적 패턴을 탐지하고 활용하는가?
RQ3BLM 해결에서 언어적 오류와 추론 오류가 어떻게 상호작용하는가?
RQ4LLM의 내부 표현은 청크, 구성 요소, 의미역에 대해 무엇을 드러내는가?
RQ5체계성을 지지하는 추상화가 언어와 과제 전반에서 성립하는가?

주요 결과

다양한 언어에 걸쳐 간단한 베이스라인이나 보다 맞춤형 모델을 사용하면 좋은 성능 수준으로 해결할 수 있다.
BLM 표현은 과제 해결에 관련된 문법적 객체와 속성을 담고 있다.
해결은 표면적 단서뿐 아니라 문장 전반의 체계적 패턴을 탐지하는 데에서 나온다.
학습 맥락, 기대 정답, 수작업으로 구성된 자극을 구조화하여 설명가능성 연구를 지원한다.
이 프레임워크는 객체 유도, 구조 의존성, 합성 일반화 등을 포함해 다면적으로 언어 모델을 탐구할 수 있게 한다.

Figure 13: Data flow for the automatic creation of the BLM structured datasets.

더 나은 연구,지금 바로 시작하세요

연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.

카드 등록 없음 · 무료 플랜 제공

이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.