QUICK REVIEW

[논문 리뷰] BLEnD: A Benchmark for LLMs on Everyday Knowledge in Diverse Cultures and Languages

Junho Myung, Nayeon Lee|arXiv (Cornell University)|2024. 06. 14.

Library Science and Information Systems인용 수 5

한 줄 요약

BLEnD는 16개 지역과 13개 언어에 걸친 일상 문화 지식을 평가하는 LLM용 수작업 벤치마크로, 52.6k개의 Q&A를 통해 대표성 rendah 문화와 언어에 대한 상당한 격차를 드러냅니다.

ABSTRACT

Large language models (LLMs) often lack culture-specific knowledge of daily life, especially across diverse regions and non-English languages. Existing benchmarks for evaluating LLMs' cultural sensitivities are limited to a single language or collected from online sources such as Wikipedia, which do not reflect the mundane everyday lifestyles of diverse regions. That is, information about the food people eat for their birthday celebrations, spices they typically use, musical instruments youngsters play, or the sports they practice in school is common cultural knowledge but uncommon in easily collected online sources, especially for underrepresented cultures. To address this issue, we introduce BLEnD, a hand-crafted benchmark designed to evaluate LLMs' everyday knowledge across diverse cultures and languages. BLEnD comprises 52.6k question-answer pairs from 16 countries/regions, in 13 different languages, including low-resource ones such as Amharic, Assamese, Azerbaijani, Hausa, and Sundanese. We construct the benchmark to include two formats of questions: short-answer and multiple-choice. We show that LLMs perform better for cultures that are highly represented online, with a maximum 57.34% difference in GPT-4, the best-performing model, in the short-answer format. For cultures represented by mid-to-high-resource languages, LLMs perform better in their local languages, but for cultures represented by low-resource languages, LLMs perform better in English than the local languages. We make our dataset publicly available at: https://github.com/nlee0212/BLEnD.

연구 동기 및 목표

LLMs가 다양한 지역과 언어에 걸친 일상 문화 상식을 얼마나 잘 포착하는지 평가합니다.
영어 중심 소스 너머의 일상 생활을 반영하는 다국어, 문화적으로 다양한 데이터셋을 제공합니다.
짧은 답변 형식과 선다형 형식 모두를 사용한 문화 간 평가를 가능하게 합니다.
언어 자원 수준과 지역 대표성에 따른 편향과 성능 격차를 식별합니다.

제안 방법

16개 지역과 13개 언어에 걸쳐 여섯 범주(음식, 스포츠, 가족, 교육, 휴일/여가, 업무-생활)로 52.6k 질문-응답 쌍을 구성합니다.
지역별로 500개의 SAQ 템플릿을 만들고 적용 가능한 지역별 1,942–3,699개 옵션에서 대응하는 MCQ 항목을 생성합니다.
원어 화자들의 응답 주석을 지역별로 수집하고 표를 합산하여 주석을 영어로 번역합니다.
SAQ(현지 언어 및 영어) 및 MCQ(영어만)에서 16개 주류 및 지역 중심 LLM을 평가하여 교차 문화 성능을 비교합니다.
점수 매길 때 다국어 표현을 표제화/어간추출 및 악센트 제거로 정규화합니다.

Figure 1: The overall framework of dataset construction and LLM evaluation on BLE n D. BLE n D is built through 4 steps: question collection, question filtering & translation, answer annotation, and answer aggregation. The dataset includes the same questions in 13 different languages, answered from

실험 결과

연구 질문

RQ1대기업 언어 모델(LLMs)이 다양한 언어와 지역에서 일상 문화 지식을 얼마나 잘 알고 있나요?
RQ2대표성이 높은 문화의 현지 언어와 대표성이 낮은 문화의 영어에서 LLM이 더 잘 수행하나요?
RQ3프롬프트 언어가 문화와 언어 간 LLM 성능에 어떤 영향을 미치나요?
RQ4어떤 문화 영역(음식, 휴일, 교육 등)이 LLM이 정확하게 답하기에 더 어려운가요?
RQ5언어 자원 수준과 지역 대표성이 LLM 성능과 어느 정도 상관관계가 있나요?

주요 결과

Country/Region	Language (SAQ)	SAQ Count	Language (MCQ)	MCQ Count
US	English	500	English	1,942
GB	English	500	English	2,167
CN	English (en)	1,000	English (en)	1,929
ES	English (en)	1,000	English (en)	1,931
MX	English (en)	1,000	English (en)	1,899
ID	English (en)	1,000	English (en)	1,995
KR	English (en)	1,000	Korean (ko)	2,512
GR	English (en)	1,000	Greek (el)	2,734
IR	English (en)	1,000	Persian (fa)	3,699
DZ	English (en)	1,000	Arabic (ar)	2,600
AZ	English (en)	1,000	Azerbaijani (az)	2,297
KP	English (en)	1,000	Korean (ko)	2,185
JB	English (en)	1,000	Sundanese (su)	2,345
AS	English (en)	1,000	Assamese (as)	2,451
NG	English (en)	1,000	Hausa (ha)	2,008
ET	English (en)	1,000	Amharic (am)	2,863

LLMs는 고도로 대표되는 문화와 대표성이 낮은 문화 간에 상당한 성능 격차를 보입니다.
미국 문화(영어)에서의 평균 SAQ 성능은 79.22%, 반면 ET 문화(암하라어)에서는 12.18%로 떨어집니다.
지역 중심 모델은 일반 모델보다 자국 지역에서 더 우수한 성능을 보이는 경향이 있습니다(예: KP의 GPT-4 대 북한; KR의 HyperCLOVA-X).
현지 언어 프롬프트는 중상계 문화에서 성능을 향상시키지만, 자원이 낮은 문화의 경우 현지 언어보다 영어 프롬프트가 더 나은 성능을 보입니다.
MCQ는 일반적으로 SAQ보다 정확도가 높고, 언어 및 지역 전반에서 SAQ와 MCQ 결과 간 강력한 상관관계가 있습니다.
음식 및 휴일 범주는 업무-생활이나 교육에 비해 LLM에 더 큰 난이도를 제공합니다.
일부 응답에서 문화적 편향 및 고정관념의 흔적이 나타나며, 특히 대표성이 낮은 지역에서 그렇습니다.

Figure 2: Heatmap showing the average number of common lemmas within each question between all country/region pairs. Pairs from the same countries/regions are shown in white. Higher numbers of shared lemmas indicate that those countries/regions provide more similar answers compared to other countrie

더 나은 연구,지금 바로 시작하세요

연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.

카드 등록 없음 · 무료 플랜 제공

이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.