QUICK REVIEW

[논문 리뷰] Mecellem Models: Turkish Models Trained from Scratch and Continually Pre-trained for the Legal Domain

Ömer Uğur, Mahmut Göksu|arXiv (Cornell University)|2026. 01. 22.

Topic Modeling인용 수 0

한 줄 요약

논문은 Mecellem을 제시합니다: (1) 법률 NLP를 위한 112.7B 토큰에서 처음부터 학습된 ModernBERT 기반의 터키어 인코더와 다운스트림 기반 체크포인트 전략; (2) CPT에 적응된 Qwen 디코더 모델이 터키 법률 텍스트에서 36.2% perplexity 감소를 달성.

ABSTRACT

This paper presents Mecellem models, a framework for developing specialized language models for the Turkish legal domain through domain adaptation strategies. We make two contributions: (1)Encoder Model Pre-trained from Scratch: ModernBERT-based bidirectional encoders pre-trained on a Turkish-dominant corpus of 112.7 billion tokens. We implement a checkpoint selection strategy that evaluates downstream retrieval performance throughout training, revealing that optimal checkpoints achieve best retrieval scores before pre-training loss reaches its minimum. Our encoder models achieve top-3 rankings on the Turkish retrieval leaderboard, with smaller models (155M parameters) achieving comparable performance to larger reference models (307M-567M parameters). Our approach achieves 92.36% production efficiency compared to state-of-the-art models (embeddinggemma-300m: 100.00%, BAAI/bge-m3: 99.54%, newmindai/bge-m3-stsb: 94.38%), ranking fourth overall despite requiring less computational resources. SOTA models rely on multi-stage, computationally intensive training pipelines, making our single-stage pre-training followed by efficient post-training approach a cost-effective alternative; (2)Decoder Model with Continual Pre-training (CPT): Qwen3-1.7B and Qwen3-4B models adapted to Turkish legal domain through controlled curriculum learning. Four-phase CPT with optimal sample ratios enables gradual transition from general language knowledge to specialized legal terminology and long-context reasoning. This approach achieves 36.2% perplexity reduction on Turkish legal text, demonstrating domain adaptation gains.

연구 동기 및 목표

큰 규모의 터키어 주도 코퍼스에서 처음부터 사전 학습하여 터키어 법률 NLP 인코더를 개발한다.
다운스트림 검색 성능이 효과적인 사전 학습 체크포인트 선택을 안내한다는 것을 보여준다.
계단식 학습을 이용한 지속적 사전 학습으로 터키어 법률 도메인에 디코더 모델을 적응시킨다.
임베딩/검색 성능을 SOTA 터키 법률 모델과 평가하고 비교한다.
도메인 적응을 위한 다단계 학습 파이프라인에 대한 확장 가능하고 비용 효율적인 대안을 제공한다.

제안 방법

MLM을 목표로 112.7B 터키어 주도 토큰에서 처음부터 ModernBERT 기반 양방향 인코더를 사전 학습한다.
다운스트림 검색 성능을 모니터링하여 최적의 사전 학습 체크포인트를 선택하는 체크포인트 선택 전략을 구현한다.
임베딩 작업을 위한 인코더를 여러 대조학습 방법(InfoNCE 변형 및 캐시된 가이드를 가진 GISTEmbed)으로 후-학습한다.
터키 법률 콘텐츠에 초점을 맞춘 네단계 커리큘럼을 가진 CPT를 Qwen3-1.7B 및 Qwen3-4B 디코더에 적용한다.
CPT 및 커리큘럼 학습에 대한 효과적인 초기화 및 데이터 비율 구성을 식별하기 위한 소거 연구를 수행한다.
SemHash 기반 중복제거 및 FineWeb 품질 필터링으로 대규모 터키 법률/일반 코퍼스를 큐레이션하고 전처리한다.

Figure 1: Natural completion rate over a 6.5-hour extraction run.

실험 결과

연구 질문

RQ1처음부터 학습된 터키어 인코더가 터키어 데이터에서 경쟁력 있는 법률 검색 성능을 달성할 수 있는가?
RQ2사전 학습 중 다운스트림 검색 평가가 최소 사전 학습 손실보다 더 나은 체크포인트를 제공하는가?
RQ3네단계 커리큘럼을 갖는 디코더의 지속적 사전 학습이 터키어 법률 용어 사용 및 긴 컨텍스트 추론에 어떤 영향을 미치는가?
RQ4데이터셋 구성, 중복 제거 및 필터링 전략 중 어떤 것이 도메인 적응과 일반 언어 보존의 균형을 가장 잘 맞추는가?
RQ5모델 규모와 학습 전략이 기존 SOTA 터키 법률 NLP 접근법과 어떻게 비교되는가?

주요 결과

인코더 모델이 터키 검색 랭킹에서 상위 3위에 오릅니다.
더 작은 인코더(155M 파라미터)가 더 큰 모델들(307M–567M)의 성능과 일치합니다.
대조학습을 이용한 인코더 후학습이 검색 벤치마크에서 경쟁력 있는 터키 합법 임베딩을 달성합니다.
터키 법률 데이터에 대한 디코더 CPT가 36.2% perplexity 감소를 낳습니다.
최적화된 샘플 비율을 가진 네단계 CPT가 일반 언어 능력을 유지하면서 점진적인 도메인 적응을 가능하게 합니다.
이 방법은 다단계 학습 파이프라인에 대한 비용 효율적인 대안을 제공하고 선택된 기준에 비해 92.36%의 생산 효율성을 달성합니다.

Figure 2: Token Count Distribution Analysis Across All Threshold Combinations.

더 나은 연구,지금 바로 시작하세요

연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.

카드 등록 없음 · 무료 플랜 제공

이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.