QUICK REVIEW

[논문 리뷰] Reasoning Language Models for complex assessments tasks: Evaluating parental cooperation from child protection case reports

Dragan Stoll, Brian E. Perron|arXiv (Cornell University)|2026. 02. 15.

Language Development and Disorders인용 수 0

한 줄 요약

본 논문은 Reasoning language models (RLMs)을 활용하여 아동 보호 사례 보고서에서 부모 협력을 평가하고, 서로 다른 크기의 모델을 인간 전문가와 비교합니다.

ABSTRACT

Purpose: Reasoning language models (RLMs) have demonstrated significant advances in solving complex reasoning tasks. We examined their potential to assess parental cooperation during CPS interventions using case reports, a case factor characterized by ambiguous and conflicting information. Methods: A four stage workflow comprising (1) case reports collection, (2) reasoning-based assessment of parental cooperation, (3) automated category extraction, and (4) case labeling was developed. The performance of RLMs with different parameter sizes (255B, 32B, 4B) was compared against human validated data. Two expert human reviewers (EHRs) independently classified a weighted random sample of reports. Results: The largest RLM achieved the highest accuracy (89%), outperforming the initial approach (80%). Classification accuracy was higher for mothers (93%) than for fathers (85%), and EHRs exhibited similar differences. Conclusions: RLMs' reasoning can effectively assess complex case factors such as parental cooperation. Lower accuracy in assessing fathers' cooperation supports the argument of a stronger professional focus on mothers in CPS interventions.

연구 동기 및 목표

복잡한 CPS 관련 평가 과제에서 모호한 정보를 특징으로 하는 Reasoning language models의 사용 가치를 자극한다.
CPS 사례 보고서를 부모 협력 평가를 위해 처리하기 위한 네 단계 워크플로우를 개발한다.
RLM 성능을 모델 크기에 따라 인간 검증 분류와 비교한다.
부모 협력 평가에서 성별 관련 편향 가능성을 식별한다.

제안 방법

네 단계 워크플로우: (1) 사례 보고서 수집, (2) 부모 협력에 대한 추론 기반 평가, (3) 자동 카테고리 추출, (4) 사례 라벨링.
정확도 평가를 위해 255B, 32B, 4B 매개변수 크기의 RLM을 인간 검증 데이터와 비교한다.
두 명의 전문가가 무작위 샘플 중 가중치를 부여한 독립적으로 보고서를 분류한다.
RLM의 정확도를 계량화하고 인간의 성능과 비교한다.

실험 결과

연구 질문

RQ1추론 언어 모델이 모호한 정보를 가진 CPS 사례 보고서에서 부모 협력을 신뢰성 있게 평가할 수 있는가?
RQ2모델 크기가 부모 협력 분류 정확도에 어떤 영향을 미치는가?
RQ3모성/부성에 대한 인간 및 모델 평가의 정확도 차이가 관찰되는가?

주요 결과

가장 큰 RLM이 89%의 최고 정확도를 달성하여 초기 접근법의 80%를 상회했다.
모성의 분류 정확도가 93%로 부성의 85%보다 높았다.
인간 심사자도 모델과 유사하게 모성-부성 간 정확도 차이를 보였다.

더 나은 연구,지금 바로 시작하세요

연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.

카드 등록 없음 · 무료 플랜 제공

이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.