QUICK REVIEW

[论文解读] Reasoning Language Models for complex assessments tasks: Evaluating parental cooperation from child protection case reports

Dragan Stoll, Brian E. Perron|arXiv (Cornell University)|Feb 15, 2026

Language Development and Disorders被引用 0

一句话总结

本论文评估推理语言模型（RLMs）在儿童保护案件报告中评估父母合作程度的能力，比较不同规模的模型与人类专家的表现。

ABSTRACT

Purpose: Reasoning language models (RLMs) have demonstrated significant advances in solving complex reasoning tasks. We examined their potential to assess parental cooperation during CPS interventions using case reports, a case factor characterized by ambiguous and conflicting information. Methods: A four stage workflow comprising (1) case reports collection, (2) reasoning-based assessment of parental cooperation, (3) automated category extraction, and (4) case labeling was developed. The performance of RLMs with different parameter sizes (255B, 32B, 4B) was compared against human validated data. Two expert human reviewers (EHRs) independently classified a weighted random sample of reports. Results: The largest RLM achieved the highest accuracy (89%), outperforming the initial approach (80%). Classification accuracy was higher for mothers (93%) than for fathers (85%), and EHRs exhibited similar differences. Conclusions: RLMs' reasoning can effectively assess complex case factors such as parental cooperation. Lower accuracy in assessing fathers' cooperation supports the argument of a stronger professional focus on mothers in CPS interventions.

研究动机与目标

利用推理语言模型处理 CPS 相关评估任务中的模糊信息以提供动机。
开发一个处理 CPS 案件报告以评估父母合作的四阶段工作流程。
在不同模型规模下将 RLM 的表现与人类验证的分类进行比较。
识别在父母合作评估中可能的性别相关偏差。

提出的方法

四阶段工作流程：（1）收集案件报告，（2）基于推理的父母合作评估，（3）自动化类别提取，（4）案件标注。
对比参数量为 255B、32B、4B 的 RLM 与人类验证数据的表现。
让两名专家人类评审独立对加权随机抽样的报告进行分类。
量化 RLM 的准确性并与人类表现进行比较。

实验结果

研究问题

RQ1推理语言模型是否能够在信息模糊的 CPS 案件报告中可靠地评估父母合作？
RQ2模型规模如何影响在父母合作分类中的准确性？
RQ3在母亲与父亲的准确性方面，模型与人类评估之间是否存在可观察的差异？

主要发现

最大规模的 RLM 取得最高准确率 89%，胜过起始方法的 80%。
母亲的分类准确率为 93%，高于父亲的 85%。
人类评审在母亲-父亲差异方面的准确性与模型类似。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。