QUICK REVIEW

[论文解读] The Czech Court Decisions Corpus (CzCDC): Availability as the First Step

Tereza Novotná, Jakub Harašta|arXiv (Cornell University)|Oct 21, 2019

European and International Law Studies被引用 2

一句话总结

本文介紹了捷克法院判決語料庫（CzCDC），這是一個免費提供的語料庫，包含237,723份捷克憲法法院、最高行政法院和最高法院的純文字判決（1993–2018年）。這些資料透過資訊公開請求與網頁爬蟲取得，使自然語言處理與法律研究得以批量存取，克服了以往資料可及性不足、格式不一致及商業管道限制等問題。

ABSTRACT

In this paper, we describe the Czech Court Decision Corpus (CzCDC). CzCDC is a dataset of 237,723 decisions published by the Czech apex (or top-tier) courts, namely the Supreme Court, the Supreme Administrative Court and the Constitutional Court. All the decisions were published between 1st January 1993 and 30th September 2018. Court decisions are available on the webpages of the respective courts or via commercial databases of legal information. This often leads researchers interested in these decisions to reach either to respective court or to commercial provider. This leads to delays and additional costs. These are further exacerbated by a lack of inter-court standard in the terms of the data format in which courts provide their decisions. Additionally, courts' databases often lack proper documentation. Our goal is to make the dataset of court decisions freely available online in consistent (plain) format to lower the cost associated with obtaining data for future research. We believe that simplified access to court decisions through the CzCDC could benefit other researchers. In this paper, we describe the processing of decisions before their inclusion into CzCDC and basic statistics of the dataset. This dataset contains plain texts of court decisions and these texts are not annotated for any grammatical or syntactical features.

研究动机与目标

解決捷克最高法院判決缺乏批量且標準化存取的問題，這些判決對法律研究與自然語言處理至關重要。
克服專有資料庫、資料格式不一致以及法院網站限制批量存取等障礙。
提供一個免費、一致且文件齊全的語料庫，以降低研究成本，並加速捷克語境下法律自然語言處理的發展。
建立一個未來用於標註、元數據增強與比較法律研究的基礎資源。

提出的方法

從官方法院網站收集判決：憲法法院（nalus.usoud.cz）、最高行政法院（nssoud.cz）與最高法院（nsoud.cz）。
透過資訊公開請求索取資料；從憲法法院取得完整資料集（RTF格式）與最高行政法院（PDF格式），而最高法院則分批進行網頁爬蟲。
將所有文件轉換為純文字格式，以確保語料庫內的可及性與一致性。
為每份判決保留基本元數據（例如案號、日期、法院類型），以支援未來分析與索引。
將最終語料庫上傳至 LINDAT/CLARIN 資源庫，並取得持久性識別碼（hdl.handle.net/11372/LRT-3052）。
透過與商業法律資料庫比較，確保資料完整性，估計覆蓋率為：憲法法院99.5%、最高法院91%、最高行政法院99.9%。

实验结果

研究问题

RQ1透過集中化、標準化且免費提供的資料集，能否顯著改善捷克法院判決的批量存取？
RQ2在非英語法律體系中，純文字且未標註的法院判決資料的可得性，對法律自然語言處理與計算語言學研究有何影響？
RQ3捷克最高法院在資料可及性與格式一致性方面存在哪些限制？這些限制應如何緩解？
RQ4一個公開可取得的法院判決語料庫，能否作為未來標註、元數據增強與高階自然語言處理任務在捷克法律領域的基礎？

主要发现

CzCDC 包含來自捷克最高法院、最高行政法院與憲法法院的 237,723 份法院判決，時間範圍為 1993 年 1 月 1 日至 2018 年 9 月 30 日。
語料庫總計 460,524,867 個詞，其中憲法法院佔 21.4%（73,086 份判決），最高法院佔 48.65%（111,977 份判決），最高行政法院佔 29.93%（52,660 份判決）。
覆蓋率估計顯示，與商業資料庫相比，語料庫涵蓋了 99.5% 的憲法法院判決、91% 的最高法院判決與 99.9% 的最高行政法院判決。
資料集以純文字格式提供，僅含最少元數據，確保廣泛的可及性與與自然語言處理工具及研究工作流程的相容性。
語料庫託管於 LINDAT/CLARIN 資源庫，並取得持久性識別碼（hdl.handle.net/11372/LRT-3052），確保長期可存取與引用。
作者指出目前語料庫僅為基礎步驟，未來具備擴展元數據、進行標註，並整合至法律自然語言處理工作流程的潛力。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。