Skip to main content
QUICK REVIEW

[Paper Review] Errors and secret data in the Italian research assessment exercise. A comment to a reply

Alberto Baccini, Giuseppe De Nicolao|arXiv (Cornell University)|Jul 21, 2017
scientometrics and bibliometrics research9 references5 citations
TL;DR

This paper critically examines the Italian research assessment (VQR) experiment comparing peer review and bibliometric evaluation, revealing statistical errors, undisclosed data, biased sampling, and non-replicable results. It argues that the lack of data transparency undermines the credibility of numerous scholarly papers relying on ANVUR's data, calling for urgent disclosure to ensure scientific reproducibility.

ABSTRACT

Italy adopted a performance-based system for funding universities that is centered on the results of a national research assessment exercise, realized by a governmental agency (ANVUR). ANVUR evaluated papers by using 'a dual system of evaluation', that is by informed peer review or by bibliometrics. In view of validating that system, ANVUR performed an experiment for estimating the agreement between informed review and bibliometrics. Ancaiani et al. (2015) presents the main results of the experiment. Baccini and De Nicolao (2017) documented in a letter, among other critical issues, that the statistical analysis was not realized on a random sample of articles. A reply to the letter has been published by Research Evaluation (Benedetto et al. 2017). This note highlights that in the reply there are (1) errors in data, (2) problems with 'representativeness' of the sample, (3) unverifiable claims about weights used for calculating kappas, (4) undisclosed averaging procedures; (5) a statement about 'same protocol in all areas' contradicted by official reports. Last but not least: the data used by the authors continue to be undisclosed. A general warning concludes: many recently published papers use data originating from Italian research assessment exercise. These data are not accessible to the scientific community and consequently these papers are not reproducible. They can be hardly considered as containing sound evidence at least until authors or ANVUR disclose the data necessary for replication.

Motivation & Objective

  • To challenge the validity of the Italian VQR research assessment experiment that compared peer review and bibliometric evaluation.
  • To highlight critical methodological flaws in the statistical analysis of agreement between peer review and bibliometrics.
  • To expose data inconsistencies, undisclosed sampling procedures, and lack of transparency in ANVUR’s official reports and subsequent publications.
  • To warn the scholarly community that numerous papers relying on ANVUR’s unpublished data are not reproducible.
  • To advocate for the disclosure of raw data to ensure scientific rigor, replicability, and trust in research evaluation systems.

Proposed method

  • Analyzing discrepancies between reported data in the reply by Benedetto et al. (2017) and official ANVUR reports (ANVUR 2013).
  • Identifying inconsistencies in population sizes across tables (e.g., 99,005 vs. 86,998 articles) and factual errors (e.g., 4,7583 instead of 47,583).
  • Assessing the impact of non-random subsampling, where articles with uncertain bibliometric classifications were excluded.
  • Investigating the undisclosed averaging procedure used to calculate the peer review score (P), which influenced kappa statistics.
  • Comparing the protocol used in economics and statistics with other areas, revealing methodological differences that contradict claims of uniformity.
  • Evaluating the logical inconsistency of comparing F vs. P agreement with P1 vs. P2 agreement, given that P is derived from P1 and P2.

Experimental results

Research questions

  • RQ1What are the statistical and data inconsistencies in the reply by Benedetto et al. (2017) to the original critique?
  • RQ2How does the exclusion of non-randomly selected articles with uncertain bibliometric classifications affect the validity of agreement statistics?
  • RQ3Why is the averaging procedure for deriving the final peer review score (P) from two panelists' evaluations (P1, P2) not disclosed, and how might it bias results?
  • RQ4To what extent do official reports contradict the claim that the same protocol was used across all research areas in the VQR assessment?
  • RQ5How does the lack of data transparency undermine the replicability and scientific credibility of papers citing ANVUR’s VQR experiment?

Key findings

  • The reply by Benedetto et al. (2017) contains inconsistent data, including a population drop from 99,005 to 86,998 articles across tables, with percentages based on incorrect totals.
  • The subsample size is inconsistently reported as 7,598 in some tables and 7,597 in others, with no resolution of the discrepancy.
  • The data used in the analysis remain undisclosed, preventing replication and verification of results by the scientific community.
  • The averaging method for deriving the final peer review score (P) from two panelists’ evaluations is not disclosed, raising concerns about potential bias in kappa statistics.
  • The claim of a uniform protocol across all research areas is contradicted by official ANVUR reports, which show that economics and statistics used a different, more favorable method.
  • The comparison of F vs. P agreement with P1 vs. P2 agreement is logically flawed, as P is derived from P1 and P2, making P1 vs. P agreement inherently higher.

Better researchstarts right now

From paper design to paper writing, dramatically reduce your research time.

No credit card · Free plan available

This review was created by AI and reviewed by human editors.