QUICK REVIEW

[論文レビュー] Consensus and Subjectivity of Skin Tone Annotation for ML Fairness

Candice Schumann, Gbolahan O. Olanubi|arXiv (Cornell University)|May 16, 2023

Infection Control and Ventilation被引用数 14

ひとこと要約

この論文は Monk Skin Tone (MST) スケール上の肌色注釈が注釈者のタイプと地理によってどのように変化するかを調べ、訓練・評価用の MST-E データセットを導入し、公平性研究における多様で再現可能な注釈のベストプラクティスを提供します。

ABSTRACT

Understanding different human attributes and how they affect model behavior may become a standard need for all model creation and usage, from traditional computer vision tasks to the newest multimodal generative AI systems. In computer vision specifically, we have relied on datasets augmented with perceived attribute signals (e.g., gender presentation, skin tone, and age) and benchmarks enabled by these datasets. Typically labels for these tasks come from human annotators. However, annotating attribute signals, especially skin tone, is a difficult and subjective task. Perceived skin tone is affected by technical factors, like lighting conditions, and social factors that shape an annotator's lived experience. This paper examines the subjectivity of skin tone annotation through a series of annotation experiments using the Monk Skin Tone (MST) scale, a small pool of professional photographers, and a much larger pool of trained crowdsourced annotators. Along with this study we release the Monk Skin Tone Examples (MST-E) dataset, containing 1515 images and 31 videos spread across the full MST scale. MST-E is designed to help train human annotators to annotate MST effectively. Our study shows that annotators can reliably annotate skin tone in a way that aligns with an expert in the MST scale, even under challenging environmental conditions. We also find evidence that annotators from different geographic regions rely on different mental models of MST categories resulting in annotations that systematically vary across regions. Given this, we advise practitioners to use a diverse set of annotators and a higher replication count for each image when annotating skin tone for fairness research.

研究の動機と目的

注釈者のタイプ（専門家 vs クラウドソース）が MST の肌色注釈に与える影響を評価する。
地理的地域が MST 注釈およびモデル注釈者の挙動に与える影響を調べる。
MST 注釈の一貫性を改善するためのデータセットとトレーニングリソースを提供する。
公正性研究における肌の色注釈タスクの設計に関する実践的な推奨を提供する。

提案手法

Monk Skin Tone (MST) scale と MST-E データセットを紹介し、10 MSTポイントに跨る1515枚の画像と31本の動画を含む。
2つの注釈実験を実施: 小規模な専門家フォトグラファー研究と、5地域に跨るより大きなクラウドソース注釈者研究を実施。
注釈者の中央値注釈を、MST scale の作者 Dr. Ellis Monk が提供するゴールドスタンダード・オラクルと比較する。
ICC（intraclass correlation）を用いて評定者間信頼性を測定し、オラクルとの一致を1ポイントの乖離と平均中央値距離の指標で評価する。
注釈実験を in-the-wild Open Images データへ拡張し、注釈者の挙動の一般化可能性を検証する。

実験結果

リサーチクエスチョン

RQ1誰が MST scale 上の肌色を信頼性高く注釈できるか（専門家 vs クラウドソース）と、どの条件下でか？
RQ2注釈者の地理的地域は MST 注釈に影響を与えるか、またそれをどう管理すべきか？
RQ3訓練された注釈者は、照明条件を跨いで MST creator の意図に近い合意を達成できるか？
RQ4信頼性と公正性分析を改善する実践的な注釈設計の推奨は何か？

主な発見

専門家とクラウドソースの両方の注釈者は、異なる照明条件下で MST creator の意図と一致する信頼性の高い注釈を行う。
地域差が大きく観察された：同じ対象に対して、インドのフォトグラファーはより明るい MST を、米国のフォトグラファーはより暗い MST をラベル付けする傾向があった。
専門家およびクラウドソース両方の研究で、合意注釈の大半がオラクルから1 MSTポイント以内だった（インドで 88.9%、米国で 83.4% が専門家；クラウドソース群は高い ICC）。
5地域にまたがるクラウドソース注釈者は強い評定者間信頼性を示した（各対象で ICC 0.86–0.94；Golden Images で 0.90–0.96）、オラクルとの差は平均で 1 ポイント未満を維持した（0.78–0.84）。
MST-E データセットは、全 MST スケールと様々な照明条件で公正性のための注釈者とモデルの訓練・評価を支援し、地域的に多様な注釈者プールは MST scale creator の意図に沿った注釈を生み出す。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。