Machine Learning Data Practices through a Data Curation Lens: An Evaluation Framework
ACM Conference on Fairness, Accountability and Transparency(2024)
摘要
Studies of dataset development in machine learning call for greater attention
to the data practices that make model development possible and shape its
outcomes. Many argue that the adoption of theory and practices from archives
and data curation fields can support greater fairness, accountability,
transparency, and more ethical machine learning. In response, this paper
examines data practices in machine learning dataset development through the
lens of data curation. We evaluate data practices in machine learning as data
curation practices. To do so, we develop a framework for evaluating machine
learning datasets using data curation concepts and principles through a rubric.
Through a mixed-methods analysis of evaluation results for 25 ML datasets, we
study the feasibility of data curation principles to be adopted for machine
learning data work in practice and explore how data curation is currently
performed. We find that researchers in machine learning, which often emphasizes
model development, struggle to apply standard data curation principles. Our
findings illustrate difficulties at the intersection of these fields, such as
evaluating dimensions that have shared terms in both fields but non-shared
meanings, a high degree of interpretative flexibility in adapting concepts
without prescriptive restrictions, obstacles in limiting the depth of data
curation expertise needed to apply the rubric, and challenges in scoping the
extent of documentation dataset creators are responsible for. We propose ways
to address these challenges and develop an overall framework for evaluation
that outlines how data curation concepts and methods can inform machine
learning data practices.
更多查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要