题目：An Approach for Validating the Quality of Datasets for Machine Learning
There are basically two ways for improving the accuracy of machine learning: building relevant machine learning models, and providing high quality datasets for training the models. Significant efforts have been made in designing powerful machine learning models. Furthermore, many open-source datasets have been created for machine learning research. However, research on assessing the impact of the quality of a dataset on the accuracy of a machine learning system has not received attention. In this paper, we present an experimental study to show how the quality of datasets impact the accuracy of machine learning models. We discovered a common problem in datasets that could greatly impact the accuracy of machine learning. This problem could also exist in many other machine learning systems, especially those that are developed using crowd-sourced datasets. This problem is difficult to detect using traditional validation approaches. We propose a novel technique based on metamorphic testing for validating a machine learning system together with its training and testing data. The key to metamorphic testing is to create tests that will adequately test the system. We propose an approach for creating such tests. The effectiveness of the proposed approach is demonstrated through a case study of automated classification of biological cell images.
丁俊华：美国北德克萨斯大学（University of North Texas）信息科学系教授，1994毕业于中国地质大学，1997年获南京大学计算机科学硕士学位，2000年和2004年分别获美国佛罗里达国际大学计算机科学硕士和博士学位。曾在美国贝克曼库尔特公司、美国强生公司工作，2007年后，在美国东卡大学计算机科学系和北德克萨斯大学从事科研和教学工作。发表论文70多篇，编写专著2本，主持美国自然科学基金6项，在数据和信息质量分析、软件安全、计算法律和信息检索上具有丰富的经验。现为IEEE Transactions on Software Engineering，IEEE TSMC-A等杂志审稿人，Information and Software Technology杂志编辑，美国自然科学基金评审人。欢迎全校对机器学习感兴趣的师生参加。