Data are being generated, collected and analyzed today at an unprecedented scale, and data-driven decision making is sweeping through all aspects of society. As the use of big data has grown, so too have concerns that poor quality data, prevalent in large data sets, can have serious adverse consequences on data-driven decision making. Responsible data science thus requires a recognition of the importance of veracity, the fourth “V” of big data. In this talk, we present a vision of high-quality big data and highlight the substantial challenges that the first three “V”s, volume, velocity and variety, bring to dealing with veracity in big data. Due to the volume and velocity of data, one needs to understand and possibly repair poor quality data in a scalable and timely manner. With the variety of data, often from a diversity of sources, data quality rules cannot be specified a priori; one needs to let the “data to speak for itself.” We conclude with some recent results relevant to big data quality that are cause for optimism.
Divesh Srivastava is the Head of Database Research at AT&T Labs-Research. He is a Fellow of the Association for Computing Machinery (ACM), the Vice President of the VLDB Endowment, on the Board of Directors of the Computing Research Association (CRA), on the ACM Publications Board and an associate editor of the ACM Transactions on Data Science (TDS). He has served as the managing editor of the Proceedings of the VLDB Endowment (PVLDB), as associate editor of the ACM Transactions on Database Systems (TODS), and as associate Editor-in-Chief of the IEEE Transactions on Knowledge and Data Engineering (TKDE). He has presented keynote talks at several international conferences, and his research interests and publications span a variety of topics in data management. He received his Ph.D. from the University of Wisconsin, Madison, USA, and his Bachelor of Technology from the Indian Institute of Technology, Bombay, India.