Responsible and transparent data science
Data science is used in almost every academic discipline. This presents plenty of new opportunities, but new dilemmas too. Reason enough for the symposium on 5 March: ‘Fairness and Transparency, towards responsible data science.’ Keynote speaker Ricardo Baeza-Yates gives a sneak preview.
‘Fairness and transparency are important themes in data science because they are important themes to society. They are basic to human rights as well as to ethics in society. Transparency implies knowledge about why an automatic system took a decision. Then the person affected by that decision can analyze the reasons behind it and decide to contest it or not. This is laid down in article 22 of the GDPR [privacy regulation, ed.]. So data science is about why the analysis has produced a certain result and why this analysis was chosen.
‘What makes data science “fair” is a very hard question and there is no common consensus. In fact, just defining what is fair is complicated because the answer can change for each different culture. On the other hand, we can have a functional definition by saying that if the consequences are unintended and hurt people, then the data science used was probably unfair.
‘I will talk about bias – and realizing every human is biased, consciously or unconsciously. This is important in data science, or science in general, for many reasons. First, because if you do not analyze for bias in your data, your result may have unintended consequences. Second, many biases may be unknown, so awareness is the first step to solve any negative bias issue. Third, biases can be subtle and being aware of them is not trivial.
‘I make sure by analyzing for bias in the data I use as well as verifying that my system does not add any additional bias. This is no different for other scientists – except that when you do a controlled experiment, the initial assumptions are usually designed to avoid any bias in the resultant experimental data. However, in many cases we cannot control the environment generating the data.
‘Not all scientists need to learn about data science, but the ones that use data or design experiments do. For example, in many user studies if you do not do the correct statistical checks, the results may be completely wrong. Another example is how in many sciences it is usual to use ANOVA for the analysis of human experiments, but this is only valid for “homogeneous” data that follows a “normal distribution.” Many people do not do the homogeneity and normality tests.’
Ricardo Baeza-Yates is a world-famous computer scientist, whose areas of expertise include web search, data mining and data science. He is a professor at Northeastern University, Silicon Valley Campus.
Want to find out more about data science and how to use it responsibly in your research? Come to the ‘Fairness and Transparency, towards responsible data science’ symposium on 5 March, from 13.00 to 17.00 in PLNT, Leiden. Free entry. Please register in advance.