Dino Pedreschi, Professor of Computer Science at the University of Pisa and co-lead of KDD Lab - Knowledge Discovery and Data Mining Laboratory, talks about data commons, network effects, Gdpr and our right to explanation.
A NEW DEAL FOR DATA
What we need today is, really, a new deal on data, a new deal on personal information. We really need new, gentle digital technologies that will help each of us individually to collect, organize, make sense, and use the personal information that we generate every day. Even individual data are big data. Over months, the amount of different digital breadcrumbs that we generate in all the different digital services that we use, is amazingly large, even without counting photos and videos, we create an amazing quantity of digital recordings with our activities. What do we do with this information? Like the robot in Blade Runner, I would say that most of this gets lost in the rain, they are simply forgotten.
Only a little bit of this information ends up in the big players’ servers. Most of our personal information is used by nobody. We could use it instead, provided we have ways to collect this information, aggregate, make it meaningful to us, and show how can we use this information to improve our lifestyle, to improve the way we use public transportation, to align with our fellow citizens in accessing crowded places, in organizing our life every day in a city, or to find out what we really need, to get in touch with the services that, in any aspect of our life, are important to our experience. The complexity of our communities, of our society, of our economics, is always due to the interaction with others.
Our decisions influence the decisions of others. Consider traffic. In traffic, whenever we decide to take our car to go from A to B, we are also affecting the decisions of all our others concurring citizens that are traveling at the same time for going from somewhere to somewhere else. If we are able to organize our joint mobility in a smart way, we could probably do this with better timings, without delays, using much less energy, producing less pollution and having a safer and more livable environment. All these can only be achieved if we are intelligent subjects, that are powered by information and powered with the ability to exploit this information, in interaction and in collaboration with others. This is the new deal that we need to create.
ALGORITHMS AND NEWS
The model with which social media and social networking platforms are organized today, follows, of course, the main mission of this business, which is (as we already discussed) online advertising and target marketing. This means, that all those platforms have an interest to maximize audience, to maximize likes, to maximize the visits to the websites, or the information, or the products they recommend to users.
The most obvious and the most effective mean to increase the number of likes, or increase the number of visits, is to show people websites, or products, or services that they like. If you're exposed to something that you feel close to your interests, you will have more probably the willingness to click the link, or to like the post, or to buy the product. It's obviously a mechanism that works, because we humans like what we like, and are attracted by like-minded people, by like-minded opinions, and with, in general, things that are close to our interests and desires. This in itself does not seem anything evil, it seems an obvious trick, that advertising has to maximize and to draw our attention.
But what if the network effect of this mechanism, when it is deployed, for instance to the distribution, the diffusion of information? Many scientific studies show today, the so-called “algorithmic buyers” of the platform, the fact that the platform will expose a person with like-minded peers or with articles, talking about things that we already are interested in, create echo chambers, create communities made of people that are very much like-minded, that are very much against being contaminated from different ideas, that develop a more and more radical set of opinions of any controversial, or non-controversial topic.
This is, of course, not healthy for democracy, not healthy for the diversification of opinion in a society. Diversification of opinion, this is another important issue that big data allow us to understand: the more diverse is a crowd of persons, the more intelligent is the crowd in making collective decisions. If we have, somehow, monopolies of a few radical ideas, segregated into society, well, this kind of influence is very bad for the overall intelligence of the crowd, of the population. Which can actually be very stupid and, overall, take not really wise opinions, in many different controversial issues.
The only way we have to preserve diversity, or actually to boost diversity of opinion, is just going on the opposite direction compared to what social media platforms are using today. Because if you propose to people, like-minded people, peers and ideas, they are actually increasing polarization and radical ideas, and decreasing the intelligence and the diversification of the crowds. If we want a different ecosystem for public conversation, we need to really engineer and think of different platforms that are not driven by the marketing idea of maximizing audience and likes.
RIGHT TO EXPLANATION
The GDPR, the General Data Protection Regulation, has a very futuristic and forward-looking clause, that is about the right of explanation. What does it mean? It refers to automated profiling and decision making. Essentially, “the right of explanation” means that, first of all- totally automated decision making is illegal, is prohibited in certain legal or similarly relevant decision pertaining to people, like obtaining a mortgage or a job. So, it is necessary that there are human decision makers in the interface between the subject of the decision and the automated decision-making system that suggests that decision.
The point is that the subject in the end has anyhow the right of an explanation of the decision. Why am I being refused the mortgage? Why I was not selected for the job? Why is this connected to AI and to Big Data? Well, the decision-making systems that we are developing are based on artificial intelligence and big data. They learn how to make accurate and intelligent decisions, based on experience, and the experience is represented in big data. Most of the new AI models that help humans to make decisions are actually black boxes. So, they are very hard to be understood. They don't really provide an explanation why, they are simply very good at learning to make the right decision, from the examples, but they're not good at all in explaining the reasons for the decision, into a logic that is meaningful and comprehensible for humans.
This is precisely what “the right of explanation” calls for: meaningful explanations of decisions. We are, with an intelligent requirement, calling for a human-centric AI or an explainable AI, that is able not only to make suggestions or suggest decisions, but also to explain the logic of such decisions. And why this is important? Not only for transparency, which is, of course, extremely important, but also for being sure that this artificial intelligence has not learned from our bias, from our prejudice, from our discriminatory behavior.
JOBS AND AUTOMATION
A very important chapter is that of big data, AI and the future of work, the future of our jobs. What can we expect in the future of employment and jobs? There is really a large array of different opinions on this issue, ranging from extremely pessimistic to extremely optimistic views, by experts in different fields and even in the same fields. What can be certainly said is that AI and big data can be an exploitative technology. What do I mean by that?
Let’s make one example: consider thousands of doctors, around the word, making everyday diagnoses for certain diseases, looking at the characteristics of their patients and the MRI of scans of their patients, and imagine to collect all this in millions of diagnosis, from thousands of doctors, all together, and train some artificial neural network, or deep learning model to learn how to do this job.
Putting together the experience of thousands of different persons. It's not surprising that the end product of this can be an artifact that is better than any individual doctor, in solving the problem with the diagnosis problem with very high accuracy, probably better than any individual doctor. And now, put yourself in the shoes of the doctors. What should this new artifact do? In a sense, it has exploited the knowledge of the doctors, and to some extent it could replace the doctor, in at least part of their job. Maybe not the complete job, but in part of their job. This is an example of how this technology can seriously affect even works that are much higher level than we suspected, up to only a few years ago.
The discussion in the nineties was “the digital technology will disrupt the lower levels of our job spectrum”, but with the data science and the AI revolution, we are actually very much able to begin mechanizing, automating a substantial part of higher level works, not totally but in much part of it. Think about the language translators, or think about anything that can be somehow learned from the experience of many people, doing what once was perceived as a highly skilled work. Certainly, there will be a transformative effect on many works, on most works. Most of the works will probably not be swept away, will be deeply transformed. In the medical doctor example, for instance, such a tool probably will not replace the diagnosing activity of the medical doctors.
For instance, if equipped with suitable explanation technology, this could actually help doctors improve their diagnoses, providing better evidence from which the AI will be able to develop even better automated tools. Therefore, the virtuous loops between machines and humans should actually be two ways, in such a way that every two sides of the coin can exploit their own specific abilities in tandem with the others. Because, certainly, humans have this ability to push through means and make connections in ways that machines cannot, and, of course, machines have an incredible ability to learn from examples, generalizing in ways it is not affordable for humans, for the sheer size of data and of the problem.
BECOMING A DATA SCIENTIST
I am a “wanna be” data scientist. What should I do? How can I grab this job, figured that, according to many observers, it’s the most fast-growing job family, in a situation of reducing employment and the difficulty of the economy? Well, certainly, becoming a data scientist is a challenging story, because you need to take into consideration three skill sets, that are traditionally thought as diverse and separated. One is the digital information technologies competences: to deal with data, being able to collect them, to integrate them, to process them, to integrate and make it available for many different algorithms.
The second story is statistics, and data mining and machine learning: how to turn messy, not really meaningful data into sense, into something that makes sense for people, that can be used to actually derive information, knowledge, ability to make decisions on top of this information. And then the third part of the skill set is a humanistic part of the story. How to tell stories, how to use language or images, visual imagination, to convey the information and the knowledge, extracted from data, into the desire of stakeholders, into the tools, that will make the knowledge and the formation actionable, usable in the real life. And actually, there is even a fourth aspect which is ethics, responsible data science. As a data scientist, you will be dealing with, mostly, personal data of people. To manage these data within privacy preserving frameworks, and also value preserving frameworks, where you will be actually asking yourself about what use will be done with this information, is also a key competence.
Essentially, if you want to become a data scientist, you can start from a plurality of different undergraduate studies, but you need to push graduate studies with a mix of all these different aspects together. You can, of course, become a data scientist with a specific expertise, you can be more fluent in the IT, you can be more fluent in statistics, you can be more fluent in visualization maybe, but in any case, you will have to have a basic fluency, and master all these different aspects, together. There are more and more programs, at even undergraduate, mostly graduate, Master's and PhD level, that address this issue. My personal advice is to prefer those that are not only methodological, do not only emphasize the basic tools for computer science, statistics, data mining, AI and visualization, but also expose to interdisciplinary applications.