Navigating the Broader Impacts of Machine Learning Research

18 min readFeb 14, 2021

This essay is a (near) transcript of a talk I recently gave at the NeurIPS 2020 workshop on “Navigating the Broader Impacts of AI Research,” organized by Carolyn Ashurst, Solon Barocas, Rosie Campbell, Deb Raji, and Stuart Russell.

I want to start by talking about the Gordian knot. For those of you that don’t know, the Gordian knot is a legend associated with Alexander the Great. The short version is that the Phrygians, an ancient civilization, didn’t have a king, so they turned to Sabazios, one of their gods, to ask for advice. Sabazios told them that the next person to enter the city driving an ox-cart should become their king. A farmer named Gordias drove into town on an ox-cart and was immediately crowned. Out of gratitude, Gordias’s son, Midas, dedicated the ox-cart to Sabazios and tied the cart to a post with a knot made out of cornel bark. This knot — the Gordian knot — consisted of multiple knots, all so tightly entangled that it was impossible to see how they were fastened, let alone untie them. The knot remained there for a very long time, and it was said that anyone who did succeed in untying it would go on to rule all of Asia.

Why am I telling you this? Well, at least as I see it, the situation we have around broader impacts and societal implications of machine learning research is a lot like the Gordian knot. I’ll come back to the ultimate fate of the original Gordian knot at the end of my talk. But for now, I want to describe some of our machine learning Gordian knot. I want to be clear: I don’t have solutions — I don’t know how to untie this knot. And, although others have loosened it a little, it remains as tight as ever. But I hope that by describing some of it, maybe I can motivate people to focus on it, perhaps moving us closer to untying it.

I also want to be clear that when I say “broader impacts and societal implications,” I am talking about downstream (often negative) consequences of research, not about the ethics of research conduct, so IRBs, human subjects, etc. Although these things are often conflated, they are not the same. Research that is conducted unethically can have few downstream consequences, while research that is conducted ethically can end up causing considerable harm. The conflation of these things is its own Gordian knot, and outside the scope of my talk, but it will likely come up in the panel discussions throughout the day.

Twenty Years of ML

For those of you who don’t know me, I’m a machine learning researcher. I’ve been doing machine learning for about twenty years. In fact, I first attended NeurIPS in 2002 and I’ve never missed a year, so this is my 19th NeurIPS. I’ve been a NeurIPS program co-chair, senior program chair, and general chair; and I serve on the NeurIPS board, as well as the ICML board — so I’ve seen all sides of the review process up close. And although I started out developing machine learning methods for text, nowadays I mostly focus on issues of fairness, accountability, transparency, and ethics as they relate to machine learning.

Twenty years ago, machine learning was very different — it was an academic discipline that wasn’t often used in the real-world. When I told my friends and relatives that I was studying machine learning, they looked at me blankly. Now, in contrast, machine learning is everywhere, meaning that machine learning researchers are in a position of incredible power. And this, in many ways, is the root of our problems.

Back when machine learning existed in an academic bubble, talking about broader impacts and societal implications of our research wasn’t so important. The vast majority of contributions were unlikely to make their way into the real world, and when we wrote papers, we were talking to one another — to other researchers who were intimately familiar with the technology, its strengths and characteristics and its weaknesses and limitations, not to mention the terminology that we used to describe it.

But machine learning has grown up — it’s in our devices, our homes, and our cars, and it’s mentioned on billboards and TV shows. And this means that our academic bubble is gone. Tech companies have invested vast amounts of money, accelerating the “success” of machine learning and turning it into something that affects people’s daily lives in myriad ways. Take speech recognition, smart compose, facial identification on our phones, or Tesla’s hands-free driving. These kinds of inventions are achieved by taking research, turning it into products, and then selling those products — often in ways that can make machine learning seem like a magic solution to all of society’s problems, no matter how hard.

But as this landscape changed, we didn’t change our research practices. Our papers are still extremely short; our notion of related work is still typically limited to other machine learning papers; our review processes still focus on theoretical soundness; and we still talk about technical considerations as being of primary importance — often as if machine learning methods are neutral and free from values.

Taken together this has left us with our Gordian knot, at the center of which lies our research-to-practice pipeline. What I’m about to give is a stylized description of this pipeline — a caricature that omits many important details and messy realities. I will touch on some of these details briefly after my description, but they don’t make the job of untying our Gordian knot easier — only harder.

The Research-to-Practice Pipeline

Our research-to-practice pipeline involves several groups of actors, starting with the researchers themselves and ending with the customers and the general public. As machine learning researchers, many of us were not trained to think about broader impacts and societal implications. Although this has changed in recent years, with many universities offering courses on topics like fairness in ML, ethics of AI, and even one on “calling bullshit,” the vast majority of machine learning researchers have not taken courses on these topics. And, even if they had, our research practices do not incentivize us to prioritize broader impacts and societal implications. For the most part, we therefore write papers that focus on technical considerations and do not critically and thoroughly explore the downstream consequences of our work. We assume (in some cases, hope) that if our contributions are ever turned into products, someone else will do this — someone closer to the other end of the research-to-practice pipeline.

But the thing is, the next group of actors — applied scientists, research engineers, and sometimes others too — also received no training in these topics and, even if they had, they’re also not incentivized to prioritize them. “The researchers didn’t mention anything that we should be aware of, and they are super smart. And, anyway, all we’re doing is implementing something to see if it works in practice.”

But then, the next group does the same thing (this time most likely without reading the original paper). “We should turn this into a product given that people put all this effort into implementing it!”

Unsurprisingly, then, it’s often the case that marketing, comms, and sales teams have no idea that there might be broader impacts and societal implications that they should be aware of. And, anyway, their job is to build a market for the product, so they too are not incentivized to talk about these topics.

So, by the time our contributions get to the end of the research-to-practice pipeline — the customers and the general public — it’s often the case that we’ve kind of lost the plot. At the start, the researchers assume it’s someone else’s job further down the pipeline; at the end, the customers and the general public assume that there’s no way that a product would reach them without someone first considering the downstream consequences. And, at every point in between, we have a group of actors who assume that the group(s) before them did their due diligence and that the group(s) after them will do theirs.

Of course, as I said before, I’m simplifying things here, painting a homogeneous picture of each group of actors and omitting many details — most notably the fact that there are individual actors in every group who do not behave in this way and are trying to effect change. Indeed, many of the people that I work with are those actors. But, currently, they are the exception not the rule. Moreover, these are industry-wide structural issues — exacerbated by little accountability and little incentive for transparency — and structural issues are incredibly hard to change via individual actions. On top of this, these issues are intimately linked to the fast-paced, competitive nature of the tech industry and its overwhelming emphasis on moving fast and breaking things and on proving your stuff is better than everyone else’s.

An Example

As an example, take facial recognition. Again, what I’m about to describe is a caricature that omits many important details and obscures the work of individual actors, so take all of this with a grain of salt. From a research perspective, facial recognition is a pretty cool problem to work on — a truly “human” challenge that depends on such awesome math! Things like the contents of databases or the conditions in which photos are obtained aren’t the primary focus — these are deployment issues, and research is about what is theoretically possible. Similarly, turning that awesome paper into a functioning, reliable codebase is a fun engineering challenge, regardless of whether anyone uses it, especially at scale! And then when the system’s been built, what about wrapping it in a REST API for anyone to use? It would be pretty cool to have an on-demand facial recognition system just sitting there for anyone to use with their own database of photos! People can make their own photo organization apps and stuff! Plus, getting the system to operate with sufficiently low latency would be another awesome engineering challenge! Once there’s an API, we may as well see if anyone wants to use it! They probably won’t, but just in case! And anyway, it’s not like we’re deploying facial recognition systems ourselves — it’d be the customer, using their own data! And so on. But the next thing you know — BOOM! We’ve got police-operated facial recognition systems misidentifying Black people, leading to wrongful arrests.

Again, I want to emphasize that this is a caricature, but adding in other details just makes things even more complicated. For example, the fact that the entire research-to-practice pipeline is situated within a capitalist society, where companies are driven by profit first and foremost, means that there are strong incentives to pursue work that will generate revenue. Even researchers are not immune — we need funding to support our research, which means that our research agendas are driven by money whether we want to acknowledge this or not. It’s genuinely easier, even within academia, to pursue research that will likely lead to commercial opportunities. And this is true of computer science as a whole, not just machine learning, though it’s particularly salient for machine learning — a discipline that is arguably more closely linked to economic disruption than any another discipline in recent history.

So, how can we change our research-to-practice pipeline to introduce greater transparency and accountability around broader impacts and societal implications? Well, it’s not obvious.

Professionalization?

One option would be to look to other disciplines. Bioengineering, for example, has decided as a discipline that there are some ideas — creating human-animal interspecies, for instance — that are so abhorrent that they should not be pursued even at a research level. And, if you do, you will be shunned by your peers. But these kinds of discipline-wide norms don’t happen overnight or of their own accord — we need mechanisms to get there. One possible mechanism is licensing — both medicine and law require that practicing professionals be licensed, for example. But it’s not clear to me that this is what we need in machine learning, and indeed licensing typically focuses on practice, not research.

Another option would be to rely one or more professional societies that span the entire research-to-practice pipeline and advance the interest of the field while also establishing norms for their members via codes of ethics, and so on. But I doubt that the machine learning community is amenable to professionalization of this sort — especially machine learning researchers. As a discipline, machine learning places great emphasis on being “home-grown.” Unlike other areas of computer science, most researchers and practitioners do not belong to ACM or IEEE, our conferences are independent events, and even our main journal (JMLR) is an open-access breakaway from traditional academic publishing.

Research Practices?

Of course, there are alternatives to these “top-down” or “outside-in” mechanisms for establishing norms. One of these is to collectively work together from within to change our research practices to reflect the fact that machine learning no longer exists in an academic bubble and that our contributions will make their way into the real world, no matter how theoretical or abstract they seem initially. Then, hopefully, these changes will propagate through the rest of the research-to-practice pipeline.

So, how do we change our research practices? The most obvious way is to change our review processes. In turn, this requires changes on the part of three groups of actors: 1) venues, by which I mean the decision-makers who are responsible for our conferences’ and journals’ policies; 2) reviewers, area chairs, and editors, who make recommendations or decisions that reflect those policies; and 3) authors.

Venues

I’ll now briefly describe these changes, starting with venues. First, venues need to change their policies to factor broader impacts and societal implications into their review processes. NeurIPS took a first step toward this for this year’s conference by requiring every submission to include a broader impact statement (though papers were not rejected for failing to do so, and, indeed, about 9% of submissions did not contain one). Submissions with strong “technical” reviews, but potentially concerning broader impacts and societal implications (regardless of whether these were stated in their broader impact statements) were flagged by reviewers for a further ethical review process. In the end, thirteen submissions went through this process, of which four ended up being rejected. This entire endeavor was a bold move, and I’m grateful to NeurIPS for doing it. But at the same time, it revealed some hurdles. Among others, which I’ll get to in a moment, NeurIPS gave relatively little guidance to authors or to reviewers/ACs, meaning that both groups were often unclear on what might count as broader impacts and societal implications, or what framework should be used to structure discussions around them.

At the other end of the spectrum, other venues have said that they WON’T change their review processes to factor in broader impacts and societal implications. Some have said that this is because these topics are non-quantitative and thus subjective, and subjective opinions do not belong in a scientific review process, while others have said that broader impacts and societal implications depend on values, values vary by geography, and it’s not clear whose values to prioritize. Somewhere in the middle, venues have told ACs/editors that they can reject submissions on the basis of concerning broader impacts and societal implications if they wish, but that no guidance will be provided and that such decisions will be treated as personal judgements that do not reflect venue-wide policies.

I don’t know the best way for venues to factor broader impacts and societal implications into their review processes, but it does seem important for venues to at least a) provide clear guidance to everyone involved in their review processes and b) support ACs/editors in rejecting submissions on the basis of concerning broader impacts and societal implications. I get that the latter is controversial and difficult to implement, especially because these topics have not traditionally played a role in our review processes. But, in that case, venues need to allow people to opt out of handling submissions that they are uncomfortable with — and to do so easily and without penalty. Of course, given the increased number of submissions to machine learning venues, which means that there aren’t enough people to handle them as it is, this seems challenging without substantially overhauling everything.

Reviewers, ACs, and Editors

s for changes on the part of reviewers, area chairs, and editors, things are complicated here too. For brevity, as I describe these changes, I’ll collectively refer to this group of actors as program committee members, even though this isn’t the right terminology for journals. NeurIPS rightly realized that most program committee members don’t have the substantive knowledge needed to critically and thoroughly evaluate broader impacts and societal implications. As a result, submissions with strong “technical” reviews, but potentially concerning broader impacts and societal implications underwent a separate ethical review process. Many of the reviewers responsible for this process have training in the humanities and the social sciences and therefore have a deep knowledge of history and culture.

But I worry, especially if we double down on this approach and scale it up so that all submissions go through the ethical review process, that we are reinforcing a dangerous division of labor, where we relieve machine learning researchers of the responsibility of doing this work themselves. Or, to put it differently, I worry that we’re taking deep structural issues that we don’t know how to tackle and simply laying them at the feet of people (individuals!) in other, less resourced, disciplines to fix for us. Plus, this approach is adversarial. Other disciplines become the “police,” while machine learning researchers try to evade them. This seems unlikely to foster mutual respect or to take steps toward a meeting of minds.

Authors

Lastly, what about changes on the part of authors? Well, like program committee members, most authors have no training that would enable them to deeply critique their own work so as to surface broader impacts and societal implications. But without appropriate training, we risk ending up with a bunch of broader impact statements that are superficial, speculative, or even outright wrong.

On top of that, most authors have little training in scientific communication — choosing terminology and other words carefully, stating assumptions, getting feedback so as to flag misunderstandings, and suchlike. Indeed, these activities are disincentivized by our current research practices, which reward the fast-paced submission of as many papers as possible. Moreover, in contrast to most other disciplines, we have somehow picked up a culture of immodesty, in which overselling is the norm and talking about limitations and negative results are discouraged. This means that even when we do see potentially concerning broader impacts and societal implications, we aren’t comfortable calling them out.

But to set appropriate expectations for the other groups of actors in the research-to-practice pipeline, researchers need to communicate effectively and honestly about their work, without overselling. For example, I may know that my technique is designed to recognize facial expressions, and cannot measure emotions or other mental states unless they are well proxied by facial expressions, which vary by context. But unless I explicitly state this in my papers, choosing my words carefully, is it realistic to assume that this will be clear to the subsequent groups in the pipeline, including marketing, comms, and sales teams who need to convey this to global customers when selling a system based on my technique?

Sociotechnical Education?

The thing is, all of these changes, regardless of whether they involve venues, program committees, or authors, mean acknowledging at every level that machine learning is always sociotechnical and we cannot isolate technical considerations absent the rest. That ship has sailed — or, really, never existed.

But how, then, do we reorient machine learning toward a sociotechnical perspective? Well, I’m fairly convinced that the only effective way is to revamp our education system. Again, what I am about to describe is a caricature that omits important details — most notably, the fact that machine learning researchers come from all over the world, so we don’t even have a single education system. There are multiple education systems to revamp, all situated within very different cultural contexts.

Not only are broader impacts and societal implications absent from many computer science curricula, they are actively downplayed and deprioritized via sentiments like “numbers are neutral,” “computers are objective,” “edge cases and outliers are just a nuisance,” “abstraction away from the messiness of the real world is key,” and “modularity is important so that parts of a problem can be addressed in isolation.” Yet, the reality is that so long as machine learning plays the role it does in society, machine learning researchers and practitioners need to understand society. We need to acknowledge and contend with truly difficult societal topics, like sexism and racism, no matter how uncomfortable it may be to do so. To quote Donald Martin Jr. from Google, “it’s a science-understanding-of-public crisis.”

So, what about adding courses on these topics? Well, for sure I think that anyone who will go on to work in machine learning should take courses in the humanities and social sciences, as well as courses in scientific communication. But which courses and what should they cover? Societal topics center around social constructs, like race or gender, which vary by geography, by culture, and even over time. I don’t have any easy answers here, but I will note that this issue is similar to that faced by venues struggling with whose values to prioritize. On top of that, I’m not sure that simply adding courses is sufficient.

At least as I see it, separate courses reinforce the dangerous narrative that technical considerations CAN be isolated, and that it’s okay to develop algorithms and models, without thinking about the broader impacts and societal implications of using them to address real-world problems — that thinking about these topics can be done separately, secondarily, and after the fact. But it’s dangerous to do this in the real world. Machine learning is sociotechnical and, as I argued earlier, broader impacts and societal implications need to be prioritized and considered at every point in the research-to-practice pipeline.

The only way to train people to approach machine learning from a sociotechnical perspective is to make sure that these topics are part of every machine learning course, part of every homework — that technical considerations are never isolated and considered on their own. Great, you’re learning about kernel methods for classification — but what are the societal implications of classifying people in the first place? Who decides what label schema will be used and what values will it reflect? Similarly, the only way to help people develop scientific communication skills is to prioritize them in everything they do.

Will these changes make machine learning less fun? Maybe, for some people. But that’s their privilege talking about their ethical debt. Machine learning has never been all that fun for people who are involuntarily represented in datasets or subject to uncontestable, life-altering decisions made by machine learning systems. Ultimately, with great power comes great responsibility — and, thanks to the incredible success of machine learning, anyone who works in this space is extremely powerful.

But I don’t see these changes happening anytime soon. They require a fundamental shift from disciplinary training — majors, and so on — to holistic, multidisciplinary training. And this is not a good match for most universities and their disciplinary departments. Add to that the fact that universities are losing people to industry or simply not hiring them in the first place, and none of this seems realistic.

Academic Freedom

Furthermore, and returning to research practices and review processes, even if researchers DO deeply understand the sociotechnical nature of machine learning and they have the skills to communicate carefully and effectively about broader impacts and societal implications, they may not be empowered to do so — or to do so publicly and honestly — especially if they are pre-tenure or work in industry. Take last week’s events involving Timnit Gebru and Google, for example. Academic freedom isn’t the main point of my talk today, but for sure it’s another matted snarl in our machine learning Gordian knot.

What Now?

So, where do we go from here? Well, I don’t know. In the case of the original Gordian knot, though, it never did get untied. Instead, after years of failed attempts, Alexander the Great came along and simply sliced through the whole damn thing with his sword, reasoning that untying it would never be possible. But that’s why I’m giving this talk. I hope that by describing (some of) our machine learning Gordian knot, some of you will be inspired to study it, perhaps eventually realizing that it wasn’t a Gordian knot in the first place and CAN be untied. Or maybe some of you will take up the art of the sword.

Acknowledgements

Thanks to Solon Barocas, Natasha Crampton, Miro Dudík, Andreas Mueller, Jenn Wortman Vaughan, and Tim Vieira for their valuable feedback.

Citation

To cite this article, please use the following BibTex:

@misc{wallach2021navigating,
  author = {Hanna Wallach},
  title = {Navigating the Broader Impacts of Machine Learning Research},
  month = {February},
  year = {2021},
  url = {https://hannawallach.medium.com/navigating-the-broader-impacts-of-machine-learning-research-f2d72a37a5b?source=friends_link&sk=3b9e2ab57cb142e00cf0b55fe5bc1df3}
}