GitHub Copilot, Microsoft’s AI pair-programming service, has been out for less than a month now, but it’s already wildly popular. In projects where it’s enabled, GitHub states nearly 40% of code is now being written by Copilot. That’s over a million users and millions of lines of code.
This extension and a back-end service suggest code to developers right in their editors. It supports integrated development environments (IDE) such as Microsoft’s Visual Studio Code, Neovim, and JetBrains. Within these, the AI suggests the next line of code as developers type.
The program can suggest complete methods and complex algorithms alongside boilerplate code and assistance with unit testing. For all intents and purposes, the back engine AI acts as a pair-programming assistant. Developers are free to accept, reject or edit Copilot’s suggestions. If you’re a new programmer, Copilot can interpret simple natural language commands and translate them into one of a dozen programming languages. These include Python, JavaScript, TypeScript, Ruby, and Go.
Microsoft, GitHub, and OpenAI collaborated to build the program. It’s based on OpenAI’s Codex. The Codex was trained on billions of publicly available source code lines — including code in public repositories on GitHub — and on natural language, which means it can understand both programming and human languages.
It sounds like a dream come true, doesn’t it? There’s a rather large fly in the soup, though. There are legal questions about whether Codex had the right to use the open source code to provide the foundation of a proprietary service. And, even if it is legal, can Microsoft, OpenAI, and GitHub, and thus Copilot’s users, ethically use the code it “writes?”
According to Nat Friedman, GitHub’s CEO when Copilot was released in beta, GitHub is legally in the clear because “training ML systems on public data is fair use.” But, he also noted, “IP [intellectual property] and AI will be an interesting policy discussion around the world in the coming years.” You can say that again.
Others venomously disagree. The Software Freedom Conservancy (SFC), a non-profit organization that provides legal services for open source software projects, holds the position that OpenAI was trained exclusively with GitHub-hosted projects. And many of these have been licensed under copyleft licenses. Therefore, as Bradley M. Kuhn., the SFC’s Policy Fellow and Hacker-in-Residence, stated, “Most of those projects are not in the ‘public domain,’ they are licensed under Free and Open Source Software (FOSS) licenses. These licenses have requirements including proper author attribution and, in the case of copyleft licenses, they sometimes require that works based on and/or that incorporate the software be licensed under the same copyleft license as the prior work. Microsoft and GitHub have been ignoring these license requirements for more than a year.”
Therefore, the SFC bites the bullet and urges developers not only to avoid using Copilot but to stop using GitHub completely. They know that won’t be easy. Thanks to Microsoft and GitHub’s “effective marketing, GitHub has convinced Free and Open Source Software (FOSS) developers that GitHub is the best (and even the only) place for FOSS development. However, as a proprietary, trade-secret tool, GitHub itself is the very opposite of FOSS,” added Kuhn.
Other people land between these extremes.
For example, Stefano Maffulli, executive director of the Open Source Initiative (OSI), the organization that oversees open source licenses, understands “why so many open source developers are upset: They have made their source code available for the progress of computer science and humanity. Now that code is being used to train machines to create more code — something the original developers never envisioned nor intended. I can see how it’s infuriating for some.”
That said, Maffulli thinks, “Legally, it appears that GitHub is within its rights.” However, it’s not worth getting “lost in the legal weeds discussing if there is an open source license issue here or a copyright issue. This would miss the wider point. Clearly, there *is* a fairness issue that affects the whole of society, not just open source developers.”
Maffulli argues:
Copilot has exposed developers to one of the quandaries of modern AI: the balance of rights between individuals participating in public activities on the internet and in social networks and the corporations using ‘user-generated content’ to train a new almighty AI. For many years we knew that uploading our pictures, our blog posts, and our code on public internet sites meant we’d be losing some amount of control over our creations. We created norms and licenses (open source and Creative Commons, for example) to balance control and publicity between creators and society as a whole. How many billions of Facebook users realized that their pictures and tags were being used to train a machine that would recognize them in the streets protesting or shopping? How many of those billions would choose to participate in this public activity if they understood that they were training a powerful machine with unknown reach into our private lives?
We can’t expect organizations to use AI in the future with “goodwill” and “good faith,” so it’s time for a broader conversation about AI’s impact on society and on open source.
That’s an excellent point. Copilot is the tip of an iceberg of a much larger issue. The OSI won’t be ignoring it. The organization has been working for several months on building a virtual event called Deep Dive: AI. This, the OSI hopes, will launch a conversation about the legal and ethical implications of AI and what’s acceptable for AI systems to be “open source”. It comprises a podcast series, which will launch soon, and a virtual conference, which will be held in October 2022.
Focusing more on the legal elements, well-known open-source lawyer and OSS Capital General Partner Heather Meeker believes Copilot is legally in the clear.
People get confused when a body of text like software source code — which is a copyrightable work of authorship — is used as data by other software tools. They might think that the results produced by an AI tool are somehow “derivative” of the body of text used to create it. In fact, the licensing terms for the original source code are probably irrelevant. AI tools that do predictive writing are, by definition, suggesting commonly used phrases or statements when the context makes them appropriate. This would likely fall under the fair use or scene-a-faire defenses to copyright infringement — if it were infringement in the first place. It’s more likely that these commonly used artifacts are small code snippets that are entirely functional in nature and, therefore, when used in isolation, don’t enjoy copyright protection at all.
Meeker noted that even the Freedom Software Foundation (FSF) doesn’t claim that what Copilot does is copyright infringement. As John A. Rothchild, Professor of Law at Wayne State University, and Daniel H. Rothchild, Ph.D. candidate at the University of California at Berkeley, said in their FSF paper, “The use of Copilot’s output by its developer-customers is likely, not infringing.” That, however, “does not absolve GitHub of wrongdoing, but rather argues that Copilot and its developer-customers likely do not infringe developers’ copyrights.” Instead, the FSF argues that Copilot is immoral because it is a Software as a Service (SaaS).
Open source legal expert and Columbia law professor Eben Moglen thinks Copilot doesn’t face serious legal problems, but GitHub and OpenAI do need to answer some concerns.
That’s because, Moglen said, “like photocopiers, or scissors and paste, code recommendation programs can result in copyright infringement. Therefore, parties offering such recommendation services should proceed in a license-aware fashion so that users incorporating recommended code in their projects will be informed in a granular fashion of any license restrictions on recommended code. Ideally, users should have the ability to filter recommendations automatically to avoid the unintentional incorporation of code with conflicting or undesired license terms.” At this time, Copilot doesn’t do this.
Therefore, since many “free software programmers are uncomfortable with code they have contributed to free software projects being incorporated in a GitHub code database through which it is distributed as snippets by the Copilot recommendation engine at a price,” said Moglen. GitHub should provide “a simple, persistent way to sequester their code from Copilot.” If GitHub doesn’t, they’ve given programmers a reason to move their projects elsewhere, as the SFC is suggesting. Therefore, Moglen expects GitHub to offer a way to protect concerned developers from having their code vacuumed into the OpenAI Codex.
So, what happens now? Eventually, the courts will decide. Besides open source and copyright issues, there are still larger legal issues over the use of “public” data by private AI services.
As Maffulli said, “We need to better understand the needs of all actors affected by AI in order to establish a new framework that will embed the value of open source into AI, providing the guardrails for collaboration and fair competition to happen at all levels of society.”
Finally, it should be noted that GitHub isn’t the only company using AI to help programmers. Google’s DeepMind has its own AI developer system AlphaCode, Salesforce has CodeT5, and there’s also the open-source PolyCoder. In short, Copilot isn’t the only AI coder. The issue of how AI fits into programming, open-source, and copyright is much bigger than the simplistic “Microsoft is bad for open source!”
Related Stories: