Open Sesame

Debate over “open source AI” term brings new push to formalize definition

Restrictive AI model licenses claimed as "open source" spark for clear standard.

Benj Edwards – Aug 27, 2024 8:07 PM | 45

A man peers over a glass partition, seeking transparency. Credit: Image Source via Getty Images

The Open Source Initiative (OSI) recently unveiled its latest draft definition for "open source AI," aiming to clarify the ambiguous use of the term in the fast-moving field. The move comes as some companies like Meta release trained AI language model weights and code with usage restrictions while using the "open source" label. This has sparked intense debates among free-software advocates about what truly constitutes "open source" in the context of AI.

For instance, Meta's Llama 3 model, while freely available, doesn't meet the traditional open source criteria as defined by the OSI for software because it imposes license restrictions on usage due to company size or what type of content is produced with the model. The AI image generator Flux is another "open" model that is not truly open source. Because of this type of ambiguity, we've typically described AI models that include code or weights with restrictions or lack accompanying training data with alternative terms like "open-weights" or "source-available."

To address the issue formally, the OSI—which is well-known for its advocacy for open software standards—has assembled a group of about 70 participants, including researchers, lawyers, policymakers, and activists. Representatives from major tech companies like Meta, Google, and Amazon also joined the effort. The group's current draft (version 0.0.9) definition of open source AI emphasizes "four fundamental freedoms" reminiscent of those defining free software: giving users of the AI system permission to use it for any purpose without permission, study how it works, modify it for any purpose, and share with or without modifications.

Ars Video

By establishing clear criteria for open source AI, the organization hopes to provide a benchmark against which AI systems can be evaluated. This will likely help developers, researchers, and users make more informed decisions about the AI tools they create, study, or use.

Truly open source AI may also shed light on potential software vulnerabilities of AI systems, since researchers will be able to see how the AI models work behind the scenes. Compare this approach with an opaque system such as OpenAI's ChatGPT, which is more than just a GPT-4o large language model with a fancy interface—it's a proprietary system of interlocking models and filters, and its precise architecture is a closely guarded secret.

OSI's project timeline indicates that a stable version of the "open source AI" definition is expected to be announced in October at the All Things Open 2024 event in Raleigh, North Carolina.

“Permissionless innovation”

In a press release from May, the OSI emphasized the importance of defining what open source AI really means. "AI is different from regular software and forces all stakeholders to review how the Open Source principles apply to this space," said Stefano Maffulli, executive director of the OSI. "OSI believes that everybody deserves to maintain agency and control of the technology. We also recognize that markets flourish when clear definitions promote transparency, collaboration and permissionless innovation."

The organization's most recent draft definition extends beyond just the AI model or its weights, encompassing the entire system and its components.

For an AI system to qualify as open source, it must provide access to what the OSI calls the "preferred form to make modifications." This includes detailed information about the training data, the full source code used for training and running the system, and the model weights and parameters. All these elements must be available under OSI-approved licenses or terms.

Notably, the draft doesn't mandate the release of raw training data. Instead, it requires "data information"—detailed metadata about the training data and methods. This includes information on data sources, selection criteria, preprocessing techniques, and other relevant details that would allow a skilled person to re-create a similar system.

The "data information" approach aims to provide transparency and replicability without necessarily disclosing the actual dataset, ostensibly addressing potential privacy or copyright concerns while sticking to open source principles, though that particular point may be up for further debate.

"The most interesting thing about [the definition] is that they're allowing training data to NOT be released," said independent AI researcher Simon Willison in a brief Ars interview about the OSI's proposal. "It's an eminently pragmatic approach—if they didn't allow that, there would be hardly any capable 'open source' models."

A diverse group with a clear mission

The OSI's approach to crafting the "open source AI" definition goes back to 2022 when it first approached organizations about defining the term. So far, the process has involved a series of workshops worldwide that have brought together diverse groups from various backgrounds. According to the OSI, 53 percent of participants in the working groups on Open Source AI have been people of color, and 28 percent were women.

"After spending almost two years gathering voices from all over the world to identify the principles of Open Source suitable for AI systems, we're embarking on a worldwide roadshow to refine and validate the release candidate version of the Open Source AI Definition," Maffulli said when the series of workshops was originally announced in May.

Those workshops are still ongoing, and it's not too late to have input in the process. The OSI invites broader participation through further public forums, town hall meetings, and opportunities to comment on draft versions of the definition, as detailed on its website.

When it's over and the final definition is unveiled in October, the new open source AI definition may have deep implications for the AI industry, influencing how companies release AI models and shaping future regulation, such as California's controversial SB-1047. Ultimately, the OSI hopes that the new definition will unite participating members of the AI industry under a banner of software transparency.

"Just as the Open Source Definition serves as the globally accepted standard for Open Source software," the OSI wrote, "so will the Open Source AI Definition act as a standard for openness in AI systems and their components."

Listing image: Image Source via Getty Images

Benj Edwards Senior AI Reporter

Benj Edwards is Ars Technica's Senior AI Reporter and founder of the site's dedicated AI beat in 2022. He's also a tech historian with almost two decades of experience. In his free time, he writes and records music, collects vintage computers, and enjoys nature. He lives in Raleigh, NC.

45 Comments

Ars Video

“Permissionless innovation”

A diverse group with a clear mission

nproxy.org