Is the gap between perception and reality in ‘Open Source AI’ putting you at risk? The answer may surprise you!
Open source used to be simple with clear rules, well-known licences and strong community values. Everyone knew what ‘open source’ meant and the ecosystem thrived because of that clarity. Then came AI, bursting onto the scene with hype, marketing buzz and a heavy dose of confusion.
When I see the terms ‘open source’ and ‘AI’ together, what I often see is messy: misleading claims, unclear definitions, and risky assumptions. This is not just a theoretical problem; I have watched startups and global companies alike realise too late that ‘open source AI’ often is not what they thought it was. The gap between perception and reality is growing, and the risks, especially around licensing, compliance, and transparency, are significant.
Why am I talking about this?
I speak about this because I see too many people in the industry gloss over the details. Developers, tech leads, and product managers often scroll past licence agreements, sign terms without reading them, and assume that they are safely operating in an open source environment.
From my experience, ignoring these details can be costly. With AI, the landscape has become more complicated. Definitions have blurred, licensing is trickier, and marketing noise makes it hard to know what ‘open’ really means.
The marketing versus the reality
Today, ‘open source AI’ is one of the hottest marketing terms in technology. But I have noticed that the label is often slapped onto tools, datasets, and models that do not actually meet the established criteria for open source.
I have heard people say things like:
‘I used a model from Hugging Face, so it is open source AI.’
‘I cloned a repository from GitHub, so it is open source AI.’
‘I trained a dataset from Kaggle, so it is open source AI.’
I have learned the hard way that using one open component does not make the entire AI system open source. AI is built from multiple layers, and each one needs to be open for the final product to genuinely qualify.
The four key components of AI
In the early days of open source, things were simpler. You had code, you licensed it under something like GNU General Public License (GPL) or Apache, and the community understood the rules.
AI is different. I think of it as having four main components.
- Code: The application logic I or other developers write.
- Model architecture: The structure that defines how the AI processes and learns from data.
- Training data: The datasets used to teach the model.
- Model weights: The learned parameters that make the model work.
From my perspective, if even one of these is proprietary, the AI system is not fully open source. And I have seen this misunderstanding happen more times than I can count.
Real-world examples
I often point to PyTorch. On its own, it is truly open source under Apache 2.0. But as soon as you combine it with a proprietary model or closed dataset, the resulting application may no longer be considered open source.
Another example is Common Crawl, a huge open dataset used to train models like GPT-3 or Llama. Even though the dataset itself is open, the resulting models are not necessarily open source if their architecture or weights are closed.
Why do traditional rules not always apply?
I have spent years working with open source licences, and I can tell you that they were written for software code, not multi-layered AI systems. In the past, openness meant you could study, modify, and redistribute software freely. It meant you could build tools without hidden restrictions.
AI complicates that. Training datasets may have copyright limits. Model architectures can be secret. Weights can be withheld. Even usage might be capped.
Why do I think developers must pay attention?
If you adopt AI without checking licences carefully, you risk legal trouble, relying on closed components, and losing transparency for your users.
I have seen many tools marketed as ‘open’ that are not. People only realise the limitations after investing significant time and money. That is why I emphasise reading licences carefully, verifying openness at every layer, and questioning marketing claims.
Let’s take some examples of tools that are supposedly ‘open’.
- Gemma: Open architecture and weights, but the training algorithms and datasets are closed. Partial openness.
- Mistral: Initially restrictive licence; switched to Apache 2.0 after community feedback.
- Claude: Fully closed weights, architecture, and datasets, but accessible via application programming interface (API). Open access, not open source.
- Falcon: Open weights and architecture, but commercial use requires written permission.
- Llama: Open weights but with usage caps. Compliant today, possibly in violation tomorrow if user counts grow.
- Stable Diffusion: Transparent datasets, open architecture, modifiable weights, and ethical safeguards through Responsible AI Licence (RAIL) licensing.
These examples show what I mean when I say ‘openness’ in AI is more nuanced than in traditional software.
I have reviewed AI licences that contain hidden restrictions such as usage caps, commercial bans, and hosting limits. These are not always obvious, and they can change over time. From my experience, designing applications with these ‘moving goalposts’ is extremely challenging.
My take on transparency and security
From my perspective, openness is also about trust and safety. Without full access to training datasets and algorithms, it is nearly impossible to detect issues such as data poisoning or bias.
I am encouraged by efforts from groups like the Open Source Security Foundation (OpenSSF), which are developing tools to detect licensing conflicts and security risks. But the ecosystem still has a long way to go.
What needs to change?
Here is what I believe we must do.
- Define open source AI clearly: We need universal standards for openness.
- Raise awareness: Developers must check licences at every layer.
- Engage in policy: Participate in organisations like the Linux Foundation or Open Source Initiative (OSI).
- Educate users: Informed users drive demand for transparency.
Open source has always been about transparency, collaboration, and trust. These values are worth preserving in the AI era. Calling something ‘open source AI’ when only part of it is open is misleading. AI is made of code, architecture, training data, and weights. If any part is closed, the system is not truly open.
From my experience, misunderstanding openness in AI can lead to legal problems, security risks, and a loss of trust. But if we commit to careful licence review, community engagement, and educating developers and users, we can build AI that is both innovative and genuinely open. That is the future I want to see and the one I work towards every day.
This article is based on a talk at AI DevCon, titled ‘Open Source AI: Dogma and Dilemma’, by Ram Iyengar, chief evangelist at Linux Foundation. It has been transcribed and curated by Vidushi Saxena, journalist at OSFY.













































































