“We Believe GenAI Will Soon Run Itself, Optimising And Scaling Without Humans”

0
78
Jithin V.G, Founder and CEO of_Bud Ecosystem

Bud Ecosystem’s software helps genAI work on regular hardware, doing away with the need for large GPUs or a complicated setup and leading to reduced costs. In a chat with OSFY’s Nidhi Agarwal and Niraj Sahay, Jithin VG, Founder and CEO of Bud Ecosystem, talked about building India’s top genAI model, the challenges of running AI at scale without GPUs, and how open source tools can help more people and companies use AI on their own terms.

Q. When did your firm start, and what does it do?

A. Bud Ecosystem began as an AI research lab in early 2023, focused on open source research and model development. We created the GenZ 70B model—the first from India to reach the top in Hugging Face—where it remained a leading model for about six months. A few months later, Intel approached us to build a solution for optimising genAI (generative AI) deployment on Intel CPUs like Xeon and Gaudi devices, and we successfully delivered the solution. Subsequently, we launched Bud Runtime, a software stack that helps organisations adopt genAI more easily by reducing costs and simplifying technical complexities.

Q Where does electronics fit into your product?

A. We don’t build any electronics, but our software stack, Bud Runtime, is built to optimise the efficiency of hardware infrastructure used for deploying AI applications, be it a CPU, GPU, HPU, NPU or TPU. A major challenge faced by organisations today is that running large AI models is very costly and not scalable. This is due to high costs as well as scarcity of GPU infrastructure. Our software stack solves this by enabling deployment of AI workloads on commodity hardware such as CPUs or low-cost GPUs without compromising on performance. It can even enable deployment of AI models on laptops and edge devices. Think of it like a USB driver: just as a driver lets you use a USB device, our software acts as a driver for AI models, allowing them to run smoothly on any device.

Q. What makes you different from others in the market?

A. We’re not sure how many others offer a product like ours, which is enterprise production ready with an end-to-end genAI runtime software stack. A few months back, we heard about Kompact AI, a startup from IIT Madras, offering CPU-based genAI systems, but haven’t seen much progress from them since. We are the first to build tech that runs genAI at scale on CPUs. In terms of performance, we’re ahead—especially for running these models on CPU-based devices. Also, ours is the only product available in the market that allows heterogenous hardware parallelism, a feature that lets you mix and use different types of hardware, like AMD and Intel CPUs with an NVIDIA GPU for AI workloads. This is a big advantage as it solves the problem of GPU scarcity and high costs of genAI for users.

Q. How do you make genAI run well on different commodity hardware? What are the main challenges?

A. A key challenge was that there is no common software layer to support different hardware platforms like NVIDIA’s CUDA, AMD’s ROCm, and Intel’s systems. OpenAI’s Triton only worked well for NVIDIA, so we had to build much of this ourselves, especially for CPUs, creating computing, communication, and networking libraries from scratch to connect everything. Even with this foundation, we needed to ensure AI models ran efficiently in real-world situations, so we developed a simulation system to help optimise performance for both ourselves and our customers.

Q. Who uses your product?

A. We’re building a better version of Azure AI Foundry that allows enterprises to host optimised AI models on-premises, delivering high accuracy, security, and performance. Several enterprises already use it. Large service providers like Infosys, Capgemini, LTImindtree and Accenture leverage our platform to quickly create scalable, secure AI solutions for their clients. Cloud companies use our system to move beyond selling GPU hardware and offer AI-powered services like models or tokens as a service, adding more value on top of their infrastructure.

OEMs like Dell and Netweb use our technology to support multiple hardware platforms—including Intel, NVIDIA, AMD, Qualcomm, and ARM—and provide AI solutions alongside their devices, enabling them to sell outcomes rather than just specs. Think of it like selling a PC — instead of just promoting the hardware specs, they sell improved performance and productivity. Semiconductor companies also adopt our stack from day one. These are the main customer groups we serve.

Q. What are the basic needs to run your software on any hardware? Do you also offer it as a SaaS?

A. We plan to launch a SaaS version in the next 6 to 12 months. Right now, we offer on-prem licences for large companies, cloud providers, and systems integrators. As for hardware, our software can run on low-cost CPUs—like a $200 server from AWS. That means you can run genAI applications on basic, low-spec machines, like those used for simple web apps. And it’s not just limited use—we support high-performance, non-quantized LLM inference with many requests at once, even on this kind of hardware.

Q. How do you reduce hardware limits but keep performance high?

A. We’re building a platform to minimise hardware dependency for genAI infrastructure—something currently nearly impossible. Systems built on NVIDIA CUDA are locked into NVIDIA hardware and even deploying across different NVIDIA chips is difficult due to varying CUDA versions. Mixing hardware from different vendors like NVIDIA, AMD, and Intel—or combining GPUs, HPUs, and CPUs—is currently unfeasible.

We’re the only company addressing this with heterogeneous parallelism and clustering, plus SLO-based management, which abstracts hardware details. This means users don’t have to worry about the underlying hardware. It’s like how operating systems solved the early mainframe problem by providing hardware abstraction and portability. Today, genAI faces a similar issue — models built on NVIDIA can’t easily move elsewhere. We aim to solve this, enabling AI applications to run seamlessly across diverse hardware.

Jithin VG receiving the Breakthrough Innovation ISV award from Intel Corporation
Jithin VG receiving the Breakthrough Innovation ISV award from Intel Corporation

Q. How does your system make CPUs run as fast as GPUs?

A. We’ve implemented multiple optimisations to boost performance. Custom kernels improve compute, networking, memory bandwidth, and inter-chip communication, delivering 20-30% better performance. Fused kernels optimise LLM computations on CPUs, and a new CPU-specific attention mechanism, detailed in our paper, enables efficient LLM inference.

Recently, we developed GPU-free genAI models. The main GPU bottleneck is attention, which relies on heavy matrix multiplication needing massive parallelism. Building on Microsoft’s BitNet, we converted parts of attention from multiplication to addition using ternary weights. We also solved CPU memory bandwidth limits with block approximation. These advances enable fast genAI inference on CPUs without GPUs.

Q. How can we make big genAI models use less memory and run faster?

A. There are several techniques to reduce memory usage in genAI. The most common is quantization—converting model weights from FP16 to INT8 or INT4, which reduces memory size linearly. For example, a 16GB model at FP16 drops to 8GB with INT8, or 4GB with INT4. Distillation compresses a large model’s capabilities into a smaller one, cutting memory further. Offloading shifts memory from VRAM to RAM or disk, easing pressure on expensive GPU memory. Pruning removes less critical weights but requires optimised runtimes. We also use ternary weights, converting matrix multiplications into additions, which further shrinks memory usage.

Beyond model weights, the KV cache is a major memory load in autoregressive LLMs. We apply KV cache compression and explore alternative architectures like SSN and Liquid models. For scaling, we favour adapters over fine-tuned model replicas—supporting multiple use cases on a single base model. These strategies help address genAI’s core challenges — expensive GPU memory and limited memory bandwidth.

Q. How does your company integrate AI into its software development processes?

A. As an AI-first company, we use AI wherever it’s practical and reliable. Most of us use Cursor for a small boost in speed and efficiency, and we rely on multi-layer AI agents for tasks like PR reviews, code refactoring, documentation, and translation—helping us work across multiple programming languages. Our entire SDLC is AI-driven, supported by various tools, and recently we’ve added Devin, OpenAI’s coding tools, and Anthropic’s agents to further streamline development.

Q. What are the biggest scaling challenges in today’s AI, and how does your tech fix them?

A. The core challenge with genAI today is scalability—it simply isn’t there yet. Frameworks like vLLM are difficult to scale, and only a few experts know how to manage tens of thousands of concurrent requests with proper autoscaling and parallelism across nodes. Optimising tensor and replica parallelism based on specific hardware is complex and still being figured out. Companies like OpenAI, Anthropic, and Google can scale because they focus on a small set of known models and build custom optimisations. But creating a general-purpose inference runtime that works across any model, hardware, and cloud setup is far more difficult.

Q. Is your software runtime open source and built from scratch?

A. Yes, a part of our software is open source. Our base system was built from scratch, but it does include some open source parts like many other projects. We’ve open sourced part of the project—around 20–30%—mainly some tools and utilities we used to help build our system. It’s not fully built on top of another open source project. We’re also starting to grow a community around it and planning to do more outreach soon.

Q. Are the developers from the open source community or part of your team?

A. Right now, all the development is done by our team and a few partner groups. But in the next few months, we want to grow the community and get more outside developers involved. The code is already available on GitHub.

Q. What are the advantages of open sourcing a genAI project?

A. We believe this technology can make generative AI far more affordable and accessible. Today, running large models like LLaMA 8B requires expensive GPUs—often costing tens of thousands of dollars a month—which most small and mid-sized businesses simply can’t afford. While APIs from providers like OpenAI or Anthropic seem cheaper initially, costs quickly rise with scale, and users remain dependent on those platforms without owning any part of the technology.

That’s why open source genAI is crucial. It empowers developers and companies of all sizes to build, run, and improve AI systems on their own terms. An open ecosystem drives innovation, enables problem-solving across diverse use cases, and makes the technology more transparent and trustworthy. With more people contributing, we can push genAI to be faster, more efficient, and widely useful—beyond just those with deep pockets.

Q. What challenges do you face when open sourcing genAI systems?

A. A key challenge with open sourcing is building an active community early on. You need enough contributors to help the project grow and scale. Another issue, especially in India, is the shortage of developers skilled in low-level systems work—like firmware, kernel writing, or programming in C and Assembly. These experts are hard to find, which makes early development difficult.

India’s developer community is great at using open source tools, but not as strong in contributing back. We consume a lot from GitHub, possibly more than any other country, but we produce fewer major open source projects. Growing a culture of contribution is essential, and while it’s improving, it’s still a major challenge—especially for projects being built from here.

Q. Are there any parts of your tech stack that you haven’t open sourced yet, and why?

A. There’s a small part of our stack—a new runtime for non-LLM use cases—that isn’t open sourced yet. We’re still deciding whether to release it and, if so, which parts. Our team is in the middle of that decision-making process.

While we’re a research-focused company, we also need to generate revenue to stay afloat. So, for now, we’re keeping some IP in-house to work directly with large enterprises instead of giving everything away on GitHub. Our tech team strongly supports open source and would prefer to release everything from day one, but we’re taking a case-by-case approach. Most of our work will likely be open sourced over time, though we don’t yet have a fixed process for when or how that will happen.

Q. What open source licence do you use for genAI runtime? How does it affect business use?

A. We use three licences: MIT, Apache 2.0, and our own Bud Community Licence. The Bud licence protects us by stopping companies from just taking the code to offer it as a paid service, since we plan to launch our own SaaS soon. It also limits use by companies valued over $100 million, encouraging them to get an enterprise licence.

Licensing is a balance. Too strict, and fewer people use it. Too open, like MIT or Apache, and it can hurt future business opportunities. For example, LLaMA’s community licence is strict but still widely used. In generative AI, strict licences aren’t a big issue. The key is to pick a licence that fits your goals.

Q. How do you add new features without breaking old ones in open source AI?

A. We’ve prioritised rapid feature development while ensuring backward compatibility. From the start, the codebase was carefully designed with long-term stability in mind—not a quick hack, but a well-planned architecture inspired by mature industry projects like OpenVINO, OpenCL, and ONNX. By studying their architectures and communities, we identified common challenges—especially those causing repeated structural changes and backward incompatibility—and worked to avoid them. Although future requirements may still force major changes, our planning so far has helped maintain stability across multiple iterations.

That said, we’re not focused on releasing features rapidly for the sake of speed. As an optimisation system, our priority is building reliable, high-performance components. We take our time to ensure every line of code is efficient and stable, emphasising quality over quantity. Our goal is to deliver lasting value and performance rather than frequent feature updates.

Q. How do you handle security risks in open source AI runtime code?

A. From the start, we had a clear vision — to build an open source project focused on enterprise needs, especially secure on-premises deployments. Security, along with performance and scalability, has been our top priority, shaping the entire system architecture.

We follow key security standards like ISO, MITRE, CWE, SOC, HIPAA, and AI-specific guidelines from the White House and others. Built for compliance from day one, we also use AI tools to monitor vulnerabilities. Our proprietary BUD SENTRY or Secure Evaluation and Runtime Trust for Your models, is a zero-trust model ingestion framework that ensures that no untrusted models enter your production pipeline without rigorous checks.

Q. How do you see India and Indian developers being part of this?

A. India has one of the largest developer communities in the world, so it will play a key role in building and supporting our project. The country also needs a solution like this right now. We’re working to catch up with nations like the US, China, and those in the EU in terms of AI and data centre capabilities, but India still lags behind—while the US has around 21 gigawatts of data centre capacity, India has only about 1 gigawatt. One way to close this gap is by using existing infrastructure more efficiently—like CPUs or commodity hardware —instead of relying solely on costly GPUs. Our software enables this, helping developing countries make faster progress in AI with the hardware they already have. That’s why we believe this will be especially valuable for India and why we expect strong community support from Indian developers.

Q. How will genAI hardware and software evolve in five years?

A. We ask this daily because it drives us. Since genAI aims to automate everything, we asked: shouldn’t it automate itself? That led to Bud Agent—an AI that builds and manages genAI infrastructure, rewrites its code, and acquires hardware as needed. It’s not open source yet, but videos are available on LinkedIn. We believe genAI will soon run itself, optimising and scaling without humans.

GenAI will evolve towards general and maybe superintelligence. These terms are unclear since we don’t fully grasp intelligence yet. But I hope within a few years we’ll create AI as capable as humans and start solving these big questions.

Q. What are your plans and what are you working on now?

A. India lacks fundamental AI research and mostly consumes technology rather than creates it. Most genAI startups rely on models and architectures developed outside the country—mainly in the US, China, and Europe. Less than 0.01% of practical genAI research papers come from India. While regional personalisation is important and many Indian companies focus on it, this doesn’t address the deeper gap in core AI research and innovation.

Our goal is to change this by focusing on fundamental AI research, creating new, efficient model architectures tailored for India and similar developing countries. These models would be optimised for low energy, low compute, and edge device deployment. Rather than just building huge data centres like the US, India should leverage its billions of existing client devices to maximise compute capacity and compete globally. Innovation and deep research are our priorities because we believe growth and success naturally follow when you create real value.

LEAVE A REPLY

Please enter your comment!
Please enter your name here