Introducing the Lambda Inference API: Lowest-Cost Inference Anywhere
Lambda Cloud announced the GA release of the Lambda Inference API, the lowest-cost inference anywhere. For just a fraction of a cent, you can access the latest LLMs through a serverless API. Lambda Inference API offers low-cost, scalable AI inference with some of the latest models, such as the recently released Llama 3.3 70B Instruct (FP8), at just $0.20 per million input and output tokens. That’s the lowest-priced serverless AI inference available anywhere at less than half the cost of most competitors.
Choose from “Core” models, which are selected for stability and long-term support, or “Sandbox” models provide access to the latest innovations with more frequent updates. The API scales effortlessly to handle workloads of any size and integrates seamlessly with OpenAI-style endpoints, making implementation quick and easy.
AI without the complexity
Inference is where trained models prove their worth. It’s where the AI model takes in new data (aka prompts)—text, images, and embeddings—and generates actionable predictions, insights, or even videos of fire-fighting kittens in near real-time.
From powering conversational agents to generating images, inference is at the heart of every AI-driven application.
But let’s face it: deploying AI at scale is no easy feat. It requires massive amounts of compute, significant expertise in MLOps to set everything up and performance tune it, as well as a hefty budget to keep it all running smoothly. If you’ve ever tried deploying an AI model before, you know.
They built the Lambda Inference API, to make it simple, scalable, and accessible. For over a decade, Lambda has been engineering every layer of our stack– hardware, networking, and software, for AI performance and efficiency.
They’ve taken everything we’ve learned since then and built an Inference API, underpinned by an industry-leading inference stack, that’s purpose-built for AI.
- ALSO READ: Lambda Launches Grant for AI Researchers, Expands Research Program
GPU cloud company to award cloud credits to hundreds of researchers, powering the most GPU-dependent groundbreaking AI research - Lambda continues company momentum as it aims to double the size of its research team in the next year, bolsters Lambda-led research output, and opens a new San Francisco Office
Original Article & insights from Venture Beat
Lambda is a 12-year-old San Francisco company best known for offering graphics processing units (GPUs) on demand as a service to machine learning researchers and AI model builders and trainers.
But today it’s taking its offerings a step further with the launch of the Lambda Inference API (application programming interface), which it claims to be the lowest-cost service of its kind on the market. The API allows enterprises to deploy AI models and applications into production for end users without worrying about procuring or maintaining compute.
The launch complements Lambda’s existing focus on providing GPU clusters for training and fine-tuning machine learning models.
“Our platform is fully verticalized, meaning we can pass dramatic cost savings to end users compared to other providers like OpenAI,” said Robert Brooks, Lambda’s vice president of revenue, in a video call interview with VentureBeat. “Plus, there are no rate limits inhibiting scaling, and you don’t have to talk to a salesperson to get started.”
Developers can head over to Lambda’s new Inference API webpage, generate an API key, and get started in less than five minutes.
Lambda’s Inference API supports leading-edge models such as Meta’s Llama 3.3 and 3.1, Nous’s Hermes-3, and Alibaba’s Qwen 2.5, making it one of the most accessible options for the machine learning community. The full list is available here and includes:
- deepseek-coder-v2-lite-instruct
- dracarys2-72b-instruct
- hermes3-405b
- hermes3-405b-fp8-128k
- hermes3-70b
- hermes3-8b
- lfm-40b
- llama3.1-405b-instruct-fp8
- llama3.1-70b-instruct-fp8
- llama3.1-8b-instruct
- llama3.2-3b-instruct
- llama3.1-nemotron-70b-instruct
- llama3.3-70b
Pricing begins at $0.02 per million tokens for smaller models like Llama-3.2-3B-Instruct, and scales up to $0.90 per million tokens for larger, state-of-the-art models such as Llama 3.1-405B-Instruct.
As Lambda cofounder and CEO Stephen Balaban put it recently on X, “Stop wasting money and start using Lambda for LLM Inference.” Balaban published a graph showing its per-token cost for serving up AI models through inference compared to rivals in the space.
![](https://venturebeat.com/wp-content/uploads/2024/12/GeEIhsXWUAAjyXG-1.png?w=800)
Furthermore, unlike many other services, Lambda’s pay-as-you-go model ensures customers pay only for the tokens they use, eliminating the need for subscriptions or rate-limited plans.
Closing the AI loop
Lambda has a decade-plus history of supporting AI advancements with its GPU-based infrastructure.
From its hardware solutions to its training and fine-tuning capabilities, the company has built a reputation as a reliable partner for enterprises, research institutions, and startups.
“Understand that Lambda has been deploying GPUs for well over a decade to our user base, and so we’re sitting on literally tens of thousands of Nvidia GPUs, and some of them can be from older life cycles and newer life cycles, allowing us to still get maximum utility out of those AI chips for the wider ML community, at reduced costs as well,” Brooks explained. “With the launch of Lambda Inference, we’re closing the loop on the full-stack AI development lifecycle. The new API formalizes what many engineers had already been doing on Lambda’s platform — using it for inference — but now with a dedicated service that simplifies deployment.”
Brooks noted that its deep reservoir of GPU resources is one of Lambda’s distinguishing features, reiterating that “Lambda has deployed tens of thousands of GPUs over the past decade, allowing us to offer cost-effective solutions and maximum utility for both older and newer AI chips.”
This GPU advantage enables the platform to support scaling to trillions of tokens monthly, providing flexibility for developers and enterprises alike.
Open and flexible
Lambda is positioning itself as a flexible alternative to cloud giants by offering unrestricted access to high-performance inference.
“We want to give the machine learning community unrestricted access to inference APIs. You can plug and play, read the docs, and scale rapidly to trillions of tokens,” Brooks explained.
The API supports a range of open-source and proprietary models, including popular instruction-tuned Llama models.
The company has also hinted at expanding to multimodal applications, including video and image generation, in the near future.
“Initially, we’re focused on text-based LLMs, but soon we’ll expand to multimodal models,” Brooks said.
Serving devs and enterprises with privacy and security
The Lambda Inference API targets a wide range of users, from startups to large enterprises, in media, entertainment, and software development.
These industries are increasingly adopting AI to power applications like text summarization, code generation, and generative content creation.
“There’s no retention or sharing of user data on our platform. We act as a conduit for serving data to end users, ensuring privacy,” Brooks emphasized, reinforcing Lambda’s commitment to security and user control.
As AI adoption continues to rise, Lambda’s new service is poised to attract attention from businesses seeking cost-effective solutions for deploying and maintaining AI models. By eliminating common barriers such as rate limits and high operating costs, Lambda hopes to empower more organizations to harness AI’s potential.