Or try one of the following: 詹姆斯.com, adult swim, Afterdawn, Ajaxian, Andy Budd, Ask a Ninja, AtomEnabled.org, BBC News, BBC Arabic, BBC China, BBC Russia, Brent Simmons, Channel Frederator, CNN, Digg, Diggnation, Flickr, Google News, Google Video, Harvard Law, Hebrew Language, InfoWorld, iTunes, Japanese Language, Korean Language, mir.aculo.us, Movie Trailers, Newspond, Nick Bradbury, OK/Cancel, OS News, Phil Ringnalda, Photoshop Videocast, reddit, Romanian Language, Russian Language, Ryan Parman, Traditional Chinese Language, Technorati, Tim Bray, TUAW, TVgasm, UNEASYsilence, Web 2.0 Show, Windows Vista Blog, XKCD, Yahoo! News, You Tube, Zeldman
Z.ai unveils GLM-5.1, enabling AI coding agents to run autonomously for hours | InfoWorld
Technology insight for the enterpriseZ.ai unveils GLM-5.1, enabling AI coding agents to run autonomously for hours 8 Apr 2026, 3:27 am
Chinese AI company Z.ai has launched GLM-5.1, an open-source coding model it says is built for agentic software engineering. The release comes as AI vendors move beyond autocomplete-style coding tools toward systems that can handle software tasks over longer periods with less human input.
Z.ai said GLM-5.1 can sustain performance over hundreds of iterations, an ability it argues sets it apart from models that lose effectiveness in longer sessions.
As one example, the company said GLM-5.1 improved a vector database optimization task over more than 600 iterations and 6,000 tool calls, reaching 21,500 queries per second, about six times the best result achieved in a single 50-turn session.
In a research note, Z.ai said GLM-5.1 outperformed its predecessor, GLM-5, on several software engineering benchmarks and showed particular strength in repo generation, terminal-based problem solving, and repeated code optimization. The company said the model scored 58.4 on SWE-Bench Pro, compared with 55.1 for GLM-5, and above the scores it listed for OpenAI’s GPT-5.4, Anthropic’s Opus 4.6, and Google’s Gemini 3.1 Pro on that benchmark.
GLM-5.1 has been released under the MIT License and is available through its developer platforms, with model weights also published for local deployment, the company said. That may appeal to enterprises looking for more control over how such tools are deployed.
Longer-running coding agents
Z.ai says long-running performance is a key differentiator for the company when compared to models that lose effectiveness in extended sessions.
Analysts say this is because many current models still plateau or drift after a relatively small number of turns, limiting their usefulness on extended, multi-step software tasks.
Pareekh Jain, CEO of Pareekh Consulting, said the industry is now moving beyond tools that can answer prompts toward systems that can carry out longer assignments with less supervision.
The question, Jain said, is no longer, “What can I ask this AI?” but, “What can I assign to it for the next eight hours?”
For enterprises, that raises the prospect of assigning an agent a ticket in the morning and receiving an optimized solution by day’s end, after it has run hundreds of experiments and profiled the code.
“This capability aligns with real needs such as large refactors, migration programs, and continuous incident resolution,” said Charlie Dai, VP and principal analyst at Forrester. “It suggests that long‑running autonomous agents are becoming more practical, provided enterprises layer in governance, monitoring, and escalation mechanisms to manage risk.”
Open-source appeal grows
GLM-5.1’s release under the MIT License could be significant, especially for companies in regulated or security-sensitive sectors.
“This matters in four key ways,” Jain said. “First, cost. Pricing is much lower than for premium models, and self-hosting lets companies control expenses instead of paying per use. Second, data governance. Sensitive code and data do not have to be sent to external APIs, which is critical in sectors such as finance, healthcare, and defense. Third, customization. Companies can adapt the model to their own codebases and internal tools without restrictions.”
The fourth factor, according to Jain, is geopolitical risk. Although the model is open source, its links to Chinese infrastructure and entities could still raise compliance concerns for some US companies.
Dai said the MIT license makes it easier for companies to run the model on their own systems while adapting it to internal requirements and governance policies. “For many buyers, this makes GLM‑5.1 a viable strategic option alongside commercial models, especially where regulatory constraints, IP sensitivity, or long‑term platform control matter most,” Dai said.
Benchmark credibility
Z.ai cited three benchmarks: SWE-Bench Pro, which tests complex software engineering tasks; NL2Repo, which measures repository generation; and Terminal-Bench 2.0, which evaluates real-world terminal-based problem solving.
“These benchmarks are designed to test coding agents’ advanced coding capabilities, so topping those benchmarks reflects strong coding performance, such as reliability in planning-to-execution, less prompt rework, and faster delivery,” said Lian Jye Su, chief analyst at Omdia. “However, they are still detached from typical enterprise realities.”
Su said public benchmarks still do not capture the messiness of proprietary codebases, legacy systems, and code review workflows. He added that benchmark results come from controlled settings that differ from production, though the gap is closing as more teams adopt agentic setups.
The article originally appeared in ComputerWorld.
Microsoft’s new Agent Governance Toolkit targets top OWASP risks for AI agents 8 Apr 2026, 2:38 am
Microsoft has quietly introduced the Agent Governance Toolkit, an open source project designed to monitor and control AI agents during execution as enterprises try, and move them into production workflows.
The toolkit, which is a response to the Open Worldwide Application Security Project’s (OWASP) emerging focus on AI and LLM security risks, adds a runtime security layer that enforces policies to mitigate issues such as prompt injection, and improves visibility into agent behavior across complex, multi-step workflows, Imran Siddique, principal group engineering manager at Microsoft wrote in a blog post.
More specifically, the toolkit maps to OWASP’s top 10 risks for agentic systems, including goal hijacking, tool misuse, identity abuse, supply chain risks, code execution, memory poisoning, insecure communications, cascading failures, human-agent trust exploitation, and rogue agents.
The rationale behind the toolkit, Siddique wrote, stems from how AI systems increasingly resemble loosely governed distributed environments, where multiple untrusted components share resources, make decisions, and interact externally with minimal oversight.
That prompted Microsoft to apply proven design patterns from operating systems, service meshes, and site reliability engineering to bring structure, isolation, and control to these environments, Siddique added.
The result was the Redmond-headquartered giant packaging these principles into the toolkit comprising seven components available in Python, TypeScript, Rust, Go, and .NET.
The cross language approach, Siddique explained, is aimed at meeting developers where they are and enabling integration across heterogeneous enterprise stacks.
As for the components, the toolkit includes modules such as a policy enforcement layer named Agent OS, a secure communication and identity framework named Agent Mesh, an execution control environment named Agent Runtime, and additional components, such as Agent SRE, Agent Compliance, and Agent Lightning, covering reliability, compliance, marketplace governance, and reinforcement learning oversight.
Beyond its modular design, Siddique further wrote that the toolkit is built to work with existing development ecosystems: “We designed the toolkit to be framework-agnostic from day one. Each integration hooks into a framework’s native extension points, LangChain’s callback handlers, CrewAI’s task decorators, Google ADK’s plugin system, Microsoft Agent Framework’s middleware pipeline, so adding governance doesn’t require rewriting agent code.”
This approach, the senior executive explained, would reduce integration overhead and risk, allowing developers to introduce governance controls into production systems without disrupting existing workflows or incurring the cost and complexity of rearchitecting applications.
Siddique even went on to give examples of several framework integrations that are already deployed in production workloads, including LlamaIndex’s TrustedAgentWorker integration.
For those wishing to explore the toolkit, which is currently in public preview, it is available under an MIT license and structured as a monorepo with independently installable components.
Microsoft, in the future, plans to transition the project to a foundation-led model and is already engaging with the OWASP agentic AI community to support broader governance and stewardship, Siddique wrote.
The winners and losers of AI coding 8 Apr 2026, 2:00 am
I don’t need to tell you that agentic coding is changing the world of software development. Things are happening so quickly that it’s hard to keep up. Internet years seem like eons compared to agentic coding years. It seemed like just a few short weeks ago that everyone very suddenly stopped writing code and let Claude Code do all the work because, well, it was a few short weeks ago that it happened.
It seems like new ideas, tools, and frameworks are popping up every day.
Despite things moving like a cheetah sprinting across the Savannah, I am going to make a few predictions about where the cheetah is going to end up and what will happen when it gets there.
So long, legacy software
First, legacy software is going to become a thing of the past. You know what I’m talking about—those big balls of mud that have accreted over the last 30 years. The one started by your cousin’s friend who wrote that software for your dad’s laundromat and is now the software recommended by the Coin Laundry Association. The one with seven million lines of hopeless spaghetti code that no one person actually understands, that uses ancient, long-outdated technology, that is impossible to maintain but somehow still works. The one that depends on an entire team of developers and support people to keep running.
Well, someone is going to come along and write a completely fresh, new, unmuddy version of that ball of mud with a coding agent. The perfect example of this is happening in open source with Cloudflare’s EmDash project. Now don’t get me wrong. I have a deep respect for WordPress, the CMS that basically runs the internet. It’s venerable and battle-tested—and bloated and insecure and written in PHP.
EmDash is a “spiritual successor” to WordPress. Cloudflare basically asked, “What would WordPress look like if we started building it today?” Then they started building it using agentic coding, and basically did in a couple of months what WordPress took 24 years to do. Sure, they had WordPress as a template, but it was only because of agentic coding that they were even willing to attempt it. It’s long been thought foolish to say “Let’s rebuild the whole thing from scratch.” Now, with agentic coding, it seems foolish not to.
This is not the last creaky, old-school project that will be re-imagined in the coming days. If your business relies on a big ball of mud, it’s time to start looking at rebuilding it from the ground up before someone else beats you to it.
Ideas, implemented
Second, all those great application ideas you’ve been thinking about but could never find the time to do? Well, now you and millions of other developers can actually do them. I myself am nearing completion on six — six! — of the ideas I’ve been kicking around for years and never found the time to do. Yep, I build them all in parallel, with six different agents running at once. (Thank you, Garry Tan and gstack!)
Now, will there be a lot of slop that comes out of that? Sure. But will there be a huge supply of cool new software that will change the world? Yes, definitely.
That project you’ve always wanted to do? You can do it now.
Third, bespoke software will become the norm. Today, a business that needs accounting software will buy a product like Quickbooks or some other off-the-shelf solution, and adapt it to their way of doing things. But going forward, those businesses can create their own accounting package designed specifically for the way they do business. No one knows their domain better than the small business owner themselves. Instead of relying on someone who doesn’t understand the nuances of running your particular plumbing business, you can just talk to Claude Code and build your own solution.
This is happening today (the head of finance wrote the solution!). If you aren’t considering becoming more efficient via agentic coding, then you might find yourself dealing with competitors that are.
Legacy apps need rewriting. Those side projects need building. That app you need for your business isn’t going to build itself. Three months ago, it all seemed foolish and impossible. Today? You are either the cheetah or the gazelle.
Get started with Python’s new frozendict type 8 Apr 2026, 2:00 am
Only very rarely does Python add a new standard data type. Python 3.15, when it’s released later this year, will come with one—an immutable dictionary, frozendict.
Dictionaries in Python correspond to hashmaps in Java. They are a way to associate keys with values. The Python dict, as it’s called, is tremendously powerful and versatile. In fact, the dict structure is used by the CPython interpreter to handle many things internally.
But a dict has a big limitation: it’s not hashable. A hashable type in Python has a hash value that never changes during its lifetime. Strings, numerical values (integers and floats), and tuples are all hashable because they are immutable. Container types, like lists, sets, and, yes, dicts, are mutable, so can’t guarantee they hold the same values over time.
Python has long included a frozenset type—a version of a set that doesn’t change over its lifetime and is hashable. Because sets are basically dictionaries with keys and no values, why not also have a frozendict type? Well, after much debate, we finally got just that. If you download Python 3.15 alpha 7 or later, you’ll be able to try it out.
The basics of a frozendict
In many respects, a frozendict behaves exactly like a regular dictionary. The main difference is you can’t use the conventional dictionary constructor (the {} syntax) to make one. You must use the frozendict() constructor:
my_frozendict = frozendict(
x = 1, y = True, z = "Hello"
)
You can also take an existing dictionary and give it to the constructor:
my_frozendict = frozendict(
{x:1, y:True, z:"Hello", "A string":"Another string"}
)
One big advantage of using a dict as the source is that you have more control over what the keys can be. In the above example, we can’t use "A string" as a key in the first constructor, because that’s not a valid argument name. But we can use any string we like as a dict key.
The new frozendict bears some resemblance to an existing type in the collections module, collections.frozenmap. But frozendict differs in several key ways:
frozendictis built-in, so doesn’t need to be imported from a module.frozenmapdoes not preserve insertion order.- Lookups for keys in a
frozenmapare potentially slower (O(log n) than in afrozendict(O(1)).
Working with frozendicts
A frozendict behaves exactly like a regular dict as long as all you’re doing is reading values from it.
For instance, if you want to get a value using a key, it’s the same: use the syntax the_frozendict[the_key]. If you want to iterate through a frozendict, that works the same way as with a regular dict: for key in the_frozendict:. Likewise for key/value pairs: for key, value in the_frozendict.items(): will work as expected.
Another convenient aspect of a frozendict is that they preserve insertion order. This feature was added relatively recently to dictionaries, and can be used to do things like create FIFO queues there. That the frozendict preserves the same behavior is very useful; it means you can iterate through a frozendict created from a regular dictionary and get the same items in the same sequence.
What frozendicts don’t let you do
The one big thing you can’t do with a frozendict is change its contents in any way. You can’t add keys, reassign their values, or remove keys. That means all of the following code would be invalid:
# new key x
my_frozendict[x]=y
# existing key q
my_frozendict[q]=p
# removing item
my_frozendict.pop()
Each of these would raise an exception. In the case of myfrozendict.pop(), note that the method .pop() doesn’t even exist on a frozendict.
While you can use merge and update operators on a frozendict, the way they work is a little deceptive. They don’t actually change anything; instead, they create a new frozendict object that contains the results of the merge or update. It’s similar to how “changing” a string or tuple really just means constructing a new instance of those types with the changes you want.
# Merge operation
my_frozendict = frozendict(x=1)
my_other_frozendict = frozendict(y=1)
new_fz = my_frozendict | my_other_frozendict
# Update operation
new_fz |= frozendict(x=2)
Use cases for frozendicts
Since a frozendict can’t be changed, it obviously isn’t a substitute for a regular dictionary, and it isn’t meant to be. The frozendict will come in handy when you want to do things like:
- Store key/value data that is meant to be immutable. For instance, if you collect key/value data from command-line options, you could store them in a
frozendictto signal that they should not be altered over the lifetime of the program. - Use a dictionary in some circumstance where you need a hashable type. For instance, if you want to use a dictionary as a key in a dictionary, or as an element in a set, a
frozendictfits the bill.
It might be tempting to think a frozendict will provide better performance than a regular dict, considering it’s read-only. It’s possible, but not guaranteed, that eventual improvements in Python will enable better performance with immutable types. However, right now, that’s far from being a reason to use them.
GitHub Copilot CLI adds Rubber Duck review agent 7 Apr 2026, 4:17 pm
GitHub has introduced an experimental Rubber Duck mode in the GitHub Copilot CLI. The latest addition to the AI-powered coding tool uses a second model from a different AI family to provide a second opinion before enacting the agent’s plan.
The new feature was announced April 6. Introduced in experimental mode, Rubber Duck leverages a second model from a different AI family to act as an independent reviewer, assessing plans and work at the moments where feedback matters most, according to GitHub. Rubber Duck is a focused review agent, powered by a model from a complementary family to a primary Copilot session. The job of Rubber Duck is to check the agent’s work and present a short, focused list of high-value concerns including details the primary agent may have missed, assumptions worth questioning, and edge cases to consider.
Developers can use/experimentalin the Copilot CLI to access Rubber Duck alongside other experimental features.
Evaluating Rubber Duck on SWE-Bench Pro, a benchmark of real-world coding problems drawn from open-source repositories, GitHub found that Claude Sonnet 4.6 paired with Rubber Duck running GPT-5.4 achieved a resolution rate approaching Claude Opus 4.6 running alone, closing 74.7% of the performance gap between Sonnet and Opus. GitHub said Rubber Duck tends to help more with difficult problems, ones that span three-plus files and would normally take 70-plus steps. On these problems, Sonnet plus Rubber Duck scores 3.8% higher than the Sonnet baseline and 4.8% higher on the hardest problems identified across three trials.
GitHub cited these examples of the kinds of problems Rubber Duck finds:
- Architectural catch (OpenLibrary/async scheduler): Rubber Duck caught that the proposed scheduler would start and immediately exit, running zero jobs—and that even if fixed, one of the scheduled tasks was itself an infinite loop.
- One-liner bug (OpenLibrary/Solr): Rubber Duck caught a loop that silently overwrote the same
dictkey on every iteration. Three of four Solr facet categories were being dropped from every search query, with no error thrown. - Cross-file conflict (NodeBB/email confirmation): Rubber Duck caught three files that all read from a Redis key which the new code stopped writing. The confirmation UI and cleanup paths would have been silently broken on deploy.
The Terraform scaling problem: When infrastructure-as-code becomes infrastructure-as-complexity 7 Apr 2026, 5:41 am
Terraform promised us a better world. Define your infrastructure in code, version it, review it, and deploy it with confidence. For small teams running a handful of services, that promise holds up beautifully.
Then your organization grows. Teams multiply. Modules branch and fork. State files balloon. And suddenly, that clean declarative vision starts looking a lot like a sprawling monolith that nobody fully understands and everyone is afraid to touch.
If you’ve ever watched a Terraform plan run for 20 minutes, encountered a corrupted state file at 2 a.m. or inherited a Terraform codebase where half the resources are undocumented and a quarter are unmanaged, you know exactly what we’re talking about. This is the Terraform scaling problem, and it’s affecting engineering organizations of every size.
The numbers confirm it isn’t a niche concern. The 2023 State of IaC Report found that 90% of cloud users are already using infrastructure-as-code, with Terraform commanding 76% market share according to the CNCF 2024 Annual Survey. Yet the HashiCorp State of Cloud Strategy Survey 2024 showed that 64% of organizations report a shortage of skilled cloud and automation staff, creating a dangerous gap between Terraform’s adoption and the expertise required to operate it well at scale.
In this post, we break down where Terraform breaks down, why traditional solutions fall short, and how AI-assisted IaC management is offering a credible path forward.
The root causes of Terraform complexity at scale
Terraform’s design philosophy is fundamentally sound: Declarative infrastructure, idempotent operations and a provider ecosystem that covers nearly every cloud service imaginable. The problem isn’t the tool; it’s the gap between how Terraform was designed to work and how large engineering organizations actually operate.
State management becomes a full-time job
Terraform’s state file is both its greatest strength and its biggest liability at scale. State gives Terraform the ability to track what it has deployed and calculate diffs — but as infrastructure grows, that state file becomes a critical shared resource with no native support for distributed access patterns.
Teams running a monolithic state end up with a single point of contention. Engineers queue up to run plans and apply. Locking mechanisms in backends like S3 with DynamoDB help, but they don’t solve the underlying architectural issue: Everyone is competing for the same resource.
The HashiCorp State of Cloud Strategy Survey consistently places state management issues, corruption, drift and locking failures among the top pain points for Terraform users in organizations with more than 50 engineers. When a state file gets corrupted mid-apply, recovery can take hours and require deep expertise. The problem compounds as infrastructure grows: Organizations running more than 500 managed resources in a single workspace routinely report 15–30 minute plan times, turning what should be a fast feedback loop into a deployment bottleneck.
Module sprawl and dependency hell
Terraform modules are the right answer to code reuse. They’re also the source of some of the most painful debugging sessions in platform engineering.
As organizations scale, module libraries grow organically. Teams fork modules to meet specific requirements. Version pinning gets inconsistent. A security patch in a root module requires coordinated updates across dozens of dependent modules — a task that sounds simple until you’re dealing with circular dependencies, incompatible provider versions and module registries that weren’t designed for enterprise governance.
Adopting semantic versioning for Terraform modules has a measurable impact: According to a Moldstud IaC case study (June 2025), approximately 60% of organizations that enforce semantic versioning on module releases report a decrease in deployment failures over six months. Yet most teams don’t adopt this practice until after they’ve experienced the failure modes firsthand. The same research found that teams using peer reviews for Terraform code experience a 30% improvement in code quality but this requires process investment that most fast-moving platform teams skip in the early stages.
The pattern is consistent: What starts as a tidy module hierarchy becomes a tangled dependency graph that requires tribal knowledge to navigate.
Plan times and blast radius
At a certain scale, the Terraform plan stops being a quick feedback loop and starts being a liability. Teams managing thousands of resources in a single workspace can wait 15–30 minutes for a plan to complete. More critically, the blast radius of a single application expands proportionally.
A misconfigured security group rule in a small workspace affects a handful of resources. The same mistake in a large monolithic workspace can cascade across hundreds of resources before anyone can intervene. Terraform’s own declarative model means that configuration errors can trigger resource destruction, a risk that grows with workspace size. This reality pushes teams toward increasingly conservative change management processes, which defeats the core value proposition of IaC in the first place.
There’s a meaningful ROI case for solving this. The Moldstud IaC case study indicates that implementing automated IaC solutions can lead to a 70% reduction in deployment times. But capturing that return requires architectural decisions that prevent plan-time bottlenecks before they compound.
Drift: The silent killer
Infrastructure drift — where the actual state of your cloud environment diverges from what Terraform believes it to be — is among the most insidious challenges at scale. It accumulates slowly, through emergency console changes, partially applied runs and resources created outside of Terraform entirely.
The causes are well-documented: An on-call engineer hotfixes a security group at 3 a.m. and forgets to update the code; an autoscaling event modifies a resource configuration that Terraform manages; a third-party integration quietly changes a setting that Terraform has no visibility into. Each of these is a small divergence. Collectively, they erode the reliability of your entire IaC foundation. Terraform Drift Detection Guide documents how teams across industries are consistently caught off guard by drift accumulation in environments they believed were fully under IaC control.
By the time drift becomes visible, it’s often embedded deep enough to make remediation genuinely risky. The DORA 2023 State of DevOps Report found that teams dealing with frequent configuration drift had 2.3× higher change failure rates than teams maintaining consistent IaC hygiene. The compounding effect is significant: Drift erodes confidence in your IaC, which leads to more manual changes, which causes more drift.
Why traditional approaches fall short
The conventional responses to Terraform scaling challenges are well-documented: Workspace decomposition, remote state backends, CI/CD pipelines with policy enforcement and module registries with semantic versioning. These are all necessary practices. They’re also insufficient on their own.
- Workspace decomposition reduces blast radius but multiplies operational overhead. You’re trading one large problem for many smaller ones, each requiring its own state management, access controls and pipeline configuration. Managing 200 workspaces is a full-time engineering effort.
- CI/CD enforcement catches policy violations after the fact. By the time a plan hits your pipeline, an engineer has already spent time writing code that may get rejected. Feedback loops are slow, and the root cause — the complexity of authoring correct IaC at scale — remains unsolved.
- Manual code reviews don’t scale. Platform teams can become bottlenecks when every Terraform change requires expert review to validate correctness, security posture and compliance. The cognitive load required to review infrastructure changes accurately is substantial, and reviewers burn out. This bottleneck is only sharpened by the talent shortage: With 64% of organizations reporting a shortage of skilled cloud and automation staff, the supply of qualified reviewers isn’t growing fast enough to match Terraform’s adoption curve.
The honest assessment: These solutions manage Terraform complexity rather than resolving it. They require ongoing investment in tooling, process and expertise that many organizations struggle to maintain.
This is exactly the friction that StackGen’s Intent-to-Infrastructure Platform was designed to address. Rather than adding more manual process overhead, it introduces an intelligent layer that helps teams author, validate and govern Terraform configurations from the point of intent before complexity accumulates.
Emerging solutions: Where the industry is moving
The Terraform ecosystem is evolving rapidly in response to these challenges. The global IaC market reflects this urgency: Valued at $847 million in 2023, it’s projected to reach $3.76 billion by 2030 at a 24.4% compound annual growth rate, according to Grand View Research’s IaC Market Report. That growth isn’t just adoption — it’s investment in solving the complexity problems that widespread adoption creates.
Workspace automation and orchestration
Tools like Atlantis, Stackgen, Terraform Cloud, are moving toward intelligent workspace orchestration, automatically managing dependencies between workspaces, ordering applies correctly and providing better visibility into cross-workspace impact. This reduces the manual coordination overhead that plagues large-scale Terraform operations.
The key shift is treating your collection of workspaces as a managed system rather than a set of independent units. When a shared networking module changes, an orchestration layer should automatically identify affected workspaces, calculate the propagation order and manage the apply sequence — rather than requiring a human to track and coordinate each dependency manually.
Policy-as-code with earlier enforcement
Open Policy Agent (OPA) and HashiCorp Sentinel have matured significantly. More importantly, teams are learning to push policy enforcement left — validating Terraform plans against organizational policies before they hit a CI/CD pipeline, and ideally before they’re even submitted for review.
HashiCorp has reported that teams using Sentinel with pre-plan validation see a 45% reduction in policy violation-related build failures compared to teams running post-plan enforcement only. Earlier feedback means faster iteration and lower engineer frustration.
AI-assisted IaC management: The emerging frontier
This is where the most significant innovation is happening. AI-assisted infrastructure management addresses the problems that automation alone can’t solve: The cognitive complexity of understanding large IaC codebases, identifying drift patterns before they become critical and translating high-level intent into correct, compliant Terraform code.
Platforms like StackGen’s Intent-to-Infrastructure Platform represent a new paradigm here. Rather than requiring platform engineers to manually author and review every Terraform resource definition, StackGen interprets infrastructure intent — expressed in natural language or high-level policy- and generates compliant Terraform configurations, validates them against organizational standards and surfaces potential issues before they reach production. This directly addresses the bottleneck where expert review becomes a constraint on velocity.
The practical applications are concrete:
- Drift detection and remediation: AI models trained on infrastructure patterns can identify anomalous drift, distinguishing between expected configuration changes and unauthorized modifications, and surface remediation recommendations with context about impact and risk. This is particularly powerful for teams managing hundreds of workspaces where manual drift monitoring isn’t practical.
- Intelligent module recommendations: Rather than requiring engineers to navigate sprawling module registries manually, AI-assisted tooling can analyse an infrastructure request, identify the most appropriate existing modules and flag where new module development is needed. This reduces the “reinvent the wheel” pattern that causes module sprawl.
- Natural language to IaC: For platform teams managing self-service infrastructure portals, AI translation layers allow development teams to request infrastructure in natural language and receive validated Terraform configurations that conform to organizational standards — without requiring deep Terraform expertise from every team consuming platform services.
- Proactive complexity warnings: AI analysis of Terraform codebases can identify emerging complexity patterns before they become critical — detecting circular dependencies forming, state files approaching problematic size thresholds or module versioning patterns that suggest future compatibility issues.
Gartner predicts that by 2026, more than 40% of organizations will be using AI-augmented IaC tooling for some portion of their infrastructure management workflow — up from under 10% in 2023. The trajectory is clear, and the window for early-mover advantage is still open.
Practical guidance: Scaling terraform without losing your mind
While AI-assisted tooling continues to mature, there are concrete architectural and process changes your team can adopt today.
- Decompose by domain, not by team. Workspace boundaries should reflect infrastructure domains (networking, compute, data) rather than organizational team boundaries. Teams change; infrastructure domains are more stable. This reduces the reorganization tax you pay when teams restructure.
- Treat state as infrastructure. Your state backend deserves the same reliability engineering as production systems. Remote state with versioning, automated backup verification and clear recovery runbooks should be non-negotiable before you’re managing more than a few dozen resources. The HashiCorp State of Cloud Strategy Survey shows that over 80% of enterprises already integrate IaC into their CI/CD pipelines — but pipeline integration doesn’t substitute for state backend reliability.
- Invest in a private module registry early. Whether you use Terraform Cloud’s built-in registry, a self-hosted solution or a structured module registry with enforced semantic versioning pays compounding dividends as your module library grows. The cost of retrofitting governance onto an ungoverned module library is significantly higher than building in governance from the start.
- Automate drift detection, not just drift remediation. Drift remediation is expensive; drift detection is cheap. Scheduled Terraform plan runs in CI/CD, combined with alerting on detected drift, give you an early warning system that prevents drift from compounding silently. For teams managing large environments where manual detection becomes impractical, automated drift tooling, whether native to HCP Terraform or third-party solutions, becomes essential infrastructure in its own right.
- Build a paved road for Terraform consumers. If every application team needs to become a Terraform expert to consume platform services, your platform won’t scale. Build opinionated, simplified interfaces, whether that’s a service catalogue, a self-service portal or an AI-assisted request layer that allows development teams to get the infrastructure they need without requiring deep IaC expertise.
The strategic inflection point
We’re at an inflection point in how the industry thinks about infrastructure-as-code. The original vision of IaC infrastructure defined, versioned and managed like software was correct. The execution, for large-scale organizations, has accumulated significant complexity debt.
The next wave of IaC tooling isn’t about replacing Terraform. Terraform’s declarative model, provider ecosystem and community are genuine strengths that won’t be supplanted quickly. The opportunity is in the layer above Terraform: Intelligent orchestration, AI-assisted authoring, proactive complexity management and intent-driven infrastructure interfaces that make IaC accessible to the full organization rather than just a specialized subset of platform engineers.
Teams that invest in this layer now, whether through emerging platforms, internal tooling or AI-assisted workflows, will build a meaningful operational advantage. Teams that continue fighting Terraform complexity with more Terraform will find themselves spending an increasing proportion of engineering capacity on infrastructure maintenance rather than product development.
The IaC market’s 24.4% CAGR reflects growing awareness that the tools and processes managing this complexity need to evolve as fast as the infrastructure they govern.
Key takeaways
The Terraform scaling problem is real, but it’s solvable. The path forward involves three parallel tracks: Architectural decisions that manage blast radius and reduce state contention; process investments in policy-as-code and module governance; and tooling that uses AI to address the cognitive complexity that has always been the hardest part of IaC at scale.
Your infrastructure code should accelerate your engineering organization, not constrain it. If it’s doing the latter, the problem isn’t your engineers; it’s the layer of tooling and process sitting between intent and deployed infrastructure.
Ready to explore how AI-assisted IaC management can reduce the complexity overhead in your Terraform workflows?
This article is published as part of the Foundry Expert Contributor Network.
Want to join?
Nvidia’s SchedMD acquisition puts open-source AI scheduling under scrutiny 7 Apr 2026, 5:13 am
Nvidia’s recent acquisition of SchedMD, the company behind the Slurm workload manager, is raising concerns among AI industry executives and supercomputing specialists who fear the chip giant could use its new position to favour its own hardware over competing chips, whether through code prioritization or roadmap decisions.
The concern, as industry sources frame it, is straightforward: Nvidia now controls scheduling software that also runs on hardware from its rivals, including AMD and Intel. A vendor that controls workload scheduling software has significant leverage over how efficiently competing hardware performs within shared computing environments — whether it exercises that leverage or not, Reuters reported, citing five anonymous sources, three of whom work in the AI industry and two with knowledge of supercomputer operations.
Analysts who spoke to InfoWorld said Nvidia’s open-source commitment — the company said during the acquisition announcement that it would “continue to develop and distribute Slurm as open-source, vendor-neutral software” — may not be sufficient protection.
“Slurm’s open-source foundation offers safeguards such as transparent code, forking ability, and community governance, but SchedMD’s control gives Nvidia soft power rather than hard lock-in,” said Manish Rawat, semiconductor analyst at TechInsights. Rawat said Nvidia could subtly shape the roadmap, prioritising GPU-aware scheduling and topology optimisations that favour its own hardware, and that integration timelines already showed faster support for the CUDA ecosystem compared to alternatives such as AMD’s ROCm or Intel’s oneAPI – creating what he described as a “best-supported path effect.”
What is Slurm, and why does it matter
Slurm, originally developed at Lawrence Livermore National Laboratory, runs on roughly 60% of the world’s supercomputers. The software is in active use at major AI companies, including Meta Platforms, French AI startup Mistral, and Anthropic for elements of AI model training, Reuters reported.
Government supercomputers used for weather forecasting and national security research also depend on it. Nvidia acquired Slurm developer SchedMD in December 2025 and described the deal as a push to strengthen its open-source ecosystem and help users adopt newer AI techniques alongside traditional supercomputing work.
Is the concern valid?
Dr. Danish Faruqui, CEO of Fab Economics, a US-based AI hardware and datacenter advisory, said the risk was real.
“The skepticism that Nvidia may prioritize its own hardware in future software updates, potentially delaying or under-optimizing support for rivals, is a feasible outcome,” he said. As the primary developer, Nvidia now controls Slurm’s official development roadmap and code review process, Faruqui said, “which could influence how quickly competing chips are integrated on new development or continuous improvement elements.”
Owning the control plane alongside GPUs and networking infrastructure such as InfiniBand, he added, allows Nvidia to create a tightly vertically integrated stack that can lead to what he described as “shallow moats, where advanced features are only available or performant on Nvidia hardware.”
One concrete test of that, industry observers say, will be how quickly Nvidia integrates support for AMD’s next-generation chips into Slurm’s codebase compared with how quickly it integrates its own forthcoming hardware and networking technologies, such as InfiniBand.
Does the Bright Computing precedent hold?
Analysts point to Nvidia’s 2022 acquisition of Bright Computing as a reference point, saying the software became optimized for Nvidia chips in ways that disadvantaged users of competing hardware. Nvidia disputed that characterization, saying Bright Computing supports “nearly any CPU or GPU-accelerated cluster.”
Rawat said the comparison was instructive but imperfect. “Nvidia’s acquisition of Bright Computing highlights its preference for vertical integration, embedding Bright tightly into DGX and AI Factory stacks rather than maintaining a neutral, multi-vendor orchestration role,” he said. “This reflects a broader strategic pattern — Nvidia seeks to control the full-stack AI infrastructure experience.”
However, he said Slurm presented a fundamentally different challenge. “Deeply entrenched in supercomputing centers and academia, and effectively community-governed, Slurm carries high switching costs,” Rawat said. “Nvidia may influence but is unlikely to replicate the same tightly integrated control in markets dominated by established, neutral, and community-driven platforms.”
The open-source safety valve and its limits
Faruqui acknowledged that Slurm’s open-source licensing under a GNU GPL v2.0 licence offers some protection, including the community’s right to fork the project if Nvidia’s stewardship is seen as biased. But he cautioned that the option carried its own risks. “Slurm’s open-source status provides a safety valve with its limitations, but it is not a complete shield against vendor-neutrality,” he said.
The acquisition brought many of the world’s leading Slurm developers inside Nvidia, he noted, meaning a community-led fork would struggle to sustain the same pace of development.
Rawat described the situation as “a strategic dependency risk, not a crisis,” and said organisations should diversify GPU procurement, benchmark workloads across multiple vendor ecosystems, and develop internal expertise to modify or switch orchestration tools if needed.
Faruqui recommended that enterprise buyers negotiating Slurm support agreements seek service-level guarantees that apply equally to non-Nvidia hardware, covering response times, bug fixes, and feature parity across heterogeneous clusters. On architecture, he said organisations should consider containerising AI workloads to isolate applications from the underlying scheduler, making migration to alternative schedulers such as Flux or Kubernetes more feasible if required.
Enterprise developers question Claude Code’s reliability for complex engineering 7 Apr 2026, 4:56 am
When a coding assistant starts looking like it’s cutting corners, developers notice. A senior director in AMD’s AI Group has publicly needled Anthropic’s Claude Code for what she calls a tendency to skim the hard bits, offering answers that land but don’t quite stick.
The gripe isn’t about outright failure so much as fading rigor, with complex problems drawing responses that seem quicker, lighter, and a little too eager to move on, forcing the senior executive and her team to stop using the pair programming tool for complex engineering tasks, such as debugging hardware and kernel-level issues.
The concerns were detailed in a GitHub issues ticket that Stella Laurenzo filed, where she claims that a February update of the tool might have resulted in quality regression issues around its reasoning capabilities for complex tasks.
The ticket stems from her quantitative analysis of 17,871 thinking blocks and 234,760 tool calls across 6,852 session files spanning January to March, covering both pre- and post-update periods for comparison.
In her analysis, Laurenzo pointed out that the model stopped reading code gradually before making changes to it as a result of a loss of reasoning capabilities.
“When thinking is shallow, the model defaults to the cheapest action available: edit without reading, stop without finishing, dodge responsibility for failures, take the simplest fix rather than the correct one,” she wrote in the ticket.
The loss in reasoning, Laurenzo added, is a major hurdle for her team as it affects over 50 concurrent agent sessions doing systems programming in C, GPU drivers, and over 30 minutes of autonomous runs with complex multi-file changes.
Laurenzo is not alone in raising these concerns. Several users commented on the ticket saying that they were having similar experiences as Laurenzo and her team.
Another user pointed to multiple subreddits highlighting similar degradation concerns, a comment that itself drew visible support from other developers through upvotes on GitHub.
Capacity crunch meets developer patience
That growing chorus of complaints has not gone unnoticed by analysts, who connected the issue to Anthropic’s fledgling capacity constraints.
“This is primarily a capacity and cost issue. Complex engineering tasks require significantly more compute, including intermediate reasoning steps. As usage increases, the system cannot sustain this level of compute for every request,” said Chandrika Dutt, research director at Avasant.
“As a result, the system limits how long a task runs or how much reasoning depth is applied and how many such tasks can run simultaneously,” Dutt added.
This is not the first instance where Anthropic had to deal with capacity constraints when it comes to Claude Code.
Last month, it started limiting usage across its Claude subscriptions to cope with rising demand that is stretching its compute capacity. The rationale then was that by accelerating how quickly users hit their session limits within these windows, Anthropic would be able to effectively redistribute access to prevent system overloads while still preserving overall weekly usage quotas.
Developers, much like in the case of the reasoning regression, had pushed back sharply against the rate limits imposed on Claude Code, arguing that the restrictions undercut its usefulness.
No exodus, but a slow erosion of trust
Taken together, the twin frustrations over rate limits and perceived reasoning regressions risk denting developer confidence in the platform, rather than a mass exodus, slowing momentum and nudging enterprise users to hedge their bets with alternatives, analysts say.
“This is not the kind of moment where users walk away overnight. It is far more subtle and far more dangerous than that. What is happening is a quiet shift in how much developers trust the system when the stakes are high. The loudest complaints are coming from teams that had already begun to rely on the system for serious, multi-step engineering work over extended sessions,” said Sanchit Vir Gogia, chief analyst at Greyhound Research.
“What has changed is not just the quality of outputs, but the way the system behaves while producing them. There is a noticeable drift from careful, step-by-step reasoning toward quicker, more reactive execution. That creates a cycle where engineers step in more often, interrupt more frequently, and end up doing the thinking the system was expected to handle,” Gogia pointed out.
That change, according to the analyst, will force teams to route complex or critical work elsewhere while keeping simpler tasks with Claude, which over time will erode the platform’s role from primary tool to optional tool.
Laurenzo, too, as per her GitHub issues ticket, is taking the same route that Gogia is predicting, ditching Claude Code temporarily for Anthropic to fix and switching to an unnamed rival offering for now.
No easy escape hatch in a GPU-constrained world
However, Avasant’s Dutt isn’t hopeful about Laurenzo’s decision in the long run. She pointed out that rivals might start facing similar capacity constraints as Anthropic: “All frontier models operate under similar GPU and cost constraints. As usage scales, all providers will need to introduce throttling mechanisms, tiered access models, and trade-offs between speed, cost, and reasoning depth. This is structurally inevitable.”
More so for reasoning regression because the analyst sees maintaining deep reasoning at scale as a difficult challenge, pinning her theory on recent SWE-EVO 2025 benchmarks on AI coding agents that show that success rates drop sharply for multi-step tasks, with failure rates often in the 60%–80% range, especially for execution-heavy scenarios.
Pay more, see more: the emerging AI trade-off?
As a fallback, though, Laurenzo is optimistic that Anthropic can course-correct, even suggesting, in her ticket, that the company introduce premium tiers that allow users to pay for greater reasoning capacity.
That might soon become a reality, both Dutt and Gogia said, as the industry is moving toward a consumption model where basic usage is treated differently from heavy, reasoning-intensive workloads.
Analysts also support Laurenzo’s other suggestions to Anthropic, which included transparency around thinking token allocation.
“Users need to understand what the system is doing under the hood. Not every detail, but enough to know whether the system actually reasoned through a problem or simply produced a quick answer. Today, users are forced to infer that from outcomes, which is why you are seeing users analyzing logs and behavior patterns. That should not be necessary,” Gogia said.
For now, though, Anthropic has yet to respond to Laurenzo’s GitHub ticket or assign it to anyone.
However, if they’re hoping for a quick fix, especially around capacity, they may want to lower expectations, at least till 2027, because that’s when new chips, in the form of Google TPUs manufactured by Broadcom, will be added to its fleet. Until more GPUs show up or the company decides who gets to use them at higher pricing, developers may be left refreshing threads, watching tokens get rationed, and waiting for reasoning to make a comeback.
How to destroy a company quickly 7 Apr 2026, 2:00 am
Too many executives are cutting software engineering teams because they bought into the fantasy that AI can now build and maintain enterprise applications with only a few people around to supervise the machine. That idea isn’t bold. It isn’t visionary. It’s reckless, and more executives will suffer the consequences of their mistakes beyond just a bad quarter.
Yes, AI can write code. That much is clear. The problem is that many vendors and leaders have taken this fact and exaggerated it into something absurd: the idea that software engineering has become essentially optional. They believe that if a model can generate application logic, then experienced developers, architects, and performance engineers are suddenly unnecessary expenses. This kind of thinking might seem clever in a boardroom presentation, but it falls apart in real-world production.
How this story unravels
The applications often work, which makes this approach deceptively effective. The demo succeeds, and, at first, the feature seems to function properly. Everyone congratulates themselves. But then the system is deployed at scale and the cloud bill skyrockets. What used to cost $10,000 a month on AWS suddenly jumps to $300,000 or more. In the worst cases, companies face multimillion-dollar monthly cloud costs for systems that should never have been built that way in the first place.
AI can generate code, but it doesn’t grasp efficiency like experienced engineers do. It doesn’t prioritize cost-efficient architecture. It doesn’t instinctively avoid wasteful service calls, excessive data movement, poor caching, bad concurrency patterns, noisy database behavior, or compute-heavy nonsense that might look good in a code sample but fails in real-world use. It produces something plausible. However, it doesn’t deliver something financially responsible.
Then comes my favorite bad argument from the AI hype crowd: “Just optimize it afterward.” Fine. With whom? These companies fired the experts who understood complex systems, leaving behind AI-generated code no one fully understands. The remaining humans didn’t build it, don’t know its structure, and can’t safely modify it. They are trapped with applications they can run at an exorbitant price but not reliably maintain.
That isn’t innovation. That’s self-inflicted technical debt on an industrial scale.
Normally, technical debt creeps in over time. A rushed release here, a shortcut there, an old dependency nobody wants to touch. With AI-generated enterprise software, companies are creating years of technical debt in a matter of months. It’s almost impressive, in the worst possible way. They are compressing entire failure cycles because AI lets them build faster than they can think.
And now the frantic calls begin. Why is the app slow? Why are users complaining? Why are outages harder to diagnose? Why is the cloud bill out of control? Why can’t anyone fix this without causing something else to fail? Why doesn’t the AI coding promise look anything like the sales pitch?
Know the pros and cons of AI
That doesn’t mean AI is useless—far from it. AI can absolutely help software teams move faster. It can help with scaffolding, documentation, repetitive coding tasks, test generation, and even architectural brainstorming. In the hands of strong engineering teams, it is a legitimate accelerator. But somewhere along the way, too many executives decided that “accelerator” meant “replacement,” and the bad decisions began.
Good engineers are not valuable because they can type code into an editor. Good engineers are valuable because they understand systems. They understand trade-offs. They understand why one design choice creates future operational pain and another choice avoids it. They understand how software behaves after launch, under load, across regions, inside complex security and compliance environments, and on top of public cloud pricing models that punish inefficiency. AI does not replace that. It imitates fragments of it.
What makes this even worse is that too many companies incentivize the short term. The market loves a cost-cutting story. Announce layoffs or say “AI transformation” often enough and you may get a nice temporary stock bump. Executives know that. They also know that if the real damage shows up three or four quarters later, they can always blame execution, market conditions, or “unexpected complexities.” Meanwhile, the company’s engineering foundation is being hollowed out.
Don’t be the company that finds out too late that it has painted itself into an AI corner. The old human-built systems will still around, but the people who understood them are gone. The new AI-built systems are expensive, fragile, and opaque. Rebuilding will cost a fortune. Rehiring talent will be difficult. Some employees will not come back, and I wouldn’t blame them.
I said this before, and it still holds true: AI is nowhere near replacing software engineers at the scale being promised. Not even close. The leaders who think otherwise are gullible, not brave. Worse, they are risking their companies for marketing stories pushed by people who profit from overstating the future.
In the next few years, I anticipate some difficult case studies. Some companies will quietly change direction. Others will spend a lot of money trying to fix issues. A few might shut down entirely because they made a fatal management mistake: They bought into the hype, fired the people who knew what they were doing, and handed control of systems to individuals who couldn’t truly manage them.
If companies want to avoid that outcome, the answer is straightforward. Keep your engineers, use AI to enhance their capabilities, and assign experienced architects to lead, enforce governance, control costs, and ensure maintainability. Treat AI as a tool and not a replacement for human judgment.
It’s easy for hype cycles to make lots of magical claims. Reality is less exciting. Look past the marketing spin to long-term implications, because reality is what pays the cloud bill.
What enterprise devops teams should learn from SaaS 7 Apr 2026, 2:00 am
Many enterprise devops teams struggle to deploy frequently, increase test automation, and ensure reliable releases. What can they learn from SaaS companies, where developing and deploying software for thousands of customers is core to their revenue and business operations?
SaaS companies must have robust testing, observability, deployment, and monitoring capabilities. One bad deployment can disrupt customer operations, unravel sales opportunities, and attract negative media coverage.
What’s most challenging is that many SaaS platforms are configurable and have low-code development capabilities. These platforms require robust test data sets and real-time monitoring to ensure that deployments don’t break functionality for customers. Impacting even a tiny fraction of customers is an unacceptable outcome.
Validating data entry forms and end-to-end workflows is a combinatorial problem that requires building robust data sets and testing a statistical significance of input patterns. Further, developing, integrating, and deploying AI agents and language models adds new complexities. In enterprises, testing open-ended AI agents with non-deterministic responses becomes a greater challenge as more organizations use third-party AI agents and move AI experiments into production.
I asked SaaS providers to share some of their devops secrets. As more enterprise devops teams develop, deploy, and support mission-critical apps, integrations, and data pipelines, how can they improve resiliency? Look to the practices of SaaS providers.
Aim for smart ‘customer’ upgrades
If you deploy an upgrade, will end users take advantage of the new capabilities, or will they be frustrated by defects that leaked into production? Enterprises must aim for smart customer upgrades that are seamless for end users, are deployed frequently, have low defect rates, avoid security issues, and drive adoption of new capabilities.
To consistently meet these objectives, enterprise devops teams must embrace a shift in mindset, away from legacy IT norms. They must recognize that:
- End users are customers and disrupting workflows affects business operations. Improving deployment frequency but shipping with defects is not a win for SaaS, nor should it be for enterprise IT.
- Deploying capabilities that few people notice, try out, and adopt is highly problematic. It implies that the team invested time and resources without delivering business value and likely introduced new technical debt.
- Deploying software is never a one-time effort, and agile teams should communicate their release management plans.
“The most successful devops teams realize that their internal platform is actually a specialized SaaS product where the developers are the primary customers,” says Sergio Rodríguez Inclán, senior devops engineer at Jalasoft. “By replacing rigid project deadlines with a commitment to continuous reliability and self-service automation, IT shifts from being a corporate bottleneck to a competitive advantage.”
One transformation many enterprises are undertaking is a shift to product-based IT, which helps group applications into products and assigns product managers to oversee their roadmaps. Most SaaS companies assign product managers to communicate product vision, define user personas, understand customer needs, prioritize features, and measure business outcomes.
Esko Hannula, SVP of robotics at Copado, says, “Modern enterprise IT should adopt the SaaS mindset. Software isn’t a project with an end date but a continuously improving product delivered through frequent, incremental releases.”
Hannula recommends reviewing the devops practices used by SaaS teams, including advanced CI/CD, continuous testing, canary releases, A/B testing, and data-guided product management, to be able to release whenever needed. “These practices matter because they create the confidence, agility, and quality necessary for rapid response to business change—outcomes that naturally follow from treating software as a long-lived product rather than a one-off project,” Hannula says.
Code less and test more
Developers in enterprise IT are using AI code generators and experimenting with vibe coding tools. Research shows that these tools can improve developer productivity by 30% or more. But will productivity translate into devops teams deploying features faster and more reliably?
Enterprise IT has a long history of underfunding testing and targeting big-bang deployments. SaaS companies do the opposite. They apply analytics in test automation, build synthetic test data sets, and use feature flagging to reduce the risks of deploying more frequently. The more advanced SaaS companies adopt continuous deployment, but this may be challenging to implement for many enterprises.
“Test automation may feel like an upfront cost, but it pays off quickly because more resilient services lead to fewer incidents, fewer support tickets, and lower operational overhead,” says Nikhil Mungel, Director of AI R&D at Cribl. “SaaS teams often de-risk launches by releasing features to small groups first and using observability to watch system vitals and user experience before broad release, typically via feature flags and bucketing. IT devops teams can mirror this by enabling ‘power users’ to opt in early, improving satisfaction while reducing support burden.”
Secure in design, development, and deployment
The productivity improvement from code generators may come at a cost. The same study noted above, which showed improved developer productivity and code maintainability, also found that 23.7% more security vulnerabilities were introduced in the code generated.
Shifting security left sounds straightforward, but in reality, it’s a broad agenda for enterprise devops teams that want security on equal grounds with dev and ops priorities. To become devsecops, agile teams must address security and compliance beyond application security, including cloud infrastructure hardening, identity management, data security, and AI model bias.
“SaaS teams win when they embed security into their codebase with dependencies that upgrade seamlessly,” says David Mytton, CEO of Arcjet. Mytton recommends these four practices:
- Test suites with automated security and privacy checks that flag when dependencies break.
- Standardized observability with structured traces and context-rich logs.
- Privacy-aware data management and in-app PII redaction with CI/CD gates.
- Feature flags and canary rollouts to move fast without breaking customers or compliance.
Priya Sawant, GM and VP of platform and infrastructure at ASAPP, adds that modern SaaS teams shift left by baking security, testing, access control, and observability into the design and CI/CD pipelines rather than patching them in at the end. “Automating permissions, enforcing golden-path pipelines, and delivering built-in observability removes friction, improves quality, and accelerates delivery. IT and devops teams that adopt this model move faster and scale more reliably than those stuck in manual approvals and reactive workflows,” Sawant says.
Start planning for resilient operations from Day 0
Back when I was a CIO, I once was asked what our Day 2 model was for a new application we were building. Day 2 is legacy terminology for when an application is deployed to production and requires support, as opposed to Day 1 (development) and Day 0 (planning).
SaaS teams have a very different mindset around operations, and they start planning for scalability, security, performance, and incident management from the outset in the architecture design.
For example, SaaS companies place a lot of emphasis on their developer experiences. Developers who are too busy tinkering with cloud configurations, manually patching components, or handling data management tasks can lose focus on customer needs.
“SaaS engineers use technologies that don’t overcomplicate things and let them move fast without wrestling with upgrade paths,” says Alejandro Duarte, developer relationships engineer at MariaDB.
Duarte recommends choosing infrastructure that doesn’t slow down developers. For example, at the data layer, Duarte prioritizes systems that support native replication, vector storage, fast analytics, and automatic node recovery.
Define an observability strategy, then implement it
Another SaaS-inspired mindset shift is from Day 2 monitoring of applications as black boxes to Day 0 observability, providing ops teams with the details needed to aid incident management and root cause analysis. In enterprises, establishing observability standards is essential because operations teams track alerts and incidents across hundreds to thousands of applications.
“IT devops teams can learn from SaaS developers that observability isn’t just about monitoring systems after deployment—it’s about embedding real context into every stage of development,” says Noam Levy, founding engineer and field CTO at groundcover. “Modern observability tools, especially when paired with AI, help engineers anticipate regressions before they happen in production environments, guiding safer code changes and more reliable releases. This shift from reactive troubleshooting to proactive reliability mirrors how leading SaaS teams continuously refine and reinforce trust in their software.”
The importance of observability was a common theme among SaaS leaders, and many standardize it as a devops non-negotiable. But logging every bit of information can become expensive and complex, especially when AI agents log all interactions.
“As AI-driven systems generate exponentially more logs, metrics, and traces, tightly coupled observability stacks can’t keep enough data hot without driving up costs or offloading it into slow, hard-to-query cold storage,” says Eric Tschetter, chief architect at Imply. “With an observability warehouse as the scalable data layer, teams keep telemetry data accessible at scale without increasing costs.”
Ang Li, director of engineering at Observe, shares a good rule that SaaS teams use to decide what information to include in their standards. “SaaS engineering teams design observability around users and workflows, not just whether systems are up or down. IT devops can apply the same thinking, moving beyond uptime monitoring to instrumenting critical business transactions to better understand user impact, limit blast radius, and recover faster,” says Li.
Key takeaways
We can distill two key takeaways for enterprise devops teams from the recommendations our experts shared above. First, apply product management practices, focus on features that matter, and develop robust testing. Second, shift left in practices, not just culture, by considering observability, security, and resiliency as part of the solution’s architecture.
Rust team warns of WebAssembly change 6 Apr 2026, 9:51 pm
WebAssembly targets for Rust will soon face a change that could risk breaking existing projects, according to an April 4 bulletin in the official Rust blog. The bulletin notes that all WebAssembly targets in Rust have been linked using the --allow-undefined flag to wasm-ld, but this flag is being removed.
Removing --allow-undefined on wasm targets is being done in rust-lang/rust#149868. That change is slated to land in nightly builds soon and will be released with Rust 1.96 on 2026-05-28. The bulletin explains that all WebAssembly binaries in Rust are created by linking with wasm-ld, thus serving a similar purpose to ld, lld, and mold. Since the first introduction of WebAssembly targets in Rust, the --allow-undefined flag has been passed to wasm-ld.
However, by passing --allow-undefined on all WebAssembly targets, rustc introduces diverging behavior between other platforms and WebAssembly, the bulletin says. The main risk of --allow-undefined is that misconfiguration or mistakes in building can result in broken WebAssembly modules being produced, as opposed to compilation errors. The bulletin lists the following example problematic situations:
- If
mylibrary_initwas mistakenly typed asmylibraryinit, then the final binary would import themylibraryinitsymbol instead of calling the linkedmylibrary_initC symbol. - If
mylibrarywas mistakenly not compiled and linked into a final application, then themylibrary_initsymbol would end up imported rather than producing a linker error saying it’s undefined. - If external tools are used to process a WebAssembly module, such as
wasm-bindgenorwasm-tools component new, they are likely to provide an error message that isn’t clearly connected back to the original source code from which the symbols were imported. - Web errors along the lines of
Uncaught TypeError: Failed to resolve module specifier "env". Relative references must start with either "/", "./", or "../".can mean that"env"leaked into the final module unexpectedly and the true error is the undefined symbol error, not the lack of"env"items provided.
All native platforms consider undefined symbols to be an error by default. Therefore, by passing --allow-undefined rustc introduces surprising behavior on WebAssembly targets. The goal of the change is to remove this surprise so that WebAssembly behaves more like native platforms, the bulletin states.
In theory, however, not a lot is expected to break from this change, the bulletin concludes. If the final WebAssembly binary imports unexpected symbols, then it’s likely the binary won’t be runnable in the desired embedding, as the desired embedding probably doesn’t provide the symbol as a definition. Therefore, most of the time this change will not break users, but will instead provide better diagnostics.
Visual Studio Code 1.114 streamlines AI chat 6 Apr 2026, 2:13 pm
Microsoft has released Visual Studio Code 1.114. The update of Microsoft’s popular code editor streamlines the AI chat experience, offering previews of videos in the image carousel for chat attachments, adding a Copy Final Response command to the chat context menu, simplifying semantic searches of codebases by GitHub Copilot, and more.
Introduced April 1, VS Code 1.114 can be downloaded from the project website.
With VS Code 1.114, the image carousel, introduced in version 1.113, now also supports videos. Developers can play and navigate videos from chat attachments or the Explorer context menu. The viewer provides controls and navigation for images and videos using arrows or thumbnail. Also, there now is a Copy Final Response command in the chat context menu that copies the last Markdown section of the agent’s response, after tool calls have run.
For simplifying workspace searches, the #codebase tool now is used exclusively for semantic searches. Previously, #codebase could fall back to less accurate and less efficient fuzzy text searches. The agent can still do text and fuzzy searches, but Microsoft intends to keep #codebase purely focused on semantic searches. Microsoft also simplified how the codebase index is managed.
Elsewhere in VS Code 1.114:
- A preview feature for troubleshooting previous chat sessions allows developers to reference any previous chat session when troubleshooting. This makes it easier to investigate issues after the fact, without needing to reproduce them, Microsoft said.
- TypeScript and JavaScript support now extends to TypeScript 6.0, which was introduced March 23.
- The Python Environments extension now recommends the community Pixi extension when Pixi environments are detected, and includes Pixi in the environment manager priority order.
- Administrators now can use a group policy to disable Anthropic Claude agent integration in chat. When this policy is applied, the
github.copilot.chat.claudeAgent.enabledsetting is managed by the organization and users cannot enable the Claude agent. - A proposed API for fine-grained tool approval allows language model tools with an approval flow to scope approval to a specific combination of arguments, so that users approve each command individually.
VS Code 1.114 is part of a March change where Microsoft has begun releasing weekly updates to VS Code instead of just monthly updates. VS Code 1.115 is likely to be released any day now, under this new policy.
How to choose the best LLM using R and vitals 6 Apr 2026, 9:14 am
Is your generative AI application giving the responses you expect? Are there less expensive large language models—or even free ones you can run locally—that might work well enough for some of your tasks?
Answering questions like these isn’t always easy. Model capabilities seem to change every month. And, unlike conventional computer code, LLMs don’t always give the same answer twice. Running and rerunning tests can be tedious and time consuming.
Fortunately, there are frameworks to help automate LLM tests. These LLM “evals,” as they’re known, are a bit like unit tests on more conventional computer code. But unlike unit tests, evals need to understand that LLMs can answer the same question in different ways, and that more than one response may be correct. In other words, this type of testing often requires the ability to analyze flexible criteria, not simply check if a given response equals a specific value.
The vitals package, based on Python‘s Inspect framework, brings automated LLM evals to the R programming language. Vitals was designed to integrate with the ellmer R package, so you can use them together to evaluate prompts, AI applications, and how different LLMs affect both performance and cost. In one case, it helped show that AI agents often ignore information in plots when it goes against their expectations, according to package author Simon Couch, a senior software engineer at Posit. Couch said over email that the experiment, done using a set of vitals evaluations dubbed bluffbench, “really hit home for some folks.”
Couch is also using the package to measure how well different LLMs write R code.
Vitals setup
You can install the vitals package from CRAN or, if you want the development version, from GitHub with pak::pak("tidyverse/vitals"). As of this writing, you’ll need the dev version to access several features used in examples for this article, including a dedicated function for extracting structured data from text.
Vitals uses a Task object to create and run evals. Each task needs three pieces: a dataset, a solver, and a scorer.
Dataset
A vitals dataset is a data frame with information about what you want to test. That data frame needs at least two columns:
input: The request you want to send to the LLM.target: How you expect the LLM to respond.
The vitals package includes a sample dataset called are. That data frame has a few more columns, such as id (which is never a bad idea to include in your data), but these are optional.
As Couch told posit::conf attendees a few months ago, one of the easiest ways to create your own input-target pairs for a dataset is to type what you want into a spreadsheet. Set up spreadsheet columns with “input” and “target,” add what you want, then read that spreadsheet into R with a package like googlesheets4 or rio.

Example of a spreadsheet to create a vitals dataset with input and target columns.
Sharon Machlis
Below is the R code for three simple queries I’ll use to test out vitals. The code creates an R data frame directly, if you’d like to copy and paste to follow along. This dataset asks an LLM to write R code for a bar chart, determine the sentiment of some text, and create a haiku.
my_dataset This desktop computer has a better processor and can handle much more demanding tasks such as running LLMs locally. However, it\U{2019}s also noisy and comes with a lot of bloatware.",
"Write me a haiku about winter"
),
target = c(
'Example solution: ```library(ggplot2)\r\nlibrary(scales)\r\nsample_data
Next, I’ll load my libraries and set a logging directory for when I run evals, since the package will suggest you do that as soon as you load it:
library(vitals)
library(ellmer)
vitals_log_dir_set("./logs")
Here’s the start of setting up a new Task with the dataset, although this code will throw an error without the other two required arguments of solver and scorer.
my_task
If you’d rather use a ready-made example, you can use dataset = are with its seven R tasks.
It can take some effort to come up with good sample targets. The classification example was simple, since I wanted a single-word response, mixed. But other queries can have more free-form responses, such as writing code or summarizing text. Don’t rush through this part—if you want your automated “judge” to grade accurately, it pays to design your acceptable responses carefully.
Solver
The second part of the task, the solver, is the R code that sends your queries to an LLM. For simple queries, you can usually just wrap an ellmer chat object with the vitals generate() function. If your input is more complex, such as needing to call tools, you may need a custom solver. For this part of the demo, I’ll use a standard solver with generate(). Later, we’ll add a second solver with generate_structured().
It helps to be familiar with the ellmer R package when using vitals. Below is an example of using ellmer without the vitals package, with my_dataset$input[1], the first query in my dataset data frame, as my prompt. This code returns an answer to the question but doesn’t evaluate it.
Note: You’ll need an OpenAI key if you want to run this specific code. Or you can change the model (and API key) to any other LLM from a provider ellmer supports. Make sure to store any needed API keys for other providers. For the LLM, I chose OpenAI’s least expensive current model, GPT-5 nano.
my_chat
You can turn that my_chat ellmer chat object into a vitals solver by wrapping it in the generate() function:
# This code won't run yet without the tasks's third required argument, a scorer
my_task
The Task object knows to use the input column from your dataset as the question to send to the LLM. If the dataset holds more than one query, generate() handles processing them.
Scorer
Finally, we need a scorer. As the name implies, the scorer grades the result. Vitals has several different types of scorer. Two of them use an LLM to evaluate results, sometimes referred to as “LLM as a judge.” One of vitals’ LLM-as-a-judge options, model_graded_qa(), checks how well the solver answered a question. The other, model_graded_fact(), “determines whether a solver includes a given fact in its response,” according to the documentation. Other scorers look for string patterns, such as detect_exact() and detect_includes().
Some research shows that LLMs can do a decent job in evaluating results. However, like most things involving generative AI, I don’t trust LLM evaluations without human oversight.
Pro tip: If you’re testing a small, less capable model in your eval, you don’t want that model also grading the results. Vitals defaults to using the same LLM you’re testing as the scorer, but you can specify another LLM to be your judge. I usually want a top-tier frontier LLM for my judge unless the scoring is straightforward.
Here’s what the syntax might look like if we were using Claude Sonnet as a model_graded_qa() scorer:
scorer = model_graded_qa(scorer_chat = chat_anthropic(model = "claude-sonnet-4-6"))
Note that this scorer defaults to setting partial credit to FALSE—either the answer is 100% accurate or it’s wrong. However, you can choose to allow partial credit if that makes sense for your task, by adding the argument partial_credit = TRUE:
scorer = model_graded_qa(partial_credit = TRUE, scorer_chat = chat_anthropic(model = "claude-sonnet-4-6"))
I started with Sonnet 4.5 as my scorer, without partial credit. It got one of the gradings wrong, giving a correct score to R code that did most things right for my bar chart but didn’t sort by descending order. I also tried Sonnet 4.6, released just this week, but it also got one of the grades wrong.
Opus 4.6 is more capable than Sonnet, but it’s also about 67% pricier at $5 per million tokens input and $25 per million output. Which model and provider you choose depends in part on how much testing you’re doing, how much you like a specific LLM for understanding your work (Claude has a good reputation for writing R code), and how important it is to accurately evaluate your task. Keep an eye on your usage if cost is an issue. If you’d rather not spend any money following the examples in this tutorial, and you don’t mind using less capable LLMs, check out GitHub Models, which has a free tier. ellmer supports GitHub Models with chat_github(), and you can also see available LLMs by running models_github().
Keep an eye on your usage if cost is an issue. If you’d rather not spend any money following the examples in this tutorial, and you don’t mind using less capable LLMs, check out GitHub Models, which has a free tier. ellmer supports GitHub Models with chat_github(). (You can also see available LLMs by running models_github().)
Below, I’ve added model_graded_qa() scoring to my_task, and I also included a name for the task. However, I’d suggest not adding a name to your task if you plan to clone it later to try a different model. Cloned tasks keep their original name, and as of this writing, there’s no way to change that.
my_task
Now, my task is ready to use.
Run your first vitals task
You execute a vitals task with the task object’s $eval() method:
my_task$eval()
The eval() method launches five separate methods: $solve(), $score(), $measure(), $log(), and $view(). After it finishes running, a built-in log viewer should pop up. Click on the hyperlinked task to see more details:

Details on a task run in vitals’ built-in viewer. You can click each sample for additional info.
Sharon Machlis
“C” means correct and “I” stands for incorrect, and there could have been “P” for partially correct if I had allowed partial credit.
If you want to see a log file in that viewer later, you can invoke the viewer again with vitals_view("your_log_directory"). The logs are just JSON files, so you can view them in other ways, too.
You’ll probably want to run an eval multiple times, not just once, to feel more confident that an LLM is reliable and didn’t just get lucky. You can set multiple runs with the epochs argument:
my_task$eval(epochs = 10)
The accuracy of bar chart code on one of my 10-epoch runs was 70%—which may or may not be “good enough.” Another time, that rose to 90%. If you want a true measure of an LLM’s performance, especially when it’s not scoring 100% on every run, you’ll want a good sample size; margin of error can be significant with just a few tests. (For a deep dive into statistical analysis of vitals results, see the package’s analysis vignette.)
It cost about 14 cents to use Sonnet 4.6 as a judge versus 27 cents for Opus 4.6 on 11 total epoch runs of three queries each. (Not all these queries even needed an LLM for evaluation, though, if I were willing to separate the demo into multiple task objects. The sentiment analysis was just looking for “Mixed,” which is simpler scoring.)
The vitals package includes a function that can format the results of a task’s evaluation as a data frame: my_task$get_samples(). If you like this formatting, save the data frame while the task still exists in your R session:
results_df
You may also want to save the Task object itself.
If there’s an API glitch while you’re running your input queries, the entire run will fail. If you want to run a test for a lot of epochs, you may want to break it up into smaller groups so as not to risk wasting tokens (and time).
Swap in another LLM
There are several ways to run the same task with a different model. First, create a new chat object with that different model. Here’s the code for checking out Google Gemini 3 Flash Preview:
my_chat_gemini
Then you can run the task in one of three ways.
1. Clone an existing task and add the chat as its solver with $set_solver():
my_task_gemini
2. Clone an existing task and add the new chat as a solver when you run it:
my_task_gemini
3. Create a new task from scratch, which allows you to include a new name:
my_task_gemini
Make sure you’ve set your API key for each provider you want to test, unless you’re using a platform that doesn’t need them, such as local LLMs with ollama.
View multiple task runs
Once you’ve run multiple tasks with different models, you can use the vitals_bind() function to combine the results:
both_tasks

Example of combined task results running each LLM with three epochs.
Sharon Machlis
This returns an R data frame with columns for task, id, epoch, score, and metadata. The metadata column contains a data frame in each row with columns for input, target, result, solver_chat, scorer_chat, scorer_metadata, and scorer.
To flatten the input, target, and result columns and make them easier to scan and analyze, I un-nested the metadata column with:
library(tidyr)
both_tasks_wide
unnest_longer(metadata) |>
unnest_wider(metadata)
I was then able to run a quick script to cycle through each bar-chart result code and see what it produced:
library(dplyr)
# Some results are surrounded by markdown and that markdown code needs to be removed or the R code won't run
extract_code
filter(id == "barchart")
# Loop through each result
for (i in seq_len(nrow(barchart_results))) {
code_to_run
Test local LLMs
This is one of my favorite use cases for vitals. Currently, models that fit into my PC’s 12GB of GPU RAM are rather limited. But I’m hopeful that small models will soon be useful for more tasks I’d like to do locally with sensitive data. Vitals makes it easy for me to test new LLMs on some of my specific use cases.
vitals (via ellmer) supports ollama, a popular way of running LLMs locally. To use ollama, download, install, and run the ollama application, and either use the desktop app or a terminal window to run it. The syntax is ollama pull to download an LLM, or ollama run to both download and start a chat if you’d like to make sure the model works on your system. For example: ollama pull ministral-3:14b.
The rollama R package lets you download a local LLM for ollama within R, as long as ollama is running. The syntax is rollama::pull_model("model-name"). For example, rollama::pull_model("ministral-3:14b"). You can test whether R can see ollama running on your system with rollama::ping_ollama().
I also pulled Google’s gemma3-12b and Microsoft’s phi4, then created tasks for each of them with the same dataset I used before. Note that as of this writing, you need the dev version of vitals to handle LLM names that include colons (the next CRAN version after 0.2.0 should handle that, though):
# Create chat objects
ministral_chat
All three local LLMs nailed the sentiment analysis, and all did poorly on the bar chart. Some code produced bar charts but not with axes flipped and sorted in descending order; other code didn’t work at all.

Results of one run of my dataset with five local LLMs.
Sharon Machlis
R code for the results table above:
library(dplyr)
library(gt)
library(scales)
# Prepare the data
plot_data
rename(LLM = task, task = id) |>
group_by(LLM, task) |>
summarize(
pct_correct = mean(score == "C") * 100,
.groups = "drop"
)
color_fn
tidyr::pivot_wider(names_from = task, values_from = pct_correct) |>
gt() |>
tab_header(title = "Percent Correct") |>
cols_label(`sentiment-analysis` = html("sentiment-
analysis")) |>
data_color(
columns = -LLM,
fn = color_fn
)
It cost me 39 cents for Opus to judge these local LLM runs—not a bad bargain.
Update (April 6, 2026): I used vitals and the same three-item dataset to test several LLMs from Google’s new Gemma 4 open-weight, commercially permissive family announced on April 2.
While the 4b version did about the same as other local LLMs I’ve tried, gemma-4-26b scored a surprising 100% when I ran it six times.
Note that although it ran at an acceptable speed in Ollama, gemma-4-26b was a tight fit in my PC’s memory when running vitals inside RStudio. In fact, it choked when I tried to run multiple epochs at once, so I ended up running only one test at a time.
Also important: I set up the ellmer chat object to turn off the model’s “thinking.” The code:
chat_gemma_26b
Extract structured data from text
Vitals has a special function for extracting structured data from plain text: generate_structured(). It requires both a chat object and a defined data type you want the LLM to return. As of this writing, you need the development version of vitals to use the generate_structured() function.
First, here’s my new dataset to extract topic, speaker name and affiliation, date, and start time from a plain-text description. The more complex version asks the LLM to convert the time zone to Eastern Time from Central European Time:
extract_dataset R Package Development in Positron\r\nThursday, January 15th, 18:00 - 20:00 CET (Rome, Berlin, Paris timezone) \r\nStephen D. Turner is an associate professor of data science at the University of Virginia School of Data Science. Prior to re-joining UVA he was a data scientist in national security and defense consulting, and later at a biotech company (Colossal, the de-extinction company) where he built and deployed scores of R packages.",
"Extract the workshop topic, speaker name, speaker affiliation, date in 'yyyy-mm-dd' format, and start time in Eastern Time zone in 'hh:mm ET' format from the text below. (TZ is the time zone). Assume the date year makes the most sense given that today's date is February 7, 2026. Return ONLY those entities in the format {topic}, {speaker name}, {date}, {start_time}. Convert the given time to Eastern Time if required. R Package Development in Positron\r\nThursday, January 15th, 18:00 - 20:00 CET (Rome, Berlin, Paris timezone) \r\nStephen D. Turner is an associate professor of data science at the University of Virginia School of Data Science. Prior to re-joining UVA he was a data scientist in national security and defense consulting, and later at a biotech company (Colossal, the de-extinction company) where he built and deployed scores of R packages. "
),
target = c(
"R Package Development in Positron, Stephen D. Turner, University of Virginia (or University of Virginia School of Data Science), 2026-01-15, 18:00. OR R Package Development in Positron, Stephen D. Turner, University of Virginia (or University of Virginia School of Data Science), 2026-01-15, 18:00 CET.",
"R Package Development in Positron, Stephen D. Turner, University of Virginia (or University of Virginia School of Data Science), 2026-01-15, 12:00 ET."
)
)
Below is an example of how to define a data structure using ellmer’s type_object() function. Each of the arguments gives the name of a data field and its type (string, integer, and so on). I’m specifying I want to extract a workshop_topic, speaker_name, current_speaker_affiliation, date (as a string), and start_time (also as a string):
my_object
Next, I’ll use the chat objects I created earlier in a new structured data task, using Sonnet as the judge since grading is straightforward:
my_task_structured
It cost me 16 cents for Sonnet to judge 15 evaluation runs of two queries and results each.
Here are the results:

How various LLMs fared on extracting structured data from text.
Sharon Machlis
I was surprised that a local model, Gemma, scored 100%. I wanted to see if that was a fluke, so I ran the eval another 17 times for a total of 20. Weirdly, it missed on two of the 20 basic extractions by giving the title as “R Package Development” instead of “R Package Development in Positron,” but scored 100% on the more complex ones. I asked Claude Opus about that, and it said my “easier” task was more ambiguous for a less capable model to understand. Important takeaway: Be as specific as possible in your instructions!
Still, Gemma’s results were good enough on this task for me to consider testing it on some real-world entity extraction tasks. And I wouldn’t have known that without running automated evaluations on multiple local LLMs.
Conclusion
If you’re used to writing code that gives predictable, repeatable responses, a script that generates different answers each time it runs can feel unsettling. While there are no guarantees when it comes to predicting an LLM’s next response, evals can increase your confidence in your code by letting you run structured tests with measurable responses, instead of testing via manual, ad-hoc queries. And, as the model landscape keeps evolving, you can stay current by testing how newer LLMs perform—not on generic benchmarks, but on the tasks that matter most to you.
Learn more about the vitals R package
- Visit the vitals package website.
- Use the are dataset on GitHub to run evaluations on various LLMs to see how they perform writing R code.
- View Simon Couch’s presentation at posit::conf(2025).
Databricks launches AiChemy multi-agent AI for drug discovery 6 Apr 2026, 4:28 am
Databricks has outlined a reference architecture for a multi-agent AI system, named AiChemy, that combines internal enterprise data on its platform with external scientific databases via the Model Context Protocol (MCP) to accelerate drug discovery tasks such as target identification and candidate evaluation.
These early-stage steps are critical in drug development because they help pharma companies determine which biological mechanisms to pursue and which compounds are worth advancing, directly influencing the cost, time, and likelihood of success in later clinical stages.
The multi-agent AI system is built on Databricks components, including its Data Intelligence Platform, Delta Lake, and Mosaic AI, including Agent Bricks, which manage and govern enterprise data while enabling the creation and orchestration of domain-specific agents and “skills.”
These skills include instructions for querying and summarizing scientific literature, retrieving chemical and molecular data, performing similarity searches across compounds, and synthesizing evidence across sources.
The system combines these agents and skills with external data sources such as OpenTargets, PubMed, and PubChem, accessed via MCP, allowing agents to retrieve and reason over both proprietary and public scientific data.
In doing so, AiChemy brings these data access, orchestration, and analysis in a single, governed environment, which Databricks says will help researchers in pharma companies surface relevant insights from disparate datasets without losing context, in turn accelerating tasks like target identification and candidate evaluation.
Underpinning the entire system is a supervisor agent that coordinates how individual agents and skills are used to fulfill a query.
Databricks describes this supervisor agent not as a prepackaged component, but as a pattern that enterprise teams can implement using its Mosaic AI and Agent Bricks tooling.
Enterprise teams building such a supervisor agent, according to a Databricks blog post, would need to start by defining and implementing domain-specific skills, such as literature search, compound lookup, or data synthesis, and registering them so they can be programmatically invoked.
Developers then would need to configure the supervisor agent with instructions or policies that determine how it selects and sequences these skills in response to a query, including how tasks are decomposed and routed, the company wrote in the blog post.
This setup is typically tied to enterprise and external data sources via MCP, with access controls and governance applied through Databricks’ platform, it added.
The AiChemy initiative builds on earlier Databricks efforts in healthcare and drug discovery.
In June 2025, the company partnered with Atropos Health to combine real-world clinical data with its Data Intelligence Platform to support evidence generation and accelerate research workflows.
A month later, in July 2025, it announced a partnership with TileDB focused on integrating multimodal scientific data, such as genomics, imaging, and clinical records, to enable AI-driven analysis for drug discovery and clinical insights.
The AiChemy reference architecture, Databricks said, has been made available through a web application and a GitHub repository, where developers can explore the system and adapt it to their own use cases using its Agent Bricks framework.
Multi-agent AI is the new microservices 6 Apr 2026, 2:00 am
We just can’t seem to help ourselves. Our current infatuation with multi-agent systems risks mistaking a useful pattern for an inevitable future, just as we once did with microservices. Remember those? For some good (and bad) reasons, we took workable applications, broke them into a confusing cloud of services, and then built service meshes, tracing stacks, and platform teams just to manage the complexity we’d created. Yes, microservices offered real advantages, as I’ve argued. But also, you don’t need to “run like Google” unless you actually have Google’s problems. (Spoiler alert: You don’t.)
Now we’re about to make the same mistake with AI.
Every agent demo seems to feature a planner agent, a researcher agent, a coder agent, a reviewer agent, and, why not? an agent whose sole job is to feel good about the architecture diagram. This doesn’t mean multi-agent systems are bad; they’re simply prescribed more broadly than is wise, just as we did with microservices.
So when should you embrace a multi-agent approach?
A real pattern, with a hype tax
Even the companies building the frontier models are practically begging developers not to use them promiscuously. In its 2024 guide to building effective agents, Anthropic explicitly recommends finding “the simplest solution possible” and says that might mean not building an agentic system at all. More pointedly, Anthropic says that for many applications, optimizing single LLM calls with retrieval and in-context examples is usually enough. It also warns that frameworks can create layers of abstraction that obscure prompts and responses, make systems harder to debug, and tempt developers to add complexity when a simpler setup would suffice. Santiago Valdarrama put the same idea more bluntly: “Not everything is an agent,” he stresses, and “99% of the time, what you need is regular code.”
That’s not anti-agent. It’s engineering discipline.
OpenAI lands in roughly the same place. Its practical guide recommends maximizing a single agent’s capabilities first because one agent plus tools keeps complexity, evaluation, and maintenance more manageable. It explicitly suggests prompt templates as a way to absorb branching complexity without jumping to a multi-agent framework. Microsoft is similarly blunt: If the use case does not clearly cross security or compliance boundaries, involve multiple teams, or otherwise require architectural separation, start with a single-agent prototype. It even cautions that “planner,” “reviewer,” and “executor” roles do not automatically justify multiple agents, because one agent can often emulate those roles through persona switching, conditional prompting, and tool permissioning. Google, for its part, adds a particularly useful nuance here, warning that the wrong choice between a sub-agent and an agent packaged as a tool can create massive overhead. In other words, sometimes you don’t need another teammate. You need a function with a clean contract.
Microsoft makes one more point that deserves extra attention: Many apparent scale problems stem from retrieval design, not architecture. So, before you add more agents, fix chunking, indexing, reranking, prompt structure, and context selection. That isn’t less ambitious. It is more adult. We learned this the hard way with microservices. Complexity doesn’t vanish when you decompose a system. It relocates. Back then, it moved into the network. Now it threatens to move into hand-offs, prompts, arbitration, and agent state.
Distributed intelligence is still distributed
What could have been one strong model call, retrieval, and a few carefully designed tools can quickly turn into agent routing, context hand-offs, arbitration, permissioning, and observability across a swarm of probabilistic components. That may be worth it when the problem is truly distributed, but often it’s not. Distributed intelligence is still distributed systems, and distributed systems aren’t cheap to build or maintain.
As OpenAI’s evaluation guide warns, triaging and hand-offs in multi-agent systems introduce a new source of nondeterminism. Its Codex documentation says subagents are not automatic and should only be used when you explicitly request parallel agent work, in part because each subagent does its own model and tool work and therefore consumes more tokens than a comparable single-agent run. Microsoft makes the same point in enterprise language: Every agent interaction requires protocol design, error handling, state synchronization, separate prompt engineering, monitoring, debugging, and a broader security surface.
Modularity, yes. But don’t pretend that modularity will be cheap.
This is why I suspect most teams that think they need multiple agents actually have a different problem. Their tools are vague, their retrieval is weak, their permissions are too broad, and their repositories are under-documented. Guess what? Adding more agents doesn’t fix any of that. It exacerbates it. As Anthropic explains, the most successful implementations tend to use simple, composable patterns rather than complex frameworks, and for many applications a single LLM call with retrieval and in-context examples is enough.
This matters even more because AI makes complexity cheap. In the microservices era, a bad architectural idea was at least constrained by the effort required to implement it. In the agent era, the cost of sketching yet another orchestration layer, another specialized persona, another hand-off, or another bit of glue code is collapsing. That can feel liberating even as it destroys our ability to maintain and manage systems over time. As I’ve written, lower production costs don’t automatically translate into higher productivity. They often just make it easier to manufacture fragility at scale.
Earn the extra moving parts
This also brings us back to a point I’ve made for years about hyperscaler architecture. Just because Google, Amazon, Anthropic, or OpenAI do something doesn’t mean you should too, because you don’t have their problems. Anthropic’s research system is impressive precisely because it tackles a hard, open-ended, breadth-first research problem. Anthropic is also candid about the cost. In its data, agents used about four times more tokens than chat interactions, while multi-agent systems used about 15 times more. The company also notes that most coding tasks are not a particularly good fit because they offer fewer truly parallelizable subtasks, and agents are not yet especially good at coordinating with one another in real time.
In other words, even one of the strongest public examples of multi-agent success comes with a warning label attached. It’s not quite “abandon hope, all ye who enter here, but it’s definitely not “do as I’m doing.”
The better question is “What’s the minimum viable autonomy for this job?” Start with a strong model call. If that isn’t enough, add retrieval. Still not enough? Add better tools. If you need iteration, wrap those tools in a single agent loop. If context pollution becomes real, if independent tasks can truly run in parallel, or if specialization materially improves tool choice, then and only then start “earning” your second agent. If you can’t say which of those three problems you are solving, you probably don’t need another agent. Don’t believe me? All of the top purveyors of agent tools (Anthropic, OpenAI, Microsoft, Google) converge on this same counsel.
So yes, multi-agent is the new microservices. That is both a compliment and a warning. Microservices were powerful when you had a problem worth distributing. Multi-agent systems are powerful when you have a problem worth decomposing. Most enterprise teams don’t, at least not yet. Many others never will. Instead, most need one well-instrumented agent, tight permissions, strong evaluations, boring tools, and clear exit conditions. The teams that win with agentic AI won’t be those that reach for the fanciest topology first. Instead, they’ll be disciplined enough to earn every extra moving part and will work hard to avoid additional moving parts for as long as possible. In the enterprise, boring is still what scales.
27 questions to ask when choosing an LLM 6 Apr 2026, 2:00 am
Car buyers kick tires. Horse traders inspect the teeth. What should shoppers for large language models (LLMs) do?
Here are 27 prescient questions that developers are asking before they adopt a particular model. Model capabilities are diverse, and not every application requires the same support. These questions will help you identify the best models for your job.
What is the size of the model?
The number of parameters is a rough estimate of how much information is already encoded in the model. Some problems want to leverage this information. The prompts will be looking for information that might be in the training corpus.
Some problems won’t need larger models. Perhaps there will be plenty of information added from a retrieval-augmented generation (RAG) database. Perhaps the questions will be simpler. If you can anticipate the general size of the questions, you can choose the smallest model that will satisfy them.
Does the model fit in your hardware?
Anyone who will be hosting their own models needs to pay attention to how well they run on the hardware at hand. Finding more RAM or GPUs is always a chore and sometimes impossible. If the model doesn’t fit or run smoothly on the hardware, it can’t be a solution.
What is the time to first token?
There are multiple ways to measure the speed of an LLM. The time to first token, or TTFT, is important for real-time, interactive applications where the end user will be daydreaming while waiting for some answer on the screen. Some models start the response faster, but then poke along. Others take longer to begin responding. If you’re going to be using the LLM in the background or as a batch job, this number isn’t as important.
Are there rate limits?
All combinations of models and hardware have a speed limit. If you’re supplying the hardware, you can establish the maximum load through testing. If you’re using an API, the provider will probably put rate limits on how many tokens it can process for you. If your project needs more, you’ll either need to buy more hardware or look for a different provider.
What is the size of the context window?
How big is your question? Some problems like refactoring a large code base require feeding millions of tokens into the machine. A smaller model with a limited context window won’t do. It will forget the first part of the prompt before it gets to the end.
If your problem fits into a smaller prompt, then you can get by with a smaller context window and a simpler model.
How does the model balance reasoning with speed?
Model developers can add different stages where the models will attempt to reason or think about the problem on a meta level. This is often considered “reasoning,” although it generally means that the models will iterate through a variety of different approaches until they find an answer that seems good enough. In practice, there’s a tradeoff between reasoning and speed. More iterations means slower responses. Is this “reasoning” worth it? It all depends upon the problem.
How stable is the model?
On certain prompts, some models are more likely to fail than others. They’ll start off with an answer but diverge into some dark statistical madness, spewing random words and gibberish. In many cases, they’ll offer correct answers. In many cases, the instability appears at random times when the model is already running in production.
When did training end?
The “knowledge cutoff” is the last day when the training set for the model stopped getting an injection of new information. If you’re going to be relying on the general facts embedded in the model, then you’ll want to know how they age. Not all projects need a current date, though, because some use other documents in a RAG system or vector database to add more details to the prompt.
Is additional training possible?
Some LLM providers support another round of training, usually on domain-specific data sets of the customer. This fine-tuning can teach a foundation model some of the details that give it the power to take up a place in some workflow or data assembly line. The fine-tuning is often dramatically cheaper and faster than building an entirely new model from scratch.
Which media types are supported?
Some models only return text. Some return images. Some are trained to do something else entirely. The same goes for input. Some can read a text prompt. Some can examine an image file and parse charts or PDFs. Some are smart enough to unpack strange file types. Just make sure the LLM can listen and speak in the file formats you need.
What is the prompting structure?
The structure of the prompt can make a difference with many models. Some pay particular attention to instructions in the system prompt. Others are moving to a more interactive, Socratic style of prompting that allows the user and the LLM to converge upon the answer. Some encourage the LLM to adopt different personas of famous people. The best way to prompt iterative, agentic thought is still a very active research topic.
Is the model open source?
Some models have been released with open source licenses that bring many of the same freedoms as open source software. Projects that need to run in controlled environments can fire up these models inside their space and avoid trusting online services. Some users will want to fine-tune the models, and open source models allow them to take advantage of access to the model weights.
Is there a guaranteed lifespan?
If the model is not open source, the creators may shut it down at any time. Some services offer assurance that the model will have a set lifespan and will be supported for a predictable amount of time. This allows developers to be sure that the rug won’t be pulled out from beneath their feet soon after integrating the model with their stack.
Whereas earlier versions of open source models remain available, the ongoing availability of proprietary models is determined by the owners. What happens to some old versions that have been retired? Most of us are happier with their replacements, but some of us may have grown to rely on them and we’re out of luck. Some providers of proprietary models have promised to release the model weights on retirement, an option that makes the model always available even though it’s not fully open source.
Does the model have a batch architecture?
If the answer is not needed in real time, some LLMs can process the prompts in delayed batches. Many model hosts will offer large discounts for the option to answer at some later time when demand is lower. Some inference engines can offer continuous batching with techniques like PagedAttention or finer-grained scheduling. All of these techniques can lower costs by boosting the throughput of hardware.
What is the cost?
In some situations, price is very important, especially when some tasks will be repeated many times. While the cost of one answer may be fractions of a cent, they’ll add up. On big data assembly lines, downgrading to a cheap option can make the difference between a financial success and failure.
In other jobs, the price won’t matter. Maybe the prompt will only be run a few times. Maybe the price is much lower than the value of the job. Scrimping on the LLM makes little sense in these cases because spending extra for a bigger, fancier model won’t break the budget.
Was the model trained on synthetic data?
Some LLMs are trained on synthetic data created by other models. When things go correctly, the model doesn’t absorb any false bits of information. But when things go wrong, the models can lose precision. Some draw an analogy to the way that copies of copies of copies grow blurred and lose detail. Others compare the process to audio feedback between an amplifier and a microphone.
Is the training set copyrighted?
Some LLM creators cut corners when they started building their training set by including pirated books. Anthropic, for example, has announced a settlement to a class action lawsuit for some books that are still under copyright. Other lawsuits are still pending. The claim is that the models may produce something close to the copyrighted material when prompted the right way. If your use cases may end up asking for answers that might turn up plagiarized or pirated material, you should look for some assurances about how the training set was chosen.
Is there a provenance audit?
Some developers are fighting the questions about synthetic data and copyright by offering a third-party audit of their training sets. This can answer questions and alleviate worries about future infringement issues.
Does the model come with indemnification?
Does the contract offer a guarantee that the answers won’t infringe upon copyright or include personal information? Some companies are confident that their training data is clean enough that they’re able to offer contractual indemnification for customers.
Do we know the environmental impacts?
This usually means how much electricity and water is consumed to produce an answer. Some services are offering estimates that they hope will distinguish their services from others that are more wasteful. In general, price is not a bad proxy for environmental impact because both electricity and water are direct costs and they’re often some of the greatest ones. Developers have a natural incentive to use less of both.
Is the hardware powered by renewable energy?
Did the power come from a clean source? Some services are partnering directly with renewable energy providers so that they can promise that the energy used to construct an answer came from solar or wind farms. In some cases, they’re offering batch services that queue up the queries until the renewable sources are online.
Does the model have compliance issues?
Some developers who work in highly regulated environments need to worry about access to their data. These developers will need to review how standards like SOC2, HIPAA, and GDPR among others affect how the model can be used. In many cases, the model needs to be fired up in a controlled environment. In some cases, the problem is more complex. Some regulations require “transparency” in some decisions meaning that the model will need to explain how it came to a conclusion. This is often one of the most complicated questions to answer.
Where does the model run?
Some of the regulations are tied directly to location. Some of the GDPR regulations, for instance, require that personal data from Europeans stay in Europe. Geopolitics and national borders also affect legal questions for a number of issues like taxes, libel, or privacy. If your use case strays into these areas, the physical location of the LLM may be important. Some services are setting up regional deployments just to resolve these questions.
Does the model support human help?
Some developers are explicitly building in places for humans inside the reasoning of the LLM. These “human-in-the-loop” solutions make it possible to stop an LLM from delivering a flawed or dangerous answer. Finding the best architectural structure for these hooks can be tricky because they can create too much labor if they’re triggered too often.
Does the model support tool use?
Some models and services allow their models to use outside features for searching the internet, looking in a database, or calling an arbitrary function. These functions can really help some problems that need to leverage the data found from the outside sources. There is a large collection of tools and interfaces that uses APIs like the Model Context Protocol (MCP). It’s worth experimenting with them to determine how stable they are.
Is the model agentic?
There may be no bigger buzzword now and that’s because everyone is using it to describe how they’re adding more reasoning capabilities to their models. Sometimes this means that a constellation of LLMs work together, often choreographed by some other set of LLMs. Does it mean smarter? Maybe. Better? Only you can tell.
What are the model’s quirks?
Anyone who spends some time with an LLM starts to learn its quirks. It’s almost like they’ve learned everything they know from fallible humans. One model gives different answers if there are two spaces after a period instead of one. Another model sounds pretentious. Most are annoyingly sycophantic. Anyone choosing an LLM must spend some time with the model and get a feel for whether the quirks will end up being endearing, annoying, or worse.
Anthropic cuts OpenClaw access from Claude subscriptions, offers credits to ease transition 6 Apr 2026, 1:40 am
Anthropic has blocked paid Claude subscribers from using the widely used open-source AI agent OpenClaw under their existing subscription plans, a move that took effect April 4 and has drawn pushback from subscribers who question both the cost implications and the company’s stated rationale.
In an email to subscribers reviewed by InfoWorld, Anthropic said access to third-party tools through subscription tokens was being discontinued. “Starting April 4, third-party harnesses like OpenClaw connected to your Claude account will draw from extra usage instead of from your subscription,” the company said. Users accessing Claude through the API are unaffected by the change.
To ease the transition, Anthropic offered each subscriber a one-time credit equal to their monthly subscription price, redeemable by April 17 and valid for 90 days across Claude Code, Claude Cowork, chat, or connected third-party tools. The company also introduced pre-purchase extra usage bundles at discounts of up to 30% for subscribers who want to continue running OpenClaw with Claude as the underlying model.
“If you ever run past your subscription limits, this is the easiest way to keep going,” the company said in the email.
Capacity, not competition
Boris Cherny, head of Claude Code at Anthropic, explained the decision in a post on X. “We’ve been working hard to meet the increase in demand for Claude, and our subscriptions weren’t built for the usage patterns of these third-party tools,” Cherny said. “Capacity is a resource we manage thoughtfully and we are prioritizing our customers using our products and API. We want to be intentional in managing our growth to continue to serve our customers sustainably long-term.”
The token gap between standard subscription usage and third-party agent workloads is substantial. Testing conducted by German technology outlet c’t 3003 in January found that a single day of OpenClaw usage running on Claude’s Opus model consumed $109.55 in AI tokens. Anthropic’s own published benchmarks for Claude Code put the average daily cost for a professional software developer at $6, with 90% of team users staying below $12 per day.
OpenClaw team pushed back — and bought a week
Peter Steinberger, the Austrian developer who created OpenClaw before joining OpenAI, said on X that the original implementation date had been earlier. “Both me and @davemorin tried to talk sense into Anthropic, best we managed was delaying this for a week,” Steinberger wrote. He also drew attention to the sequence of product moves preceding the access cut. “Funny how timings match up, first they copy some popular features into their closed harness, then they lock out open source,” Steinberger said.
When one commenter argued that third-party tools did not belong on flat subscription plans and that any vendor allowing it was being “intellectually dishonest,” Steinberger noted that OpenClaw already supports subscriptions from other AI providers. “Funny how it works for literally any other player in the AI industry, we support subscriptions from MiniMax, Alibaba, OpenCode, GLM, OpenAI,” he replied in the post.
Cherny responded to the open source criticism directly, saying he had personally contributed pull requests to OpenClaw to improve its prompt cache efficiency. “This is more about engineering constraints,” Cherny said. “Our systems are highly optimized for one kind of workload, and to serve as many people as possible with the most intelligent models, we are continuing to optimize that.”
Subscribers weigh the cost
Developer Jared Tate said on X that he intended to cancel his subscriptions over the change. After Cherny’s response, Tate acknowledged the engineering explanation and noted that careful OpenClaw configuration, including a one-hour prompt cache time-to-live and a 55-minute heartbeat, had materially reduced his own token consumption. “OpenClaw dramatically increased usage. But we all became so much more productive,” he wrote.
One subscriber posting as @ashen_one said they were running two OpenClaw instances on a $200-per-month plan. Shifting to API keys or overage bundles, they said, would make continued use financially unworkable. “I’ll probably have to switch over to a different model at this point,” the user wrote.
The user also pointed to Claude Cowork, Anthropic’s own agentic productivity tool, as a direct OpenClaw rival, and suggested the decision served competitive purposes. AI developer Brian Vasquez offered a different read. “Anthropic oversold their server capacity, and this was their response, point blank and simple,” Vasquez wrote on X. “It’s a capacity/bad bet. Time to pay off that bad bet.”
Internet Bug Bounty program hits pause on payouts 3 Apr 2026, 10:16 am
Researchers who identify and report bugs in open-source software will no longer be rewarded by the Internet Bug Bounty team. HackerOne, which administers the program, has said that it is “pausing submissions” while it contemplates ways in which open source security can be handled more effectively.
The Internet Bug Bounty program, funded by a number of leading software companies, has been run since 2012 and has awarded more than $1.5m to researchers who have reported bugs. Up to now, 80% of its payouts have been for discoveries of new flaws, and 20% to support remediation efforts. But as artificial intelligence makes it easier to find bugs, that balance needs to change, HackerOne said in a statement.
“AI-assisted research is expanding vulnerability discovery across the ecosystem, increasing both coverage and speed. The balance between findings and remediation capacity in open source has substantively shifted,” said HackerOne.
Among the first programs to be affected is the Node.js project, a server-side JavaScript platform for web applications known for its extensive ecosystem. While the project team will continue to accept and triage bug reports through HackerOne, without funding from the Internet Bug Bounty program it will no longer pay out rewards, according to an announcement on its website.
The Internet Bug Bounty Program is not the only bug-hunting project that has struggled with the onset of AI in vulnerability hunting. In January, the Curl program said that it was not taking any more submissions. And just last month, Google also put a halt to AI-generated submissions provided to its Open Source Software Vulnerability Reward Program.
Claude Code is still vulnerable to an attack Anthropic has already fixed 3 Apr 2026, 9:55 am
The leak of Claude Code’s source is already having consequences for the tool’s security. Researchers have spotted a vulnerability documented in the code.
The vulnerability, revealed by AI security company Adversa, is that if Claude Code is presented with a command composed of more than 50 subcommands, then for subcommands after the 50th it will override compute-intensive security analysis that might otherwise have blocked some of them, and instead simply ask the user whether they want to go ahead. The user, assuming that the block rules are still in effect, may unthinkingly authorize the action.
Incredibly, the vulnerability is documented in the code, and Anthropic has already developed a fix for it, the tree-sitter parser, which is also in the code but not enabled in public builds that customers use, said Adversa.
Adversa outlined how attackers might exploit the vulnerability by distributing a legitimate-looking code repository containing a poisoned CLAUDE.md file. This would contain instructions for Claude Code to build the project, with a sequence of 50 or more legitimate-looking commands, followed by a command to, for example, exfiltrate the victim’s credentials. Armed with those credentials, the attackers could threaten a whole software supply chain.
CERT-EU blames Trivy supply chain attack for Europa.eu data breach 3 Apr 2026, 9:37 am
The European Union’s Computer Emergency Response Team, CERT-EU, has traced last week’s theft of data from the Europa.eu platform to the recent supply chain attack on Aqua Security’s Trivy open-source vulnerability scanner.
The attack on the AWS cloud infrastructure hosting the Europa.eu web hub on March 24 resulted in the theft of 350 GB of data (91.7 GB compressed), including personal names, email addresses, and messages, according to CERT-EU’s analysis.
The compromise of Trivy allowed attackers to access an AWS API key, gaining access to a range of European Commission web data, including data related to “42 internal clients of the European Commission, and at least 29 other Union entities using the service,” it said.
“The threat actor used the compromised AWS secret to create and attach a new access key to an existing user, aiming to evade detection. They then carried out reconnaissance activities,” said CERT-EU. The organization had found no evidence that the attackers had moved laterally to other AWS accounts belonging to the Commission.
Given the timing and involvement of AWS credentials, “the European Commission and CERT-EU have assessed with high confidence that the initial access vector was the Trivy supply-chain compromise, publicly attributed to TeamPCP by Aqua Security,” it said.
In the event, the stolen data became public after the group blamed for the attack, TeamPCP, leaked it to the ShinyHunters extortion group, which published it on the dark web on March 28.
Back door credentials
The Trivy compromise dates to February, when TeamPCP exploited a misconfiguration in Trivy’s GitHub Actions environment, now identified as CVE-2026-33634, to establish a foothold via a privileged access token, according to Aqua Security.
Discovering this, Aqua Security rotated credentials but, because some credentials remain valid during this process, the attackers were able to steal the newly rotated credentials.
By manipulating trusted Trivy version tags, TeamPCP forced CI/CD pipelines using the tool to automatically pull down credential-stealing malware it had implanted.
This allowed TeamPCP to target a variety of valuable information including AWS, GCP, Azure cloud credentials, Kubernetes tokens, Docker registry credentials, database passwords, TLS private keys, SSH keys, and cryptocurrency wallet files, according to security researchers at Palo Alto Networks. In effect, the attackers had turned a tool used to find cloud vulnerabilities and misconfigurations into a yawning vulnerability of its own.
CERT-EU advised organizations affected by the Trivy compromise to immediately update to a known safe version, rotate all AWS and other credentials, audit Trivy versions in CI/CD pipelines, and most importantly ensure GitHub Actions are tied to immutable SHA-1 hashes rather than mutable tags.
It also recommended looking for indicators of compromise (IoCs) such as unusual Cloudflare tunnelling activity or traffic spikes that might indicate data exfiltration.
Extortion boost
The origins and deeper motives of TeamPCP, which emerged in late 2025, remain unclear. The leaking of stolen data suggests it might be styling itself as a sort of initial access broker which sells data and network access on to the highest bidder.
However, the fact that stolen data was handed to a major ransomware group suggests that affected organizations are likely to face a wave of extortion demands in the coming weeks.
If so, this would be a huge step backwards at a time when ransomware has been under pressure as the proportion of victims willing to pay ransoms has declined.
The compromise of Trivy, estimated to have affected at least 1,000 SaaS environments, is rapidly turning into the one of the most consequential supply-chain incidents of recent times.
The number of victims is likely to grow in the coming weeks. Others caught up in the incident include Cisco, which reportedly lost source code, security testing company Checkmarx, and AI gateway company LiteLLM.
This article was first published on CSO.
Page processed in 0.288 seconds.
Powered by SimplePie 1.3, Build 20180209064251. Run the SimplePie Compatibility Test. SimplePie is © 2004–2026, Ryan Parman and Geoffrey Sneddon, and licensed under the BSD License.
