The 200ms latency: A developer’s guide to real-time personalization 19 Feb 2026, 2:00 am

For engineers building high-concurrency applications in e-commerce, fintech or media, the “200ms limit” is a hard ceiling. It is the psychological threshold where interaction feels instantaneous. If a personalized homepage, search result or “Up Next” queue takes longer than 200 milliseconds to load, user abandonment spikes. There is a famous study from Amazon showing that every 100ms of latency cost them 1% in sales. In the streaming world, that latency translates directly to “churn.”

The problem is that the business always wants smarter, heavier models. They want large language models (LLMs) to generate summaries, deep neural networks to predict churn and complex reinforcement learning agents to optimize pricing. All of these push latency budgets to the breaking point.

As an engineering leader, I often find myself acting as the mediator between data science teams who want to deploy massive parameters and site reliability engineers (SREs) who are watching the p99 latency graphs turn red.

To reconcile the demand for better AI with the reality of sub-second response times, we must rethink our architecture. We need to move away from monolithic request-response patterns and decouple inference from retrieval.

Here is a blueprint for architecting real-time systems that scale without sacrificing speed.

The architecture of the two-pass system

A common mistake I see in early-stage personalization teams is trying to rank every item in the catalog in real-time. If you have 100,000 items (movies, products or songs), running a complex scoring model against all 100,000 for every user request is mathematically impossible within 200ms.

Two-tower architecture architecture for latency

Manoj Yerrasani

To solve this, we implement a two-tower architecture (or a candidate generation and ranking split).

  1. Candidate generation (the retrieval layer): This is a fast, lightweight sweep. We use vector search or simple collaborative filtering to narrow the 100,000 items down to the top 500 candidates. This step prioritizes recall over precision. It needs to happen in under 20ms.
  2. Ranking (the scoring layer): This is where the heavy AI lives. We take those 500 candidates and run them through the sophisticated deep learning model (e.g. XGBoost or a neural network) that considers hundreds of features like user context, time of day and device type.

By splitting the process, we only spend our expensive compute budget on the items that actually have a chance of being shown. This funnel approach is the only way to balance scale with sophistication.

Solving the cold start problem

The first hurdle every developer faces is the “cold start.” How do you personalize for a user with no history or an anonymous session?

Traditional collaborative filtering fails here because it relies on a sparse matrix of past interactions. If a user just landed on your site for the first time, that matrix is empty.

To solve this within a 200ms budget, you cannot afford to query a massive data warehouse to look for demographic clusters. You need a strategy based on session vectors.

We treat the user’s current session (clicks, hovers and search terms) as a real-time stream. We deploy a lightweight Recurrent Neural Network (RNN) or a simple Transformer model right at the edge or in the inference service.

As the user clicks “Item A,” the model immediately infers a vector based on that single interaction and queries a Vector Database for “nearest neighbor” items. This allows us to pivot the personalization in real-time. If they click a horror movie, the homepage reshuffles to show thrillers instantly.

The trick to keeping this fast is to use hierarchical navigable small world (HNSW) graphs for indexing. Unlike a brute-force search which compares the user vector against every item vector, HNSW navigates a graph structure to find the closest matches with logarithmic complexity. This brings query times down from hundreds of milliseconds to single-digit milliseconds.

Crucially, we only compute the delta of the current session. We do not re-aggregate the user’s lifetime history. This keeps the inference payload small and the lookup instant.

The decision matrix: Inference vs. pre-compute

Another architectural flaw I frequently encounter is the dogmatic attempt to run everything in real-time. This is a recipe for cloud bill bankruptcy and latency spikes.

You need a strict decision matrix to decide exactly what happens when the user hits “load.” We divide our strategy based on the “Head” and “Tail” of the distribution.

First, look at your head content. For the top 20% of active users or globally trending items (e.g. the Super Bowl stream or a viral sneaker drop), you should pre-compute recommendations. If you have a VIP user who visits daily, run those heavy models in batch mode via Airflow or Spark every hour.

Store the results in a low-latency Key-Value store like Redis, DynamoDB or Cassandra. When the request comes in, it is a simple O(1) fetch that takes microseconds, not milliseconds.

Second, use just-in-time inference for the tail. For niche interests or new users that pre-computing cannot cover, route the request to a real-time inference service.

Finally, optimize aggressively with model quantization. In a research lab, data scientists train models using 32-bit floating-point precision (FP32). In production, you rarely need that level of granularity for a recommendation ranking.

We compress our models to 8-bit integers (INT8) or even 4-bit using techniques like post-training quantization. This reduces the model size by 4x and significantly reduces memory bandwidth usage on the GPU. Often, the accuracy drop is negligible (less than 0.5%), but the inference speed doubles. This is often the difference between staying under the 200ms ceiling or breaking it.

Resilience and the ‘circuit breaker’

Speed means nothing if the system breaks. In a distributed system, a 200ms timeout is a contract you make with the frontend. If your sophisticated AI model hangs and takes 2 seconds to return, the frontend spins and the user leaves.

We implement strict circuit breakers and degraded modes.

We set a hard timeout on the inference service (e.g., 150ms). If the model fails to return a result within that window, the circuit breaker trips. We do not show an error page. Instead, we fall back to a “safe” default: a cached list of “Popular Now” or “Trending” items.

From the user’s perspective, the page loaded instantly. They might see a slightly less personalized list, but the application remained responsive. Better to serve a generic recommendation fast than a perfect recommendation slow.

Data contracts as a reliability layer

In a high-velocity environment, upstream data schemas change constantly. A developer adds a field to the user object or changes a timestamp format from milliseconds to nanoseconds. Suddenly, your personalization pipeline crashes because of a type mismatch.

To prevent this, you must implement data contracts at the ingestion layer.

Think of a data contract as an API spec for your data streams. It enforces schema validation before the data ever enters the pipeline. We use Protobuf or Avro schemas to define exactly what the data should look like.

If a producer sends bad data, the contract rejects it at the gate (putting it into a dead letter queue) rather than poisoning the personalization model. This ensures that your runtime inference engine is always fed clean, predictable features. It prevents the “garbage in, garbage out” scenarios that cause silent failures in production.

Observability beyond the average

Finally, how do you measure success? Most teams look at “average latency.” This is a vanity metric. It hides the experience of your most important users.

Averages smooth over the outliers. But in personalization systems, the outliers are often your “power users.” The user with 5 years of watch history requires more data processing than a user with 5 minutes of history. If your system is slow only for heavy data payloads, you are specifically punishing your most loyal customers.

We look strictly at p99 and p99.9 latency. This tells us how the system performs for the slowest 1% or 0.1% of requests. If our p99 is under 200ms, then we know the system is healthy.

The architecture of the future

We are moving away from static, rule-based systems toward agentic architectures. In this new model, the system does not just recommend a static list of items. It actively constructs a user interface based on intent.

This shift makes the 200ms limit even harder to hit. It requires a fundamental rethink of our data infrastructure. We must move compute closer to the user via edge AI, embrace vector search as a primary access pattern and rigorously optimize the unit economics of every inference.

For the modern software architect, the goal is no longer just accuracy. It is accuracy at speed. By mastering these patterns, specifically two-tower retrieval, quantization, session vectors and circuit breakers, you can build systems that do not just react to the user but anticipate them.

This article is published as part of the Foundry Expert Contributor Network.
Want to join?

(image/jpeg; 0.6 MB)

Community push intensifies to free MySQL from Oracle’s control amid stagnation fears 19 Feb 2026, 1:36 am

Pressure is building on Oracle to loosen its grip on MySQL, with a group of database veterans, developers, and long-time contributors urging the company to transition the open source database to an independent foundation model.

The call, articulated in an open letter, reflects mounting concern about MySQL’s development velocity, roadmap transparency, and role in an increasingly AI-driven data ecosystem.

The letter itself has received at least 248 signatures so far.

These signatories span database administrators, architects, and developers from MySQL fork providers such as Percona, MariaDB, and PlanetScale, as well as engineers and executives from companies including Zoho, DigitalOcean, Vultr, and Pinterest, among others.

Open letter reflects growing unease over MySQL’s direction

Chief among the signatories’ concerns is how Oracle has managed updates to MySQL’s codebase, which they argue has cost the database a significant loss in market share. Developers and enterprises are increasingly gravitating toward PostgreSQL as demand surges for AI-driven workloads, where databases play a critical role in consolidating and serving data.

The letter also argued that not only are the MySQL updates “private” and sparse, but they also don’t even include features that are now table stakes for AI-driven workloads and have become standard across most databases, including the enterprise versions offered by Oracle, the signatories wrote in the letter.

In fact, Percona co-founder Vadim Tkachenko, who was one of the co-authors of the letter, told InfoWorld that enterprises’ concern around MySQL’s direction under Oracle had reached a “critical” level.

So much so that enterprises were looking at MySQL fork-providers and cloud providers, such as AWS, for new features and innovation, citing what they perceive as stagnation in the core MySQL project, Tkachenko added.

However, the co-founder pointed out that the interest from enterprises combined with the innovations at the fork-and cloud-provider level does not help move MySQL forward, but rather creates confusion and fragmentation: “Often, forks are not compatible with each other and with upstream (core OSS MySQL), which creates major barriers for adoption and migrations.”

As AI workloads rise, MySQL loses ground to PostgreSQL

Analysts agree with Tkachenko. “The concerns being raised in the open letter about MySQL’s development velocity and governance are consistent with what I have seen,” said Stephanie Walter, leader of the AI stack at HyperFRAME Research.

“The database layer is becoming an AI system dependency. When developers and enterprises feel upstream is slow or opaque on modern requirements, they don’t just complain. They route around it, most likely to something like PostgreSQL,” Walter added.

Echoing Walter, dbinsight chief analyst Tony Baer pointed out that MySQL forks indeed create lock-ins because of their unique individual extensions, resulting in challenges around migration.

The foundation proposal: what signatories want

Nonetheless, Tkachenko and other signatories do see an alternative to rescue MySQL out of the alleged rut: Oracle accepting the proposal to place MySQL under an independent, non-profit foundation.

Under the proposed model outlined in the open letter, MySQL would be governed by a neutral, non-profit foundation with a technical steering committee representing Oracle, fork providers, cloud vendors, and the broader contributor community.

The foundation would oversee roadmap planning, release governance, and contributor access, while allowing Oracle to retain its commercial MySQL offerings and trademarks.

Signatories argue that this structure would protect Oracle’s commercial interests as well as give vendors and enterprises more confidence around the database due to transparent roadmaps, updates, and long-term technical direction while reducing fragmentation across forks.

The foundation may not fix the power dynamics

Analysts, however, didn’t seem too confident about the foundation model proposed in the letter.

“It won’t fully resolve the core power dynamics if Oracle retains the trademark and the effective release pipeline,” Walter said, adding that the proposed model may help with the coordination and contribution process — another concern expressed by the signatories.

Explaining further, Walter pointed out that the proposed structure contrasts with autonomous, community-led projects such as PostgreSQL, whose governance model has played a meaningful role in sustaining contributor trust and accelerating long-term adoption.

PostgreSQL, according to a 2025 Stack Overflow survey, leads MySQL in terms of usage and popularity.

The decline in MySQL’s popularity could be directly linked to a sheer drop in the number of contributors and commits over the last few years.

Julia Vural, software engineering manager at Percona, wrote in a blog post that the pool of active contributors to MySQL had dropped to around 75 by Q3 2025, compared to 135 active contributors in Q4 2017.

Similarly, the total number of commits has declined from 22,360 in 2010 to 4,730 in 2024, Vural added.

Other factors could include recent layoffs at the MySQL division at Oracle, including the departure of Oracle MySQL community manager Frederic Descamps, who moved to the MariaDB Foundation this week.

Oracle did not immediately respond to queries about the open letter.

(image/jpeg; 8.99 MB)

WinterTC: Write once, run anywhere (for real this time) 19 Feb 2026, 1:00 am

The WinterCG community group was recently promoted to a technical committee, signaling a growing maturity for the standard that aims to solidify JavaScript runtimes. Now is good time to catch up with this key feature of modern JavaScript and the web development landscape.

The WinterTC manifesto

To understand what WinterTC is about, we can begin with the committee’s own manifesto:

The ultimate goal of this committee is to promote runtimes supporting a comprehensive unified API surface that JavaScript developers can rely on, regardless of whether their code will be used in browsers, servers, or edge runtimes.

What is notable here is that it was only very recently that the JavaScript server-side needed unification. For over a decade, this space was just Node. Nowadays, we have a growing abundance of runtime options for JavaScript and TypeScript; options include Node, Deno, Bun, Cloudflare Workers, serverless platforms like Vercel and Netlify, and cloud environments like AWS’s LLRT. While this variety indicates a healthy response to the demands of modern web development, it also leads to fragmentation. As developers, we may find ourselves managing constant mental friction: forced to worry about the where rather than the what.

Also see: The complete guide to Node.js frameworks.

WinterTC proposes to smooth out these hard edges by creating a baseline of guaranteed API surface across all JavaScript runtimes. It’s a project whose time has come.

Ecma TC55: The committee for interoperable web runtimes

WinterTC isn’t just a hopeful suggestion; it’s an official standard that any runtime worth its salt will need to satisfy. WinterTC (officially Ecma TC55) is a technical committee dedicated to interoperable web runtimes. It sits alongside TC39, the committee that standardizes JavaScript itself.

WinterTC is a kind of peace treaty between the major players in the web runtimes space—Cloudflare, Vercel, Deno, and the Node.js core team.

The main insight of TC55, which underpins the solutions it seeks, is simple: The browser is the baseline.

Instead of inventing new server-side standards, like a new way to handle HTTP requests, WinterTC mandates that servers adopt browser standards (an approach that successful APIs like fetch had already driven into de facto standards). It creates a kind of universal standard library for JavaScript that exists outside the browser but provides the same services.

The convergence

To understand what this new standardization means for developers, we can look at the code. For a long time, server-side and client-side code relied on different dialects:

  • Browser: fetch for networking, EventTarget for events, and web streams.
  • Node: http.request, EventEmitter, and Node streams.

The server has gradually absorbed the browser way, and is now standardized by WinterTC:

  • fetch: The universal networking primitive is now standard on the back end.
  • Request / Response: These standard HTTP objects (originally from the Service Worker API) now power server frameworks.
  • Global objects: TextEncoder, URL, Blob, and setTimeout work identically everywhere.

This convergence ultimately leads to the realization of the “isomorphic JavaScript” promise. Isomorphic, meaning the server and client mirror each other. You can now write a validation function using standard URL and Blob APIs and run the exact same file on the client (for UI feedback) and the server (for hard security).

I thought isomorphic JavaScript was on the horizon when Node came out, and I was not alone. Better late than never.

The new server battlefields

When every runtime is trending toward supporting the same APIs, how do they continue to distinguish themselves? If code is really portable, the runtimes can no longer compete on API availability (or even worse, on API lock-in). Instead, much like web frameworks, they must compete on the basis of developer experience.

We are seeing distinctive profiles emerge for each runtime:

  • Bun (tooling + speed): Bun isn’t just a runtime; it’s an all-in-one bundler, test runner, and package manager. Its other selling point is raw speed.
  • Deno (security + enterprise): Deno focuses on security (with its opt-in permission system) and a “zero-config” developer experience. It has found a strong niche powering the so-called Enterprise edge. It also has the Deno Fresh framework.
  • Node (familiarity + stability): Node’s asset is its massive legacy ecosystem, reliability, and sheer familiarity. It is catching up by adopting WinterTC standards, but its primary value proposition is boring reliability—a feature that holds considerable weight in the development world.

The cloud operating system

WinterTC also has implications for the deployment landscape. In the past, you chose an operating system; today, you choose a platform.

Platforms like Vercel and Netlify are gradually becoming a new OS layer. WinterTC acts as the POSIX for this emerging cloud OS. Just as POSIX allowed C code to run on Linux, macOS, and Unix, WinterTC allows JavaScript code to run on Vercel, Netlify, and Cloudflare without much finagling.

However, developers should be wary of the new lock-in. Platforms can’t really lock you in with the language anymore (WinterTC makes it easier to swap deployment engines), but they can still trap you with data. Services like Vercel KV, Netlify Blobs, or Cloudflare D1 offer incredible convenience, but they are proprietary. Your compute might be portable, but your state is not. Not that this is anything new—databases, especially managed ones, are inherently a point of lock-in.

The poster child: Hono

If you want to see the standardized server in action today, look no further than Hono. Hono is the Express.js of the WinterTC world. It’s a lightweight web framework that runs natively on Node, Deno, Bun, Cloudflare Workers, and Fastly, or even straight in the browser.

It’s important to note that, while Hono has similarities to Express, it does not use the familiar Express req and res objects. Express objects are wrappers around Node-specific streams, IncomingMessage, and are mutable and closely tied to the Node runtime. Hono objects, by contrast, are the standard Fetch API Request and Response objects. They are immutable and universal. Because it is built on these standards, a Hono router looks familiar to anyone who has used Express, but it is infinitely more portable:

import { Hono } from 'hono'
const app = new Hono()

app.get('/', (c) => {
  return c.text('Hello InfoWorld!')
})

export default app

You could deploy this code to a $5 DigitalOcean droplet running Node, move it to a global edge network on Cloudflare, or even run it inside a browser service worker to mock a back end, all without changing anything.

The universal adapter: Nitro

While Hono represents the “pure” approach (writing code that natively adheres to standards), as developers, we often need more power and greater abstraction—things like file-system routing, asset handling, and build pipelines. This is where Nitro comes in.

Nitro, which is part of the UnJS ecosystem, is a kind of universal deployment adapter for server-side JavaScript. It is the engine that powers frameworks like Nuxt and Analog, but it also works as a standalone server toolkit.

Nitro gives you a higher order layer atop WinterTC. Nitro gives you extra powers while smoothing out some of the quirks that distinguish runtimes. As an example, say you wanted to use a specific Node utility, but you were deploying to Cloudflare Workers. Nitro would automatically detect the target environment and poly-fill the missing features or swap them for platform-specific equivalents during the build process.

With Nitro, you can build complex, feature-rich applications today that are ready for the universal, WinterTC driven future.

Conclusion

By acknowledging the browser as the baseline, we might finally fulfill the promise of “write once, run anywhere.” We’ll finally get our isomorphic JavaScript and drop the mental overhead of context switching. There will always be a distinction between front-end and back-end developers, with the former being involved with view templates and reactive state and the latter touching the business logic, file system, and datastores. But the reality of the full-stack developer is becoming less divisive at the language level.

This movement is part of an overall maturation in the language, web development in general, and the server-side in particular. It feels like the JavaScript server is finally catching up with the browser.

(image/jpeg; 15.16 MB)

How to choose the best LLM using R and vitals 19 Feb 2026, 1:00 am

Is your generative AI application giving the responses you expect? Are there less expensive large language models—or even free ones you can run locally—that might work well enough for some of your tasks?

Answering questions like these isn’t always easy. Model capabilities seem to change every month. And, unlike conventional computer code, LLMs don’t always give the same answer twice. Running and rerunning tests can be tedious and time consuming.

Fortunately, there are frameworks to help automate LLM tests. These LLM “evals,” as they’re known, are a bit like unit tests on more conventional computer code. But unlike unit tests, evals need to understand that LLMs can answer the same question in different ways, and that more than one response may be correct. In other words, this type of testing often requires the ability to analyze flexible criteria, not simply check if a given response equals a specific value.

The vitals package, based on Python‘s Inspect framework, brings automated LLM evals to the R programming language. Vitals was designed to integrate with the ellmer R package, so you can use them together to evaluate prompts, AI applications, and how different LLMs affect both performance and cost. In one case, it helped show that AI agents often ignore information in plots when it goes against their expectations, according to package author Simon Couch, a senior software engineer at Posit. Couch said over email that the experiment, done using a set of vitals evaluations dubbed bluffbench, “really hit home for some folks.”

Couch is also using the package to measure how well different LLMs write R code.

Vitals setup

You can install the vitals package from CRAN or, if you want the development version, from GitHub with pak::pak("tidyverse/vitals"). As of this writing, you’ll need the dev version to access several features used in examples for this article, including a dedicated function for extracting structured data from text.

Vitals uses a Task object to create and run evals. Each task needs three pieces: a dataset, a solver, and a scorer.

Dataset

A vitals dataset is a data frame with information about what you want to test. That data frame needs at least two columns:

  • input: The request you want to send to the LLM.
  • target: How you expect the LLM to respond.

The vitals package includes a sample dataset called are. That data frame has a few more columns, such as id (which is never a bad idea to include in your data), but these are optional.

As Couch told posit::conf attendees a few months ago, one of the easiest ways to create your own input-target pairs for a dataset is to type what you want into a spreadsheet. Set up spreadsheet columns with “input” and “target,” add what you want, then read that spreadsheet into R with a package like googlesheets4 or rio.

Screenshot of a spreadsheet to create a vitals dataset with input and target columns.

Example of a spreadsheet to create a vitals dataset with input and target columns.

Sharon Machlis

Below is the R code for three simple queries I’ll use to test out vitals. The code creates an R data frame directly, if you’d like to copy and paste to follow along. This dataset asks an LLM to write R code for a bar chart, determine the sentiment of some text, and create a haiku.

my_dataset This desktop computer has a better processor and can handle much more demanding tasks such as running LLMs locally. However, it\U{2019}s also noisy and comes with a lot of bloatware.",
    "Write me a haiku about winter"
  ),
  target = c(
    'Example solution: ```library(ggplot2)\r\nlibrary(scales)\r\nsample_data 

Next, I’ll load my libraries and set a logging directory for when I run evals, since the package will suggest you do that as soon as you load it:

library(vitals)
library(ellmer)
vitals_log_dir_set("./logs")

Here’s the start of setting up a new Task with the dataset, although this code will throw an error without the other two required arguments of solver and scorer.

my_task 

If you’d rather use a ready-made example, you can use dataset = are with its seven R tasks.

It can take some effort to come up with good sample targets. The classification example was simple, since I wanted a single-word response, mixed. But other queries can have more free-form responses, such as writing code or summarizing text. Don’t rush through this part—if you want your automated “judge” to grade accurately, it pays to design your acceptable responses carefully.

Solver

The second part of the task, the solver, is the R code that sends your queries to an LLM. For simple queries, you can usually just wrap an ellmer chat object with the vitals generate() function. If your input is more complex, such as needing to call tools, you may need a custom solver. For this part of the demo, I’ll use a standard solver with generate(). Later, we’ll add a second solver with generate_structured().

It helps to be familiar with the ellmer R package when using vitals. Below is an example of using ellmer without the vitals package, with my_dataset$input[1], the first query in my dataset data frame, as my prompt. This code returns an answer to the question but doesn’t evaluate it.

Note: You’ll need an OpenAI key if you want to run this specific code. Or you can change the model (and API key) to any other LLM from a provider ellmer supports. Make sure to store any needed API keys for other providers. For the LLM, I chose OpenAI’s least expensive current model, GPT-5 nano.

my_chat 

You can turn that my_chat ellmer chat object into a vitals solver by wrapping it in the generate() function:

# This code won't run yet without the tasks's third required argument, a scorer
my_task 

The Task object knows to use the input column from your dataset as the question to send to the LLM. If the dataset holds more than one query, generate() handles processing them.

Scorer

Finally, we need a scorer. As the name implies, the scorer grades the result. Vitals has several different types of scorer. Two of them use an LLM to evaluate results, sometimes referred to as “LLM as a judge.” One of vitals’ LLM-as-a-judge options, model_graded_qa(), checks how well the solver answered a question. The other, model_graded_fact(), “determines whether a solver includes a given fact in its response,” according to the documentation. Other scorers look for string patterns, such as detect_exact() and detect_includes().

Some research shows that LLMs can do a decent job in evaluating results. However, like most things involving generative AI, I don’t trust LLM evaluations without human oversight.

Pro tip: If you’re testing a small, less capable model in your eval, you don’t want that model also grading the results. Vitals defaults to using the same LLM you’re testing as the scorer, but you can specify another LLM to be your judge. I usually want a top-tier frontier LLM for my judge unless the scoring is straightforward.

Here’s what the syntax might look like if we were using Claude Sonnet as a model_graded_qa() scorer:

scorer = model_graded_qa(scorer_chat = chat_anthropic(model = "claude-sonnet-4-6"))

Note that this scorer defaults to setting partial credit to FALSE—either the answer is 100% accurate or it’s wrong. However, you can choose to allow partial credit if that makes sense for your task, by adding the argument partial_credit = TRUE:

scorer = model_graded_qa(partial_credit = TRUE, scorer_chat = chat_anthropic(model = "claude-sonnet-4-6"))

I started with Sonnet 4.5 as my scorer, without partial credit. It got one of the gradings wrong, giving a correct score to R code that did most things right for my bar chart but didn’t sort by descending order. I also tried Sonnet 4.6, released just this week, but it also got one of the grades wrong.

Opus 4.6 is more capable than Sonnet, but it’s also about 67% pricier at $5 per million tokens input and $25 per million output. Which model and provider you choose depends in part on how much testing you’re doing, how much you like a specific LLM for understanding your work (Claude has a good reputation for writing R code), and how important it is to accurately evaluate your task. Keep an eye on your usage if cost is an issue. If you’d rather not spend any money following the examples in this tutorial, and you don’t mind using less capable LLMs, check out GitHub Models, which has a free tier. ellmer supports GitHub Models with chat_github(), and you can also see available LLMs by running models_github().

Keep an eye on your usage if cost is an issue. If you’d rather not spend any money following the examples in this tutorial, and you don’t mind using less capable LLMs, check out GitHub Models, which has a free tier. ellmer supports GitHub Models with chat_github(). (You can also see available LLMs by running models_github().)

Below, I’ve added model_graded_qa() scoring to my_task, and I also included a name for the task. However, I’d suggest not adding a name to your task if you plan to clone it later to try a different model. Cloned tasks keep their original name, and as of this writing, there’s no way to change that.

my_task 

Now, my task is ready to use.

Run your first vitals task

You execute a vitals task with the task object’s $eval() method:

my_task$eval()

The eval() method launches five separate methods: $solve(), $score(), $measure(), $log(), and $view(). After it finishes running, a built-in log viewer should pop up. Click on the hyperlinked task to see more details:

Screenshot of the details on a task run in vitals’ built-in viewer.

Details on a task run in vitals’ built-in viewer. You can click each sample for additional info.

Sharon Machlis

“C” means correct and “I” stands for incorrect, and there could have been “P” for partially correct if I had allowed partial credit.

If you want to see a log file in that viewer later, you can invoke the viewer again with vitals_view("your_log_directory"). The logs are just JSON files, so you can view them in other ways, too.

You’ll probably want to run an eval multiple times, not just once, to feel more confident that an LLM is reliable and didn’t just get lucky. You can set multiple runs with the epochs argument:

my_task$eval(epochs = 10)

The accuracy of bar chart code on one of my 10-epoch runs was 70%—which may or may not be “good enough.” Another time, that rose to 90%. If you want a true measure of an LLM’s performance, especially when it’s not scoring 100% on every run, you’ll want a good sample size; margin of error can be significant with just a few tests. (For a deep dive into statistical analysis of vitals results, see the package’s analysis vignette.)

It cost about 14 cents to use Sonnet 4.6 as a judge versus 27 cents for Opus 4.6 on 11 total epoch runs of three queries each. (Not all these queries even needed an LLM for evaluation, though, if I were willing to separate the demo into multiple task objects. The sentiment analysis was just looking for “Mixed,” which is simpler scoring.)

The vitals package includes a function that can format the results of a task’s evaluation as a data frame: my_task$get_samples(). If you like this formatting, save the data frame while the task still exists in your R session:

results_df 

You may also want to save the Task object itself.

If there’s an API glitch while you’re running your input queries, the entire run will fail. If you want to run a test for a lot of epochs, you may want to break it up into smaller groups so as not to risk wasting tokens (and time).

Swap in another LLM

There are several ways to run the same task with a different model. First, create a new chat object with that different model. Here’s the code for checking out Google Gemini 3 Flash Preview:

my_chat_gemini 

Then you can run the task in one of three ways.

1. Clone an existing task and add the chat as its solver with $set_solver():

my_task_gemini 

2. Clone an existing task and add the new chat as a solver when you run it:

my_task_gemini 

3. Create a new task from scratch, which allows you to include a new name:

my_task_gemini 

Make sure you’ve set your API key for each provider you want to test, unless you’re using a platform that doesn’t need them, such as local LLMs with ollama.

View multiple task runs

Once you’ve run multiple tasks with different models, you can use the vitals_bind() function to combine the results:

both_tasks 
Screenshot of combined task results running each LLM with three epochs.

Example of combined task results running each LLM with three epochs.

Sharon Machlis

This returns an R data frame with columns for task, id, epoch, score, and metadata. The metadata column contains a data frame in each row with columns for input, target, result, solver_chat, scorer_chat, scorer_metadata, and scorer.

To flatten the input, target, and result columns and make them easier to scan and analyze, I un-nested the metadata column with:

library(tidyr)
both_tasks_wide 
  unnest_longer(metadata) |>
  unnest_wider(metadata)

I was then able to run a quick script to cycle through each bar-chart result code and see what it produced:

library(dplyr)

# Some results are surrounded by markdown and that markdown code needs to be removed or the R code won't run
extract_code 
  filter(id == "barchart")

# Loop through each result
for (i in seq_len(nrow(barchart_results))) {
  code_to_run 

Test local LLMs

This is one of my favorite use cases for vitals. Currently, models that fit into my PC’s 12GB of GPU RAM are rather limited. But I’m hopeful that small models will soon be useful for more tasks I’d like to do locally with sensitive data. Vitals makes it easy for me to test new LLMs on some of my specific use cases.

vitals (via ellmer) supports ollama, a popular way of running LLMs locally. To use ollama, download, install, and run the ollama application, and either use the desktop app or a terminal window to run it. The syntax is ollama pull to download an LLM, or ollama run to both download and start a chat if you’d like to make sure the model works on your system. For example: ollama pull ministral-3:14b.

The rollama R package lets you download a local LLM for ollama within R, as long as ollama is running. The syntax is rollama::pull_model("model-name"). For example, rollama::pull_model("ministral-3:14b"). You can test whether R can see ollama running on your system with rollama::ping_ollama().

I also pulled Google’s gemma3-12b and Microsoft’s phi4, then created tasks for each of them with the same dataset I used before. Note that as of this writing, you need the dev version of vitals to handle LLM names that include colons (the next CRAN version after 0.2.0 should handle that, though):

# Create chat objects
ministral_chat 

All three local LLMs nailed the sentiment analysis, and all did poorly on the bar chart. Some code produced bar charts but not with axes flipped and sorted in descending order; other code didn’t work at all.

Screenshot of results after running a dataset with gemma, minisral, and phi.

Results of one run of my dataset with five local LLMs.

Sharon Machlis

R code for the results table above:

library(dplyr)
library(gt)
library(scales)

# Prepare the data
plot_data 
  rename(LLM = task, task = id) |>
  group_by(LLM, task) |>
  summarize(
    pct_correct = mean(score == "C") * 100,
    .groups = "drop"
  )

color_fn 
  tidyr::pivot_wider(names_from = task, values_from = pct_correct) |>
  gt() |>
  tab_header(title = "Percent Correct") |>
  cols_label(`sentiment-analysis` = html("sentiment-
analysis")) |> data_color( columns = -LLM, fn = color_fn )

It cost me 39 cents for Opus to judge these local LLM runs—not a bad bargain.

Extract structured data from text

Vitals has a special function for extracting structured data from plain text: generate_structured(). It requires both a chat object and a defined data type you want the LLM to return. As of this writing, you need the development version of vitals to use the generate_structured() function.

First, here’s my new dataset to extract topic, speaker name and affiliation, date, and start time from a plain-text description. The more complex version asks the LLM to convert the time zone to Eastern Time from Central European Time:

extract_dataset R Package Development in Positron\r\nThursday, January 15th, 18:00 - 20:00 CET (Rome, Berlin, Paris timezone) \r\nStephen D. Turner is an associate professor of data science at the University of Virginia School of Data Science. Prior to re-joining UVA he was a data scientist in national security and defense consulting, and later at a biotech company (Colossal, the de-extinction company) where he built and deployed scores of R packages.",

    "Extract the workshop topic, speaker name, speaker affiliation, date in 'yyyy-mm-dd' format, and start time in Eastern Time zone in 'hh:mm ET' format from the text below. (TZ is the time zone). Assume the date year makes the most sense given that today's date is February 7, 2026. Return ONLY those entities in the format {topic}, {speaker name}, {date}, {start_time}. Convert the given time to Eastern Time if required. R Package Development in Positron\r\nThursday, January 15th, 18:00 - 20:00 CET (Rome, Berlin, Paris timezone) \r\nStephen D. Turner is an associate professor of data science at the University of Virginia School of Data Science. Prior to re-joining UVA he was a data scientist in national security and defense consulting, and later at a biotech company (Colossal, the de-extinction company) where he built and deployed scores of R packages."
  ),
  target = c(
    "R Package Development in Positron, Stephen D. Turner, University of Virginia (or University of Virginia School of Data Science), 2026-01-15, 18:00. OR R Package Development in Positron, Stephen D. Turner, University of Virginia (or University of Virginia School of Data Science), 2026-01-15, 18:00 CET.",
    "R Package Development in Positron, Stephen D. Turner, University of Virginia (or University of Virginia School of Data Science), 2026-01-15, 12:00 ET."
  )
)

Below is an example of how to define a data structure using ellmer’s type_object() function. Each of the arguments gives the name of a data field and its type (string, integer, and so on). I’m specifying I want to extract a workshop_topic, speaker_name, current_speaker_affiliation, date (as a string), and start_time (also as a string):

my_object 

Next, I’ll use the chat objects I created earlier in a new structured data task, using Sonnet as the judge since grading is straightforward:

my_task_structured 

It cost me 16 cents for Sonnet to judge 15 evaluation runs of two queries and results each.

Here are the results:

Screenshot of results after running the structured data task on gemini, gemma, gpt_5_nano, ministral and phi.

How various LLMs fared on extracting structured data from text.

Sharon Machlis

I was surprised that a local model, Gemma, scored 100%. I wanted to see if that was a fluke, so I ran the eval another 17 times for a total of 20. Weirdly, it missed on two of the 20 basic extractions by giving the title as “R Package Development” instead of “R Package Development in Positron,” but scored 100% on the more complex ones. I asked Claude Opus about that, and it said my “easier” task was more ambiguous for a less capable model to understand. Important takeaway: Be as specific as possible in your instructions!

Still, Gemma’s results were good enough on this task for me to consider testing it on some real-world entity extraction tasks. And I wouldn’t have known that without running automated evaluations on multiple local LLMs.

Conclusion

If you’re used to writing code that gives predictable, repeatable responses, a script that generates different answers each time it runs can feel unsettling. While there are no guarantees when it comes to predicting an LLM’s next response, evals can increase your confidence in your code by letting you run structured tests with measurable responses, instead of testing via manual, ad-hoc queries. And, as the model landscape keeps evolving, you can stay current by testing how newer LLMs perform—not on generic benchmarks, but on the tasks that matter most to you.

Learn more about the vitals R package

(image/jpeg; 9.03 MB)

What happens when you add AI to SAST 19 Feb 2026, 1:00 am

Nearly a year ago, I wrote an article titled “How to pick the right SAST tool.” It was a look at the pros and cons of two different generations of static application security testing (SAST):

  • Traditional SAST (first generation): Deep scans for the best coverage, but creates massive friction due to long run times.
  • Rules-based SAST (second generation): Prioritized developer experience via faster, customizable rules, but coverage was limited to explicitly defined rules.

At that time, these two approaches were really the only options. And to be honest, neither option was all that great. Basically, both generations were created to alert for code weaknesses that have mostly been solved in other ways (i.e., improvements in compilers and frameworks eliminated whole classes of CWEs), and the tools haven’t evolved at the same pace as modern application development. They rely on syntactic pattern matching, occasionally enhanced with intraprocedural taint analysis. But modern applications are much more complex and often use middleware, frameworks, and infrastructure to address risks.

So while responsibility for weaknesses shifted to other parts of the stack (thanks to memory safety, frameworks, and infrastructure), SAST tools spew out false positives (FPs) found at the granular, code level. Whether you’re using first or second generation SAST, 68% to 78% of findings are FPs. That’s a lot of manual triaging by the security team. Worse, today’s code weaknesses are more likely to come from logic flaws, abuse of legitimate features, and contextual misconfigurations. Unfortunately, those aren’t problems a regex-based SAST can meaningfully understand. So in addition to FPs, you also have high rates of false negatives (FNs). And as organizations adopt AI code assistants at high volumes, we can also expect more logic and architecture flaws that SASTs can’t catch.

Can AI solve the SAST problem?

As the security community started adopting AI to solve previously unsolvable/hard problems, an interesting question was repeatedly posed: Can AI help produce a SAST that actually works?

In fact, it can. And so dawned the third generation of SAST:

  • AI SAST (third generation): Uses AI agents and multi-modal analysis to target business logic flaws and achieve extremely high FP reduction.

Let’s be clear! Good quality AI SAST should be more than just a first or second generation tool with a ChatGPT wrapper around it. For the tool to perform well, it needs the context of your code and architecture. But don’t just dump your entire code bases into a large language model (LLM). That will burn tokens and quickly become prohibitively costly at enterprise scale.

When evaluating AI SAST solutions, I suggest looking for a multi-modal analysis that includes a combination of rules, dataflow analysis, and LLM reasoning. This multi-modal approach replicates the same process security teams use manually: read the code, trace the dataflow, reason about business logic.

Rules for syntax

Rules are dead, long live rules!

Deterministic checks (via rules) are still an excellent way to catch specific patterns at a near-zero runtime cost. To use a security truism, a good AI SAST will leverage a defense-in-depth strategy, with the rules identifying obvious security bugs while AI is used later in the flow. For example, a rule can quickly flag the use of an outdated encryption algorithm or the absence of input validation on a critical API endpoint.

When looking at AI SAST products, find out where the rules come from:

  • Are they generic linters, or is there a research team tuning them for accuracy?
  • Are the rules tested against real code?
  • Do the findings include detailed context and remediation guidance?
  • Does the tool let you add natural language rules to the system? (This is key because, well, writing rules is no fun.)

All of these points can really benefit AI-triage-at-scale by reducing the tokens needed to parse a code base.

Dataflow analysis

Let’s suppose a rule flags the usage of a vulnerable encryption function in two different places in the code. Finding those weaknesses doesn’t mean they’re true positives. Here’s where dataflow analysis is useful. The AI SAST follows the dataflow across multiple files and functions, looking through the code in the tested source file to perform a taint analysis, tracing input from sources to sinks. The purpose of this step is to remove or deprioritize findings that aren’t exploitable. (It’s a bit like reachability for software composition analysis, or SCA.) And while AI can do this, it’s also beneficial for the tool to have some non-AI way of conducting program analysis to speed things up.

When you’re evaluating AI SAST products to see how they do dataflow analysis, ask:

  • How is the analysis performed? Is it done by AI or with program analysis, or both?
  • Can the tool handle multi-file, multi-function analysis?
  • What evidence is provided to justify whether the code is exploitable?
  • What percentage of false positives is the analysis able to detect?

You should expect the tool to show the path an attacker could take to exploit a weakness within the context of your application, turning hypothetical issues into actionable knowledge. Dataflow analysis is also a good use case for AI agents, so you might expect to see AI at this step.

Reasoning with LLMs

Not long ago, the combination of rules and analysis might have been considered adequate. But it still generates FPs because the tool is just flagging potential vulnerabilities without understanding what other compensating controls might be in place. The culprit is often a SAST tool’s inability to perform cross-file analysis, and unfortunately adding more rules can backfire. That’s because more patterns yield more findings, but without context, many of those findings will be of low quality. And of course, those older tools can’t catch complex logic flaws.

This is where AI SAST can add more value, by telling you if a finding is high-priority. Using AI-based triage, the tool can review findings in the context of the entire code base and any additional metadata, much like a human security expert would, to make final determinations and prioritizations. This final triage step can identify logic flaws, eliminate more FPs, or potentially downgrade the severity of findings based on specific runtime configurations, the relationships between components, or the nuances of business logic.

Some questions to ask the AI SAST vendor include:

  • Does the tool have cross-file awareness?
  • What kinds of files or documentation is the tool prompted to look at?
  • Can the tool detect complex logic flaws?
  • Can the tool learn from engineer feedback?

How will the vendor handle your data?

Finally, before you decide to try an AI SAST, be sure you understand the vendor’s data handling practices. Ask:

  • How is the analysis scoped?
  • Is my code retained?
  • Is my data used for training?
  • What can I opt out of, and how will that impact accuracy?

You might be tempted to say, well I’ll just bring my own model (BYO LLM). That sounds like an easy fix, but maintaining your own LLM requires a massive infrastructure that is neither easy nor cheap. A potential compromise could be bringing your own API key, even with something simple like AZURE_OPENAI_API_KEY=your_azure_openai_api_key.

Is AI SAST for you?

If SAST has become a painful checkbox in your organization, with developers and security engineers alike bemoaning its existence, then definitely look into whether an AI SAST is right for you. As AI coding tools get better in the future, we’ll get to a world where design, architecture, and logic risks are really the only remaining flaws. Someday (perhaps soon), your first-generation or second-generation SAST may no longer detect the risks that are present in your code. AI SAST could well prepare you for that future.

Here’s a quick reference table to think about the pros and cons of each.

 Traditional SAST (first generation)Rules-based SAST (second generation) AI SAST (third generation)
TL;DRSlow but accurateFast but noisyFast and accurate
Pros– Best coverage possible– Fast, CI/CD compatible
– Highly customizable, tailored rules
– Developer-oriented, seamless integration
– Detects complex logic flaws (low FNs)
– Understands code context (low FPs)
– Potential for the agents to learn what matters    
Cons– Slow, not ideal for agile workflows, happens very late in the SDLC
– Limited customization options
– Coverage comes at the cost of FPs
– Can’t detect complex business flaws (FNs)
– Requires separate tools or processes
– Rule-dependent, may require expertise
– Requires ensuring rules meet specific use cases (e.g., language support)
– Speed comes at the cost of FNs and FPs
– Can’t detect complex business flaws (FNs)
– Must be comfortable with LLM having access to code

New Tech Forum provides a venue for technology leaders—including vendors and other outside contributors—to explore and discuss emerging enterprise technology in unprecedented depth and breadth. The selection is subjective, based on our pick of the technologies we believe to be important and of greatest interest to InfoWorld readers. InfoWorld does not accept marketing collateral for publication and reserves the right to edit all contributed content. Send all inquiries to doug_dineley@foundryco.com.

(image/jpeg; 4.24 MB)

Enterprise use of open source AI coding is changing the ROI calculation 18 Feb 2026, 7:06 pm

Coders are understandably complaining about AI coding problems, with the technology often delivering what’s become known as “AI slop,” but their concerns are signaling a more strategic issue about how enterprises calculate coding ROI.

The issues, according to IT analysts and consultants, go far beyond vastly faster production of code accompanied by the kinds of errors generated by AI agents that don’t truly understand the human implications of their code. 

Even if the resulting code functions properly, which it often doesn’t, it is introducing a wide range of corporate risks, ranging from legal (copyright, trademark, or patent infringements), cybersecurity (backdoors and inadvertently introduced malware) and accuracy (hallucinations, as well as models trained/fine-tuned on inaccurate data). Some of those issues are generated by poorly-worded prompts, and others occur because the model improperly interpreted proper prompts. 

This issue was explored this week in a discussion on the Bluesky social media site initiated by user Rémi Verschelde, a French developer living in Copenhagen, who said that he is the project manager and lead maintainer for @godotengine.org, as well as a co-founder of a gaming firm. 

AI slop impacts enterprises

“AI slop PRs [pull requests] are becoming increasingly draining and demoralizing for Godot maintainers,” he said. “We find ourselves having to second guess every PR from new contributors, multiple times per day.” Questions arise about whether the code was written at least in part by a human, and whether the ‘author’ understands the code they’re sending.

He asked, “did they test it? Are the test results made up? Is this code wrong because it was written by AI or is it an honest mistake from an inexperienced human contributor? What do you do when you ask a PR author if they used AI, because you’re suspicious, and they all reply ‘yes, I used it to write the PR description because I’m bad with English’?”  

These problems with AI coding are impacting executives throughout IT, legal, compliance, and cybersecurity. That is mostly because the accelerations that AI is driving are not merely putting out code thousands of times more quickly; the problems associated with AI and open source are increasing even more rapidly.

There are even reports that AI agents are fighting back against open source maintainers. 

These are especially vexing issues for enterprise executives, because many larger companies are trying to move more AI projects to open source to try to avoid problems such as data leaks and unauthorized uses associated with the major hyperscalers. 

The problem isn’t that the code is bad

Vaclav Vincalek, CTO at personalized web vendor Hiswai, said the problem with much vibe coding is not that the code looks bad. Ironically, the problem is that it looks quite good. 

“The biggest risk with AI-generated code isn’t that it’s garbage, it’s that it’s convincing. It compiles, it passes superficial review and it looks professional, but it may embed subtle logic errors, security flaws, or unmaintainable complexity,” Vincalek said. “AI slop isn’t just a quality issue. It’s a long-term ownership issue. Maintainers aren’t reviewing a patch [as much as they are] adopting a liability they may have to support for years.”

Another irony that Vincalek flagged is that some enterprises have been going to open source to avoid the same issues that AI in open source now delivers. 

“Some enterprises think open source is a refuge from hyperscaler AI risk, but AI-generated code is now flowing into open source itself. If you don’t have strong governance, you’re just shifting the risk upstream,” Vincalek said. “AI has lowered the cost of producing code to near zero, but the cost of reviewing and maintaining it hasn’t changed. That imbalance is crushing maintainers.”

Vincalek argued that the fix for this problem is to push back far more on those submitting the AI-generated code. 

“One of the simplest anti-slop mechanisms is forcing contributors to explain the intent behind the code. AI can generate syntax, but it can’t justify design decisions,” Vincalek said. “Projects need AI contribution policies the same way they need licensing policies. If someone can’t explain or maintain what they submit, it doesn’t belong in the codebase.”

One criticism of AI coding has been that the agents do not actually understand how humans function. For example, on a LinkedIn discussion forum, an AWS executive posted about an AI system that was creating a series of registration pages and extrapolated from other examples how those pages should look and function. But it drew the wrong conclusion. It learned from fields for username, email address, and phone number that if those characters in the same sequence already existed in the system, it needed to require a different input. It then applied that same logic to a field asking for age, and rejected an answer because “user with this age already exists.” 

Workflow changes needed

Jason Andersen, principal analyst at Moor Insights & Strategy, said the AI coding problem is not solely with code creation, but in how enterprises handle the process. 

“What AI really needs these days is a change of workflow [to deal with the] increasing amount of crap that you have to inspect. Where we are with AI right now is that one step in a long process happens very fast, but that doesn’t mean the other steps have caught up,” Andersen said. “A 30% increase in coding productivity delivers strains across the entire process. If it doubles, the system would break down. There are pieces of this that are starting to come together, but it’s going to take a lot longer than people think.”

Andersen, who described these coding agents as “robotic toddlers,” said that IT had been demanding accelerated coding, and then chose to embrace AI-accelerated open source. “But now that the Pandora’s Box has been opened,” they are unhappy with the results. 

Andersen compared this to a large marketing department that begs partners for as many sales leads as they can find and then later complains, “all of these leads suck.”

ROI calculations need revamping

Rock Lambros, CEO of security firm RockCyber, added that the ROI calculations need to be completely reconsidered.

“AI-made code is now almost free to produce, but it did nothing to reduce the cost of reviewing it,” he pointed out. “A contributor can generate a 500-line pull request in 90 seconds. Yet a maintainer still needs 2 hours to determine whether it’s sound. That asymmetry is what’s crushing open source teams right now.”

He noted that this isn’t just a code quality problem, it’s a supply chain security risk. “Nobody is paying attention to context rot, the gradual loss of coherence that happens over long AI generation sessions,” he said, noting that an agent might implement proper validation in one file and silently cease to do so in another. In fact, he said, research from UT San Antonio found that roughly 20% of package names in AI-generated code don’t even exist, and “attackers are already squatting [on] those names.”

Degradation of trust

Consultant Ken Garnett, founder of Garnett Digital Strategies, said he sees the problem as a degradation of the trust that open source has historically delivered.

“It’s what I’d call a verification collapse. Rémi Verschelde isn’t simply saying ‘the code is bad.’ He’s describing a system in which maintainers can no longer trust the signals they’ve always relied upon,” he said. “That’s a considerably deeper and more consequential problem than low-quality code alone, because it corrodes the trust infrastructure that open-source contribution has always depended on.”

Accumulating risk

Enterprises have scaled AI generation without redesigning the reviewing process to validate it, he noted. “The submission side of the workflow received, essentially, a ten-times speed multiplier. The human review side received nothing,” Garnett said. “The result is exactly what Godot is experiencing: a small, dedicated group of people drowning under a volume of work the system was never structured for them to handle. This is the entirely predictable consequence of accelerating one half of a workflow without touching the other.”

He added: “For enterprise IT leaders, the more uncomfortable question is whether they’ve built any accountability structure around AI-assisted code at all, or whether they’ve simply handed developers a faster instrument and assumed quality would follow. Consequently, what they’re often dealing with now isn’t an AI problem so much as a governance gap that AI has made impossible to ignore.”

Cybersecurity consultant Brian Levine, executive director of FormerGov, succinctly summed up the issue: “AI slop creates a false sense of velocity. You think you’re shipping faster, but you’re actually accumulating risk faster than your team can pay it down.”

(image/jpeg; 0.29 MB)

GlassFish 8 Java server boosts data access, concurrency 18 Feb 2026, 4:55 pm

The Eclipse Foundation has released the final version of GlassFish 8, an update of its enterprise Java application server. The new release serves as a compatible implementation of the Jakarta EE 11 Java platform and accommodates Jakarta Data repositories for simplifying data access, according to GlassFish development participant OmniFish. Virtual threads support for scalable concurrency also is featured.

Released February 5, the final version of GlassFish 8 can be downloaded from glassfish.org. The previous milestone release of GlassFish 8 was released in December 2025, OmniFish said.

With Jakarta Data repositories support, developers can work with both JPA (Java Persistence API) entities and JNoSQL databases using a consistent, repository-pattern-based approach, said Ondre Mihalyi, OmniFish co-founder and engineer. Key benefits of this feature include reduced boilerplate code, flexible repository organization, and flexible pagination. Support for both Jakarta Persistence entities and Jakarta NoSQL entities in Jakarta Data repositories is featured in GlassFish 8, according to release notes.

In addition, GlassFish 8 embraces the future of concurrency in Java with support for virtual threads in its HTTP thread pools and managed executors, Mihalyi said. Virtual threads support enables the server to handle a massive number of concurrent requests with minimal overhead, leading to significant improvements in scalability and performance for I/O-bound applications. Virtual threads represent a paradigm shift in how to think about concurrent programming, enabling developers to write simpler, more maintainable code that scales effortlessly, Mihalyi added.

Other GlassFish 8 highlights:

  • A new version of Jakarta Security provides more flexible authentication options. Its Integration between MicroProfile JWT and Jakarta Security allows more flexibility.
  • Developers can secure REST endpoints with JWT (JSON Web Token) while using other Jakarta Security mechanisms to protect UI pages, providing a comprehensive security solution that adapts to diverse application architectures.
  • Monitoring via JMX (Java Management Extensions) is supported in Embedded Eclipse GlassFish.

(image/jpeg; 10.82 MB)

GitHub readies agents to automate repository maintenance 18 Feb 2026, 5:04 am

GitHub is readying a new feature to automate some of the most expensive work in DevOps: the invisible housekeeping no one wants to own. Developers would rather be building features than debugging flaky continuous integration (CI) pipelines, triaging low-quality issues, updating outdated documentation, or closing persistent gaps in test coverage.

In order to help developers and enterprises manage the operational drag of maintaining repositories, GitHub is previewing Agentic Workflows, a new feature that uses AI to automate most routine tasks associated with repository hygiene.

It won’t solve maintenance problems all by itself, though.

Developers will still have to describe the automation workflows in natural language that agents can follow, storing the instructions as Markdown files in the repo created either from the terminal via the GitHub CLI or inside an editor such as Visual Studio Code.

Then, they’ll have to connect up whichever large language model (LLM) and vibe coding tool they want the agent to use — available options include GitHub Copilot, Claude, or OpenAI Codex — and set guard rails defining what the agent is allowed to read, what it can propose, and which events (issues, pull requests, scheduled runs) should trigger it.

Once committed, the workflows execute on GitHub Actions like any other automation, with the agents’ decisions and proposed changes surfacing as issue comments, pull requests, and CI logs for developers to review.

These automated workflows should reduce the cognitive tax around maintenance work for developers, GitHub executives wrote in a blog post about Github Agentic Workflows.

Can productivity gains scale?

Analysts see immediate productivity gains for developers as well as engineering heads, especially in the form of fewer stalled builds, faster root-cause analysis, and cleaner repositories that quietly improve delivery velocity with the same headcount.

“Mid-sized engineering teams gain immediate productivity benefits because they struggle most with repetitive maintenance work like triage and documentation drift,” said Dion Hinchcliffe, VP of the CIO practice at The Futurum Group.

Another reason for gaining productivity is Agentic Workflows’ use of intent-based Markdown instead of YAML which makes authoring faster for developers, Hinchcliffe added.

However, Advait Patel, a senior site reliability engineer at Broadcom, cautioned that though intent-based Markdown makes authoring workflows quicker, it can also reduce precision: “YAML is annoying, but it is explicit. Natural language can be interpreted differently by different models or versions.”

Similarly, Hinchcliffe pointed out that there is also a risk of these workflows generating excessive low-value pull requests (PRs) or issue noise, especially if they are unattended or unmanaged.

Compounding compute cost

Patel also warned that beyond precision and signal-to-noise concerns, there is a more prosaic risk teams may underestimate at first: As agentic workflows scale across repositories and run more frequently, the underlying compute and model-inference costs can quietly compound, turning what looks like a productivity boost into a growing operational line item if left unchecked.

This can become a boardroom issue for engineering heads and CIOs because they must justify return on investment, especially at a time when they are grappling with what it really means to let software agents operate inside production workflows, Patel added.

In addition, Shelly DeMotte Kramer, principal analyst at Kramer & Company, warned that GitHub’s approach could also deepen platform dependence for both developers and CIOs, effectively nudging teams toward tighter lock-in with Agentic Workflows.

“By embedding agents natively into GitHub Actions rather than bolting them on externally they’re creating switching costs that go beyond tooling familiarity. This is, and will be challenging, and create a bit of a lock-in situation, as you can’t easily port a Markdown-based agentic workflow to GitLab because the execution engine, permissions model, and safe outputs architecture are GitHub-native,” Kramer said.

A push for greater control

That strategic play, Kramer added, reflects GitHub’s push to exert greater control over developer workflows, betting that ownership of the automation layer of the software development lifecycle will shape how software teams operate, giving it an edge over rivals.

However, the analyst expects to see rivals like GitLab and Atlassian to respond soon with similar offerings: “The interesting question is whether they build native agentic runtimes or simply become MCP-compatible surfaces that third-party agents can drive.”

Given that MCP just moved to the Linux Foundation, that second path may “actually accelerate faster than GitHub’s proprietary approach”, Kramer added.

Analysts also see issues around security, especially in the context of regulated industries, despite Agentic Workflows offering security capabilities such as least privilege and sandboxed execution.

“To begin with, GitHub describes network isolation but doesn’t specify whether workflow execution environments are FedRAMP-authorized, which is critical for US government-related work, or whether audit logs meet HIPAA’s required retention and access control standards, which is critical for healthcare in the US,” Kramer said.

Security concerns

GitHub also does not specify whether the agent’s access to repository content, including potentially sensitive code, secrets, or customer data embedded in repos, is governed by data residency requirements, Kramer added.

“For financial services, a full lineage layer is needed, not just a ‘this workflow created this PR’ but a complete record of every API call the agent made, every file it read, and every decision it made. These are all things that need to be addressed,” the analyst noted further.

Although GitHub leaves it to developers and individual teams to decide what automation to write in Agentic Workflows and how far to take it including planning autonomous CI/CD, analysts suggest enterprises treat the technical preview as a controlled testing window to evaluate whether the new feature can be absorbed into production environments without breaking governance, security, or cost discipline.

“For CIOs, this is the learning phase: Establish controlled pilots in non-critical repositories, develop governance patterns early, and prepare for broader adoption once auditability and operational predictability stabilize,” Hinchcliffe said.

To rein in costs and quantify ROI, CIOs can set budget caps, tier LLM choices, and closely track ‘run’ frequency and AI request volumes, then benchmark those costs against reclaimed developer time and reduced operational delays, Hinchcliffe added.

For developers, analysts said, this could mark a shift in culture and performance metrics.

“Developer culture will evolve toward supervising automation rather than executing routine tasks. This may shift developers toward architecture, design decisions, and higher-value problem solving,” Hinchcliffe said.

“Team structures will place greater emphasis on platform engineering and automation stewardship, while performance metrics move away from activity measures toward outcomes such as cycle time, reliability, and engineering effectiveness per developer,” the analyst added.

(image/png; 0.95 MB)

Flaws in four popular VS Code extensions left 128 million installs open to attack 18 Feb 2026, 4:37 am

Critical and high-severity vulnerabilities were found in four widely used Visual Studio Code extensions with a combined 128 million downloads, exposing developers to file theft, remote code execution, and local network reconnaissance.

Application security company OX Security published the findings this week, saying it had begun notifying vendors in June 2025 but received no response from three of the four maintainers.

Three CVEs, CVE-2025-65717CVE-2025-65715, and CVE-2025-65716, were formally assigned and published on February 16.

VS Code extensions are add-ons that expand the functionality of Microsoft’s widely used code editor, adding capabilities such as language support, debugging tools, live preview, and code execution. They run with broad access to local files, terminals, and network resources, which is what made these vulnerabilities consequential.

Unlike the rogue extensions that threat actors have repeatedly planted in the VS Code marketplace, these flaws resided in legitimate, widely installed tools, meaning developers had no reason to suspect them, OX Security said in an advisory.

“Our research demonstrates that a hacker needs only one malicious extension, or a single vulnerability within one extension, to perform lateral movement and compromise entire organizations,” the advisory added.

The vulnerabilities also affected Cursor and Windsurf, the AI-powered IDEs built on VS Code’s extension infrastructure.

OX Security published individual advisories for each flaw, detailing how each could be exploited and what an attacker could achieve.

How the attacks worked

The most severe flaw, CVE-2025-65717 (critical), was in Live Server, a 72-million-download extension that launches a local HTTP server for real-time browser previews. OX Security found the server was reachable from any web page a developer visited while it was running, not just their own browser.

“Attackers only need to send a malicious link to the victim while Live Server is running in the background,” OX Security researchers Moshe Siman Tov Bustan and Nir Zadok wrote in an advisory.

CVE-2025-65715 (high severity) affected Code Runner, with 37 million downloads. The extension reads execution commands from a global configuration file, and OX Security found a crafted entry that was enough to trigger arbitrary code execution, including reverse shells. An attacker could place it by phishing a developer into pasting a malicious snippet, or through a compromised extension that modified the file silently.

CVE-2025-65716 (CVSS 8.8) affected Markdown Preview Enhanced, with 8.5 million downloads. Simply opening an untrusted Markdown file was enough to trigger it. “A malicious Markdown file could trigger scripts or embedded content that collects information about open ports on the victim’s machine,” the researchers noted.

Microsoft quietly patched its own extension

The fourth vulnerability played out differently. Microsoft’s Live Preview extension, with 11 million downloads, contained a cross-site scripting flaw that, according to OX Security, let a malicious web page enumerate files in the root of a developer’s machine and exfiltrate credentials, access keys, and other secrets.

The researchers reported the issue to Microsoft on August 7. Microsoft initially rated it as low severity, citing required user interaction.

“However, on September 11, 2025 — without notifying us — Microsoft quietly released a patch addressing the XSS security issues we reported. We only recently discovered that this patch had been deployed,” the researchers added.

No CVE was assigned to this vulnerability. “Users with Live Preview installed should update to version 0.4.16 or later immediately,” the researchers suggested.

Microsoft did not immediately respond to a request for comment.

Taken together, the four flaws pointed to a broader problem with how developer tools are secured and maintained.

What security teams should do

“These vulnerabilities confirm that IDEs are the weakest link in an organization’s supply chain security,” the researchers at OX Security said in the advisory.

Developer workstations routinely hold API keys, cloud credentials, database connection strings, and SSH keys. OX Security warned that a successful exfiltration from a single machine could give an attacker access to an organization’s broader infrastructure and that the risks extended to lateral movement and full system takeover.

The researchers advised developers to disable extensions not actively in use and avoid browsing untrusted sites while localhost servers are running. They also cautioned against applying configuration snippets from unverified sources to VS Code’s global settings.

(image/jpeg; 1.07 MB)

Mistral AI deepens compute ambitions with Koyeb acquisition 18 Feb 2026, 3:50 am

Mistral AI has acquired Paris-based cloud startup Koyeb, marking the model-maker’s first acquisition and entry into the enterprise infrastructure market.

This suggests a strategic shift for the French company, which has built its reputation on frontier models but is now investing heavily in compute capabilities and expanded deployment options.

The acquisition folds Koyeb’s serverless deployment platform into Mistral Compute, the company’s AI cloud offering launched last year, as Mistral shapes up to be a sovereign European alternative for enterprises running AI workloads at scale. Mistral has been betting on its “open weight” large language models as a point of differentiation. In a recent interview with Bloomberg, Mistral CEO Arthur Mensch said Europe is betting “actively and heavily” on open source.

Mistral recently pledged to invest 1.2 billion euros in AI data center infrastructure in Sweden, underscoring its broader push into compute and digital infrastructure.

In a LinkedIn post, the company said the move “strengthens our Compute capabilities and accelerates our mission to build a full-stack AI champion.”

The move also signals a wider market trend of model providers racing to control more of the stack, from infrastructure and inference to deployment and optimization, to lock in enterprise customers and capture higher margins.

For enterprise IT leaders, the question is whether this marks the emergence of a viable alternative to US cloud giants for AI workloads, or simply a tighter vertical integration play aimed at improving margins and performance.

Full-stack AI push

Analysts say the acquisition reflects a deliberate shift toward vertical integration, with Mistral seeking greater control over key layers of the AI stack, from infrastructure and middleware to models. That positioning brings the company closer to what some of them describe as an “AI hyperscaler,” though with a narrower focus.

“Mistral gets a step-up in its progress toward full-stack capabilities,” said Prabhu Ram, VP of the industry research group at Cybermedia Research. “The Koyeb acquisition bolsters Mistral Compute, enabling better on-premises deployments, GPU optimization, and AI inference scaling. Koyeb elevates Mistral’s hybrid support, appealing to regulated US and European enterprises.”

For enterprise buyers, hybrid and on-premises flexibility is increasingly important, particularly in regulated sectors where data residency and latency requirements limit full reliance on public cloud providers.

Still, analysts caution that Mistral remains more specialized than general-purpose cloud providers such as Microsoft, Google, or Amazon Web Services. Its infrastructure footprint and capital expenditure profile are significantly smaller, shaping how it competes.

“Mistral AI’s modest CAPEX compared with the big AI hyperscalers makes Koyeb’s acquisition important, as it adds the capability to offer more efficient and cost-effective inference scaling for enterprises focused on specialized AI tasks,” said Neil Shah, VP for research at Counterpoint Research. “Whether Mistral AI can expand this capability to compete with general-purpose AI inference from hyperscale providers across enterprise and consumer markets seems unlikely at this point.”

Shah added that Mistral’s European roots position it strongly in sovereign AI deployments for enterprises and public sector organizations, where serverless architecture and localized control can be differentiators.

At the same time, structural challenges also remain. Ram noted that ecosystem maturity, GPU access, execution depth, and cost efficiency are still areas where Mistral trails larger hyperscalers. For CIOs evaluating long-term AI infrastructure bets, those factors may weigh as heavily as model performance.

(image/jpeg; 13.67 MB)

Let a million apps bloom 18 Feb 2026, 1:00 am

Remember the good old days when we had “Internet time”? It was back in the late 1990s, when the dot-coms moved so fast that building businesses and fortunes would take only months instead of years.

Seems so quaint now.

I think we are now living in AI time. It seems like today, things that used to take months are now taking weeks—or even days. Shoot, you might even say hours.

My head is spinning with how much change has happened in—literally—the last few weeks. It almost feels like we woke up one morning and just stopped writing code. And of course, this has everyone freaking out. Developers are worried about massive layoffs and what junior developers are going to do, and if there will even be any junior developers at all.

It all seemed to culminate in this article by Matt Shumer, “Something Big is Happening,” which went viral last week. (It’s not often that an article on the .dev domain makes it to the top of the Drudge Report.)

Shumer tapped into those fears. And he’s right to. The fear isn’t irrational—it’s the natural response to watching a job that you’ve spent years mastering get handed off to a machine. That’s a real loss, even if what comes next turns out to be better.

Last month I wrote about how code is no longer the bottleneck. I pointed out how choosing what to build will become even harder when we can build 20 things instead of just two.

If you are hesitant to let AI write your code for you because it makes mistakes or doesn’t write code exactly the way you like it written, consider this: The same is true of any other developer on your team. No other person will write code that is mistake-free and architected exactly like you want things done. You don’t hesitate to delegate tasks to fellow developers, so why do you hesitate to pass a task to a tireless coder who builds things hundreds(!) of times faster than any human?

And it’s the “hundreds of times” faster that has everyone nervous. And they have a right to be. Many companies—Salesforce, Amazon, Microsoft, and more—are laying off people, citing automation and AI efficiencies among the reasons. Things are starting to change, and there are many people whose lives will be affected. There is reason for concern.

But there is reason for optimism in the turmoil. Garry Tan, the CEO of YCombinator, recently said, “Our fear of the future is directly proportional to how small our ambitions are.” I don’t want to minimize the trauma of getting laid off, but the thing that is causing people to be laid off is also the thing that will cause an explosion of new ideas. And those new ideas will lead to new jobs.

For a while now, I’ve been keeping a list of silly, goofy, and maybe brilliant ideas for apps and websites. Some are simple, and some are ambitious, but every one of them required more time to implement than I had when the idea struck. Now? I built one of them this past weekend. Actually, I built it on Sunday afternoon.

Now, I’m not saying I’m ready to quit my day job, but imagine if the barrier to implementing an idea goes from “It will take me six months of evenings and weekends to build this” to “It will be done before I can cook dinner.” Lots of people out there are going to have an idea that will enable them to quit their day job and hire people to make the business a reality. Instead of 10 new digital companies starting up every week, will there be 100, or 1,000? The opportunity seems endless.

And not only will we have new job titles and new things to do, but we’ll be creating things that no one has thought up yet. “Let a hundred flowers bloom” will turn into “Let a million flowers bloom.”

Now that we are moving in AI time, ideas won’t wait for permission anymore. They can bloom almost the moment you think of them.

(image/jpeg; 7.63 MB)

What is Docker? The spark for the container revolution 18 Feb 2026, 1:00 am

Docker is a software platform for building applications based on containers—small and lightweight execution environments that make shared use of the operating system kernel but otherwise run in isolation from one another. While containers have been used in Linux and Unix systems for some time, Docker, an open source project launched in 2013, helped popularize the technology by making it easier than ever for developers to package their software to “build once and run anywhere.”

A brief history of Docker

Founded as DotCloud in 2008 by Solomon Hykes in Paris, what we now know as Docker started out as a platform as a service (PaaS) before pivoting in 2013 to focus on democratizing the underlying software containers its platform was running on.

Hykes first demoed Docker at PyCon in March 2013, explaining that Docker was created because developers kept asking for the underlying technology powering the DotCloud platform. “We did always think it would be cool to be able to say, ‘Yes, here is our low-level piece. Now you can do Linux containers with us and go do whatever you want, go build your platform.’ So that’s what we are doing.”

And so, Docker was born, with the open source project quickly picking up traction with developers and attracting the attention of high-profile technology providers like Microsoft, IBM, and Red Hat, as well as venture capitalists willing to pump millions of dollars into the innovative startup. The container revolution had begun.

What are containers?

As Hykes described it in his PyCon talk, containers are “self-contained units of software you can deliver from a server over there to a server over there, from your laptop to EC2 to a bare-metal giant server, and it will run in the same way because it is isolated at the process level and has its own file system.”

The components for doing this have long existed in operating systems like Linux. By simplifying their use and giving these bits a common interface, Docker quickly became close to a de facto industry standard for containers. Docker let developers deploy, replicate, move, and back up a workload in a single, streamlined way, using a set of reusable images to make workloads more portable and flexible than previously possible.

Also see: Why you should use Docker and OCI containers.

In the virtual machine (VM) world, something similar could be achieved by keeping applications separate while running on the same hardware. But each VM requires its own operating system, meaning VMs are typically large, slow to start up, difficult to move around, and cumbersome to maintain and upgrade.

Containers represent a defined shift from the VM era, in that they isolate execution environments while sharing the underlying OS kernel. As a result, they are speedier and far more lightweight than VMs.

virtualmachines vs containers

Stacking up the virtualization and container infrastructure stacks.

Docker: The component parts

Docker took off with software developers as a novel way to package the tools required to build and launch a container. It was more streamlined and simplified than anything previously possible. Broken down into its component parts, Docker consists of the following:

  • Dockerfile: Each Docker container starts with a Dockerfile. This text file provides a set of instructions to build a Docker image, including the operating system, languages, environmental variables, file locations, network ports, and any other components it needs to run. Provide someone with a Dockerfile and they can recreate the Docker image wherever they please, although the build process takes time and system resources.
  • Docker image: Like a snapshot in the VM world, a Docker image is a portable, read-only executable file. It contains the instructions for creating a container and the specifications for which software components to run and how the container will run them. Docker images are far larger than Dockerfiles but require no build step: They can boot and run as-is.
  • Docker run utility: Docker’s run utility is the command that launches a container. Each container is an instance of an image, and multiple instances of the same image can be run simultaneously.
  • Docker Hub: Docker Hub is a repository where container images can be stored, shared, and managed. Think of it as Docker’s own version of GitHub, but specifically for containers.
  • Docker Engine: Docker Engine is the core of Docker. It is the underlying client-server technology that creates and runs the containers. The Docker Engine includes a long-running daemon process called dockerd for managing containers, APIs that allow programs to communicate with the Docker daemon, and a command-line interface.
  • Docker Compose: Docker Compose is a command-line tool that uses YAML files to define and run multicontainer Docker applications. It allows you to create, start, stop, and rebuild all the services from your configuration and view the status and log output of all running services.
  • Docker Desktop: All of these component parts are wrapped in Docker’s Desktop application, providing a user-friendly way to build and share containerized applications and microservices.

Advantages of Docker

Docker containers provide a way to build applications that are easier to assemble, maintain, and move around than previous methods allowed. That provides several advantages to software developers:

  • Docker containers are minimalistic and enable portability: Docker helps to keep applications and their environments clean and minimal by isolating them, which allows for more granular control and greater portability.
  • Docker containers enable composability: Containers make it easier for developers to compose the building blocks of an application into a modular unit with easily interchangeable parts, which can speed up development cycles, feature releases, and bug fixes.
  • Docker containers make orchestration and scaling easier: Because containers are lightweight, developers can launch many of them for better scaling of services, and each container instance launches many times faster than a VM. These clusters of containers do then need to be orchestrated, which is where a platform like Kubernetes typically comes in.

Also see: How to succeed with Kubernetes.

Drawbacks of Docker

Containers solve a great many problems, but they don’t solve them all. Common complaints about Docker include the following:

  • Docker containers are not virtual machines: Unlike virtual machines, containers use controlled portions of the host operating system’s resources, which means elements aren’t as strictly isolated as they would be on a VM.
  • Docker containers don’t provide bare-metal speed: Containers are significantly more lightweight and closer to the metal than virtual machines, but they do incur some performance overhead. If your workload requires bare-metal speed, a container will get you close but not all the way there.
  • Docker containers are stateless and immutable: Containers boot and run from an image that describes their contents. That image is immutable by default—once created, it doesn’t change. But a container instance is transient. Once removed from system memory, it’s gone forever. If you want your containers to persist state across sessions, like a virtual machine, you need to design for that persistence.

Docker today

Container usage has continued to grow in tandem with cloud-native development, now the dominant model for building and running software. But these days, Docker is only a part of that puzzle.

Docker grew popular because it made it easy to move the code for an application and its dependencies from the developer’s laptop to a server. But the rise of containers led to a shift in the way applications were built—from monolithic stacks to networks of microservices. Soon, many users needed a way to orchestrate and manage groups of containers at scale.

Launched at Google, the Kubernetes open source project quickly emerged as the best way to orchestrate containers, superseding Docker’s own attempts to solve this problem with Docker Swarm (RIP). Amidst increasing funding trouble, Docker eventually sold its enterprise business to Mirantis in 2019, which has since absorbed Docker Enterprise into the Mirantis Kubernetes Engine.

The remains of Docker—which includes the original open source Docker Engine container runtime, Docker Hub image repository, and Docker Desktop application—live on under the leadership of company veteran Scott Johnston, who is looking to reorient the business around its core customer base of software developers.

The Docker Business subscription service, and the revised Docker Desktop product, both reflect those new goals: Docker Business offers tools for managing and rapidly deploying secure Docker instances, and Docker Desktop requires paid usage for organizations with more than $10 million in annual revenue and 250 or more employees. But there’s also the Docker Personal subscription tier, for individuals and companies that fall below those thresholds, so end users still have access to many of Docker’s offerings.

Docker has other offerings suited to the changing times. Docker Hardened Images, available in both free and enterprise tiers, provide application images with smaller attack surfaces and checked software components for better security. And, in step with the AI revolution, the Docker MCP Catalog and Toolkit provide Dockerized versions of tools that give AI applications broader functionality (such as by allowing access to the file system), making it easier to deploy AI apps with less risk to the surrounding environment.

(image/jpeg; 27.84 MB)

Claude Sonnet 4.6 improves coding skills 17 Feb 2026, 4:25 pm

Anthropic has launched Claude Sonnet 4.6, an update to the company’s hybrid reasoning model that brings improvements in coding consistency and instruction following, Anthropic said.

Introduced February 17, Claude Sonnet 4.6 is a full upgrade of the model’s skills across coding, computer use, long-context reasoning, agent planning, design, and knowledge work, according to Anthropic. the model also features a 1M token context window in beta.

With Claude Sonnet 4.6, improvements in consistency, instruction following, and other areas have made developers with early access prefer this release to its predecessor, Claude Sonnet 4.5, by a wide margin, according to Anthropic. Early Sonnet 4.6 users are seeing human-level capability in tasks such as navigating a complex spreadsheet or filling out a multi-step web form, before pulling it all together across multiple browser tabs, said Anthropic. Performance that previously would have required an Anthropic Opus-class model—including on real-world, economically viable office tasks—now is available with Sonnet 4.6. The model also shows a major improvement in computer use skills compared to prior Sonnet models, the company said.

Claude Sonnet 4.6 is available on all Claude plans, Claude Cowork, Claude Code, the Anthropic API, and all major cloud platforms. Developers can get started by using claude-sonnet-4-6 via the Claude API. The model still lags behind the most skilled humans at using computers, Anthropic said. But the rate of progress means that computer use is much more useful for a range of work tasks, and substantially more capable models are within reach, Anthropic said. On the Claude Developer Platform, Sonnet 4.6 supports adaptive thinking and extended thinking, as well as context compaction in beta. Context compaction automatically summarizes older context as conversations approach limits, increasing effective context length, according to the company.

Anthropic also announced updates to its API. Claude’s web search and fetch tools now automatically write and execute code to filter and process search results, keeping only relevant content in context, improving both response quality and token efficiency. Also, code execution, memory, programmatic tool calling, tool search, and tool use examples now are generally available.

(image/jpeg; 2.12 MB)

WebMCP API extends web apps to AI agents 17 Feb 2026, 1:16 pm

World Wide Web Consortium (W3C) participants including Google and Microsoft have launched the WebMCP API, a JavaScript interface that allows web applications to provide client-side “tools” to AI agents. The API would enable agents to interact directly with web pages and participate in collaborative workflows with human users within the same web interface.

WebMCP is available for early preview at Google, said Google’s Andre Cipriani Bandarra, developer relations engineer for Chrome and the web in a February 10 blog post. “WebMCP aims to provide a standard way for exposing structured tools, ensuring AI agents can perform actions on your site with increased speed, reliability, and precision,” Bandarra said.

A draft community group report on WebMCP was published on February 12 by the W3C Web Machine Learning Community Group. The WebMCP API is described in the report as a JavaScript interface that lets web developers expose web application functionality as “tools,” meaning JavaScript functions with natural language descriptions and structured schemas that can be invoked by agents, browser’s agents, and assistive technologies. Web pages that use WebMCP can be viewed as Model Context Protocol (MCP) servers that implement tools in client-side script instead of on the back end, enabling collaborative workflows where users and agents work together within the same web interface, according to the report. Editors of the report include Khusal Sagar and Dominic Farolino of Google and Brandon Walderman of Microsoft. The specification is neither a W3C standard nor on the W3C Standards Track, the report says.

Bandarra cited use cases including customer support, ecommerce, and travel, in which agents help users fill out customer support tickets, shop for products, and book flights. Bandarra cited two proposed APIs as part of WebMCP that allow browser agents to act on behalf of the user: a declarative API that performs standard actions that can be defined directly in HTML forms and an imperative API that performs complex and more dynamic interactions that require JavaScript execution. “These APIs serve as a bridge, making your website ‘agent-ready’ and enabling more reliable and performant agent workflows compared to raw DOM actuation,” said Bandarra.

(image/jpeg; 8.51 MB)

Alibaba’s Qwen3.5 targets enterprise agent workflows with expanded multimodal support 17 Feb 2026, 3:11 am

Alibaba has unveiled Qwen3.5, a new multimodal AI model that the company says is intended to serve as a foundation for digital agents capable of advanced reasoning and tool use across applications.

The release reflects the ongoing shift from standalone chatbot deployments toward AI systems that can execute multi-step workflows and operate with minimal human prompting.

In a blog post, Alibaba highlighted gains across selected benchmarks, claiming Qwen3.5 outperformed earlier versions and competing frontier systems such as GPT-5.2, Claude 4.5 Opus, and Gemini 3 Pro.

The company is releasing the open-weight Qwen3.5-397B-A17B model for developers, while a hosted version, Qwen3.5-Plus, will be available through Alibaba Cloud’s Model Studio platform. The hosted version includes built-in tool capabilities and an expanded context window of up to one million tokens, aimed at enterprise developers building more complex, multi-step applications.

Alibaba also emphasized expanded multilingual support, increasing coverage from 119 to 201 languages and dialects, a move that could appeal to global enterprises operating across diverse markets.

Enterprise AI implications

The release comes amid intensifying competition within China’s AI market.

Last week, ByteDance introduced Doubao 2.0, an upgrade to its chatbot platform that the company also positioned around agent-style capabilities. DeepSeek, whose rapid global rise last year unsettled US tech investors, is widely expected to release its next-generation model soon.

Analysts say Qwen3.5’s improvements in reasoning and other benchmarks are significant, particularly for enterprise use cases.

“In pilot settings, these features help teams explore new interactions and validate feasibility,” said Tulika Sheel, senior vice president at Kadence International. “But in production environments, enterprises will still require robust performance metrics, reliability guarantees, and governance controls before fully trusting these capabilities.”

Sanchit Vir Gogia, chief analyst at Greyhound Research, pointed out that Qwen3.5 is not simply a stronger language model but a workflow-capable system.

“When those capabilities are combined, the system stops behaving like a conversational assistant and starts behaving like an execution layer,” Gogia said. “That is precisely where opportunity and risk converge.”

CIOs considering adoption would look at how consistently the model performs at scale and how smoothly it fits within established governance and infrastructure frameworks.

If the conditions are met, Qwen3.5’s multimodal and agent-oriented capabilities could improve how enterprises automate support functions and manage information across systems where text, images, and structured data interact.

“The value is most tangible in environments that are structured, repetitive, and measurable,” Gogia said. “For instance, procurement validation, invoice to contract matching, supplier onboarding triage, and similar areas where workflows have volume and defined rules.”

Trust and risks

Analysts suggest the biggest hurdle may not be technological advancement but ecosystem maturity and trust, with security concerns continuing to limit global adoption.

“Qwen3.5 excels in multimodal capabilities and offers extensive model selection, including open model options for easier access and customization,” said Anushree Verma, senior director analyst at Gartner. “However, the main challenge for Qwen is its global adoption, which is limited due to restricted commercial availability, distrust of Chinese‑origin models, and a less mature partner ecosystem outside China.”

Gogia added that the evaluation of Qwen3.5 by a US enterprise cannot be reduced to model performance metrics.

“It must be framed as a durability assessment,” Gogia said. “Can this platform remain viable, compliant, and operationally stable across policy volatility?”

Sheel said that compliance with regional regulations, including data residency mandates and privacy laws, must be assessed before deployment. CIOs must also determine who can access or process enterprise data, and whether contractual safeguards and audit mechanisms align with internal governance standards.

(image/jpeg; 2.39 MB)

Working with the Windows App Development CLI 17 Feb 2026, 1:00 am

In Vernor Vinge’s science fiction novel A Deepness in the Sky, one of the characters works as a software archaeologist, mining thousands of years of code and libraries to find the solutions to development problems. In that fictional far future, every problem has been solved at least once, often in many ways with different interfaces for different programming languages and processor families.

Today’s software development world isn’t quite that complex, as we only have a few decades of programming to build on. Working with a platform as old as Windows, with all its layers and history and all the possible ways to build code on it, we’re still left to look for the right SDK and API to use in our toolchains. Microsoft’s layers of abstractions try to help, but they can cause confusion and make it hard to migrate code from older frameworks to the state-of-the-art ones for today.

All those layers are one reason why it’s unfashionable to write native Windows applications, as it can be hard to find the right prerequisites to developing with your choice of tools.

Separating software development from the OS

It’s also been hard to get the right SDKs and APIs to reach all your users. Microsoft used to ship new developer bits only with new OS releases, slowing progress and making it hard to get the latest features to users. Things began to change with the shift to rolling Windows releases with Windows 10, in the guise of “Windows as a service.” The intent was not so much that Microsoft controlled your PCs, but that developers would be able to address the widest possible audience with the latest features, allowing Windows developers to give their users the same experience as on iOS or Android.

Microsoft could finally separate the developer platform from the OS, shipping new developer tools and SDKs on their own schedule and in their own separate packages. That process has led to today’s Windows App SDK, a set of cross-language tools that provide native access to key Windows functions and APIs, including devices’ built-in neural processing units. Even with this and the ability for tools like Visual Studio to automatically install dependencies, it can still be hard to put all the necessary pieces together, let alone for other development tools and environments.

With Microsoft long having had the ambition to put its tools where developers are, there’s a need to put the necessary scaffolding and toolchain in place for alternative development environments and languages, especially where it comes to supporting cross-platform development in languages like Dart or Swift.

What’s needed is a way to automate the process of bringing the necessary libraries and environments to your choice of development tools. Windows development is as much about systems programming as it is building graphical user interfaces. Low-level code in Rust or C++ needs the same level of support as a .NET C# application, even if you’re using Vim or Eclipse.

Microsoft rediscovers the command line

A key element of Microsoft’s current developer strategy is its rediscovery of the command line. Spurred on by both the complete reworking of the Windows terminal and by the release of Windows Subsystem for Linux, Microsoft has delivered a whole suite of command-line tools, covering everything from Azure to .NET, as well as the GitHub Copilot agent orchestration tools. By using the same terminal tool inside your programming editor, you’re able to code, manage, and run without losing context and flow.

One of the latest CLI tools works with the Windows App SDK, simplifying the process of creating, building, and publishing Windows applications without using Visual Studio and encompassing most toolchains used for Windows development. It’s an open source project, hosted on GitHub.

Getting the tool on a development PC requires installing it via WinGet. Open a Windows Terminal and run an install command, restarting your terminal to ensure that you’re working with updated environment variables. In addition to the CLI tool, there is an experimental GUI that can quickly add debug identities to existing code.

Bootstrap Windows development with winapp

With winapp installed, you’re ready to put together a Windows development environment with a single command. Running winapp init on a development PC puts you into an interactive environment where a few simple questions take you from an application name to a ready-to-go environment, complete with Windows’ developer mode enabled (if you’re starting on a brand-new PC). If you’ve already set up developer mode and downloaded the SDKs, the tool will use what you have ready—after checking that you have the latest versions.

A handful of minutes later and the SDK will have created all the necessary files for your application, downloaded and installed the required SDKs and libraries, and set up the required manifests and configuration files—even configuring the certificates needed for code signing and application packaging. One downside is that it doesn’t create a new directory for your code. You need to do this first and then run the CLI tool from a terminal running in your new directory.

Winapp CLI commands replace key steps in the software development life cycle, with the intent of simply leaving you to write code. The first step, init, sets up the necessary scaffolding for building a Windows application. If you’re using Node.js, the next step, node create-addon, puts the templates in place to link your JavaScript code to native Windows functions.

Once you’ve written your code, it’s time to go back to winapp to set up the tools for testing and packaging. This includes the ability to generate signing certificates and build the required identity for debugging without packaging your code as MSIX bundles. Finally, it simplifies the process of packaging your code for distribution, either using your own platform or via the Microsoft Store. There’s no need to worry about whether you’re building x64 or Arm code; winapp will work for both.

As the documentation notes, this process takes out between 10 and 12 steps (depending on your choice of platform), turning them into four simple CLI commands. Your code also gains the benefits of integration with the Windows platform, using the Windows App SDK APIs.

Working with winapp and Rust

Usefully Microsoft provides documentation for working with alternate toolchains beyond the familiar .NET and C++. One key development environment being addressed by winapp is Rust development, helping you build memory-safe Windows applications and reducing the risk of security breaches in your code.

Working with Rust requires a basic Rust toolchain, which can be installed with the familiar rustup command. Once Rust is installed on your development PC, create a new Rust application and open a terminal in its directory. With winapp installed, run its init command to add the scaffolding of a Windows application around the default Rust code. However, when prompted to install the Windows App SDKs, choose the option to not install an SDK. You will need to install the Windows crate to provide access to Windows APIs.

As winapp creates the temporary identity to test your code, as well as the manifest and certificates used to package Rust code as MSIX, you can now use your choice of editor to build an application before building a release with the Rust compiler and using winapp to install the developer certificate created when you initialized the application before packaging the resulting binary.

Using winapp in Node.js and Electron

Running the CLI tool in JavaScript environments like Electron or Node.js is slightly different from directly using it in Windows’ own terminal, as you’re working inside another CLI environment. Here you need to install it using npm and run it with npx. However, once you’re inside the winapp CLI environment, you can use the same commands to manage your JavaScript Windows application’s life cycle.

You’re not limited to using the CLI on a Windows PC: You can use it as part of a GitHub runner or inside Azure DevOps, allowing it to become part of your CI/CD pipeline. Microsoft provides actions for both tools to quickly install it in runners and agents, simplifying automated builds and tests.

Tools such as the Windows App SDK’s CLI are increasingly important. We’re now spending most of our time inside our development environments, and having key utilities a few keystrokes away inside a terminal gives us the necessary shortcuts to be more productive. Having it in your development toolchain should save time and let you concentrate on code instead of worrying about all the steps to package and sign your applications.

(image/jpeg; 28.4 MB)

Why cloud outages are becoming normal 17 Feb 2026, 1:00 am

The Microsoft Azure outage that dragged out for 10 hours in early February serves as another stark reminder that the cloud, for all its promise, is not immune to failure. At precisely 19:46 UTC on February 2, the Azure cloud platform began experiencing cascading issues stemming from an initial misconfiguration of a policy affecting Microsoft-managed storage accounts. This seemingly minor error ballooned outwards, knocking out two of the most critical layers underpinning enterprise cloud success: virtual machine operations and managed identities.

By the time the dust began to settle, more than 10 hours later at 06:05 UTC the next morning, customers across multiple regions were unable to deploy or scale virtual machines. Mission-critical development pipelines ground to a halt, and hundreds of organizations struggled to execute even the simplest tasks on Azure. The ripple effect spread across production systems and workflows central to developer productivity, including CI/CD pipelines that run through Azure DevOps and GitHub Actions. Compounding the issue, managed identity services faltered, especially in the eastern and western United States, disrupting authentication and access to cloud resources across a swath of essential Azure offerings, from Kubernetes clusters to analytics platforms and AI operations.

The after-action report is all too familiar: an initial fix triggers a surge in service traffic, further overwhelming already-struggling platforms. Mitigation efforts, such as scaling up infrastructure or temporarily taking services offline, eventually restore order, but not before damage is done. Disrupted operations lead to lost productivity, delayed deployments, and, perhaps most insidiously, a reinforcement of the growing sense that major cloud outages are simply part of the territory of modern enterprise IT.

As the headlines become more frequent and the incidents themselves start to blur together, we have to ask: Why are these outages becoming a monthly, sometimes even weekly, story? What’s changed in the world of cloud computing to usher in this new era of instability? In my view, several trends are converging to make these outages not only more common but also more disruptive and more challenging to prevent.

Human error creeps in

It’s no secret that the economic realities of cloud computing have shifted. The days of unchecked growth are over. Headcounts no longer expand to keep pace with surging demand. Hyperscalers such as Microsoft, AWS, Google, and others have announced substantial layoffs in recent years, many of which have disproportionately affected operational, support, and engineering teams—the very people responsible for ensuring that platforms run smoothly and errors are caught before they reach production.

The predictable outcome is that when experienced engineers and architects leave, they are often replaced by less-skilled staff who lack deep institutional knowledge. They lack adequate experience in platform operations, troubleshooting, and crisis response. While capable, these “B Team” employees may not have the skills or knowledge to anticipate how minor changes affect massive, interconnected systems like Azure.

The recent Azure outage resulted from precisely this type of human error, in which a misapplied policy blocked access to storage resources required for VM extension packages. This change was likely rushed or misunderstood by someone unfamiliar with prior issues. The resulting widespread service failures were inevitable. Human errors like this are common and likely to recur given current staffing trends.

Damage is greater than before

Another trend amplifying the impact of these outages is the relative complacency about resilience. For years, organizations have been content to “lift and shift” workloads to the cloud, reaping the benefits of agility and scalability without necessarily investing in the levels of redundancy and disaster recovery that such migrations require.

There is growing cultural acceptance among enterprises that cloud outages are unavoidable and that mitigating their effects should be left to providers. This is both an unrealistic expectation and a dangerous abdication of responsibility. Resilience cannot be entirely outsourced; it must be deliberately built into every aspect of a company’s application architecture and deployment strategy.

However, what I’m seeing in my consulting work, and what many CIOs and CTOs will privately admit, is that resilience is too often an afterthought. The impact of even brief outages on Azure, AWS, or Google Cloud now ricochets far beyond the IT department. Entire revenue streams grind to a halt, and support queues overflow. Customer trust erodes, and recovery costs skyrocket, both financial and reputational. Yet investment in multicloud strategies, hybrid redundancies, and failover contingencies lags behind the pace of risk. We’re paying the price for that oversight, and as cloud adoption deepens, the costs will only increase.

Systems at the breaking point

Hyperscale cloud operations are inherently complex. As these platforms become more successful, they grow larger and more complicated, supporting a wide range of services such as AI, analytics, security, and Internet of Things. Their layered control planes are interconnected; a single misconfiguration, such as with Microsoft Azure, can quickly lead to a major disaster.

The size of these environments makes them hard to operate without error. Automated tools help, but each new code change, feature, and integration increases the likelihood of mistakes. As companies move more data and logic to the cloud, even minor disruptions can have significant effects. Providers face pressure to innovate, cut costs, and scale, often sacrificing simplicity to achieve these goals.

Enterprises and vendors must act

As we analyze the recent Azure outage, it’s obvious that change is necessary. Cloud providers must recognize that cost-cutting measures, such as layoffs or reduced investment in platform reliability, will ultimately have consequences. They should focus more on improving training, automating processes, and increasing operational transparency.

Enterprises, for their part, cannot afford to treat outages as inevitable or unavoidable. Investment in architectural resilience, ongoing testing of failover strategies, and diversification across multiple clouds are not just best practices; they’re survival strategies.

The cloud continues to be the engine of innovation, but unless both sides of this partnership raise their game, we’re destined to see these outages repeat like clockwork. Each time, the fallout will spread a little further and cut a little deeper.

(image/jpeg; 8.16 MB)

Cloud Cloning: A new approach to infrastructure portability 17 Feb 2026, 1:00 am

When it comes to cloud infrastructure portability, the reigning solutions just don’t live up to their promise. Infrastructure as code (IaC) solutions like Terraform shoehorn nuanced infrastructure into too-broad terms. Cloud provider offerings like Azure Migrate, AWS Migration Services, and Google Cloud Migrate generally don’t translate native workloads into competitors’ clouds. And governance tools are often excellent at alerting users to finops and drift issues, but rarely help users actually fix the issues they flag.

I have outlined these issues in a companion article—I invite you to start there. In this article, I provide an overview of an infrastructure replication methodology, Cloud Cloning, which my colleagues and I at FluidCloud have created to address the challenges above. Below, I’ll explain how Cloud Cloning achieves all this by walking you through how our solution works. 

Capturing the full scope of public cloud setups

As a first step of enabling cloud portability, Cloud Cloning starts with a complete snapshot of the source cloud infrastructure. Cloud Cloning calls the cloud provider’s APIs and scans and captures the complete cloud infrastructure footprint, including the many resources, dependencies, and nuances like VPCs, subnets, firewall rules, and IAM (identity and access management) permissions that the reigning multicloud tools tend to overlook.

To give some context as to what this snapshot achieves, it helps to understand where multicloud tools evolved from. From the beginning, the multicloud field emerged to support migrations from on-prem private clouds up to the public ones. As such, these tools were focused on recreating the two most fundamental primitives of private clouds: VMs and storage. To a large extent, that focus continues today, with leading offerings like AWS Migrate, Azure Migrate, Google Cloud Migrate, Nutanix Move, Veeam, and Zerto focused largely on just these two areas.

The problem is that, when it comes to migrating across public clouds, VMs and storage are only a small part of the picture. Public cloud environments rely on complex architectures with hundreds of services spanning databases, storage buckets, IAM users and permissions, subnets, routing, firewalls, Kubernetes clusters and associated control planes, and a lot more.

By starting with a snapshot of the full infrastructure ecosystem, Cloud Cloning captures a vast amount of the critical cloud elements that conventional tools don’t. In our experience, those other tools tend to capture between 10% and 30% of the source cloud setup, whereas Cloud Cloning captures 60% or more.

Translating from Cloud A to Cloud B

Once Cloud Cloning has captured the full picture of the source cloud infrastructure, the next step is mapping that infrastructure onto the target cloud’s services and configuration model. Because each cloud’s API specifications are so radically different, this translation work is no small feat. Consider just a few examples from three key areas:

  • Compute and resource management: Each cloud provider exposes similar compute and networking building blocks, but different semantics. AWS Auto Scaling Groups, Azure VM Scale Sets, and GCP Instance Groups, for instance, behave differently in how they handle availability, placement, and scaling. The same applies to security: AWS security groups are allow-only; Azure uses ordered allow/deny rules with priorities; and GCP defines separate ingress and egress firewall rules. These and other differences make it difficult to reproduce deployments exactly without re-interpreting the underlying intent into the target cloud.
  • Storage and data: Storage and data services are not interchangeable across clouds. Block volumes and file systems differ in performance, snapshot behavior, and consistency guarantees. Meanwhile, managed databases such as AWS RDS, Azure SQL / PostgreSQL, and GCP Cloud SQL share engines but diverge in extensions, limits, backup semantics, and failover models. As a result, storage and data layers often need to be re-architected rather than directly replicated.
  • Identity and access: IAM is one of the least portable layers across clouds. AWS uses policy-driven roles and users; Azure ties permissions to subscriptions and role assignments; and GCP enforces hierarchical IAM with service accounts at multiple levels. As a result, access models and automation workflows rarely translate directly and must be re-thought for each cloud.

Because the leading cloud migration tools aren’t focused on infrastructure, they’re simply not set up to make these complex translations. As I discuss here, while infrastructure-as-code solutions seem like the perfect cloud-agnostic tool to work around this lack of interoperability, IaC doesn’t really solve this portability problem either. Notably, hyperscaler-native IaC like CloudFormation (AWS) and Bicep (Azure) were developed with particular clouds in mind, and even cloud-agnostic options like Terraform must be custom-tailored to each particular provider. As a result, theoretically cloud-neutral IaC ends up being highly cloud-specific in practice.

As IT professionals well know, the standard path to this translation work is grueling. It involves painstakingly reverse-engineering infrastructure from the source into the target cloud—and often requires separate experts in each cloud to map out the infrastructures together.

Cloud Cloning solves this translation problem—converting configuration fully from one cloud provider into another through its patented cloud mapping technology. Thus, for one representative example, it can start with an active cloud-native AWS infrastructure—using services such as EC2 instances, VPCs, subnets, security groups, Kubernetes, IAM, or databases—and convert that environment into an equivalent one in Azure or GCP. In each case, the result is workloads that retain their core functionality while adapting to the target cloud’s specific APIs, semantics, and guardrails. The result is applications that are truly multicloud, without the need to increase engineering overhead.

Importantly, Cloud Cloning reverse‑engineers IaC from the live environment to the target cloud in a way that’s fully automated, eliminating the need for remediation or other manual rework. Given that each VM destination requires its own complex, often opaque network setups and configurations, this automation can be particularly welcome. Cloud Cloning delivers these results as Terraform configurations by default.

Addressing governance needs: Finops and drift

One area where the reigning multicloud tools fall short is governance, especially in terms of finops and drift management. Let’s go through how Cloud Cloning tackles these issues, starting with finops.

How Cloud Cloning handles finops

Today, cloud finops operates on two very separate tracks: optimizations within an individual cloud and migrations from one cloud to another. Neither track serves multicloud financial operations effectively.

Within individual clouds, finops is dominated by cloud-specific tools such as AWS Cost Explorer and Compute Optimizer, Azure Cost Management and Advisor, and Google Cloud Billing and Recommender. These tools are often excellent at optimizing costs within their specific cloud, but they’re not designed to recommend what’s often the biggest cost optimization of them all: migrating elsewhere. That silence is costly, to say the least. In our experience, IT teams can save dramatically—sometimes as much as 50%—by recreating configurations in alternative clouds or even regions.

Once IT teams do decide to migrate, the target clouds’ native tools and support teams do the lion’s share of finops work. Based on prospective customers’ recent billing records, the cloud provider delivers a high-level estimate of what a comparable setup would cost in the prospective target. In theory, that’s information that customers can use to decide on and migrate to an alternative environment. But given the vast complexity outlined above, it’s not enough to know how much a new setup could cost. Teams need to see how to translate their specific architectures into the exact equivalents for the target environment. That’s information that cloud providers don’t share, and that the high-level billing information used as source data simply can’t support.

To fully optimize cloud costs, IT teams need a cloud-agnostic, detailed, and actionable view of the exact clouds and configurations where current functionality would be priced best. The good news is that Cloud Cloning provides this comprehensive view. Using the same translation techniques described earlier, Cloud Cloning allows for precise comparisons of functionality and price across clouds and environments. Plus, Cloud Cloning provides the Terraform code teams can use to automatically implement the new cloud setups they decide to go with.

In other words, Cloud Cloning takes the siloed and often murky world of cross-cloud finops, and brings it far closer to the “shop and click” comparison it ought to be.

How Cloud Cloning tackles drift

Cloud deployments and migrations involve a massive number of variables. With a dizzying array of dependencies, tools, and architectures in the mix, even the most straightforward deployment can reach complexities that are far beyond what humans can manage on their own, and each nuance can go sideways in subtle but critical ways. Given that there are really no tools to track all multi-tenancy changes within a single cloud—let alone across multiple cloud providers—keeping track of changes stands to be a losing proposition. Even with the most scrupulous devops hygiene and related best practices in play, multicloud initiatives are often rife with configuration drift that goes wholly unnoticed until something breaks. 

IaC solutions such as Terraform don’t solve the drift problem, either. After all, Terraform only works as designed if teams adopt a Terraform‑first workflow—using Terraform as the source of truth from the start, consistently updating and tracking the configuration files, and ensuring the state file accurately reflects the real environment. If teams make changes outside of Terraform or let files and state fall out of sync, Terraform can’t reliably control or predict your infrastructure. Again, given all the complexity, this is still a recipe for drift.

Cloud Cloning tackles the drift challenge based on the rich infrastructure snapshots described above. Cloud Cloning takes regular snapshots of the entire infrastructure on a customizable schedule (24 hours by default). Then, it compares these state captures against current configurations in a detailed infrastructure changelog, flagging and delivering alerts on changes that could be problematic. This includes not only standard drift issues that are focused on inventory, but also around veering from cost parameters and security controls.

From infrastructure as code to cloud-native migration tools to governance offerings, cloud portability has long suffered from major gaps and too much manual work, all of which has led to a global lock-in crisis. With so many firms looking to diversify their cloud portfolio fast, we need a better solution for cloud infrastructure migration and rearchitecting. With Cloud Cloning, we believe we have provided that solution.


New Tech Forum provides a venue for technology leaders—including vendors and other outside contributors—to explore and discuss emerging enterprise technology in unprecedented depth and breadth. The selection is subjective, based on our pick of the technologies we believe to be important and of greatest interest to InfoWorld readers. InfoWorld does not accept marketing collateral for publication and reserves the right to edit all contributed content. Send all inquiries to doug_dineley@foundryco.com.

(image/jpeg; 2.71 MB)

Why cloud migration needs a new approach 17 Feb 2026, 1:00 am

Multicloud is having a crisis. Gartner predicts that over 50% of multicloud or cross-cloud efforts won’t deliver on expected benefits by 2029, with poor interoperability and fragmentation serving as key culprits.

While these numbers don’t speak to cloud migration directly, from my own experience as an IT leader, plus from my field research in developing FluidCloud, I can tell you that multicloud disappointment and cloud migrations go hand in hand.

What causes all that frustration? To a large extent, you can blame the tools. Infrastructure migration systems propose to deliver end-to-end automation, from initial replication through ongoing governance. In reality, these applications leave huge gaps in the migration process. Instead of multicloud freedom, IT teams face a cascade of unanticipated work, expense, and friction.

The good news: To address the migration challenges, my colleagues and I have developed a new approach to cloud infrastructure portability, Cloud Cloning. In two articles here, I will outline the shortcomings of previous approaches and explain how Cloud Cloning solves the above problems. The goal is to help firms capture the multicloud promise at last.

To start, below I dive into what’s missing from the three main categories of legacy cloud infrastructure migration offerings: cloud-native tools, infrastructure as code, and governance products. In the companion article, I describe how Cloud Cloning answers these legacy issues with a wholly new approach.

Cloud-native migration tools: Built for the source, not the target

If you’re migrating to a hyperscaler, the cloud providers’ own migration offerings—such as Azure Migrate, AWS Migration Services, and Google Cloud Migrate—feel like obvious choices. Built specifically for the cloud you’re moving to, these solutions can provide excellent automation for provisioning and migration with templates, snapshots, managed services and more.

The problem is design for a specific service can also mean design for one service. These solutions are provided to facilitate migration to the clouds that provide them – ideally (from the cloud provider’s viewpoint) for a long time. They’re not designed to encourage free portability across clouds.

This design-for-stickiness includes guiding customers toward native services (such as AWS CloudFormation, Azure Cosmos DB, or GCP Firebase Authentication) that won’t run correctly elsewhere without significant rewrites to the applications built on them. The solutions also often encourage lock-in pricing—for instance, by recommending a particular infrastructure along with a three-year plan commitment to maximize savings.

To be clear, it’s arguably unfair to ask cloud providers to operate differently. After all, we can’t expect providers to offer capabilities and pricing designed to help customers move off to their competitors. But it’s also true that the customer’s goal is to work with whatever cloud is right for them, at any given time. This puts the customer and cloud provider at cross-purposes when it comes to cloud-agnostic computing—which is why, when it comes to migrations, it’s best for customers to seek out unaligned options.

Infrastructure as code tools: Automate only part of the way

With automated, version-controlled foundations to build from, infrastructure as code (IaC) solutions like Terraform and OpenTofu have earned their spot as crucial cloud migration assets. Their huge popularity is no surprise (Terraform’s AWS provider alone has topped 5.5 billion downloads).

The problem is that these solutions tend to translate infrastructure into broad terms, leaving critical small details to teams to work out on their own. This oversight can be especially problematic in areas like security policy, network load balancing, and firewall models and configurations, where small differences between one cloud and the next can be both make-or-break for migration success and extremely difficult to find. Even after using IaC, teams still must spend exorbitant amounts of time poring through plans, state files, and live resources to catch and correct these subtleties that fall through the cracks.

None of this is meant to undermine the value that infrastructure as code solutions provide. In fact, in the companion article I describe how Terraform is central to FluidCloud’s Cloud Cloning process. IaC is a powerful instrument in the migration arsenal; it’s just not reliable by itself to ensure that migrations succeed and resources behave as they should.

Governance tools: Built to find problems, not fix them

It’s unquestionable that observability and finops platforms like Datadog, New Relic, and Kubecost can be crucial in surfacing underutilized resources, performance bottlenecks, and budget overruns. The problem is that while they’re often excellent at spotting problems, most don’t guide teams to take the critical next step toward solving problems and optimizing cloud setups.

  • As their name implies, observability tools are designed to observe and report on issues, not to automate solutions. For instance, they might detect high CPU usage or spot failing requests, but they won’t then launch new servers, add containers, or adjust configurations to fix the problem. It’s on customer teams to do that work.
  • Finops applications, meanwhile, might alert users that a region they’re using is particularly expensive. It won’t follow up with automation to help port infrastructure over to a cheaper area, or show cost comparisons to help teams find alternative clouds to rebuild current infrastructure at a lower cost.

Governance offerings are often excellent at flagging critical issues, which is unquestionably helpful. But without automation to follow up, they’re only raising problems without offering solutions. That isn’t helpful enough.

Across these examples and classes of applications, the underlying issue is the same. The market is full of products that, in theory, turn the complex cloud migration process into something predictable and efficient. The reality is that IT teams are left with extensive “last mile work” of translating and implementing source infrastructure in the target cloud’s architecture and dependencies.

IT teams deserve a better solution. Cloud Cloning solves the problems I’ve laid out above. I explain how in this article.

New Tech Forum provides a venue for technology leaders—including vendors and other outside contributors—to explore and discuss emerging enterprise technology in unprecedented depth and breadth. The selection is subjective, based on our pick of the technologies we believe to be important and of greatest interest to InfoWorld readers. InfoWorld does not accept marketing collateral for publication and reserves the right to edit all contributed content. Send all inquiries to doug_dineley@foundryco.com.

(image/jpeg; 1.7 MB)

Open source maintainers are being targeted by AI agent as part of ‘reputation farming’ 16 Feb 2026, 11:13 am

AI agents able to submit huge numbers of pull requests (PRs) to open-source project maintainers risk creating the conditions for future supply chain attacks targeting important software projects, developer security company Socket has argued.

The warning comes after one of its developers, Nolan Lawson, last week received an email regarding the PouchDB JavaScript database he maintains from an AI agent calling itself “Kai Gritun”.

“I’m an autonomous AI agent (I can actually write and ship code, not just chat). I have 6+ merged PRs on OpenClaw and am looking to contribute to high-impact projects,” said the email. “Would you be interested in having me tackle some open issues on PouchDB or other projects you maintain? Happy to start small to prove quality.”

A background check revealed that the Kai Gritun profile was created on GitHub on February 1, and within days had 103 pull requests (PRs) opened across 95 repositories, resulting in 23 commits across 22 of those projects.

Of the 103 projects receiving PRs, many are important to the JavaScript and cloud ecosystem, and count as industry “critical infrastructure.” Successful commits, or commits being considered, included those for the development tool Nx, the Unicorn static code analysis plugin for ESLint, JavaScript command line interface Clack, and the Cloudflare/workers-sdk software development kit.

Importantly, Kai Gritun’s GitHub profile doesn’t identify it as an AI agent, something that only became apparent to Lawson because he received the email.

Reputation farming

A deeper dive reveals that Kai Gritun advertises paid services that help users set up, manage, and maintain the OpenClaw personal AI agent platform (formerly known as Moltbot and Clawdbot), which in recent weeks has made headlines, not all of them good.

According to Socket, this suggests it is deliberately generating activity in a bid to be viewed as trustworthy, a tactic known as ‘reputation farming.’  It looks busy, while building provenance and associations with well-known projects. The fact that Kai Gritun’s activity was non-malicious and passed human review shouldn’t obscure the wider significance of these tactics, Socket said.

“From a purely technical standpoint, open source got improvements,” Socket noted. “But what are we trading for that efficiency? Whether this specific agent has malicious instructions is almost beside the point. The incentives are clear: trust can be accumulated quickly and converted into influence or revenue.”

Normally, building trust is a slow process. This gives some insulation against bad actors, with the 2024 XZ-utils supply chain attack, suspected to be the work of nation state, offering a counterintuitive example. Although the rogue developer in that incident, Jia Tan, was eventually able to introduce a backdoor into the utility, it took years to build enough reputation for this to happen.

In Socket’s view, the success of Kai Gritun suggests that it is now possible to build the same reputation in far less time, in a way that could help to accelerate supply chain attacks using the same AI agent technology. This isn’t helped by the fact that maintainers have no easy way to distinguish human reputation from an artificially-generated provenance built using agentic AI. They might also find the potentially large numbers of of PRs created by AI agents difficult to process.

“The XZ-Utils backdoor was discovered by accident. The next supply chain attack might not leave such obvious traces,” said Socket.

“The important shift is that software contribution itself is becoming programmable,” commented Eugene Neelou, head of AI security for API security company Wallarm, who also leads the industry Agentic AI Runtime Security and Self‑Defense (A2AS) project.  

“Once contribution and reputation building can be automated, the attack surface moves from the code to the governance process around it. Projects that rely on informal trust and maintainer intuition will struggle, while those with strong, enforceable AI governance and controls will remain resilient,” he pointed out.

A better approach is to adapt to this new reality. “The long-term solution is not banning AI contributors, but introducing machine-verifiable governance around software change, including provenance, policy enforcement, and auditable contributions,” he said. “AI trust needs to be anchored in verifiable controls, not assumptions about contributor intent.”

(image/jpeg; 9.83 MB)

Page processed in 0.28 seconds.

Powered by SimplePie 1.3, Build 20180209064251. Run the SimplePie Compatibility Test. SimplePie is © 2004–2026, Ryan Parman and Geoffrey Sneddon, and licensed under the BSD License.