Self Healing Code

Rather read with your ears? Then, I’ve got you covered. Check out this podcast where two LLMs talk through this blog post – Spotify and Apple Podcast.


Imagine a future where software can fix itself – just like your body heals a cut. We’re moving toward truly resilient systems, and it’s closer than you might think.

The exciting part? We’re already seeing the first signs of this happening.

When breakthrough technologies emerge, innovation tends to snowball – and that’s exactly what we’re seeing with generative AI (GenAI). In 2024, LLMs are already starting to automatically fix buggy and vulnerable code. 

I’ve discovered six ways teams tackle this challenge – two are already available products you can use today, while four are research projects. Here’s what’s cool: unlike a lot of AI research, these four research projects are pretty straightforward to try out yourself. 

Product

Research

How Auto-Patching Actually Works

I want to highlight three levels of detail on how these auto-patching systems work (our levels of inception). Our first level is an overview showing the general similarities among all the different approaches.

This diagram shows us the basic workflow.

Step 1: Everything begins with code – whether written by a developer or generated by an LLM. This code enters a CI/CD pipeline (that’s Continuous Integration/Continuous Delivery for non-techies). The system then scans the code for vulnerabilities using static analysis. When it finds an issue, it gathers important contextual information about the problem.

Step 2: Each approach packages the vulnerability information a bit differently, but they all include three key pieces: the CWE ID, the problematic code, and its exact location. This information gets fed into our LLM to analyze. The most successful method I’ve found doesn’t overwhelm the LLM with the entire codebase – instead, it zeroes in on just the relevant parts of the code. More on this later with LLMPatch. 

Step 3: The LLM analyzes the problem and suggests several possible fixes – typically between three and eight different solutions. This multiple-solution approach increases our odds of finding the right fix.

Step 4: Some systems, like Google’s, take an extra verification step by automatically testing each suggested fix with a unit test to ensure it doesn’t break anything. Then, one or more LLMs review each fix for correctness – though what counts as “correct” varies between systems.

This “LLM-as-judge” strategy is popular in many GenAI products because it improves quality without slowing things down. Here are two good resources to help you start…

  1. Creating a LLM-as-a-Judge That Drives Business Results
  2. Evaluating the Effectiveness of LLM-Evaluators (aka LLM-as-Judge)

Step 5: The system ranks the verified fixes and presents the best ones to the developer for final approval. These fixes appear as simple accept/reject options in their code repository – just like a suggestion in a Google Doc.

Now, let’s zoom in on the most impressive approach I’ve found: LLMPatch. 

LLMPatch Flow

Why focus on LLMPatch? Two reasons: it’s the most transparent system out there, and it’s incredibly capable. Unlike other systems, LLMPatch thoroughly explains its approach and can fix vulnerabilities across multiple programming languages – even brand-new security threats (what we call “zero days”). Their success rate is remarkable – they fixed 7 out of 11 zero-day vulnerabilities during testing.

Here’s the high-level flow…

Step 1: While the basic process follows our earlier framework, LLMPatch takes a unique approach to gathering vulnerability data. The researchers started by combining two massive databases: PatchDB (with over 12,000 real-world fixes) and CVEFixes (with more than 4,000 C-language fixes). Though these databases focus on C, LLMPatch’s approach works across many programming languages. They carefully selected 300 cases from this collection – 200 to guide the system with examples and 100 to test it.

Step 2: When the system identifies a vulnerability, it uses a special tool called Joern to analyze it. Joern creates a map (called a Program Dependence Graph or PDG) showing how different parts of the code connect to each other. This map helps the LLM focus on exactly the right pieces of code – not just where the vulnerability is, but all the related code that might be affected.

Step 3: The system bundles together four key pieces of information: 

  • The vulnerability ID (CWE)
  • The problematic code
  • The exact location of the issue
  • Related code pieces identified by our PDG map

This focused package helps the AI understand the problem without getting lost in irrelevant code. 

The LLM then analyzes this information to identify the underlying cause of the vulnerability. If it’s unsure, there’s a clever backup plan… another LLM steps in to provide a plain-English summary of what the vulnerable function is trying to do (Its LLMs all the way down).

Below is the prompt they shared in the paper. 

Step 4: Here’s where LLMPatch gets really smart. After identifying the root cause, it searches through its library of 200 example cases to find similar root causes that were successfully fixed before. It collects up to eight of the most relevant examples, using them as a reference for fixing the current problem. For each new vulnerability, the system crafts a custom prompt that’s specifically tailored to the problem at hand.

This custom prompt has three key parts:

  • Historical Examples: Up to 8 similar root causes and how they were fixed
  • Prompt Question: Clear instructions for what needs to be fixed
  • Response Structure: A pre-built structure that includes the root cause we found and suggested fix strategies 

The system then generates five different possible fixes.

Step 5: Each proposed fix goes through a rigorous review process. A panel of LLMs (including Gemini, Claude, GPT, and Llama) examines each solution. If any of these LLM reviewers approves a fix, it’s sent to the human developer for final validation. 

The LLM reviewers check two crucial things:

  1. Valid: Does it preserve the original code’s functionality?
  2. Correct: Does the fix actually solve the security problem?

Key Lessons for Building Auto-Patching Systems

After reading through all this auto-patching content, I’ve identified three principles for building effective auto-patching tools:

  1. Focus the AI’s Attention: Just like humans, AI works better when it can concentrate on relevant information. Don’t throw all the code at a model, hoping a quality patch is generated. Instead, narrow the scope (i.e., focusing on dependent code with our PDG graph) to help create more effective patches more consistently.
  2. Traditional Tools: Don’t throw out traditional security tools. In every case, we rely on static analysis to discover the vulnerability, test the code with a unit test, and automate the process with quality CI/CD pipelines. As we can see, LLMs assist with automating this even further, but they are not replacing everything (…yet).  
  3. Dynamic Prompting: Systems that automatically adjust their approach based on each specific vulnerability perform significantly better. LLMPatch does a great job with this by creating a series of automated steps before the final patch generation prompt, such as extracting relevant code based on the PDG, finding the root cause, comparing root cause examples, and bundling this together for the final prompt. If you check Table 2 on page 11 of their paper, you’ll see just how important each step is for effectiveness.  

Looking ahead, we’re witnessing the early stages of self-healing software becoming a reality. Each of these approaches represents another step toward truly resilient systems – and that future might be closer than we think.