RSAC 2025: Being realistic about fixing code with LLMs

SAN FRANCISCO — Programmers and developers are increasingly incorporating large language models (LLMs) into their workflow, as shown by the recent “vibe coding” trend and a 2024 GitHub survey that found nearly all respondents in these fields used generative AI in some capacity.

But how good are LLMs really at analyzing code and offering fixes when code is insecure? In the Tuesday RSAC 2025 session “The Future of Secure Programming Using LLMs,” Mark Sherman, CERT director at Carnegie Mellon University’s Software Engineering Institute, sets realistic expectations about fixing insecure code with AI based on more than 2,500 test runs conducted by his team.

Newer models improve on previous lackluster performance

Sherman and his CERT colleagues tested OpenAI’s ChatGPT 3.5, ChatGPT 4 and ChatGPT 4o, as well as GitHub Copilot, on 1,223 examples of compliant and noncompliant code from SEI CERT C, C++ and Java Secure Coding Standards, ultimately totaling 2,684 trial runs between March 2023 and August 2024.

When the models were simply asked to identify the presence or absence of errors in the code, Sherman and his team encountered many cases where the older model, ChatGPT 3.5, was “demonstrably wrong” or where the model “misses things that are obvious” yet points out nonexistent errors.

For example, the model would sometimes repeat the same code back to the user, claiming to have made a fix that was already present in the original code. In another case, the model confused a character array for a string in C, leading to an inaccurate fix where the memcomp() function was replaced with strcmp().

Overall, ChatGPT 3.5 correctly interpreted the code less than 40% of the time for insecure Java examples, less than 50% of the time for insecure C and just under 65% of the time for insecure C++.

Significant improvements were seen when it came to the newer models, like ChatGPT 4 and 4o and GitHub Copilot. These models were able to catch some of the errors missed by ChatGPT 3.5, including the problem involving memcomp() from a previous example and a potential command injection in C due to insecure use of the system() function, although GitHub Copilot refused to respond to the latter due to filtering by its “Responsible AI Service.”

However, Sherman noted that all of the models tested missed some errors. On average, the newer models (ChatGPT 4, 4o and GitHub Copilot) still missed errors in insecure code more than 30% of the time for Java and C++ and more than 20% of the time for C.

“These things are getting better … but it still has a way to go,” Sherman said.

Also notable was the fact that the models performed best when detecting the most common errors; for example, expression errors, memory operation errors and string handling errors in C were detected 96%, 92% and 84% of the time, respectively, by the newer models.

Sherman said this lines up with the fact that the most common errors and their solutions would be most prevalent in the models’ training data. This drives home the fact that “you’re not seeing correct analysis of programs — you’re seeing patterns” when you use LLMs to fix code, Sherman said.

Avoiding counterintuitive solutions and misplaced trust

Sherman’s talk also explored potential solutions to the current pitfalls of using AI to help programmers analyze and patch insecure code.

One might suspect that a smaller model trained specifically for coding, like GitHub Copilot, would have improved performance over larger, general-purpose models like ChatGPT. However, both CMU CERT’s tests and other research found that “the bigger hammer worked better than the special hammer” when it came to general purpose vs. domain-trained models, Sherman said.

Instead of looking for a specially trained model, users can boost the performance of LLMs with improved prompting that goes beyond asking about “good vs. bad” code.

“Controlling the narrative is key,” Sherman said, pointing to research that showed providing information such as static analyzer alerts or context about an error can improve AI’s ability to provide fixes by up to 60%.

Additional pieces of context Sherman recommended users provide to LLMs to get better results include functional documentation and which libraries and APIs the code should use.

When assessing the AI’s responses, Sherman warned programmers against putting too much trust in the AI, and copying and pasting its solutions without question. He noted reports from Stanford University and Snyk that suggest users often deferred to “perceived AI authority” and “trusted the AI system more than their own judgement.”

Instead, Sherman suggested LLMs should be prompted to provide evidence that their suggested fixes are correct, with that evidence further validated by theorem proving tools to reduce the burden on programmers to check the AI’s work.

Overall, programmers and developers should continue to use existing secure coding practices and analysis tools to ensure their code security, while making sure they “separate hype from capability” when employing LLMs in their workflow.