Dynamic testing methods, including feedback-based fuzzing, are the most effective approach for finding deeply hidden security vulnerabilities. Code Intelligence pioneered this technology for memory-safe languages, such as Java and JavaScript, helping numerous enterprise and open-source projects identify issues early in the dev process. As a result, we uncovered thousands of bugs and registered over 50 CVEs in OSS, including SQL/Command Injections, XSS, and Remote Code Executions. Despite this incredible track record, there has been one barrier left hindering broad adoption of dynamic white-box testing: The manual engineering effort required to identify relevant interfaces and develop the corresponding test harnesses.
In this blog post, we want to share our work on leveraging generative AI to remove this barrier and to enable broad adoption of dynamic white-box testing despite limited time and resources. Today, we are thrilled to announce CI Spark, our new AI assistant that leverages LLMs to automate the onboarding of new projects. Initial results show an acceleration of 1500%, significantly reducing the workload from several days to under three hours!
Challenges of AI-Powered White-Box Testing
One of the main technologies behind AI-powered white-box testing is feedback-based fuzzing. This testing approach leverages genetic algorithms to automatically generate test cases that maximize test coverage and trigger deep behavior in the tested software. However, getting such a fuzzer started requires human expertise to identify entry points and manually develop a test that consumes the generated inputs. For this reason, creating a sufficient suite of high-quality tests takes days or even weeks. Given the tight schedules of development teams, this manual effort presents a non-trivial barrier to the broad adoption of AI-enhanced white-box testing. Thanks to the latest advancements in generative AI, CI Spark can automate this task in JavaScript/TypeScript, Java, and C/C++.
LLM-Powered Fuzz Test Generation
CI Spark leverages generative AI’s code analysis and generation capabilities to automate the generation of fuzz tests, which are central to AI-powered white-box testing. To that end, we have created an extensive set of prompts that guide LLMs to identify security-critical functions and generate high-quality fuzz tests. The prompts give instructions on how to develop tests that optimally use our underlying fuzzing engines. They also provide the insights necessary for CI Spark to create tests that achieve maximum code coverage. We have built and refined our prompts based on our years of experience in fuzz testing across open and closed-source projects, enabling our AI assistant to maintain a minimal false positive rate. Moreover, CI Spark offers an interactive mode, allowing users to quickly interact with it to correct any false positives that slip through and improve the quality of the generated tests.
Advantages
- Automatic identification of fuzzing candidates
Provide a list of public functions/methods that can be used as entry points for fuzz tests. These APIs are called with user-controlled data and thus should be thoroughly tested. - Automatic generation of tests
Generate a fuzz test for a selected candidate. The interactive mode enables users to give tips to the AI to improve the quality of the generated test and fix any errors. - Improving existing tests
If you already have fuzz tests, CI Spark can assist you in improving the test to increase code coverage. - Leverage existing unit tests to generate high-quality fuzz tests
To have higher-quality fuzz tests, you can provide existing unit tests that call the candidate API as hints to CI Spark. These provide valuable examples of the correct usage of the API in the tests and result in better fuzz tests.
Example
Here, we show side-by-side code snippets for manual and AI-generated tests for a non-trivial project using the example of the JSON-sanitizer, a sanitizer that takes untrusted JSON and converts it into trusted JSON.
Human-generated test:
(as seen in OSS-FUZZ)
import com.code_intelligence.jazzer.api.FuzzedDataProvider;
import com.code_intelligence.jazzer.api.FuzzerSecurityIssueLow;
import com.google.gson.Gson;
import com.google.gson.JsonElement;
import com.google.json.JsonSanitizer;
public class ValidJsonFuzzer {
private static Gson gson = new Gson();
public static void fuzzerTestOneInput(FuzzedDataProvider data) {
String input = data.consumeRemainingAsString();
String output;
try {
output = JsonSanitizer.sanitize(input, 10);
} catch (ArrayIndexOutOfBoundsException e) {
// ArrayIndexOutOfBoundsException is expected if nesting depth is
// exceeded.
return;
}
// Check that the output is valid JSON. Invalid JSON may crash other parts
// of the application that trust the output of the sanitizer.
try {
Gson gson = new Gson();
gson.fromJson(output, JsonElement.class);
} catch (Exception e) {
throw new FuzzerSecurityIssueLow("Output is invalid JSON", e);
}
}
}
AI-generated test:
//package com.example;
import com.code_intelligence.jazzer.api.FuzzedDataProvider;
import com.code_intelligence.jazzer.junit.FuzzTest;
import com.google.json.JsonSanitizer;
public class JsonSanitizerFuzzingHarness {
@FuzzTest
public void fuzzJsonSanitizer(FuzzedDataProvider data) {
String jsonInput = data.consumeRemainingAsString();
JsonSanitizer.sanitize(jsonInput);
}
}
With small guidance in the interactive mode (used LLM prompt: “I only want to catch ArrayIndexOutOfBoundsException as they're expected if the nesting depth is exceeded. Also, I want to ensure that the sanitized output never contains the substring </script>. Finally, validate the output.”
//package com.example;
import com.code_intelligence.jazzer.api.FuzzedDataProvider;
import com.code_intelligence.jazzer.junit.FuzzTest;
import com.google.json.JsonSanitizer;
import com.google.gson.Gson;
import com.google.gson.JsonSyntaxException;
public class JsonSanitizerFuzzingHarness {
@FuzzTest
public void fuzzJsonSanitizer(FuzzedDataProvider data) {
try {
String jsonInput = data.consumeRemainingAsString();
String sanitizedOutput = JsonSanitizer.sanitize(jsonInput);
// Check if sanitized output contains "</script>"
if (sanitizedOutput.contains("</script>")) {
throw new AssertionError("Sanitized output contains '</script>'");
}
// Verify if sanitized output is valid JSON
try {
new Gson().fromJson(sanitizedOutput, Object.class);
} catch (JsonSyntaxException e) {
throw new AssertionError("Sanitized output is not valid JSON");
}
} catch (ArrayIndexOutOfBoundsException e) {
// Ignore ArrayIndexOutOfBoundsException as it may occur when the nesting depth is exceeded
}
}
With only minimal effort about what specifics are interesting to us for this particular API, we have an almost identical fuzz harness to what has been implemented already.
Tested Within Google’s OSS-Fuzz
CI Spark was first tested as part of our collaboration with Google’s OSS-Fuzz. The project aims to continuously ensure the security of open-source projects through continuous fuzz testing. As Code Intelligence provides the testing technology for JVM-based languages (Jazzer) and JavaScript (Jazzer.js), CI Spark was initially tested while onboarding JavaScript and Java projects. Early results proved that CI Spark enabled our engineers to condense the average workload needed to generate a fuzz test to a few hours or even minutes, depending on the project.
Complementary to CI Spark, the Google Security team recently added similar functionality to OSS-Fuzz, with a focus on C/C++ just recently. While the main purpose of this new OSS-Fuzz addition is to keep securing open-source, CI Spark will be rolled out for commercial projects in the near future.
Next Steps
The results from using CI Spark are very encouraging and clearly demonstrate the automation potential of leveraging generative AI. However, this is just the beginning and there are still significant improvements that we are currently working on. Our next items on our roadmap include:
- Plug & Play system for different LLMs
Support seamless choice of different LLMs to power CI Spark. New models appear with increasing speed and capabilities, which helps more experimentation to select the best model for the specific use case.
- Model fine-tuning for better results
Our main focus has been refining our prompts to work with all available models. Next, we will explore fine-tuning existing open-source and commercial models for the specific task of generating high-quality fuzz tests.
- Automatic validation of the fuzz tests
An essential step to fully automate this process is to validate the quality of the generated test. To this end, we will build an evaluation framework to build and execute the generated test. Build errors and progress feedback will be then fed back to the AI to improve the test. - Static analysis for candidate selection
We will leverage whole-program static analysis to extract relevant candidates for fuzz tests. This analysis will identify APIs that are more likely to contain security issues and result in good code coverage.
- Identification of inadequately tested APIs
Besides the aforementioned enhancements to our static analysis to better pinpoint suitable fuzzing candidates, we’re also investing into uncovering APIs that are inadequately tested. Whether they lack sufficient unit testing or have not been subjected to fuzz tests, our goal is to supplement these areas with additional fuzz tests.
- Support for other languages
CI Spark Live Demo
In this recorded live demo, our co-founder and Chief Scientist, Khaled Yakdan, demonstrates how you can break the barriers of dynamic testing with CI Spark. Learn how CI Spark combines self-learning AI and LLMs to detect and configure entry points for feedback-based fuzzing automatically.