Teaching a Small AI to Investigate Security Logs (And Everything That Went Wrong Along the Way)

If you want to learn how much you don’t know about your own domain, try explaining it to a 20-billion parameter AI model.

Most AI demos you see these days use frontier models: the biggest, most expensive, most capable large language models money can buy. You send your data to someone else’s API, magic happens, and you get an answer back. It works great, right up until the data you need to analyze is a stream of security logs containing IP addresses, attack payloads, and evidence of who’s trying to break into your infrastructure. Suddenly, “just send it to the AI” sounds less appealing.

This is the story of building a security investigation agent that runs on a 20-billion parameter model, small enough to fit on a single GPU with 16GB of VRAM, and the large number of things that had to go right before it stopped being an expensive random nonsense generator.

Why Build This?

I work at Akamai, and I’ve been thinking about interesting use cases for GPU instances on our cloud platform, which my employer has been pushing heavily recently. We have a managed data lake service where customers store web application firewall (WAF) and CDN logs, and it has a ClickHouse-compatible query interface. We have industry-leading WAF products. And, well, AI is kind of a big deal right now. The question was: could you combine all three to show customers a glimpse of what AI-powered security investigation might look like?

The catch: instead of reaching for GPT or Claude, I wanted to use a small, self-hostable model. Partly because the whole point was to demonstrate running on our GPU infrastructure. But also partly because every AI agent demo I’d seen was using frontier models, and nobody seemed to be seriously trying to make small models work for complex analytical tasks.

When I started experimenting with a 20B parameter open-source LLM developed by OpenAI, I expected it to be barely functional. Instead, I found something more interesting: with the right prompting techniques, it was far more capable than I’d anticipated. Not as capable as frontier models, certainly, but capable enough that it was worth pushing further. And so what started as a quick experiment turned into a deep dive into the craft of making a small AI actually useful.

The AI Doesn’t Know What Your Logs Mean

Here’s something that surprised me, although in retrospect it shouldn’t have: the very first problem wasn’t about SQL generation or model size. It was about the model not understanding what WAF logs actually represent.

I gave the model some Akamai WAF logs and asked “how many attacks succeeded?” The model dutifully found all the events where appliedAction was set to monitor, and counted them as successful attacks. After all, if the WAF detected an attack but didn’t block it, the attack must have gotten through, right?

Wrong. Monitor mode is a deliberate deployment strategy. When you roll out new WAF rules, we often run them in “monitor” mode first; they observe and log but don’t block, so you can validate that they won’t cause false positives before enabling enforcement. The model was confusing “we chose not to block this yet” with “the attacker won.”

Or take reputation-based blocking. The model would see WAF events triggered by rules like PENALTYBOX or REPUTATION (both block based on past behavior, not the current request), where the ruleData field was empty, and dismiss them as false positives. No visible payload means no real evidence of an attack, right? But reputation rules work differently: if IP 203.0.113.5 was attacking at high frequency yesterday, Akamai might block it today based on that history, regardless of what the current request looks like. Empty ruleData is expected for these rules; it’s not a quality issue.

These aren’t bugs in the model. They’re perfectly reasonable interpretations if you don’t have domain expertise in Akamai WAF. A junior analyst might make the same mistakes. The problem is that LLMs are extremely confident in their wrong interpretations, and without domain knowledge baked into the prompts, even frontier models produce what I’d call “confident nonsense”: syntactically correct SQL that answers the wrong question entirely.

Turning Tacit Knowledge Into Words

The fix was conceptually simple: encode the domain knowledge into the prompts. In practice, it was one of the most valuable exercises of the entire project.

I started with minimal system prompts and ran investigations. When the model made a mistake, I’d read through its reasoning and ask myself: “What piece of knowledge is this model missing that caused this error?” Then I’d add that knowledge to the prompt and run again.

It forced me to articulate knowledge I’d never put into words before. Every security engineer carries around a massive inventory of implicit understanding: anyone who’s worked with Akamai WAF “just knows” that monitor mode isn’t the same as a successful attack, anyone who’s worked with Akamai WAF “just knows” that reputation rules don’t need payload evidence. But this is tribal knowledge; it lives inside our heads, not in any document. Nobody ever writes this down, because it’s just… how things work.

Turning tacit domain knowledge into explicit, unambiguous language is hard. But it was also clarifying. I found gaps in my own understanding. Things I thought I knew turned out to be fuzzier than I’d realized once I had to explain them precisely enough for a machine to get them right. It was like writing documentation, except the reader is an extremely literal-minded intern who will cheerfully misinterpret anything you leave ambiguous.

The Intern Analogy

A 20B parameter LLM is basically an intern. Give it specific, detailed instructions like “find all events where appliedAction is deny and the ruleTag contains ASE/WEB_ATTACK/SQLI” (Akamai’s rule tag for SQL injection) and it does a great job. It follows step-by-step directions with admirable precision.

But ask it something abstract, like “find the most dangerous attacks,” and it falls apart. It will interpret “dangerous” in the most superficial way possible, like sorting by request count. A frontier model, given the same vague question, tends to think more deeply: it might look for SQL injection events where the HTTP response code was 200 OK, suggesting the attack actually succeeded past the WAF. The small model doesn’t make that inferential leap.

The difference isn’t really about raw capability in the way benchmarks measure it. It’s about how each type of model handles ambiguity. When the question is well-specified, meaning you know exactly what you’re looking for and you can express it clearly, a 20B model can do genuinely impressive work. All the architectural tricks in this project are essentially ways of compensating for the small model’s weaknesses by making every individual task as concrete and unambiguous as possible. You’re not extracting hidden brilliance from the model; you’re carefully removing every opportunity for it to be confused.

Frontier models are qualitatively different in exactly the situation where your careful prompt engineering fails: the ill-defined question. A senior analyst walking in with “just poke around these logs and tell me if anything looks off” is giving you the kind of vague, open-ended instruction that breaks a small model. The frontier model handles it because it brings enough world knowledge and inferential range to construct a reasonable interpretation of “looks off”, asking implicitly, “off in what way? statistically? behaviorally? compared to what baseline?” and then pursue that interpretation productively.

The honest summary: with the right architecture around it, a 20B model can handle most of the well-defined security investigation questions a real analyst would ask. What it can’t handle is the open-ended, exploratory “I don’t know what I’m looking for” questions, and that’s where frontier models earn their price tag.

Breaking Investigations Into Small Steps

Traditional text-to-SQL systems try to generate all the queries you’ll need in a single pass. For security investigations, this is hopeless. A question like “identify IPs that performed reconnaissance and then launched exploitation attacks” requires multiple correlated queries: first find the reconnaissance activity, then use those results to look for subsequent exploitation events.

A frontier model might be able to plan all of this upfront. A 20B model cannot. So instead of asking the model to be brilliant, I asked it to be incremental.

The agent works in rounds. Each round, the model generates a few simple SQL queries, executes them, looks at the results, and then decides: do I have enough information to answer the question, or do I need to investigate further? If it needs more data, it plans another round of queries, this time using specific values it discovered, including particular IP addresses, timestamps, and rule names, from the previous round’s results.

It transforms a task that’s impossibly hard for a small model into a sequence of tasks that are each individually easy. And it has a nice side benefit: the investigation process looks a lot like how a human analyst actually works. You don’t write all your queries at once. You poke at the data, notice something interesting, and dig deeper.

When Results Don’t Fit in Context

Security investigations can produce enormous result sets. Ask “show me all WAF events for this host over the past week” and you might get tens of thousands of rows. You can’t just dump all of that into the LLM’s context window, as you’ll either blow past the token limit or drown the model in noise.

The naive solution is truncation: just take the first N rows and hope the interesting stuff is in there.

Instead, when a query returns a massive result set, the agent’s Python code catches the overflow before passing the data to the LLM. Rather than feeding the model ten thousand rows and hoping for the best, it asks the LLM to rewrite the query into a few statistical summary queries that preserve the analytical insight while fitting comfortably in context. So instead of returning 10,000 raw rows, you might get “top 20 source IPs by event count, grouped by attack type” and “hourly event distribution.” The model decides which summarizations are most relevant to the investigation question, which is actually something it’s good at, since it’s a judgment call, not a precision task.

LLMs Can’t Do Math (But They Can Ask For It)

LLMs are extraordinarily bad at arithmetic. They are token prediction machines that learn numerical patterns from training data rather than executing symbolic computation, and the smaller the model, the worse it gets. Ask a 20B model to compute 211 / 712 * 100 and you sometimes get a confident answer that is completely wrong.

Analyses frequently require percentage calculations, rate comparisons, and proportional reasoning. “What percentage of attacks were blocked?” is a basic question that demands division.

The solution: the model decides what to calculate; Python does the actual math. The model never does mental math. Instead, when it needs a computation, it outputs a structured tag:

<calc formula="blocked/total*100" expr="211 / 712 * 100" precision="1" />

A post-processor in the agent’s Python code evaluates the expression using a safe evaluator and substitutes the result. The model only needs to decide which calculation is needed and set it up correctly; the actual number-crunching happens deterministically.

Factual claims got the same treatment. The model was occasionally hallucinating numbers, stating a count of 500 when the query had returned 483. To fix this, I had it cite every numeric claim with a <fact> tag pointing back to the query result it came from. This “show your sources” pattern dramatically reduced numerical hallucination, likely because forcing the model to declare “this number came from query 2, row 3” makes it actually look at the data instead of generating plausible-sounding numbers from vibes.

The Day My Colleagues Broke Everything

For a technical evaluation demo, I figured my SQL safety measures were decent enough. The ClickHouse connection was in read-only mode. I had regular expressions checking for dangerous patterns. I wasn’t claiming production-grade security, but for a proof of concept? Not bad, I thought.

Then I let my colleagues try it.

They immediately started trying to make the agent misbehave. And within a distressingly short time, they’d successfully gotten it to extract the ClickHouse version number, list all tables in the database, and access data from tables it had no business querying. All using perfectly valid SELECT statements that my regex patterns hadn’t anticipated. One colleague also managed to convince the agent to speak in cat language; one investigation session featured a detailed reconnaissance analysis of a host where every sentence ended with “Meow.”

LLMs are probabilistic systems that can be steered by adversarial inputs, whether those inputs come from a creative colleague or from malicious data embedded in the logs themselves. Regex-based SQL filtering is fundamentally inadequate because you’re playing whack-a-mole with an infinite space of possible SQL constructions.

The fix was to parse every LLM-generated SQL statement into an abstract syntax tree (AST) using SQLGlot, a Python SQL parser. Before any query reaches the database, the AST is inspected: non-SELECT statements are rejected outright, only whitelisted tables are permitted, and various known error patterns are automatically corrected. The LLM decides what to query; the parser guarantees how that query is allowed to execute.

The AST approach also solved a completely different problem: the model kept mixing up SQL dialects.

ClickHouse Is Not Spark

ClickHouse is a columnar database with its own idiosyncratic SQL dialect. It has functions that don’t exist in other databases, and it’s missing functions that exist everywhere else. A 20B model trained on a broad corpus of SQL examples tends to reach for Apache Spark dialect when it should be using ClickHouse functions.

For example, the model would write collect() (Spark) instead of groupArray() (ClickHouse), collect_set() instead of groupUniqArray(), or size() instead of length(). Each of these produces a perfectly sensible query that simply doesn’t work.

I tried fixing this with prompts. It helped some, but not enough: the model’s training data apparently contains so much Spark SQL that it kept reverting to familiar syntax despite being told not to. And every additional line of “don’t use X, use Y instead” in the system prompt made the prompt longer and more complex, which degraded the small model’s performance on other tasks.

The practical solution was mechanical: the AST parser detects known function name mismatches and rewrites them before execution. It’s a bit of a dirty hack, and in an ideal world you’d fine-tune the model with LoRA or similar techniques to just know ClickHouse dialect. But in the real world, a ten-line AST transformation that works today beats a fine-tuning pipeline you’ll build someday.

I accumulated these corrections over time by maintaining a dedicated error log. Every SQL error the model produced got logged, and periodically I’d review the log, categorize the errors, and either add a prompt hint or an AST transformation. It was the kind of work that makes for terrible conference talks.

The Structured Output Catastrophe

Of all the technical challenges in this project, the single biggest time sink was something I didn’t expect: getting the model to produce structured output reliably.

Modern LLM APIs offer “function calling” and “structured output” modes that are supposed to guarantee the model returns valid JSON conforming to a schema. For frontier models, these presumably work well enough. For the 20B parameter LLM from OpenAI, they were a nightmare.

The model would randomly omit required fields, invent JSON keys that weren’t in the schema, produce malformed JSON with trailing commas or unescaped characters, or sometimes just emit a half-finished response. Each failure triggered a retry, and since LLM inference is slow, retries meant the whole investigation would take forever. I estimate that more than half of my development time in the early phase of the project was wasted fighting structured output failures.

The conventional wisdom is that “native” function calling must be the most reliable approach; after all, the brilliant scientists who built the model specifically trained it for this. This turned out to be completely wrong, at least for small models. There are plenty of reports about this online.

The turning point was adopting BAML (Basically a Made-up Language), a domain-specific language for type-safe LLM interactions. BAML takes a fundamentally different approach: instead of constraining the model’s token generation at decode time, it lets the model generate freely and then parses the output tolerantly into typed data structures. It’s what network engineers call Postel’s Law (be conservative in what you send, be liberal in what you accept), applied to LLM output.

With BAML, structured output went from failing constantly to working almost every time. The same model, the same prompts, dramatically different reliability, just by changing how we handled the output boundary.

Context Engineering Is All You… cough

If there’s a single lesson from this project, it’s this: context engineering matters more than anything else.

“Context engineering” is a term that’s become popular recently, and I’d heard it before starting this project. I thought I understood it. I didn’t. Not really. Not until I lived it.

Context engineering is everything you do to control the information that reaches the LLM: how you structure system prompts, how you format query results, how you present previous investigation rounds, how you order instructions, what you include and what you omit. It’s the difference between the model producing useful analysis and producing garbage.

The moment this truly clicked for me was when I adopted BAML and took the opportunity to restructure my entire prompt architecture. I organized the system prompts into clear sections, carefully formatted the database output, and thoughtfully designed how previous investigation context was presented to the model. Test pass rates jumped dramatically, not because the model got smarter, but because the information it received was better organized.

There’s a subtlety here that I think is underappreciated. With small models especially, your system prompts must be internally consistent. As I iterated on different parts of a long system prompt, I’d occasionally introduce subtle contradictions: one section would say “always do X” while another section implied “do Y in this situation” where X and Y conflicted. Research shows that LLMs degrade significantly when prompted with contradictory instructions, and my experience confirmed this emphatically. The model’s performance would fall off a cliff, and it could take hours to figure out that the problem was a semantic contradiction buried across two sections of a 2,000-word prompt.

What I Actually Learned

Building this agent changed how I think about several things.

Small models are more capable than you expect. Not for everything; abstract reasoning and handling ambiguity still require larger models. But for well-defined, well-prompted tasks, a 20B model can do genuinely useful work. The key is architectural: break complex tasks into simple steps, give the model excellent context, and handle everything you can deterministically in code.

Domain expertise is the bottleneck. The hardest part of this project wasn’t making the model generate SQL. It was articulating what the SQL should mean. If you can clearly express your domain knowledge in language, you can get surprisingly far with a small model. If you can’t, even a frontier model will produce confident nonsense.

The LLM-code boundary is critical. Don’t let the LLM do things it’s bad at (arithmetic, remembering exact numbers, enforcing security constraints). Do let it do things it’s good at (planning investigations, deciding which summarizations are relevant, judging whether an answer is complete). The more clearly you separate these responsibilities, the more reliable your system becomes.

Trust but verify, then verify again. My regex-based SQL safety was fine until it wasn’t. The model’s structured output was reliable until it wasn’t. Every assumption I made about “this should work” eventually got tested by reality. Build layers of defense, and assume the layer you’re most confident in is the one that will fail.

Where This Goes From Here

This agent is a proof of concept, not a production system. It deliberately pushes work to the LLM even in cases where deterministic code would be more reliable, because the goal was to explore what’s possible, not to build something bulletproof.

But the core finding is real: you can build useful AI-powered security investigation tools that run entirely within your own infrastructure, on hardware that costs a fraction of what frontier model API calls would cost at scale, while keeping sensitive security data within your network boundary. Meow.

security-investigation-agent / GitHub

2026-02-28

../