Build With Me — A Data Scientist's Workshop

Projects

Coming Soon

Build-alongs and long-running projects will land here.

April 13, 2026

Contributing to Apache Flink

Introduction

In the world of data processing and pipeline building Apache Flink is a tool that shows up everywhere (similar to how it's of the utmost importance to understand how to use Kafka as well in these scenarios). I come across mention of this tool often on job sites for various data involved roles, particularly Data Engineering.

Apache Flink is an open-source framework and distributed processing engine used for stateful processing over both unbounded datasets and bounded datasets. Understanding how it works is paramount if you want to operate in this space.

For further context, unbounded data is simply ever-flowing, non-stop data streams while bounded datasets can best be described as “batch” data.

This series follows a real open-source contribution to the Flink project. Specifically, I'm working on Jira Issue: FLINK-25672, which addresses a known limitation in Flink's DataStream filesystem connector. When using an unbounded file source, the enumerator keeps track of every file path it has already processed and that state can grow indefinitely. For long-running jobs watching directories with high file throughput, this becomes a real problem. The Flink docs themselves acknowledge the need for a compressed form of tracking already processed files, and that's what this contribution aims to build. This series will be me tackling this issue from scratch, making mistakes and learning more along the way

Along the way, we'll dig into the parts of Flink's architecture that are relevant to this fix, how the source enumerator works, how file discovery happens, and why unbounded state growth is a problem worth solving. You'll see the full process: reading the codebase, understanding context, writing and testing a fix, and submitting it upstream.

Welcome!!!

Posts

April 13, 2026 Post 1

Problem Familiarity

Step 1: What exactly is the problem?

When we look at the Apache Flink documentation for the FileSystem connector's current limitations, the problem is laid out in plain English: when using the FileSource connector class for unbounded file sources, the enumerator keeps track of the paths of all files that have been processed. That state grows indefinitely over time, and it will eventually lead to performance issues. The more files your program processes, the larger the state grows, because Flink remembers every single file path it comes across.

Step 2: Can I actually see this happen?

I'm a visual learner and I work best by seeing the problem first. Before jumping into any solution, I want to reproduce this issue — watch the state grow and understand why it becomes a problem in production.

To do that, we need a program that uses FileSource with unbounded data: files that simply do not stop streaming in. Here's the plan:

Set up a Flink job with FileSource in unbounded (continuous monitoring) mode with checkpointing enabled
Write a script that rapidly generates small CSV files into the input directory, one every few milliseconds
Run the job and observe the Flink UI to see the checkpoint state size climb over time
Dig into specific parts of FileSource to determine whether we can add custom logging to watch the enumerator tracking processed files in real time

Flink Limitation Come to Life

The goal here is to understand how the current limitation in Apache Flink's FileSource data source works, how it happens, and why it is a problem. The example we'll see is a very basic one. I'm going to set up a Flink job with FileSource in unbounded (continuous monitoring) mode with checkpointing enabled.

Setting up Apache Flink

Getting Flink set up for this exercise is relatively easy. I'm using IntelliJ IDEA (the free version), and the Apache documentation helps to set up Flink with this IDE.

Pre-reqs

This is the link I used to get started: here (I jumped between Using Maven, Overview, and some of the other tabs that had to do with dependencies and formats).

1. Java 11. In order to use Apache Flink, you need to have Java 11 installed.

# Homebrew (if not already installed)
/bin/bash -c "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/HEAD/install.sh)"

# brew update (if you have it installed but haven't updated it)
brew update

# Java
brew install openjdk@11

2. Maven is my build automation tool of choice, so I'll be using Maven >= 3.8.6.

# Maven
brew install maven

3. After both installs, you would need to check that they are on your local machine.

java -version
mvn -version

4. Now, download IntelliJ IDEA (or your IDE of choice) here.

Note: JetBrains used to have an IntelliJ IDEA Community Edition product, but they have now unified it. You download one type of product, and you can use the core functionality of Java and Kotlin for free. Of course, folks who want additional functionality can pay for it. For the purposes of Apache Flink and going through this process, that wouldn't be necessary.

5. Create a project using the following archetype with the Maven command below. Enter it into your CLI.

mvn archetype:generate                \
  -DarchetypeGroupId=org.apache.flink   \
  -DarchetypeArtifactId=flink-quickstart-java \
  -DarchetypeVersion=2.2.0

You'll be prompted for the groupId, version, and package. The groupId and artifactId are up to you to select, but the version and package come with defaults that you can accept by pressing Enter. Then it will ask you to confirm with Y.

6. Open the project in IntelliJ and select the flink-playground folder (the one containing pom.xml). IntelliJ detects the Maven project automatically and starts downloading dependencies. You might have to give it a couple of minutes the first time.

Selecting the flink project folder in IntelliJ

While that runs, set the JDK by opening File → Project Structure → Project, set SDK to your installed JDK 11, and set Language level to match.

The cool thing is you actually can download the necessary JDK directly from this page by clicking SDK → Add SDK and then Download JDK.

7. The Flink documentation notes the following:

Note on IntelliJ: To make the applications run within IntelliJ IDEA, it is necessary to tick the Include dependencies with "Provided" scope box in the run configuration.
If this option is not available (possibly due to using an older IntelliJ IDEA version), then a workaround is to create a test that calls the application's main() method.

So, when running any Flink program, you start by clicking the Play button beside the file name at the top. OR you can click on the Run 'DataStreamJob.main()' button. Of course, DataStreamJob is a placeholder name and this would ideally be any program you create.

After clicking Play or Run, the program will fail at first. Then you go into the Edit Configurations tab and select “Add dependencies with 'provided' scope to classpath”. You'd need to do this with every program, or the other option would be to create a test that calls the application's main() method. I personally prefer the first option, shown in the images below.

Selecting add dependencies with provided scope

8. Now we can go ahead and run DataStreamJob again. It'll fail because main has nothing inside it.

9. Add the following dependency to pom.xml for the Flink web dashboard, which can be accessed at http://localhost:8081.

<dependency>
  <groupId>org.apache.flink</groupId>
  <artifactId>flink-runtime-web</artifactId>
  <version>${flink.version}</version>
  <scope>provided</scope>
</dependency>

Once you've added the dependency for the web dashboard, swap the generic streaming execution line StreamExecutionEnvironment.getExecutionEnvironment() for:

Configuration conf = new Configuration();
StreamExecutionEnvironment env =
    StreamExecutionEnvironment.createLocalEnvironmentWithWebUI(conf);

Now if you run the program again and visit localhost:8081, you can see the Flink UI in action.

Coming up with the scripts

For the purposes of this demonstration, the goal is to set up a FileSource job with a strict timeout (for example, 60 seconds) and a moderately sized state backend like HashMapStateBackend. Then create a small Python file that serves as a “file generator” and use it to flood the Flink job with thousands of small files. As the enumerator state grows, checkpoints take longer to serialize and persist. Eventually they start timing out, which means Flink can't complete a checkpoint, which means on any failure the job restarts from an increasingly stale position OR, with too many consecutive checkpoint failures, the job dies entirely.

Set up the Python script (the firehose) that floods small files into a directory. Then set up a Flink FileSource job with checkpointing enabled and a strict timeout. (I chose 10 seconds because at 60 seconds it would have taken quite a while on my machine for the inconsistencies to show up.)
At first everything works just fine. The checkpoints complete in very small amounts of time (milliseconds) and the pipeline runs without issue.
Then we notice a slight degradation creeping up. Once we have tens of thousands of files, the checkpoint durations climb up visibly in the Web UI.
Then the checkpoints begin to time out, consecutive failures stack up, and the job restarts or flat out dies.
I will also be showing the checkpoint size and what is being tracked by the enumerator.

We now have Flink installed and configured.

The Flink Job

Make sure the following Flink dependencies are available in pom.xml: flink-streaming-java, flink-clients, and flink-connector-files.

<!-- For Example -->
<dependency>
    <groupId>org.apache.flink</groupId>
    <artifactId>flink-connector-files</artifactId>
    <version>${flink.version}</version>
</dependency>

The POJO is simple: a UserEvent with four fields. These plain fields match the CSV schema, plus a toString() so that the print() sink produces something we can read in the logs.

package com.example;

public class UserEvent {
    public long timestamp;
    public String userId;
    public String eventType;
    public double amount;

    @Override
    public String toString() {
        return timestamp + " | " + userId + " | " + eventType + " | " + amount;
    }
}

The event generator firehose.py is a small Python loop that writes one CSV file every iteration into /tmp/flink-input. Each file contains five rows of fake user events with the schema timestamp, user_id, event_type, amount. So we'll have a millisecond timestamp, a user ID drawn from a pool of 1,000, a randomly chosen event type (click, purchase, view, logout), and a random dollar amount.

import os
import time
import random

output_dir = "/tmp/flink-input"
os.makedirs(output_dir, exist_ok=True)

event_types = ["click", "purchase", "view", "logout"]
counter = 0

while True:
    counter += 1
    filename = f"events_{counter:08d}.csv"
    filepath = os.path.join(output_dir, filename)

    lines = []
    for _ in range(5):
        ts = int(time.time() * 1000)
        user = f"user{random.randint(1, 1000):04d}"
        event = random.choice(event_types)
        amount = round(random.uniform(1, 500), 2)
        lines.append(f"{ts},{user},{event},{amount}")

    with open(filepath, "w") as f:
        f.write("\n".join(lines))

    if counter % 10000 == 0:
        print(f"Generated {counter} files")

The Flink job is also quite simple and straightforward. The monitorContinuously call is what makes this UNBOUNDED. The Python generates raw data and the UserEvent POJO is what parses this data with the timestamp, userId, eventType, and amount fields.

The pipeline reads files, parses records, and outputs results.

package com.example;

import org.apache.flink.api.common.eventtime.WatermarkStrategy;
import org.apache.flink.configuration.CheckpointingOptions;
import org.apache.flink.configuration.Configuration;
import org.apache.flink.connector.file.src.FileSource;
import org.apache.flink.connector.file.src.reader.TextLineInputFormat;
import org.apache.flink.core.fs.Path;
import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment;

import java.time.Duration;

public class FileSourceDemo {
    public static void main(String[] args) throws Exception {

        // Configure and launch with Web UI
        Configuration conf = new Configuration();
        conf.setString("rest.port", "8081");
        conf.set(CheckpointingOptions.CHECKPOINT_STORAGE, "filesystem");
        conf.set(CheckpointingOptions.CHECKPOINTS_DIRECTORY, "file:///tmp/flink-checkpoints");

        StreamExecutionEnvironment env =
                StreamExecutionEnvironment.createLocalEnvironmentWithWebUI(conf);

        // Checkpointing: every 10 seconds, 10-second timeout
        // Checkpointing is when (every 10 seconds) Flink pauses briefly, serializes the state of every operator (including the enumerator's list of processed file paths), and writes it to disk. I set it up to 10 seconds because I want to see the problem develop relatively quickly.

        // So we have a checkpoint every 10 seconds, with a matching 10-second timeout and a tolerance of 3 failed checkpoints before the job dies.
        env.enableCheckpointing(10000);
        env.getCheckpointConfig().setCheckpointTimeout(10000);
        env.getCheckpointConfig().setTolerableCheckpointFailureNumber(3);

        // FileSource in continuous unbounded mode
        FileSource<String> source = FileSource
                .forRecordStreamFormat(
                        new TextLineInputFormat(),
                        new Path("/tmp/flink-input"))
                .monitorContinuously(Duration.ofSeconds(5))
                .build();

        env.fromSource(source, WatermarkStrategy.noWatermarks(), "file-source")
                .map(line -> {
                    String[] parts = line.split(",");
                    UserEvent event = new UserEvent();
                    event.timestamp = Long.parseLong(parts[0].trim());
                    event.userId = parts[1].trim();
                    event.eventType = parts[2].trim();
                    event.amount = Double.parseDouble(parts[3].trim());
                    return event;
                })
                .returns(UserEvent.class)
                .print();

        env.execute("FileSource Limitation Demo");
    }
}

I start by running the Flink job first, and then in a separate terminal I run the Python script with a command like python3 firehose.py.

After that, I flip over to the Web UI at localhost:8081 to take a look at the running job and open the checkpoints tab.

FileSource job running with generated files

We can see the generated files (from the firehose script) above.

The image below shows the checkpoint End-to-End duration at 9ms, which is still quite low. Now we wait for that number to climb.

After a while, using the following command on my CLI while true; do du -sh /tmp/flink-checkpoints/*/chk-*/ 2>/dev/null; sleep 1; done, I can see the checkpoint we are currently at (note this can also be viewed in the UI as shown) and the space it's taking up. There are some gaps where we have 0B; I discovered that this shows when the next checkpoint is about to come up, so there is nothing to save.

At this stage, I'd like to stop and point out something I discovered. The Flink UI has the checkpointed data size at 3.20 KB. While working on this, I expected that to change, but it did not. No matter how much I could see the checkpoint growing via the CLI, or the duration growing from 9ms, it didn't make a difference and the checkpoint data size stayed at 3.20 KB. I tried everything and did troubleshooting; the while true; do du -sh /tmp/flink-checkpoints/xxxx/chk-*/ 2>/dev/null; sleep 1; done commands I ran all showed that the checkpoints were growing to > 30 MB. The documentation provided definitions for the fields in the UI but offered no insight into why the checkpoint size was stuck at 3.20 KB. In those docs, something that caught my eye was that “Full Checkpoint Data Size: The accumulated checkpoint data size over all acknowledged subtasks”. The FileSource enumerator state is for some reason not an acknowledged subtask.

After playing the waiting game for a while (about 30 minutes), we can see that the duration is starting to climb very high.

At a certain point, with all this growth, I begin to notice the job crashing and restarting (with checkpoints failing).

What we see happening: the enumerator was accumulating every file path it discovered into an in-memory list. At every checkpoint, that entire list was serialized and written into the _metadata file. We proved this by looking at the file on disk: it grew from 2.8 MB to 79 MB to 226 MB at some point. We ran strings on it and saw 1,843,535 individual file paths inside.

With the timeout being 10 seconds, the enumerator's state had grown so large that serializing it couldn't be completed within the deadline. Checkpoints started timing out, and after 3 consecutive failures (our configured tolerance), Flink restarted the job. On restart, it restored from the last successful checkpoint, which already had a massive enumerator state. The enumerator then discovered even more new files that landed during the failure period, making the state bigger. The next checkpoint attempt was even larger and failed even faster. This ordeal caused the job to enter a spiral of restarts and failed tasks.

Step 3: Decide on a solution path

With the problem reproduced and the failure mode understood, the next thing to figure out is the direction of the fix. I plan to develop two solutions and compare them. I'm not sure yet what that comparison will be based on, but having both side by side should make the trade-offs clearer.

Option 1: TTL on the processed-path set. Evict expired processed paths via a TTL policy, bounding the enumerator's state by time rather than by file count. I want this to be opt-in so existing users see no behavior change. Concretely, in PendingSplitsCheckpoint, replace (or add alongside) the existing Collection<Path> alreadyProcessedPaths with a new field that pairs each path with a long timestamp, something like alreadyProcessedPathAndTimestamp.

Option 2: A time-based watermark. Instead of remembering paths at all, remember a single long, the highest modification timestamp processed so far. On each rescan, ignore any file with xtime <= watermark. This is the most aggressive form of state compression: one number, regardless of how many files have been processed.

Step 4: Pull down the Flink codebase and start prototyping

With the two paths sketched out, the next move is to pull down the Flink codebase, find my way around the relevant classes, and start testing both options against the reproduction setup from Step 2.

May 4, 2026 Post 2

Decide on a Solution Path

To reiterate the problem: for unbounded file sources, the FileEnumerator currently remembers the paths of every file it has processed, and that state can grow rather large. I'm looking into possibilities to reduce this state. Stated more technically, in the ContinuousFileSplitEnumerator the pathsAlreadyProcessed HashSet stored in JM state grows unboundedly and causes OOM issues.

From the comments on the Jira ticket (FLINK-25672), it's also stated that the problem is exacerbated by the fact that the state isn't distributed.

The two methods I'm weighing:

Method 1: Adding a compressed form of tracking already-processed files (keeping modification timestamps as lower boundaries). A timestamp-based watermark / lower-bound approach.
Method 2: TTL-based eviction.

Does this need a FLIP?

I believe this MIGHT require a FLIP because some of the affected classes such as PendingSplitsCheckpoint have a @PublicEvolving annotation on them. PendingSplitsCheckpointSerializer might need to get bumped from V1 to V2, and that changes what gets written to checkpoints. The Flink contributor guide says: “If a change is identified as a large or controversial change in the discussion on Jira, it might require a Flink Improvement Proposal (FLIP) or a discussion on the Dev mailing list to reach agreement and consensus.” I don't believe this change meets that bar, but I'll be consulting with David Radley and Martijn Visser.

FLIP is for major changes or enhancements, which I don't believe this to be.

What's actually changing for FLINK-25672

In FileSourceSplit, fileModificationTime is a per-split metadata field that's captured once at enumeration time from FileStatus#getModificationTime() and carried through the rest of the pipeline. On local disk it's java.io.File#lastModified() (which under the hood is a stat(2) syscall). On HDFS it's whatever the NameNode reports. On S3 it's the Last-Modified header of the object. Each FileSystem implementation supplies its own FileStatus subclass. This means, for our purposes, the modification timestamp already exists.

The current Java logic looks like this:

// state held in memory
private final HashSet<Path> pathsAlreadyProcessed;


// Filtering, called every discoveryInterval
private void processDiscoveredSplits(Collection<FileSourceSplit> splits, Throwable error) {
    if (error != null) {
        LOG.error("Failed to enumerate files", error);
        return;
    }

    final Collection<FileSourceSplit> newSplits =
        splits.stream()
                .filter((split) -> pathsAlreadyProcessed.add(split.path())) // this is the main line that causes paths to keep getting added, making the state grow
                .collect(Collectors.toList());
    splitAssigner.addSplits(newSplits);

    assignSplits();
}


public PendingSplitsCheckpoint<FileSourceSplit> snapshotState(long checkpointId)
        throws Exception {
    final PendingSplitsCheckpoint<FileSourceSplit> checkpoint =
            PendingSplitsCheckpoint.fromCollectionSnapshot(
                    splitAssigner.remainingSplits(), pathsAlreadyProcessed); // change needs to be made here as well

    LOG.debug("Source Checkpoint is {}", checkpoint);
    return checkpoint;
}

When it comes to HashSet<Path>, this works because if a path is in the set we've seen it before, and if it's not then we haven't. The assumption being made here is that the set is bounded by the user's needs. For a job watching a small directory of config files, or a job that runs for a few hours, this is true. But as we saw in Post 1, for a job running for months over a directory that grows by 100k files per day, the set is not bounded, and that's where issues arise.

My mindset in approaching this: there has to be a way to give users the choice of opting into the current behavior when they need a bounded representation, but at the same time, Flink is built for streaming and unbounded sources, so we have to make allowances for that as well.

Method 1: Timestamp watermark tracking

We know that FileSourceSplit carries the fileModificationTime:

/** Returns the modification time of the file, from {@link FileStatus#getModificationTime()}. */
public long fileModificationTime() {
    return fileModificationTime;
}

If we assume that files appear in the directory in roughly modification-time order, then once we've processed every file with modTime <= currentTime we no longer need to remember any individual paths older than currentTime. We do need to remember currentTime itself, plus those files whose paths sit right at the boundary.

The HashSet<Path> should be replaced with two lines:

long highestModTime;
final HashSet<Path> recentPaths;

I'll add a dedupe decision and check it against a small set. After processing a discovery batch, advance the watermark.

Concerns: it's possible for a producer to write a file with a modTime older than the current watermark minus the margin, and if that's the case the file gets silently skipped. This can happen if:

An object store lets the client override modTime.
There's a historical data backfill (older files copied into a new directory).

In my mind, anything could happen to cause out-of-order modification times, and that's a concern of mine.

Method 2: TTL-based eviction

The core idea here is to keep the same HashSet<Path> semantics, but every entry also carries a timestamp (what time was this file processed), and entries older than a configured retention duration get evicted. This method relies on the user having some mechanism that ensures files older than the agreed-upon retention time won't reappear in the directory listing. Think something like an S3 lifecycle rule.

private final LinkedHashMap<Path, Long> processedPathsByTime;
private final Duration retentionDuration;

With this solution the state could still get huge. The retention time we're bound by could still result in a large state, but for a lot of users the bound matters more than the size. A bound of “7 days of files” is something you can plan capacity around, even if it's still x entries. Today's behavior gives them no bound at all.

Cons:

Misconfigured TTL vs storage retention. If retention is 7 days but a user's lifecycle policy is 14 days, files that aged out of the tracker but are still listed by S3 will be reprocessed.

Where I'm at

I'm not sure which of these solutions to go for, and I've sent an email to the Flink dev mailing list. I'm hoping to get feedback from others in the community before crafting a path forward for either method.

Post 3

Understand the Codebase

Following the problem write-up, this post digs into Flink's FileSource internals tracing how the enumerator discovers files, where processed-path state lives, and which classes we'll need to touch in order to fix FLINK-25672.

May 4, 2026 Hackathons Field Notes 12 min read

3rd Place at Modern Day Marine 2026: Building Real-Time Dispatch Translation in Three Days

Last week I flew to D.C. for the Marine Corps Logistics Command (LOGCOM) AI Forum Hackathon, held in conjunction with Modern Day Marine 2026. Eighteen teams started. Eight made it past the semi-finals (it was supposed to be seven but there was a tie). My team, ERMP, came in third in the final round. I'm glad with the progress we made and I'd like to share on this platform what we built and what I'd want to do differently should any of it ever leaves the prototype stage.

What the hackathon actually was

The MARADMIN that went out a few weeks before the event called for solutions to “Marine Corps logistics and sustainment use cases” so things like contested logistics, maintenance, supply chain, expeditionary operations. Sponsored by the Deputy Commandant for Installations and Logistics. Judging happened across two rounds against a Demonstration Readiness Level (DRL) rubric: DRL 1 was concept-level, DRL 2 was a working prototype with representative data, DRL 3 was an end-to-end demo in a mission-relevant scenario with measurable performance and a clear path to operational testing.

The first round brought 18 teams down to 8 through a five-minute pitch to the judges with a demo (if time allowed). The second/final round was a similar set-up but with more judges, additional 3 minutes (making the total 8 minutes), wider variety of judges from high-ranking military officials, government, industry, and academia, and of course the end result of this round being you could be in the top 3 teams. We successfully made it through the first round and ended up coming in third place after the second round.

The team was five people: two active-duty Marines (one was the team lead and the other was a senior contributor), one retired US Army Lieutenant (and systems engineer) who did most of the coding, a senior consultant who works alongside military personnel in emergency communication centers and acted as the SME for topics involving emergency services on military bases, and me on narrative, problem-framing, and research.

The problem we picked

Our use case on the outside does not seem like classic logistics. It was installation protection specifically, the Public Safety Communications Centers (PSCCs) on Marine Corps installations. These are the 911 dispatch centers for bases. They handle medical, fire, and security calls from a population that includes service members, multilingual dependents, civilian contractors, allied forces, and local national employees. A globally diverse population calling 911 in a crisis.

Our team SME laid out the operational gap for us in a statement of need that I helped shape into the narrative we presented from. The short version is:

When a non-English speaking caller reaches a PSCC, the current procedure is to conference in a third-party human translator (Language Line). Industry data puts that connection time at 60–180 seconds, assuming an interpreter for the right dialect is on duty. National Emergency Number Association (NENA) standards require 90% of 911 calls to be answered within 15 seconds and dispatched within 60. Connecting a translator alone blows past the dispatch SLA before anyone has said anything useful.

The compounding problem is location data. Over 80% of 911 calls now originate from mobile or VoIP, where Phase II Automatic Number Identification (ANI) and Automatic Location Identification (ALI) data is frequently degraded, delayed, or absent. That means in a growing share of calls, the dispatcher is fully dependent on the caller's verbal communication to determine where the emergency is. If you can't understand the caller, and you don't have ALI, you have an untraceable, unresolvable emergency in progress.

And the dispatcher's most important pre-arrival job telling a bystander how to do CPR, how to apply a tourniquet, how to shelter in place also requires being able to talk to them.

Reframing logistics

I want to flag the reframe because it was deliberate. The hackathon was officially about logistics and sustainment. Dispatch translation isn't logistics in the contested-supply-chain sense. The framing revolved around the idea that installation protection is indeed part of logistics command's mandate, and since the LOGCOM Forum's framing was about applying AI to real, mission-relevant Marine Corps problems, we were able to connect the two by making the case that PSCC translation was exactly the kind of problem the forum was looking to surface because it's high stakes, currently solved with high-latency legacy tooling, and unblocked by a real AI capability shift. The judges were clearly impressed by the idea, we believed that this idea was relevant from the get go and could be solved using analytics and AI.

I think this is worth a sidebar for anyone doing one of these in the future, the headline track matters less than your ability to show that the problem is real, the AI delta is meaningful, and a Marine somewhere will actually use what you built. We had active-duty Marines on the team who could speak to the operational reality.

What we built

The flow: a call comes into a phone number with Programmable Voice and Media Streams attached. Audio streams to OpenAI Whisper for transcription with automatic language identification. The transcript runs through GPT-4o for triage classification could be medical, fire, or security this gives the dispatcher a category before they get a full translation. ElevenLabs handles the synthesized voice on the way back to the caller in their language, including pre-arrival instructions. The dispatcher then sees a live transcript on a React + Vite web console. Everything ran on Replit, which is what let us iterate at the speed we did.

The before/after we walked the judges through was a Mandarin-speaking dependent witnessing a cardiac arrest at Camp Pendleton base housing. With Language Line in the loop, EMS gets dispatched at the three-minute mark, on scene at seven minutes and fifty seconds, no bystander CPR. With our system, we showed (using powerpoint slides) the difference with the dispatcher getting the language identified at one second, the panicked transcript at six seconds, EMS dispatched at twenty-two seconds, and bystander CPR cued by the system at one minute ten. Survival probability for an out-of-hospital cardiac arrest drops roughly 10% per minute without compressions. The math on those two timelines is the whole pitch.

My role on the team

My role was crafting our narrative, making sure the technical work we did translated to the presentation in a way the judges could comprehend, understand and relate to.

Our emergency services SME gave us the operational picture from her side and I structured it into the BLUF / Vulnerability / Time-is-the-Enemy framing that ran through our pitch deck. The problem framing: language barrier compounded by ANI/ALI degradation, time-as-adversary, pre-arrival instructions as the dispatcher's most important deliverable was where I spent most of my research hours. I sourced and verified the NENA standards, the 60–180 second translator connection benchmark, the >80% mobile/VoIP origination figure, and the cardiac arrest survival decline statistic. Every number in the pitch was something I could point to a source on.

On the build side, my contribution was evaluation rather than code. I helped sanity-check the Whisper outputs on different language samples and the GPT-4o triage classifications against the operational categories a 911 dispatch system would actually use. The team lead and the retired-Army systems engineer drove the engineering and my job was to make sure that when an executive judge asked “why this, why now, what does NENA say about it,” I had an answer that didn't fall apart on a follow-up.

How I would push this project forward

A working Replit prototype with Twilio, OpenAI, and ElevenLabs in the loop is great for showing what's possible in three days. It is not a thing you can deploy to a Marine Corps PSCC. In my opinion, the path from where we ended to getting something that can actually serve emergency calls is dominated by concerns that have very little to do with the model layer and a lot to do with the data layer underneath it.

A few things I'd want to think through if any of this gets continued:

Audit trail design Every translation is potentially evidence. If a dispatcher acted on a transcript and the outcome was bad, someone is going to ask exactly what audio came in, what transcript came out, what model produced it, what the confidence score was, and who saw it on the dispatcher's screen. I'd want to design that schema before I picked a model.

Criminal Justice Information Services (CJIS) aligned retention The “Beyond the Prototype” slide gestured at this and I'd want to actually nail it down: audio probably transient, transcripts encrypted at rest, retention windows configurable per CAD policy, and clean deletion semantics that you can prove to a CJIS auditor.

Long-tail languages Being able to handle dialect distribution at a specific installation is key, Camp Pendleton's caller base looks different from Kaneohe Bay's, which looks different from another base. The right technical answer is probably installation-specific fine-tuning on dialect data, which means you need a data collection and labeling pipeline that respects the population's privacy. That's a project unto itself.

I'm flagging these because as a security data engineer this is where my mind goes for such a project. How do we keep an audit trial and I always ask “what about the data? how do we store? process? clean? etc” they're the kinds of things that turn a hackathon prototype into a product fit for use by the US military.

What's next?

I'll update this post (if I am allowed to) if I hear anything back from the hackathon organisers about pushing this idea forward as a full fledged product.

If you want to follow what comes next, the LOGCOM post recognizing the top teams is here, The MARADMIN that started it all is on marines.mil.

April 14, 2026 Cybersecurity 10 min read

Strengthening Cyber Defense with AI: Lessons from the 2026 Threat Landscape

This post is based on a talk I gave at the Women in Data Sciences (WiDS) NYC event in April 2026. The findings draw from the 2026 IBM X-Force Threat Intelligence Index which is an annual report based on data from thousands of real security incidents that the IBM X-Force team responded to across the globe. This post expands on that talk, and connects it to something that landed the day before I presented: Anthropic's blog post on preparing your security program for AI-accelerated offense. What struck me was how directly their recommendations mapped to the problems I was already planning to discuss. The threats are real, the solutions are emerging, and as security professionals we must find a way to stay ahead of it.

The Common Thread: Security Basics Are Still Broken

If you follow the cybersecurity news cycle, you'd think the biggest threats are prompt injection attacks and deepfake doomsday scenarios. I would like to emphasize that those are real concerns, bad actors do use those techniques to take advantage of people. But based on the X-Force Threat Intelligence Index, we have other more pressing threats to focus on and from this report we can also identify other more pertinent ways in which AI can be used against us.

The majority of incidents that IBM's X-Force team responded to last year weren't caused by anything exotic. They were caused by the basics not being done:

passwords being reused
authentication controls that were weak or missing entirely
organizations not having clear policies around AI use
teams not even knowing what assets they have.

That gap between what makes headlines and what's actually causing breaches is the backdrop for everything that follows. The 2026 threat landscape is punishing organizations for basic security hygiene failures, and attackers are increasingly using AI to exploit those gaps faster than humans can close them.

Here is what I find compelling: the same AI capabilities being used against us can be turned around to strengthen defense. For the talk and this article, I pulled three lessons from the 2025 threat data to illustrate this. For each one, I want to show the same thing: here's the threat, here's how AI makes it worse, and here's how we can use AI to fight back.

Lesson 1: The Attack Surface Is Exploding

The threat. During previous years the X-Force Threat Intelligence Index has had valid credentials/use of valid accounts as the leading initial access vector but that changed in 2025. X-Force observed a 44% increase in attacks that began with the exploitation of public-facing applications, things like customer portals, APIs, and web apps. Exploiting vulnerabilities in internet-facing software is now the number one way attackers gain initial access, overtaking stolen credentials for the first time in years.

How AI makes it worse. AI is squeezing organizations from both sides. On the development side, AI-generated code is introducing more vulnerabilities. Veracode's 2025 GenAI Code Security Report tested over 100 large language models and found that AI-generated code contains roughly 2.7 times more vulnerabilities than human-written code. Meanwhile, Georgia Tech's Vibe Security Radar project tracked CVEs directly caused by AI coding tools and reported finding 56 in the first three months of 2026, with 35 coming from March alone.

On the attack side, AI is helping adversaries find and exploit those vulnerabilities faster. In 2025, over 32% of vulnerabilities were exploited on or before the day the CVE was publicly disclosed, and AI-powered scanning reached 36,000 scans per second. The window between a vulnerability existing and an attacker exploiting it is fast collapsing.

The chain is simple: an organization builds a piece of software (possibly with AI, which introduces more flaws), the software goes live on the internet, attackers use AI to scan for known flaws at massive scale, the scanner finds a match, and the attacker exploits it to get in. I believe this combination of more flaws being created alongside faster scanning to find them is a significant contributor to the 44% jump we see in the X-Force Index.

How can we use AI to curb this. Security teams have a tough job in front of us. The Forum of Incident Response and Security Teams, in their 2026 Vulnerability Forecast, predicted a median of 59,000 new CVEs this year.

The good news is that real systems are already being built to solve this. CrowdStrike built an Exposure Prioritization Agent (that works alongside ExPRT.AI) that uses live data to answer questions like “how could a bad actor use this vulnerability?” and “what's the business impact?”, then delivers customers a prioritized list of what to fix first.

In the recent Claude blog post on preparing for AI-accelerated offense, the first recommendation is to close the patch gap by using EPSS (Exploit Prediction Scoring System) to prioritize, automate deployment, and reduce time-to-patch on internet-exposed systems. The blog also talks about what we have just covered, which is to expect more strain on our vulnerability processes. They lay out a playbook for using AI, and one example I want to highlight is AI-powered vulnerability scanning. Traditional code scanners are rule-based: they check your code against a library of known vulnerability patterns. AI-powered scanning works differently: instead of pattern matching, an agent reads and reasons through your code the way a human security researcher would. Anthropic's recommendation is straightforward: build or implement an AI agent that scans your own codebase before a bad actor does. In practice, this means pointing an LLM at your codebase in a contained environment, having it find vulnerabilities, and keeping a human in the loop to verify the findings before acting on them.

The X-Force findings tell us what the problem is and Anthropic's recommendations show the direction of how we can use AI to act faster. The issue here is fast implementation.

Lesson 2: Your Software Supply Chain Is Your Attack Surface

The threat. Most organizations out there today do not build all their software in house. They make use of platforms, libraries, packages, and services from third-party suppliers. That has created an interconnectedness and luckily for us it is what makes modern software possible, but it's also what makes it fragile. The X-Force report tracked this over five years and found that major supply chain and third-party breaches have nearly quadrupled. Attackers target open-source registries like npm and PyPI, exploiting developer trust. One compromised component can propagate across thousands of projects.

How AI makes it worse. The AI supply chain adds a new layer. Organizations aren't just pulling in traditional software dependencies anymore. They're pulling in training data, pre-built models, plugins, skills, and AI agents all from third parties. When you download a pre-trained model from Hugging Face, you're trusting that the weights are safe, that the training data was clean, and that nobody has tampered with it. But you didn't train it and as a result you can't be certain of what went into it. That's a supply chain decision. The chain is getting much more complex and harder to trace, and AI adoption is accelerating that.

How can we use AI to curb this. The core defensive need is visibility, knowing what's in your software and what happens if a piece of it breaks, or worse, is compromised. A lot of organizations out there don't have this picture.

When it comes to the software supply chain, AI can help in ways that go beyond what traditional tools offer. Anthropic's blog lays out a few practical approaches that stood out to me. The first is using AI to identify redundancy in your dependencies. Most large codebases accumulate multiple libraries doing the same job: multiple HTTP clients, multiple JSON parsers, and each one extends the attack surface for no functional gain. Anthropic recommends pointing an LLM at your dependency file and asking which packages overlap and what consolidation would look like. Fewer dependencies means fewer things that can be compromised.

The second is using AI to replace dependencies that are no longer maintained. Some packages your software relies on may have no active maintainer, no recent updates, and no commitment to patching vulnerabilities. Rather than continuing to depend on them, Anthropic recommends having an LLM rewrite the specific functionality you actually use from that package. The LLM can scan the package's codebase and replicate the functionality. This way you replace a risky third-party component with code you now control, thereby removing that link from the supply chain entirely.

Non-AI tools like OpenSSF Scorecard can also help audit the security of your open-source dependencies.

Like the previous lesson, the X-Force report quantifies the risk while Anthropic's recommendations show how AI-powered tooling can address it.

Lesson 3: More Attackers, More Noise, Same You

The threat. In the past couple of years, law enforcement has had real success dismantling the big ransomware gangs. REvil for example was dismantled in 2021/2022. Though this is positive for the community, we must remember that when you break up a large criminal operation the factions disperse into smaller groups. The X-Force report identified 109 active ransomware or extortion groups in 2025, a 49% increase from the year before. These smaller groups are less resourced, but there are a lot more of them, and they're harder to track because they use shared tooling and overlapping tactics.

How AI makes it worse. We can't prove that the increase in ransomware groups is because of AI. The fragmentation happened because of law enforcement action. That said, we can reckon that AI likely sustains it by lowering the barrier to operate. In March 2026, IBM X-Force published research on “Slopoly” which is AI-generated malware found during a real ransomware investigation. The script used was technically mediocre, probably produced by a less advanced model. But it did work. The attackers were a group called Hive0163 and are known to be responsible for major global ransomware attacks, they used their malware to maintain persistent access for over a week. As the X-Force analysis concluded: “AI-generated malware doesn't pose a new or sophisticated threat from a technical standpoint. What it does is disproportionately enable attackers by reducing the time needed to develop and execute an attack.”

How can we use AI to curb this. When you go from tracking a handful of major ransomware groups to 109, the volume of threat intelligence explodes. More groups means more indicators of compromise, more tactics and techniques to catalog, more reports to read, more alerts to triage. The people doing this work are our colleagues: the threat intelligence analysts, detection engineers, and threat hunters. This volume of intel can be overwhelming, and we can use AI to sift through a good chunk of it.

This is where AI plugs in most directly. An AI system can take an indicator from a threat feed, say an IP address, and check it against internal telemetry. Has it appeared in your logs? When? What happened when it did appear? When analysts write up incident reports, they need to map attacker behavior to the MITRE ATT&CK framework and LLMs can be used to perform this task. Across all the feeds, alerts, and disclosures, AI can filter, prioritize, and surface what actually needs human attention.

Anthropic's blog discusses some direct AI Agent recommendations. They recommend putting a model at the front of your alert queue giving every inbound alert an automated first-pass investigation before a human sees it. They also describe an AI “triage agent” with read-only access to your SIEM platform that can direct attention to the alerts requiring human judgment. They also recommend using AI as an incident scribe and parallel investigator during active incidents, thereby allowing the agent to take notes, capture artifacts, pursue parallel investigation tracks, and draft postmortems can be an immense timesaver.

Their practical advice is also worth noting for this lesson: “pick one noisy alert rule with a high false positive rate, wire a model into its alert stream with read-only access, have it produce a structured disposition for every firing, and measure agreement against a human reviewer for two weeks. Start small, prove it works, expand from there.”

The Pattern

What struck me most when I read Anthropic's blog the day before my talk was how clearly the recommendations mapped to the problems the X-Force data was surfacing. Two completely independent sources one is an annual threat report based on thousands of real incidents, the other a set of security recommendations from an AI company based on what they've learned using frontier models to secure real systems, were pointing in the same direction.

The vulnerability flood needs AI-powered prioritization and scanning. The supply chain complexity needs AI-powered dependency mapping and auditing. The threat intelligence overload needs AI-powered triage, classification, and summarization.

At WiDS, I framed this through the lens of data science asking where do people with our skills plug into these problems? I will admit, though, that the broader point holds regardless of your role. AI is accelerating both offense and defense in cybersecurity, and the organizations and practitioners that adopt AI-powered defensive tooling will be better positioned than those that don't. The threats the X-Force report documents are not going away; they will likely be back in the report next year, and I reckon we will see these same trends by the time the Verizon DBIR rolls around. The good news is that the tools we can employ to fight back are readily available, but as security practitioners we need to act fast.

The same AI that is being used against us can be used to defend us.

Let's Learn Together

Building Security Agents for AI-Accelerated Offense

Using Open Source LLMs

The MCP Zero-Trust Baseline

Verizon DBIR: How to Read It

What goes into making an MCP server secure?

Coming Soon

Contributing to Apache Flink

Introduction

Problem Familiarity

Step 1: What exactly is the problem?

Step 2: Can I actually see this happen?

Flink Limitation Come to Life

Setting up Apache Flink

Coming up with the scripts

The Flink Job

Step 3: Decide on a solution path

Step 4: Pull down the Flink codebase and start prototyping

Decide on a Solution Path

Does this need a FLIP?

What's actually changing for FLINK-25672

Method 1: Timestamp watermark tracking

Method 2: TTL-based eviction

Where I'm at

Understand the Codebase

3rd Place at Modern Day Marine 2026: Building Real-Time Dispatch Translation in Three Days

What the hackathon actually was

The problem we picked

Reframing logistics

What we built

My role on the team

How I would push this project forward

What's next?

Strengthening Cyber Defense with AI: Lessons from the 2026 Threat Landscape

The Common Thread: Security Basics Are Still Broken

Lesson 1: The Attack Surface Is Exploding

Lesson 2: Your Software Supply Chain Is Your Attack Surface

Lesson 3: More Attackers, More Noise, Same You

The Pattern

Hey, I'm Sophia.

Contributing to Apache Flink

Apache Flink HTTP Connector — Documentation

BuffaLogs — OpenSearch Ingestion Tests

Watch the talk on YouTube

Strengthening Cyber Defense with AI: Lessons from the 2026 Threat Landscape