REPLY AI Agent Challenge - April 2026

April 17, 2026

REPLY AI Agent Challenge Cover

I participated in the REPLY AI Agent Challenge in April 2026.


What the challenge actually was

Stripping away the narrative, the problem was:

Given multiple datasets (transactions, users, locations, SMS, emails),
decide which transactions are fraudulent.

Key constraints:

  • fraud behavior evolves over time
  • decisions have asymmetric cost (false positives vs false negatives)
  • systems must generalize across datasets
  • only the evaluation dataset determines the official score
  • only the first submission on evaluation data is accepted

In addition to accuracy, the system is also evaluated on:

  • cost
  • speed
  • efficiency

Pre-Challenge Execution

Before the challenge started, I focused on something that usually gets ignored:

how the team would operate under constraint.

The team was split between Japan and Brazil, so I introduced a minimal structure early:

  • one shared channel for final decisions and conclusions
  • private loops for iteration and feedback
  • explicit separation between signal and noise

Roles were defined upfront:

  • Architecture / Direction (me) → system design, decisions, scope control
  • Core Engineering → implementation and integration
  • Validation / Output → testing and final result

We aligned on a simple execution loop:

define → implement → connect → test → repeat

This was not about process overhead — it was about avoiding:

  • duplicated work
  • misalignment under time pressure
  • decision bottlenecks

Result:

the team could move in parallel without losing direction


Strategy Before Data

Before seeing the full problem, we aligned on a few working assumptions:

  • this is a decision system over data, not a UI problem
  • iteration speed matters more than initial accuracy
  • we need something that works early and improves incrementally

Based on that, we prepared:

  • a basic data pipeline
  • a modular scoring structure
  • logging to support fast iteration

So when the challenge started:

we were adapting a system, not starting from zero


System Design

We built a layered decision system to balance signal quality, cost, and speed.


L1 — Statistical Baseline

For each user, we computed:

  • median transaction amount
  • MAD (Median Absolute Deviation)
  • behavioral references (recipients, methods, activity hours)

This established a stable notion of “normal” behavior per user.


L2 — Feature-Based Scoring

Each transaction was transformed into a set of features:

  • amount deviation (MAD-based)
  • amount vs income
  • balance drain ratio
  • new recipient detection
  • geo inconsistency (when applicable)
  • description-based signals
  • phishing exposure prior to the transaction

Phishing exposure was modeled by:

  • parsing SMS and emails
  • detecting suspicious patterns
  • building a timeline per user
  • correlating transaction timing with prior events

This added context that is not visible in the transaction alone.


L3 — Composite Scoring

Features were combined into a weighted score.

Design choices:

  • no single feature is decisive
  • known benign patterns reduce score
  • small transactions are heavily down-weighted
  • certain transaction types are penalized

We used a dynamic threshold (~top 10%) to select suspicious transactions.

This ensured:

  • valid output constraints
  • adaptability across datasets

L4 — Selective LLM Usage

LLM was used selectively, not as a primary mechanism.

We sent:

~30% highest-risk transactions

to a Groq-hosted model.

The model received:

  • transaction context
  • user profile
  • recent communications

and returned a probability score.

In this system:

LLM acts as a secondary signal layered on top of statistical and behavioral analysis


Key Design Choice

Instead of trying to detect fraud directly, we focused on:

what happens before the transaction.

Specifically:

  • phishing messages
  • suspicious emails
  • timing between contact and action

This allowed us to model causal context, not just isolated anomalies.


Efficiency Considerations

Efficiency was treated as a first-class constraint:

  • most transactions resolved locally
  • LLM used only when necessary
  • batch processing to control latency
  • simple statistical methods over heavier models

This kept the system responsive and predictable under load.


Result

40th place out of 1,971 teams

Given:

  • a 6-hour constraint
  • a distributed team
  • evolving datasets

this result reflects consistent execution rather than a single optimization.

REPLY AI Agent Challenge Rank

What I would improve

Under the time constraint, several decisions were made to prioritize speed and reliability.

With more time, I would focus on:

  • Temporal modeling
    Transactions were mostly evaluated independently. Incorporating sequence-aware analysis would better capture evolving fraud behavior.

  • Adaptive weighting
    Feature weights were fixed. A data-driven approach would improve generalization across datasets.

  • Tighter feedback loop
    Thresholding and scoring were calibrated per dataset, but not continuously refined during execution.

  • More selective LLM routing
    LLM was already used as a fallback, but selection could be further optimized for cost vs impact.

These were deliberate trade-offs:

prioritize a system that works reliably under constraint over a more complex system that would require more time to stabilize.


Final Takeaway

This challenge was not about building a perfect model.

It was about:

  • making decisions under uncertainty
  • balancing accuracy with cost and speed
  • structuring a system that can adapt quickly

And at the team level:

creating enough structure so execution remains stable under pressure

REPLY AI Agent Challenge - April 2026 | Baldaia Diniz Vitor