Work · Tim Costagliola

AI automations

#automations

Workflows that run unattended inside real businesses, with the error handling, retries, and human checkpoints that let the owner stop babysitting.

ReplyPilot

Review responses drafted in the owner's voice from a handful of examples. Suggest-before-send by design; a null-provider fallback runs the whole app offline and labels fallback output instead of faking it.

Measuring

Review responder (n8n)

Classify → draft → human approval → post → log, self-hosted and production-hardened: error workflows, retries, idempotency. Built to run for months, not demos.

In progress

Production systems

#systems

Full products designed, built, and operated end-to-end.

Twexly

Multi-tenant AI marketing platform, ~33k lines. Tenant isolation enforced by Postgres row-level security. Every model call flows through one generate() funnel: routing, retries, structured output, cost metering in micro dollars. A tree-grep test enforces the invariant.

Production

Handoffio

Quoting and scheduling for solo contractors, 22k lines. Treats LLM output as untrusted input: strict JSON contract, hard validation layer. Nothing reaches a client unvalidated.

Production

Agents & tooling

#agents

Agent runtimes and the plumbing around them. Written by hand first, so no framework holds mysteries.

ARIA

Local-first assistant on a from-scratch runtime: a hand-written think→act→observe loop with a loop-guard that catches repeated identical tool calls; one zod-typed tool registry serving both Ollama function-calling and an in-process MCP server; a symlink-safe permission engine. Files never leave the machine. 199 test files.

Production

ARIA's loop, rebuilt in LangGraph

The same control flow as a stateful graph: loop-guard, Postgres checkpointing, a human-approval interrupt with resume-from-kill.

In progress

Public MCP server + Agent Skill

Published to the registry with OAuth, annotations, a threat model, and tests. Then maintained in public: issues answered, releases cut.

2026

Measurement & safety

#measurement

The machinery that makes everything else defensible. Every artifact ships with its numbers.

Eval harness

Golden dataset from real briefs, deterministic checks, a calibrated LLM judge, and a CI gate that fails the build on regression. Pointed first at Twexly's writer, then proven general against ReplyPilot.

In progress · Q3 2026

Retrieval bake-off

Context-stuffing vs keyword vs semantic vs hybrid+rerank, scored on labeled queries over real content. The winner ships to production.

Q4 2026

Cost & trace dashboard

Full-funnel tracing, per-feature cost attribution, online scoring of live traffic. Then cost cut with quality held flat, proven by evals.

Q1 2027

Red-team report

A versioned attack battery run against my own products, guardrails built in depth, attack success measured before and after. Responsibly disclosed.

Q2 2027

Hardware & embedded

#hardware

Range past the browser, down to interrupt handlers.

Project Nightlight

Passive Wi-Fi/RF threat-detection field kit. The same detector implemented three times: Python, host-testable C, ISR-safe Arduino. Bespoke C test harness, CI, and a documented ethics standard for defensive use.

Field-tested