Regulated AI  /  Healthcare · Life sciences · Financial services

I build governed AI tools for leading customers.

Fifteen years shipping production AI in regulated, data-sensitive industries — healthcare, life sciences, and financial services. I take agents and ML systems from prototype to audit-ready, with the evals, guardrails, and SOC2/HIPAA evidence that let Legal, Security, and Compliance all say yes.

Book a 20-min scoping call → See selected work

            Jack Challis
            Burlingame, CA
          

What I do

Regulated industries don't need more demos. They need AI that ships — agents and pipelines that are measured, bounded, and audit-ready, built end to end from architecture through production.

AI agents in production

Agentic systems with tool-routing, hybrid retrieval (RAG), and guarded text-to-SQL — deployed where work already happens: Microsoft Teams, Copilot Studio, and zero-trust corporate networks, with managed-identity auth throughout.

AGENTS

Evaluation, safety & guardrails

AI output gets measured and bounded, not trusted blindly. Deterministic eval harnesses, type-safe agent loops, and objective gates that run before any human approval — a thick boundary around a creative interior.

EVALS

Data pipelines & analytics

Fragmented operational data unified into a single source of truth — LLM classification against domain rules, probability-weighted forecasting, idempotent re-runs, and strict separation of sensitive identifiers from shared reporting.

PIPELINES

Compliance & governance

SOC 2 Type II run end to end — gap analysis, control-policy authoring, evidence, audit readiness — plus HIPAA-aware architecture, PHI-safe data lanes, and security boundaries reviewed with zero high/critical findings.

SOC2 / HIPAA

AI Readiness & Risk Assessment

A clear-eyed read on where your organization actually is — the controls in place, the gaps that block deployment, and a sequenced path to a governed pilot.

ASSESSMENT

Selected work

Recent engagements building AI agents, evaluation systems, and data pipelines for regulated, data-sensitive organizations.

Client and proprietary product names omitted · technologies and architecture described in full · most delivered solo or in small teams, end to end.

AI agents in production

01 Federal research program

Clinical site-discovery agent

A natural-language agent for clinical-trial site feasibility over hundreds of thousands of documents — tool-routing across hybrid (vector + keyword) search, a guarded read-only text-to-SQL pipeline, and peer-adjusted feasibility scoring. OCR fallback lifts effective corpus coverage to ~99.9%; a JWT/JWKS-hardened bot ingress passed security review with zero high/critical findings.

Azure AI Foundry AI Search Vision OCR SQL · SELECT-only Teams Bot Terraform

02 Global information-services firm

Enterprise knowledge agents

Two production agents deployed inside a zero-trust corporate network — a project-discovery agent mining tens of thousands of engineering work items, and a product-knowledge agent over hundreds of mixed-format documents — surfaced through a Copilot Studio chatbot, with managed-identity auth throughout and a search-quality evaluation harness.

Azure Functions Azure OpenAI AI Search Copilot Studio Managed Identity Bicep

03 Healthcare AI company

Clinical evidence platform

A multi-service platform letting clinicians query real-world data and literature with LLM synthesis — React/TypeScript front ends, a SMART-on-FHIR OAuth flow into EHR systems, a graph-based Python search service, and an MCP tool-server exposing platform capabilities to third-party LLM clients with PHI-safe handling.

React / TS FastAPI LangGraph SMART-on-FHIR PostgreSQL Terraform / AWS

04 Biotech

Evidence-discovery proof of concept

An end-to-end conversational screening prototype over public trial registries and biomedical literature — tiered cross-source matching, semantic retrieval with a three-state eligibility model (match / no match / not specified), composite-score ranking, and a conversational UI with a live audit diagram and full provenance.

FastAPI Gemini embeddings NumPy SQLite / pgvector Cloud Run

Evaluation & safety

Deterministic eval harness

Runs an AI coding agent in isolated git worktrees and gates output against objective checks — correctness, latency budgets, extraction F1, schema and migration safety — before any human approval.

Type-safe agent loops

Static typing and boundary validation that catch AI-generated errors at the edge, with a strict split between the production runtime and the offline eval.

Auth / SDK spike

A gated build-vs-adopt evaluation of a vendor SDK against a battery of auth and channel tests that shaped a downstream production decision.

Data & analytics

Revenue-forecasting pipeline

A weekly pipeline unifying fragmented operational systems into one source of truth — LLM classification against domain rules, probability-weighted expected value, and prioritized, owner-assigned action queues.

Financial modeling & automation

Cash-flow reconciliation across accounts with trend-based runway forecasting, plus a set of operational automation and cost-ledger tools with strict data-separation rules.

Compliance

SOC 2 Type II program

Ran a SOC 2 Type II engagement end to end — gap analysis against the trust-services criteria, authoring and rationalizing the control-policy set, organizing audit evidence, and driving audit readiness with a quantified risk-budget model and adversarial review cycles.

How I work

01

Security from the start

Least-privilege access and explicit trust boundaries are designed in, not bolted on.

02

AI made evaluable

Deterministic gates and honest measurement — output is judged, never trusted blindly.

03

Sensitive data, separated

Protected identifiers stay isolated from anything shared or reported on.

04

Shipped, with a clean handoff

Tests, infrastructure-as-code, and documentation come with the delivery.

Writing & thinking

Featured essay

Vibe DevOps: the boring parts that make it work.

The unglamorous habits — clear boundaries, real evals, stable interfaces — that keep AI-assisted building from collapsing into throwaway spaghetti. Eight rules I actually use to ship code I can stand behind.

Read on LinkedIn →

Article · LinkedIn ↗

The one AI metric that matters: H₅₀

The longest task an agent completes reliably — and why the curve is steeper than your roadmap assumes.

Article · LinkedIn ↗

PhD founders: your degree is working against you

Why finishing the doctorate can lower, not raise, the odds of a startup working.

Experience

2025 — now

Independent AI/ML Consultant

Dodecaplex LLC

Governed Teams agents on Azure; evals, guardrails, and SOC2 evidence packs; Fabric/FHIR pipelines.

2024 — 2025

Sr. Director, AI/ML Operations

Fortrea · Global CRO

Established MLOps and AI governance across a 19,000-person clinical research organization.

2018 — 2024

SVP / VP, US Operations

Lucence · Genomics diagnostics

Built the US lab from the ground up and scaled US IT.

2015 — 2018

Director, Analytics & Outcomes

Global medical-technology company

Real-time inference from 200+ cancer centers; CI/CD and model deployment in an FDA-regulated environment.

2011 — 2015

Co-Founder & President

CliniCast · acquired 2015

Cloud platform and discharge-forecasting models from inception — 85% accuracy in clinical use.

Education

Yale University

PhD, Physics

University of Kentucky

BS, Physics & Mathematics

Certifications

Azure Data Scientist Associate (DP-100)
Azure AI Engineer (AI-900)
Azure Fundamentals (AZ-900)
AWS Solutions Architect — Professional

Full résumé on LinkedIn →

Get in touch

Governance you can actually ship.

The best first step is a short scoping call. Bring a workflow you'd like to make governed and audit-ready, and we'll talk through what's involved.

Book a 20-min call →

Email {{ emailText }} LinkedIn /in/jackchallis-ai ↗ GitHub github.com/jackChallis ↗ Company dodecaplex.co ↗

Essay

Cheap or expensive is the wrong question

There are two arguments running side by side on the feed right now, and the people having them don't realize they're having the same one.

The first is about the Chinese models. DeepSeek will run you about $0.14 per million tokens of input and $0.28 on the output — somewhere between 35 and 100 times cheaper than a frontier Western model — and at that price the conclusion writes itself: it's over, the bottom has fallen out, intelligence is basically free. The other end I can speak to personally, because I recently paid $160 for an hour of a frontier model on a genuinely hard problem. That's the fully loaded rate of a $330K-a-year engineer. You can call it absurd, and depending on the day I'll agree — except some days I mean absurdly expensive and some days absurdly cheap for what it actually did, and that whiplash is the whole story.

Both reactions are correct. $0.14 a million tokens and $160 an hour aren't competing claims about whether AI is cheap or expensive; they're two points on the same curve. The only honest answer to "is this cheap or expensive" is another question: for what you're trying to do? When I paid for that hour I wasn't overpaying and I wasn't getting robbed — I had a problem that lived at that end of the frontier. On a different problem the same week I'd have reached for something 100x cheaper and been just as right.

The confusion comes from a mental model we haven't let go of. People still think in terms of the prompt — you write it, you hit go, and either it works or the magic dies and you decide the whole thing was oversold. But the prompt was never the unit of work. The loop is. An agent, the way I'd define it now, is something with a goal, a loop, and an eval, and it grinds against that eval until the goal is met. Seen that way, the disappointment looks less like broken technology and more like a loop nobody bothered to give an eval or a second attempt.

Once work is loops instead of prompts, the loops vary in quality, and that's where it becomes a market. For a fee you'll rent whatever a loop needs — expertise dropped in, compute by the hour, a live link to the real world, a skill on demand — chosen on a blend of price and quality tuned to the job. Which is just the cost-quality frontier showing up one level down.

That's the part we're underrating. At any moment there's a frontier of cost and quality, and different problems sit naturally at different points on it. Cleaning a spreadsheet doesn't belong where designing a bridge belongs. Today we have a handful of crude dots; soon we'll have hundreds, and the frontier itself marches outward the way it did for the semiconductor — the same dollar buys more next year than this year.

This is good news. Stop asking whether a model is cheap or expensive. Nothing is, in the abstract. It's positioned — and the only question that matters is where on the frontier your problem belongs.

What I wish I knew before a Physics PhD

My physics PhD took up 22% of my life, and I hope I learned a lot during it. If I had to pick one thing that would have materially changed how I approached it: a physics PhD is judged by the volume and quality of the publications you produce. Classes, volunteer work, extracurriculars are essentially irrelevant to the task at hand. If you're serious about physics, optimize (volume of publications) × (quality per publication). Everything you do should boost one or the other.

On the economics of the decision

Theoretical physics PhDs bear enormous opportunity costs. The time isn't free — you could be doing another kind of research, learning a trade, or being paid handsomely to move numbers around a screen. They carry many of the intellectual challenges of medicine without the impact of relieving a child's suffering, and many of the challenges of engineering without owning valuable IP. Time spent on this instead of anything else needs to be accounted for.

Motivations change. When you start, your focus is on expanding horizons — curiosity, new experiences, growth. When you finish, much of your focus is on becoming rooted — family, career, a personal brand. There are real tidal forces between the ideal program when you begin and the ideal one when you finish.

High supply and low demand for theoretical physics PhDs means a low price of labor. Being smart doesn't mean the market values your contribution. And building an alternative career requires skills a PhD doesn't develop — the modern economy values teamwork, quick decisions, and clear communication, none of which thrive in a monastic working style.

Finishing a PhD can be a red flag for startup entrepreneurship. Your 20s are perfect for startup dynamics; finishing first tends to lower, not raise, your odds. The most successful startups based on PhD research usually had the students drop out before completion.

On physics itself

Physics is an experimental science and the best work is done in experiment, not theory. A mediocre experimentalist contributes vastly more than an outstanding theorist. And formal methods are for chumps — you won't beat a 60-year-old Russian master at asymptotic series, but you can beat him with intelligent numerics sampling billions of examples. If you aren't using computers early, you're doing it wrong.

On being a good PhD

Treat your research like a job: show up on time, write clear presentations, read the relevant work, find the people who can get you over the sticky bits, and care enough to do it well. Bad employee behavior translates directly into bad research. Alignment with your advisor is built over a long time — if you want early graduation or first authorship, build the case for why it's in their interest too. Generalists don't publish much; specialize quickly or run out the clock with something shallow.

What I'd do again

Productive goofing off paid enormous dividends. My unstructured time went to Chinese, economics, and programming — three things I use every day, and more profitable than my formal education during the PhD. Going to a name-brand school helped even more once I left physics; alumni networks can make or break a first job search. And I took a deliberate, two-year job search to figure out what I actually wanted to do.