PRD - Foundry: Managed Private Local Inference for Apple Silicon

1. One-line pitch

For startup and mid-market CTOs who need to cut AI API spend or keep sensitive data off cloud providers, Foundry delivers a managed Apple Silicon local-inference platform that makes private LLM serving reliable, observable, and commercially usable.

2. Problem

CTOs and engineering leads buying Mac Studios for local AI hit the same wall:

model choice is confusing and changes weekly
memory pressure, OOM crashes, and endpoint drift make local inference unreliable
observability is weak, so CTOs cannot trust the stack for real business workflows
compliance-sensitive teams cannot use OpenAI/Gemini cloud APIs, but DIY local serving is brittle
every hour spent babysitting local models erodes the cost-saving case

The job-to-be-done is not "run a model locally." It is:

"Give me a local inference stack that my team can depend on, justify to finance, and defend to compliance."

3. Target customers

Primary (V1)

1. Startup CTO / technical founder

already spending £500-£3k+/month on OpenAI/Anthropic/Gemini
willing to buy Mac Studio hardware to reduce recurring spend for suitable workloads
wants fast setup, clear model recommendations, and low babysitting overhead
values developer-friendly endpoints, benchmark evidence, and a credible migration path from cloud to local

2. Mid-market CTO / head of engineering

50-500 person company
meaningful AI usage but limited appetite for GPU infrastructure
wants a vendor-like solution, not a GitHub science project
needs visibility, support boundaries, auditability, and rollback discipline

3. Compliance-constrained organisations

healthcare, finance, legal, government, defence-adjacent, internal R&D
cannot or should not send sensitive prompts/data to cloud AI providers
values on-prem reliability, auditability, and support over raw lowest price

Secondary (V2+)

4. Service businesses / document-heavy operators (trades, field service, logistics)

high volumes of repetitive admin: job intake, quotes, invoices, completion packs
value privacy and local control for sensitive customer/job data
want admin turnaround improvement, not flashy AI demos
need human review before anything reaches customers or systems of record
This is a real market, but the sales motion, channel, and product shape are different enough that it belongs in V2 after the core offer is proven

4. Product thesis

Foundry is a complete private AI infrastructure product - not just local inference.

A CTO doesn't need "a reliable local endpoint." They need a working system that handles their workloads, connects to their tools, and runs without babysitting. A local model endpoint without orchestration is a car engine without a steering wheel.

The product has four layers. The harness layer is a choice, not a double requirement:

1. Foundry (inference layer) - which model, which quant, which runtime, what fits safely; health checks, capacity guardrails, restart discipline, drift detection

2. OpenClaw OR Hermes (harness layer) - agent orchestration or messaging integration, not both by default:

OpenClaw for agent orchestration: sessions, memory, cron, specialist routing, workflow automation
Hermes for messaging integration: connecting models to Slack, email, databases, APIs
Both together is an advanced setup, not the default

3. llm_stats (observability layer) - live status, memory pressure, loaded models, activity, crash risk, benchmark evidence

Without a harness, the CTO has a working model but no way to route work to it. Without observability, they can't trust it's working. The stack is the product, but the harness is chosen based on the buyer's needs.

For V1, the CTO doesn't need to understand agentic architecture. They describe their workloads and we configure the appropriate harness for them. Pre-built workflow templates (support routing, code review, document search) give them working examples from day one.

Current building blocks:

Project Foundry → inference layer
OpenClaw → orchestration harness
Hermes → integration harness
llm_stats → observability layer
Tes benchmark + capacity work → evidence and model-fit credibility

5. Goals (MVP)

Business goals

Prove that CTOs will pay for local inference reliability, not just raw benchmarks
Close first 3 paying design-partner customers (CTO/local-inference buyers)
Demonstrate one of two primary value stories:
cost reduction vs cloud API spend
compliance / data-sovereignty viability without cloud AI

Product goals

Give customers one supported path to run and monitor local inference on Apple Silicon
Make runtime state legible enough that a CTO can trust it for internal workflows
Reduce "DIY local AI ops" time-to-value from weeks to one working day

6. Non-goals (MVP)

Competing with hyperscaler inference providers on raw throughput
Training or fine-tuning foundation models
Windows/Linux fleet support
General-purpose multi-tenant SaaS for every hardware setup
Replacing every runtime; MVP wraps existing runtimes rather than inventing a new one
Service-business workflow automation (V2)
Headcount reduction promises
Unsupervised automation of customer-facing, financial, or operational actions

7. Customer promises

Cost story

"Reduce recurring API spend by moving suitable workloads to Apple Silicon local inference."

Reliability story

"Know what is loaded, what is healthy, what fits, and what is about to fall over."

Compliance story

"Keep sensitive inference local, auditable, and under your control."

8. MVP definition

Offer A - Foundry Advisory (£299 one-time)

Personalised hardware + model-fit report
Recommended stack and runtime selection
Workload suitability assessment
Setup scripts / deployment checklist

Offer B - Foundry Managed Setup (£999 setup + £99/mo)

Install and configure the full stack on customer Mac Studio: Foundry + (OpenClaw or Hermes) + llm_stats
Harness choice based on customer needs: OpenClaw for agent orchestration, Hermes for messaging integration, or both for advanced setups
Pre-configured workflow templates based on customer's stated workloads (support routing, code review, document search, or custom)
Health checks, capacity profiles, benchmark baseline, observability dashboard
Basic runbook and support period
Customer interacts through existing tools (Slack, email, API) - the infrastructure is invisible

Offer C - Foundry On-Prem (£2-5k/mo)

For compliance-sensitive teams
Single-node on-prem deployment with audit logging and support
Explicit supported use cases only
No-cloud mode, support contract, operational runbook

MVP software scope

Supported runtimes: omlx + Ollama (+ LM Studio visibility where practical)
Full stack: Foundry + (OpenClaw or Hermes) + llm_stats
Harness choice: OpenClaw for orchestration, Hermes for messaging integration, or both for advanced setups
Endpoint health probing
Capacity guard with named profiles (operational / benchmark / all)
Benchmark evidence store
Storage hygiene + duplicate awareness
Observability dashboard (menu bar + lightweight web)
Simple recommendation engine for “what should I run for this workload?”
Pre-configured workflow templates: support routing, code review, internal document search
Runbook + installer
Customer onboarding: we configure the full stack based on their stated workloads

9. User stories

Startup CTO

As a CTO, I want to know which models safely fit on my Mac Studio so I can stop guessing.
As a CTO, I want one dashboard showing runtime health and memory pressure so I can trust local AI in team workflows.
As a CTO, I want a migration path for suitable prompts from OpenAI to local inference so I can reduce monthly spend.
As a CTO, I want my routine workloads (support queries, code reviews, document search) running automatically on local infrastructure without me building agent logic from scratch.
As a CTO, I want to choose between agent orchestration (OpenClaw) and messaging integration (Hermes) based on my use case, not have to learn both.
As a CTO, I want the AI to work through my existing tools (Slack, email) so my team doesn't have to learn a new interface.

Mid-market / compliance buyer

As an engineering lead, I want audit-friendly logs and explicit deployment boundaries so I can justify local AI to risk/compliance.
As an ops owner, I want a supported on-prem deployment with a runbook so this does not depend on one internal tinkerer.

10. Core workflow (golden path)

1. Customer describes hardware, workloads, sensitivity, and current API spend.

2. Foundry assesses fit and recommends runtime/model profiles.

3. Foundry deploys or guides setup on Apple Silicon hardware.

4. Customer sees live status via llm_stats/dashboard.

5. Team routes selected internal workloads to local endpoint.

6. Foundry tracks health, capacity, storage, and benchmark evidence.

7. Customer expands local usage only where economics and reliability hold.

11. Technical approach

Runtime strategy

Wrap, don't fork existing runtimes.
Primary serving substrate: omlx where possible.
Support visibility for Ollama and LM Studio because buyers already use them.
Foundry owns orchestration, fit analysis, health visibility, and policy.

MVP technical components

1. Model inventory - local model discovery, duplicate/storage accounting

2. Capacity guard - named memory profiles, safe-fit checks, drift notes

3. Health probe - endpoint checks, latency, loaded model state

4. Benchmark store - structured benchmark history and comparisons

5. Recommendation engine - runtime/model suggestions by workload class

6. Observability surface - menu bar and/or lightweight dashboard

7. Runbook + installer - deployment scripts, config validation, support playbook

Compliance/enterprise technical requirements

local-only network mode option
audit log of runtime changes and incidents
explicit no-cloud mode for sensitive environments
backup/restore story for configs and benchmark state
safe update path with rollback guidance

12. Logistics and operations

Delivery models

1. Advisory only - document, scripts, recommendation pack

2. Remote managed setup - customer-owned Mac Studio, remote configuration/support

3. On-prem managed pilot - customer-owned or customer-sited hardware, restricted support contract

Operational realities to solve

hardware procurement recommendations (which Mac Studio config is enough?)
remote access/support process for managed customers
SSD/storage policy for model libraries
update cadence for model/runtime changes
incident response when models OOM, endpoints drift, or disks fill
boundaries: what workloads should stay cloud even if local is available?

Commercial/logistics questions

do we require customer-owned hardware for MVP?
do we support one standard hardware profile first (e.g. M3 Ultra 512GB) before smaller configs?
what is the support window and SLA for design partners?
how do we package onboarding so it feels like a product, not consultancy chaos?

13. Success metrics

Design-partner phase

10 qualified conversations with target buyers
3 paid pilots or advisory engagements
at least 1 customer using local inference for a real recurring workflow

Product metrics

time from kickoff to working local endpoint < 1 day for supported hardware
customer can identify runtime health state in < 30 seconds
measurable reduction in cloud AI usage for suitable workloads
zero critical incidents caused by unsupported auto-actions in MVP

14. Risks and objections

1. Market confusion - buyers may want "cheap AI" when the real sell is "reliable local AI ops"

2. Hardware narrowness - M3 Ultra 512GB is powerful but niche; smaller configs need separate guidance

3. Support burden - bespoke environments can turn MVP into consulting soup

4. Reliability gap - if local inference is still flaky, the cost story collapses

5. Procurement friction - some buyers will need budget approval before hardware or pilot spend

6. Compliance sales cycle - lucrative, but slower than startup founder sales

7. Naming collision - Microsoft has a product called "Foundry Local" in the same category

15. Open questions

Is the first sale better packaged as advisory, managed setup, or on-prem pilot?
How much of llm_stats should remain free vs become part of paid Foundry?
Should MVP support only one blessed hardware profile to keep ops sane?
What workloads are strong local wins vs clear stay-in-cloud cases?

16. V2 path

Service businesses and document-heavy operators (trades, field service, logistics, professional services) represent a real secondary market. They have repetitive admin pain, sensitive data, and no good local AI option. But:

the sales motion is different (channel/partnership vs direct)
the product shape is different (workflow orchestration, not just inference ops)
the evidence for headcount reduction is thin
incumbents like Housecall Pro and ServiceTitan already own the workflow/budget

V2 plan: After proving the core CTO/local-inference offer with 3+ design partners, run one bounded service-business pilot. Prove measurable admin turnaround reduction with human review. Then decide whether to productise vertical workflow packs.