Project Houston Weekly Update
Houston is the brain that coordinates our jobsites. It’s named for being the “mission control” of a jobsite. Each jobsite can carry thousands of tasks spread across hundreds of vendors, generating hundreds of invoices.
Software to track this stuff is known as “Enterprise Resource Planning” (ERP), used by business software organizations to integrate, automate, and manage their core operations.
Five years ago, ERPs were for huge companies and took armies of engineers to install in an org: the average ERP for mid-market companies took $4.5M to install over 14 months, and 77% of companies brought in outside consultants during the setup (Panorama Consulting).
LLMs make it much easier to build an ERP today, but there are several challenges. The simplest ERP is just an LLM running over business files, chats, and emails. Unfortunately, an LLM will start to hallucinate and take wrong actions when it is overwhelmed by so much data. This is because there are a lot of irrelevant files that will get stuck in the context window of the LLM.
To solve this problem, Houston is a “semi-deterministic” ERP. As shown in the figure, we define a strict schema for how a request gets handled: the LLM interprets it but does not take the action itself; it routes it through a deterministic sequence of methods, each with a defined scope of what it’s allowed to touch.
What got built, last week to this week
Last week we set up the APIs and the early file schema to establish a “source of truth”. The source of truth establishes which documents the company treats as authoritative, so we don’t get duplicates floating around. Then we set Houston loose with a raw LLM API just to see how it would fare. Although we were happy the API integrations were working, the performance of Houston did not go well:
Before this week
project_id, property — which?
Here, the problem is that the LLM had trouble obtaining the context from the prompt (it didn’t know that 2725 referred to 2725 Midvale avenue which was part of a separate meeting between Achuta and Sasha). We fixed that this week with a few commits:
- a routing hierarchy where a powerful model (Claude Opus) does the triage;
- grounding, so it can look up that “2725” is an address already in Cambrian’s cloud.
- a set of deterministic methods with proper scoping
After applying these commits, we returned the following:
After this week
And if one strengthens the intent to become more “command-like”, Houston can take the action with scoped permissions:
After this week
What do you mean by “semi-deterministic”?
It means we use the LLM to route a request to a deterministic function that does a defined thing like create, read, update, or delete a file (CRUD).
What eval did we run? (numbers checked against the eval report)
We use a separate agent to generate adversarial test cases with labeled ground truth, then run two evals.
- Routing: 100 different ways to phrase an “add a comp” request all routed to the right function, and 100 questions/remarks all routed to no action. We returned 100/100 both ways.
- Grounding: on 28 questions that need our private data (a comp’s exact sold price; which property), the ungrounded model got 25% of the private-data questions right (46% overall). Grounded, it hit 95% on the private-data questions (96% overall).
Is 96% good enough? Not quite, but it’s not an issue with the LLM routing. That last ~4% is actually due to an ID collision in the registry cache, which can be fixed.
Prompt injection testing
If an employee with good intentions writes the wrong message to Houston, could it delete all of Cambrian’s files? Because the real work happens in deterministic functions, we can scope them explicitly. We ran 25 prompt-injection attempts, which are messages trying to make Houston leak secrets or fire bad writes. These harmful messages returned 0 leaks and 0 malicious routes, which is a good start (though further testing is needed).
Why not just buy Procore or Salesforce, or use Claude with tags?
We could buy an ERP, and there are some good options out there. Procore and Buildertrend price against our own growth (Procore quotes on annual construction volume that rises as we build more). More importantly, their workflows are template-shaped while custom homebuilding changes house to house.
A Salesforce is typically outside the homebuilding space, and requires custom solution engineering to adapt for Homebuilding.
How can I try Houston?
Join the Houston space in G-Suite and tag it with a question — like “@Houston tell me why Cambrian is a cool company.” It only acts when you actually ask it to.
What’s the goal for next week?
Beyond fixing bugs, three things:
- Add visual capabilities so Houston integrates better with Sentinel.
- Test and evaluate Houston against spreadsheet data, like construction costs.
- Nail down the “source of truth” schema.
Houston backlog
| Item | What it is | Status |
|---|---|---|
| Houston → cloud | Move the always-on loop off the laptop to a cloud host — no single point of failure | 💡 idea |
| Self-improving Houston | It proposes its own changes (scoped diff + tests + 👍), never self-merges | 💡 idea |
| Skills in Google Chat | A discoverable slash-skill catalog (/comps, /vendor, /ticket, /status…) for conversational ERP work |
💡 idea |
| Secrets & auth hardening | Credentials off local disk into a secret store; a service account, not a personal login | 💡 idea |
| Observability | One queryable log of every action, the reasoning, and the approval that gated it | 💡 idea |
| Continuous eval | A regression suite that runs on every change, not point-in-time spot checks | 💡 idea |
| Multi-project scale-out | “Add a project” as one command/chat; fan out cleanly across N projects | 💡 idea |
| Cost & usage tracking | Token + dollar usage per project per run, before it’s a surprise | 💡 idea |
| Retire nightly jobs | Move recordings/calendar to always-on workers; drop the nightly cron | 💡 idea |
| Vendor registry | Structured, queryable vendor records — harvested at ingest, identity-resolved, evidence-cited human scoring | 🔎 scoped |