Refactoring

Refactoring

🧵 Essays

How to Build AI into Tech Products 🔨

Lessons learned from 10+ AI-first companies

Luca Rossi's avatar
Luca Rossi
Oct 29, 2025
∙ Paid
20
2
Share
Upgrade to paid to play voiceover

Hey! With all the AI craze happening right now, there are two questions tech teams usually ask themselves about it: 1) should we use more AI at work? and 2) should we integrate AI into our product?

We talk pretty often about the first one, while I feel the second is a bit underrepresented. It is a complex question, one that leads to an obvious follow up: how?

“How” questions are tricky because they can always be exploded into multiple angles, and this makes not exception:

  • How do we build AI from a tech perspective? — It’s a cascade of buy vs build decisions.

  • How do we measure that it’s doing well? — How to make sure it’s actually improving?

  • How do we structure our team to make it happen? — Do we need to hire PhDs? What do AI engineers actually do?

To explore all of this, today I am bringing in Barr Yaron from Amplify, who has a unique vantage point on this.

Amplify is a peculiar VC fund, which focuses specifically on technical founders, backing the likes of Datadog, Temporal, and Runway, and are very opinionated about what is happening right now in tech.

Barr also runs a podcast, where she talks every week to CEOs and CTOs of technical products in the AI space, so she is perfectly positioned to connect the dots and tell us about some good practices!



Hey, Barr here!

Everyone is racing to build AI into their products. What makes some teams succeed while others fail?

Thanks to the podcast and to our work with founders we have the chance to go deep into what it really takes to build lasting AI products, from technical aspects like evals and data stacks to business decisions like team building and structuring.

In this post I’m going to talk about 3 themes I’ve seen from these leaders in AI that have helped them successfully build AI products (and organizations) for technical audiences – and how you can do the same:

  1. 🔍 Getting evals right — by integrating domain expertise

  2. ⚖️ Building vs buying models — how to think through the dilemma

  3. 🏗️ Structuring AI teams — do you need researchers, engineers, or both?

Let’s dive in!



🔍 Getting evals right — by integrating domain expertise

Every CEO and CTO that I’ve spoken to says their companies have invested significantly in good evals, and it’s very hard to get them right. Like Olivier (Datadog CEO) mentioned, you fundamentally cannot improve what you can’t measure.

A few examples:

  • Replit — built a multi-step evaluation system that lets them jump into any point in a development trajectory to test how their agent handles prompts and code modifications.

  • Hex — discovered that public benchmarks were not representative of real world data workflows, so they developed their own evaluations tailored to how data scientists actually use data.

  • Decagon — built large-scale tests for voice agents that simulate customer conversations and check for compliance, friendliness, and task completion.

The one theme that was obvious among companies that have nailed evals: they integrate domain expertise into every step of the AI development process, especially evals.

Vercel CEO Guillermo Rauch explained it clearly – when customers rely on you for expertise, scaling that expertise is highly impactful:

“Imagine if I could deploy my CTO who was at Google for 13 years and he could review the code of all our customers and give performance recommendations. How can I automate that process and make it scalable? If we can do that — build up large datasets, build up our evaluations, and constantly infuse them with what we call frontier data – we get models that continuously learn from the edge.”

Another great example is Vanta.

To construct their golden dataset — essentially the canonical representation of expected customer inputs to models and desired outputs — they put together a panel of domain experts in compliance. These folks actively work with Vanta’s product and engineering teams to build these golden datasets; but also help with human evaluation of model responses in the wild.

When I say actively work with, I mean actively work with. These SMEs, or subject matter experts, are deeply embedded in product and engineering workflows. They essentially do paired sessions with an engineer or PM. A SME might say “this model response doesn’t look right” and the engineer might try shifting the prompt, or adding additional context.

The point here is that domain expertise needs to be deeply integrated into how your engineers build AI products. It can’t be an afterthought, or a panel who takes a look at things occasionally.

Digging further into the Vanta example: they evaluate at different phases of product development. During the build phase, their golden dataset — built through work with SMEs — represents how the model should respond to customer inputs. Vanta integrates LLM-as-judge into their CI/CD pipeline to automatically evaluate model outputs against this golden dataset. When an engineer pushes code that affects an AI feature, the system immediately signals whether that change improved or degraded the model’s performance. They then monitor online quality as well to make sure what they see in their evals matches what they see in production.

Your customers are the ultimate “evaluation” 🙋‍♀️

Domain expertise can also come from your biggest experts: your customers.

Real-world usage often diverges from the best benchmarks and from initial eval datasets. Replit found early on that benchmark scores didn’t always predict user satisfaction:

“Sometimes the benchmark number goes up, but quality goes down. We started measuring quality from human feedback – explicit (‘I like it’) or implicit (‘I rolled back this change’).”

You can A/B test prompts and measure whether they lead to fewer rollbacks, more accepted suggestions, or faster deployments. And you can only do that if you ship!

Suno takes advantage of this in a clever way.

When you generate a song in Suno, you actually get two songs back – not (just) because they’re generous, but because they can learn a lot from which you prefer. Here are two songs that their model thinks are equally good, and yet a human prefers one. Why? This kind of feedback, implied or otherwise, is exactly how you steer AI systems towards human preferences over time.

Ultimately, the best evals are living systems. They’re built from domain knowledge, refined through human judgment, and validated in the real world.


⚖️ Should you build models, buy them, or both?

The age-old question for all software is if you should build it or buy it, and many companies building AI products are facing the same dilemma when it comes to models.

This post is for paid subscribers

Already a paid subscriber? Sign in
© 2025 Refactoring ETS
Privacy ∙ Terms ∙ Collection notice
Start your SubstackGet the app
Substack is the home for great culture