Nero Part 1: Home Automations

I will not start this post by belaboring a truth we all know: the advances in Artificial Intelligence in the last 2 years have been staggering. Despite that, I still rarely interface directly with Large Language Models other than through the ChatGPT interface. This could be just a consequence of my own inflexibility; however, I believe there’s another reason. Most applications of Large Language Models are built on software 1.0 platforms. This means the primary way of interacting with them are through Chat interfaces like ChatGPT. I don’t really know if chat is the optimal interface for these products.

There is also an emerging ecosystem of products built around agent workflows. I think these are promising, but I think they also are built into the confines of software 1.0. I’m not saying that I know what the next generation of computing holds. I’m just saying it’s important to consider new mediums in the process. My primary goal with this project is seeing what those new mediums are.

Starting with Home Automations and Security

When deciding what I wanted to do with Nero first, I went through a checklist of several ideas, but ultimately settled on home automation - specifically interacting with my home security systems - as the first integration. I want Nero to be able to interact with the outside world, and so I want to surface any additional design considerations necessary to make this possible. Additionally, my first perceptions of “Artificial Intelligence”, were formed from watching the Disney movie Smart House as a kid. What better way to start a project like this than by making my own “Smart House”?

The home automation and security use case is also interesting to me because I have a number of “smart” devices that all require their own separate app to control. It would be nice to have an assistant that interacts with these devices for me.

For this first use case, I want Nero to be able to:

Answer questions about my home sensors, including what doors are open, if doors are locked, if my security cameras are charged, etc.
Lock and unlock doors
Set alarms
Monitor camera feeds, and provide descriptions of what’s going on
Still answer general non-home related questions

I also want the first iteration of Nero to be voice-interactive. I plan to build Nero into one of my spare SBCs so I have an actual device to interact with; however, for this post Nero exists in a website.

Reversing the SimpliSafe API

My home security consists of several SimpliSafe devices including cameras, smart locks, entry sensors, and more. If I was a true hacker, I would probably have thrown all of these devices in the trash and built my own home security system from scratch. Maybe I’ll do that in the future, but for now it’s easy enough to make Nero integrate with my existing devices.

Calling this work “reversing” is kind of insulting to real reverse engineers. The SimpliSafe API is really simple, even interacting with the live camera feeds (at least the v1 feeds) is not challenging. There are some slightly “involved” parts, but nothing over the top.

Before discussing the process, I want to note that I am well aware of projects like Home Assistant, and I understand they even have integrations for SimpliSafe. I decided not to use it because I figured I could work much quicker building my own solutions in Elixir. Also, this is as much an exercise in FITFO for me as it is a fun/useful project. Also as far as I could tell the HomeAssistant integration does not build in any way to interact with live camera feeds.

To get started, I went to the SimpliSafe website, opened up the Network tab in Developer Tools, and started inspecting the traffic. When you enter the SimpliSafe dashboard, you’ll see a few pretty obvious requests. The first one to note is to https://auth.simplisafe.com/oauth/token. After authenticating, SimpliSafe issues a Bearer token which expires after 60 minutes. This Bearer token is used in each of the API requests for controlling and interacting with SimpliSafe devices. This access is short-lived, which is not ideal for a project like Nero. Fortunately, it’s easy enough to implement a re-authentication flow to maintain access without needing to pull a new token from your browser every 60 minutes. I will leave that implementation as an exercise for the reader.

After issuing a Bearer token, SimpliSafe performs an authenticated request to https://api.simplisafe.com/v1/api/authCheck. This endpoint replies with a userId which is used in some follow-on API requests including one to https://api.simplisafe.com/v1/users/{userId}/subscriptions. The subscriptions endpoint returns subscription information, but is really only noteworthy because it’s the first request that gives you a subscription ID in the response. This ID is also used in follow-on requests.

After retrieving necessary metadata, SimpliSafe performs a request to https://api.simplisafe.com/v1/ss3/subscriptions/{subscriptionId}/sensors. This endpoint returns a response object that looks like:

{
    "account": {account},
    "success": true,
    "sensors": [
        {
            "flags": {
                "offline": false,
                "lowBattery": false,
                "swingerShutdown": false
            },
            "serial": {serial},
            "type": 1,
            "name": "entry",
            "setting": {
                "alarm": 1,
                "lowPowerMode": false
            },
            "status": {},
            "timestamp": 9072000,
            "rssi": -66,
            "WDTCount": 3,
            "nonce": 144373,
            "rebootCnt": 97,
            "deviceGroupID": 0
        },
        ...
    ],
    "lastUpdated": 1708094167,
    "lastSynced": 1708094167,
    "lastStatusUpdate": 1708094166
}

There is a sensor entry for each of the connected devices in your SimpliSafe home network. Initially, I thought this response would have all of the information I would need to give Nero a full picture of the state of devices in my home network. I was incorrect. While the sensor endpoint does include information about all of your connected sensors, I found it doesn’t actually update for locks and alarms. I tried taking a diff in responses between requests to this endpoint after messing with lock and alarm settings, and noticed that the corresponding lock and alarm entries never actually change. That said, I still relied on information from this endpoint for entry sensor states and CO2/Smoke Detector states (sensor types 5 and 14 respectively).

After some confusion about where door lock information was coming from, I noticed a request to https://api.simplisafe.com/v1/doorlock/{subscriptionId}. This request returns a response like:

[
    {
        "sid": {subscription},
        "serial": {serial},
        "name": "Front Door",
        "mode": "sensor",
        "status": {
            "calibrationErrZero": 0,
            "calibrationErrDelta": 0,
            "lockLowBattery": false,
            "lockDisabled": false,
            "pinPadLowBattery": false,
            "pinPadOffline": false,
            "lockState": 1,
            "pinPadState": 0,
            "lockJamState": 0,
            "lastUpdated": "2024-02-18T18:55:55.501Z",
            "calibrationErrTimeout": 0
        },
        "flags": {
            "offline": false,
            "lowBattery": false,
            "swingerShutdown": false
        },
        "firmwareVersion": "1.5.0",
        "bootVersion": "1.0.1",
        "otaPriority": "general",
        "nextCalibration": "2024-03-18T20:13:04.430Z"
    }
]

And actually updates after lock settings are changed. Next, I wanted to see how I could programmatically lock and unlock my door, as well as update my home alarm state. After locking my door in the dashboard, I noticed a post request to https://api.simplisafe.com/v1/doorlock/{subscriptionId}/{serial}/state with a payload:

{"state": state}

Where state is one of “lock” or “unlock”. Nice and simple, I put together a simple client using Req, and started programmatically locking and unlocking the door (despite objections from my fiancé to stop fucking with it and stressing the dogs out (love you)).

Next on the agenda was to find out how I could programmatically set my home alarm. After toggling all 3 states, I found that alarm state is changed by sending a post request to https://api.simplisafe.com/v1/ss3/subscriptions/{subscriptionId}/state/{state} where state is one of “off”, “home”, or “away”.

At this point, I had enough information for Nero to be able to:

Answer questions about my home sensors, including what doors are open, if doors are locked, if my security cameras are charged, etc.
Lock and unlock doors
Set alarms

Which was sweet! But, I really wanted to make sure Nero could monitor my camera feeds. I navigated to the camera section of the SimpliSafe dashboard, and turned on my doorbell camera feed. In the network tab, I noticed a request to https://media.simplisafe.com/v1/{uuid}/flv?x=944&audioEncoding=AAC. uuid is the camera’s UUID, which can be obtained by listing the available cameras you have via this endpoint: https://app-hub.prd.aser.simplisafe.com/v1/subscriptions/{subscription}/cameras.

The feed for my doorbell camera used chunked transfer encoding to stream an FLV video feed to the browser. Fortunately, this is pretty easy to work with! My plan for Nero’s “monitoring” ability was to use a vision language model like GPT-4 V or LLaVA on screenshots of my camera’s live feed. In order to get screenshots, we can use ffmpeg on the live stream. If we want continous monitoring, we can set up ffmpeg to take screenshots every N frames, and use a filesystem watcher to send new screenshots to a vision model. For this first demo, I just want screenshots on demand, so Nero can quickly describe the scene outside my house when asked. To do this, I used the following ffmpeg command:

ffmpeg -loglevel error -i http://localhost:8080/proxy/simplisafe/#{uuid}/stream.flv -vframes 1 -f image2pipe -vcodec png - | base64

This will return a base64 representation of a screenshot of the feed in PNG format. Note that I actually proxied the original stream, though I don’t think this was necessary.

I (naively) assumed that my mounted cameras would work exactly the same as the doorbell camera, but it turns out mounted cameras use a v2 endpoint which streams video over a websocket using Amazon Kinesis Video Streams. I didn’t really feel like messing with WebRTC and Kinesis, so I sadly left this work for another day.

Giving Nero Commands

After writing a SimpliSafe client that contained the basic features I wanted Nero to have access to, it was time to give Nero some “intelligence.” I have been interested in the Instructor library, and thought it would be a good fit for this project. Instructor is a library for performing structured prompting. It’s non-intrusive, and makes use of an LLMs ability to produce structured (e.g. JSON) outputs. In the Elixir ecosystem, Instructor allows you to define Ecto schemas like:

defmodule Nero.Modules.Home do
  use Ecto.Schema

  @doc """
  ## Field Descriptions
  - action: Action to take. One of :set_lock, :set_alarm, :view_camera, :none.
  Set door will set the door lock state. Set alarm will set the alarm state. View
  camera will view the given camera. None is used only to respond in cases where
  you do not need to perform any action.
  - lock_state: Lock state. Either lock or unlock. Should ONLY be set if the action
  is set_lock. Otherwise it MUST be nil.
  - alarm_state: Alarm state. Either off, home, away. Should ONLY be set if action
  is set_alarm. Otherwise it MUST be nil.
  - camera: Camera feed to view. Should equal the UUID of the camera selected to view.
  Should only be set when action is :view_camera. Otherwise it MUST be nil.
  """
  @primary_key false
  embedded_schema do
    field :action, Ecto.Enum, values: [:set_lock, :set_alarm, :view_camera, :none]

    field :lock_state, Ecto.Enum, values: [:lock, :unlock]
    field :alarm_state, Ecto.Enum, values: [:off, :home, :away]

    field :camera, :string
  end
end

Instructor will help coerce your LLM of choice to return results in this schema. This is the schema I used for Nero’s home automation module. It’s opinionated, but I will focus on generalizing later.

Originally, I had included a response field in the action schema to allow Nero to explain each action taken. This, however, limits the ability to stream Nero’s responses. Instructor is capable of partial schema streaming, but it will not stream a field until it is fully populated. This impacts “time to first token” or in this case “time to first spoken word” and makes interacting with Nero a little less natural. There’s some ways around it, but a simple solution is to just perform a “response” call in parallel with the “action” call. That’s the solution I used in the end. In a future post, I will explore more aggressive optimizations to make Nero feel more conversational.

After defining a schema, you can use Instructor to obtain an output matching your response model like so:

  def get_action(prompt) do
    Instructor.chat_completion(
      model: "gpt-3.5-turbo",
      response_model: Home,
      messages: [
        %{role: "user", content: prompt}
      ]
    )
  end

For Nero’s “action module,” I used the following prompt:

You are Nero, a friendly, programmable assistant.

You will be given a home automation query which you must turn into
an action to complete the query.

Do not perform redundant actions. You will get information about the home sensor
states next. If a door is already locked, do not re-lock it. Simply perform no action.
You can also use this sensor information to answer questions about the state of the house.
If the query requires no action (e.g. just answering a question about a door), simply
perform no action.

Sensor Data: <%= assigns.sensors %>
Command: <%= assigns.command %>

I’m not a student of the latest prompt engineering techniques. I’m sure there are better prompts than this one, but the action space for this demonstration is so small that I don’t think it really matters. In the future I will add evals to Nero and work to optimize these prompts a bit more. In the few trial runs I did, this prompt worked very well.

Next, I needed to give Nero the ability to “respond.” Notably, these responses needed to be streaming to improve time to first spoken word. This is pretty simple with a single call function:

  def respond(command) do
    prompt =
      Nero.Prompts.response(%{
        command: command,
        sensors: sensors()
      })

    response_stream =
      OpenAI.chat_completion(
        model: "gpt-3.5-turbo",
        messages: [
          %{role: "user", content: prompt}
        ],
        max_tokens: 1200,
        stream: true
      )

    response_stream
    |> Stream.map(&get_in(&1, ["choices", Access.at(0), "delta", "content"]))
    |> Stream.reject(&is_nil/1)
  end

I used a slightly different prompt for responses. I wanted Nero to respond in a distinct style. The most accurate terminology I could use to describe this style was “in the style of Alfred to Bruce Wayne.”

You are Nero, a friendly, programmable assistant. You've been given a command
and are in the process of performing an action which satisfies the command.
You must now provide a response which describes the action you plan on taking.

When responding, be sure to refer to the speaker as "sir" and be extremely
respectful. You should use the tone of a butler. Model your tone and responses
after how Alfred would respond to Bruce Wayne.

You will get information about the home sensor states next. If a door is already
locked, do not re-lock it. Simply reply that it's already in a locked state. You
can also use this sensor information to answer questions about the state of the house.

If asked to provide updates about a camera feed, or if that is the logical action
to take, simply acknowledge that you are checking the camera feed and will report back
shortly. Keep your replies succinct. Do NOT expand upon sensor information unless
explicitly asked to. You can simply say everything looks good if all readings are
normal. ONLY make suggestions if there are legitimate security concerns. When asked
questions unrelated to home data, DO NOT mention anything about the home state. Just
respond as normal.

Sensor Data: <%= assigns.sensors %>
Command: <%= assigns.command %>

You might see a bit more “engineering” in this prompt. This is mainly due to observed failure cases when I was testing.

The final step in this process is to implement the actual execution of the actions Nero decides to take. To accomplish this, I created a single function in the Home module I defined previously, which routed actions to the appropriate SimpliSafe client function call:

  def execute(%Home{action: :set_lock, lock_state: state}) do
    SimpliSafe.set_lock(state)
  end

  def execute(%Home{action: :set_alarm, alarm_state: state}) do
    SimpliSafe.set_alarm(state)
  end

  def execute(%Home{action: :view_camera, camera: uuid}) do
    base64 = SimpliSafe.get_screen_shot(uuid)
    describe_image(base64)
  end

  def execute(%Home{action: :none}) do
    :ok
  end

Set lock and Set alarm are simple actions - they can just defer to the SimpliSafe API. :view_camera is a bit more involved - but still just a simple integration with the GPT-4V API. To accomplish this action, I used the ffmpeg command described earlier in this post to get a base64 screenshot of my live camera feed. Then, I passed this into a GPT-4V API call with the following prompt:

You are Nero, a friendly, programmable assistant. You will be given a screenshot
of the current view of a security camera outside of a house. You should succinctly
describe what you say, if anything. If there's nothing unusual, simply state that
you see nothing out of the ordinary. Assess the threat level/security status of the
house. If there seems to be an immediate danger or perhaps a potential thread, state
what the danger is.

When responding, be sure to refer to the speaker as "sir" and be extremely
respectful. You should use the tone of a butler. Model your tone and responses
after how Alfred would respond to Bruce Wayne.

I was surprised at the results I got from the GPT-4V API. It was very accurate.

One thing to note is that the main convention here is that each execution returns :ok, {:ok, response}, :error, or {:error, reason}. We can perform tasks asynchronously, and then await on the final result of the asynchronous action and use the response to:

Do nothing (:ok)
Reply with the successfully pulled repsonse ({:ok, response})
Handle errors (e.g. reply with descriptive info on how to fix, or just notify that one occurred)

Next, I defined an “agent” module that encapsulated Nero’s ability to execute and respond to commands:

  def respond(command) do
    prompt =
      Nero.Prompts.response(%{
        command: command,
        sensors: sensors()
      })

    response_stream =
      OpenAI.chat_completion(
        model: "gpt-3.5-turbo",
        messages: [
          %{role: "user", content: prompt}
        ],
        max_tokens: 1200,
        stream: true
      )

    response_stream
    |> Stream.map(&get_in(&1, ["choices", Access.at(0), "delta", "content"]))
    |> Stream.reject(&is_nil/1)
  end

  def execute(command) do
    prompt =
      Nero.Prompts.action(%{
        command: command,
        sensors: sensors()
      })

    prompt
    |> get_action()
    |> Home.execute()
  end

  defp get_action(prompt) do
    Instructor.chat_completion(
      model: "gpt-3.5-turbo",
      response_model: Home,
      messages: [
        %{role: "user", content: prompt}
      ]
    )
  end

Now, I just needed to make Nero a bit more interactive.

Transcribing Audio

In order to make Nero voice activated, I relied on the Whisper implementation in Bumblebee. There’s actually a nice convenient LiveView Example that implements Speech to Text with Whisper. This example served as the base “interface” for Nero. I recommend taking a look at the source to see how it works.

The real change here is that rather than use an async_result and assign_async, I used start_async:

  defp handle_progress(:audio, entry, socket) when entry.done? do
    binary =
      consume_uploaded_entry(socket, entry, fn %{path: path} ->
        {:ok, File.read!(path)}
      end)

    audio = Nx.from_binary(binary, :f32)

    socket =
      socket
      |> assign(state: :transcribing)
      |> start_async(:transcription, fn ->
        Nero.SpeechToText.transcribe(audio)
      end)

    {:noreply, socket}
  end

And then matched on the successful transcription:

@impl true
def handle_async(:transcription, {:ok, transcription}, socket) do
  # do things
end

It took less than 10 minutes to have Speech to Text running in a LiveView. It was time to bring Nero to life.

Bringing Nero to Life

The final step was giving Nero a voice and executing actions based on transcriptions. To start, I wanted to build a robust text to speech endpoint. Based on some discussion with Jonatan Kłosko, I settled on an approach that used Phoenix Channels and the ElevenLabs WebSocket API. First, I defined a WebSocket handler for the ElevenLabs API:

defmodule Nero.Client.ElevenLabs.WebSocket
  use WebSockex

  require Logger

  def start_link(broadcast_fun) do
    headers = [{"xi-api-key", env(:api_key)}]

    url = "wss://api.elevenlabs.io/v1/text-to-speech/#{env(:voice_id)}/stream-input"

    WebSockex.start_link(url, __MODULE__, %{fun: broadcast_fun}, extra_headers: headers)
  end

  def open_stream(pid) do
    msg = Jason.encode!(%{text: " "})
    WebSockex.send_frame(pid, {:text, msg})

    pid
  end

  def close_stream(pid) do
    msg = Jason.encode!(%{text: ""})
    WebSockex.send_frame(pid, {:text, msg})
  end

  def send(pid, text) do
    msg = Jason.encode!(%{text: "#{text} ", try_trigger_generation: true})
    WebSockex.send_frame(pid, {:text, msg})
  end

  ## Server

  def handle_frame({:text, msg}, %{fun: broadcast_fun} = state) do
    case Jason.decode!(msg) do
      %{"audio" => audio} ->
        raw = Base.decode64!(audio)
        broadcast_fun.(raw)

      _ ->
        Logger.error("Something went wrong")
        :ok
    end

    {:ok, state}
  end
end

I’m a little upset that ElevenLabs sends Base64 encoded audio back. Maybe it doesn’t impact latency that much, but I would just feel better if I could send binaries instead. This WebSocket client allows me to declare a fun which operates on the raw audio. I used this to create a TextToSpeech module with a stream/2 function which can consume an Enumerable, or in this case a Stream of text as it is generated:

  def stream(enumerable, fun) do
    enumerable
    |> group_tokens()
    |> Stream.map(&Enum.join/1)
    |> Stream.transform(
      fn -> open_stream(fun) end,
      fn text, pid ->
        WebSocket.send(pid, text)
        {[text], pid}
      end,
      fn pid -> WebSocket.close_stream(pid) end
    )
    |> Enum.join()
  end

  defp open_stream(fun) do
    {:ok, pid} = WebSocket.start_link(fun)
    WebSocket.open_stream(pid)
  end

  defp group_tokens(stream) do
    Stream.transform(stream, {[], []}, fn item, {current_chunk, _acc} ->
      updated_chunk = [item | current_chunk]

      if String.ends_with?(item, @separators) do
        {[Enum.reverse(updated_chunk)], {[], []}}
      else
        {[], {updated_chunk, []}}
      end
    end)
    |> Stream.flat_map(fn
      {[], []} -> []
      chunk -> [chunk]
    end)
  end

Before opening the stream we need to group_tokens which chunks tokens into words per ElevenLabs’ own recommendation. You might also notice at the end we call Enum.join/1. I didn’t want to discard the response so this function will consume the entire text stream and return the joined result. Note that this approach also may not be the most performant. ElevenLabs recommends keeping the WebSocket open because they use WSS, so connecting might take a bit more time. Unfortunately, they also have a 20 second timeout, and I didn’t feel like implementing any keep-alive functionality.

The intended use case of stream/2 is to use fun to broadcast raw audio over a Phoenix Channel. I used this to declare the following speak function in my LiveView:

  defp speak(socket, text) do
    socket
    |> assign(state: :speaking)
    |> start_async(:speak, fn ->
      Nero.TextToSpeech.stream(text, fn audio ->
        broadcast_audio(socket.assigns.audio_channel, audio)
      end)
    end)
  end

  defp broadcast_audio(channel, audio) do
    NeroWeb.Endpoint.broadcast_from(
      self(),
      channel,
      "phx:audio-stream",
      {:binary, audio}
    )
  end

This spins up an asynchronous task which will consume a text stream (presumably from an LLM response). Then we can handle the async result:

  def handle_async(:speak, {:ok, _response}, socket) do
    {:noreply, assign(socket, state: :waiting)}
  end

And finally use a JavaScript Hook to connect to the channel on mount and manage incoming Audio broadcasts:

const Speaker = {
  audioContext: new (window.AudioContext || window.webkitAudioContext)(),
  audioQueue: [],
  isPlaying: false,
  initialBufferFilled: false,
  initialBufferLength: 1, // Number of chunks to buffer before starting playback
  overlapDuration: 0.05, // Overlap duration in seconds

  mounted() {
    let csrfToken = document.querySelector("meta[name='csrf-token']").getAttribute("content");

    this.channel = socket.channel('tts:' + csrfToken, {});
    this.channel.on("phx:audio-stream", payload => {
      this.enqueueAudio(payload);
    });
    this.channel.join();
  },

  enqueueAudio(audioData) {
    this.audioQueue.push(audioData);

    if (!this.initialBufferFilled && this.audioQueue.length >= this.initialBufferLength) {
      this.initialBufferFilled = true;
      this.processQueue();
    } else if (this.initialBufferFilled && !this.isPlaying) {
      this.processQueue();
    }
  },

  processQueue() {
    if (this.audioQueue.length === 0) {
      this.isPlaying = false;
      return;
    }

    this.isPlaying = true;
    const audioData = this.audioQueue.shift();
    const localThis = this;

    if (audioData) {
      this.audioContext.decodeAudioData(audioData, decodedData => {
        const source = this.audioContext.createBufferSource();
        source.buffer = decodedData;
        source.connect(this.audioContext.destination);

        let nextStartTime = this.audioContext.currentTime;
        if (this.lastSource && this.lastSource.buffer) {
          const lastBufferDuration = this.lastSource.buffer.duration;
          if (lastBufferDuration && !isNaN(lastBufferDuration)) {
            nextStartTime = this.lastSource.startTime + lastBufferDuration - this.overlapDuration;
          }
        }

        source.onended = () => {
          localThis.processQueue();
        };

        source.start(nextStartTime);
        this.lastSource = { source: source, startTime: nextStartTime, buffer: decodedData };
      }, function(e) {
        console.log("Error decoding audio data" + e.err);
        localThis.processQueue();
      });
    }
  },
};

This buffers clips and processes them as they come in. The audio is a bit choppy, but it’s not bad. Please don’t make fun of my (and hybrid ChatGPT) javascript.

Finally, I defined an act function which executes actions from the transcription:

  defp act(socket, transcription) do
    start_async(socket, :act, fn ->
      Nero.Agent.execute(transcription)
    end)
  end

And then handled that result:

  def handle_async(:act, {:ok, result}, socket) do
    socket =
      case result do
        {:ok, text} -> speak(socket, text)
        _ -> socket
      end

    {:noreply, socket}
  end

And that’s all it took! This implementation has the same functionality as the demo I posted on Twitter originally. The performance, in terms of time to first spoken word, is just better.

One Final Optimization

I was really eager to cut down on time to first spoken word, but I spent absolutely zero time profiling before trying to optimize the hell out of the inference pipeline with streams, WebSockets, channels, and more. Then I realized I was running Whisper on my CPU, using an old version of EXLA. EXLA is based on XLA, and the XLA compiler has been notably under optimized on CPUs. It’s getting better, but not it’s still not great. I added some debug configurations to my Whisper serving, and was surprised to see that transcription execution times were almost always above 1s (yikes!). At this point, I could have been a good programmer and went hard on trying to optimize Whisper running on XLA CPU, but instead I decided to just throw hardware at it. I deployed the app on my Ubuntu machine which has a 4090, and served it to my Mac with Tailscale. The result was a much shorter time to first spoken word, and a better demo.

Conclusion

This was probably the most fun I’ve had working on a project in awhile, and I have a lot of other fun ideas I’d like to explore. I was mostly surprised with how easy it was to get to a passable and impressive demonstration. Although, I don’t think I should’ve been that surprised. I have consistently found that LLMs and recent innovations in AI make it really easy to deliver an impressive demo, and yet it’s still really hard to deliver a polished product. For this application specifically there are tons of UX improvements and features that go beyond just synthesizing a few APIs.

This project also convinced me that we need to start thinking outside the box about how we interact with LLMs, and how we design systems to interact with LLMs. If you consider that most apps and interfaces were designed for the pre-LLM world, it should be easy to see how there might be better ways we can design applications in an LLM-forward manner. It seems to me like many LLM agent applications try to fit a square peg in a round hole.