All posts
Engineering · March 4, 2026 · 7 min read

Anatomy of a 60-second booking

What actually happens between 'ring' and 'confirmed'? A walk-through of a live Hala call, timestamp by timestamp.

NA

Noura Al-Rashid

Engineering Lead, ARA

People ask how the AI works. The honest answer is: not in one step. Hala is a small orchestra of components, coordinated so that the caller experiences one uninterrupted conversation. Here’s what happens during a typical booking, clocked against the wall.

00:00 — Ring

Call lands on a GCC-local number. Our telephony layer (primary: Twilio Saudi region; failover: local carriers) picks up within 400 ms — fast enough that the caller doesn’t hear a second full ring.

00:01 — Greeting

Hala opens with a regional greeting matched to the dialect profile the business has selected. For a Saudi dental clinic with the “Friendly” tone, it’s: “مرحبا، معك هلا من عيادة الفيصلية، كيف أقدر أساعدك؟” — “Hi, this is Hala from Al-Faisaliah Clinic. How can I help?”

This greeting isn’t generated on the fly. It’s cached, which saves ~700 ms and more importantly lets us deliver it with natural prosody from the moment the call connects.

00:04 — Listening

The caller starts speaking. We stream audio in 20-ms chunks to the speech-to-text layer. Partial transcripts come back every 300 ms or so. We’re not waiting for silence to start thinking — reasoning happens concurrently with listening.

Roughly 1.2 seconds after the caller finishes their sentence, we have a confident transcript: “أبي أحجز موعد مع الدكتور، الأسبوع الجاي إذا ممكن” — “I want to book an appointment with the doctor, next week if possible.”

00:06 — Understanding

The transcript goes to our reasoning layer, which has the clinic’s current context loaded in a compact 2,000-token brief: who the doctors are, appointment types, duration and buffer rules, current availability for the next 14 days, and the clinic’s custom system prompt.

The model reads the intent and decides what’s missing: we know they want to book; we don’t know which doctor, which type of appointment, or whether they’re a new or returning patient. We also don’t yet know their name or phone number.

00:07 — The first real response

Hala answers: “بالتأكيد، مع أي دكتور حاب تحجز؟ عندنا الدكتورة سارة يوم الأحد والاثنين، والدكتور أحمد يوم الثلاثاء والأربعاء.”

The voice synthesis uses a low-latency model specifically tuned for the Saudi profile. First audio byte goes out in ~600 ms from the end of the caller’s turn. The caller experiences this as instant.

00:10–00:40 — The conversation

The next 30 seconds are the real work. Hala asks for:

  • Doctor preference (Dr. Sarah, Monday morning)
  • Appointment type (routine check-up, 30 min)
  • Patient status (returning, so we look up the file)
  • Name (if new, we capture it; if returning, we confirm)
  • Contact number (we use the caller ID but confirm it)

Each turn is under 2 seconds of latency. We re-check availability after each constraint the caller adds — if the requested slot gets booked by another caller mid-conversation, we surface alternatives immediately rather than promise a slot we can’t deliver.

00:47 — The write

Before Hala confirms verbally, we write the booking to the database with a 2-second tentative hold. If the write fails (network blip, constraint violation), we apologize and offer a retry. If it succeeds, Hala confirms the time, reads it back, and — crucially — pauses to let the caller object before marking the booking final.

00:58 — Goodbye

“تم الحجز. موعدك مع الدكتورة سارة الأحد الجاي الساعة عشرة صباحاً. راح توصلك رسالة واتساب بالتأكيد. شكراً لاتصالك.”

Booking confirmed. WhatsApp template fires within 4 seconds. Activity feed in the dashboard updates in real time.

What makes this hard

The hard parts aren’t the speech-to-text or text-to-speech — those are commodities. The hard parts are:

  • Keeping the conversation natural when the caller interrupts, changes their mind, or asks a question mid-sentence
  • Handling graceful failure when a booking can’t be completed (no slot, clinic closed, payment hold on the account)
  • Not hallucinating. A receptionist that invents a doctor, a time, or a policy is worse than no receptionist at all.

Everything visible — the warmth, the speed, the natural cadence — sits on top of infrastructure designed around one principle: the AI should never claim something it can’t verify. Every booking Hala confirms is a booking that exists in the database. Every doctor she mentions is a real provider on the clinic’s roster.

That’s the bar. That’s what makes the difference between an impressive demo and a receptionist you can trust with the phone line.