Why We Built Our Own Telecom Stack
The architectural decisions behind Trunx — provider abstraction, stateless workers, Redis event bus, and guardrails that can't be bypassed.
Why We Built Our Own Telecom Stack
If you've ever tried to build an AI-powered communication product, you already know the problem: there is no single vendor that does everything you need. You end up stitching together three to five services — Twilio for calls and SMS, a separate answering machine detection provider, a DNC list service, a reputation checker, maybe a compliance layer on top. Each has its own API semantics, its own auth model, its own billing dashboard, and its own failure modes. Your application code becomes a patchwork of vendor-specific integrations, and swapping any one of them out means rewriting half your stack.
We decided to build a unified telecom hub instead.
The Architecture
Trunx sits between your application and the telecom providers. One API, one auth model, one billing meter. Behind that surface, five architectural decisions keep the system reliable and the codebase maintainable.
1. Provider Abstraction
Route code never touches a vendor SDK. Every external service — Twilio, Ultravox, Asterisk, reputation APIs — sits behind a typed interface. A voice provider implements originate(), hangup(), and bridge(). An SMS provider implements send(). A reputation provider implements lookup().
The provider registry resolves the right implementation at runtime, either by configuration or by inspecting which provider owns a given DID. If we need to swap Twilio for another carrier tomorrow, we implement the interface and update the registry. No route code changes. No service logic changes.
The trade-off is real: you pay an indirection cost, and the interfaces need to be general enough to cover multiple vendors without becoming lowest-common-denominator. We've found that telecom operations are surprisingly uniform at the right level of abstraction — the differences are mostly in auth, webhook formats, and error codes, not in the operations themselves.
2. State in Postgres, Not Memory
Every BullMQ worker in the system is a stateless reactor. DID health scores are not cached counters — they're SQL queries over did_health_events. Campaign progress is not an in-memory state machine — it's a query over campaign_prospects. IVR sessions live in Redis with a TTL, not in process memory.
This means a worker can restart at any time without losing state. It means you can scale workers horizontally without coordination. It means debugging is just querying the database, not attaching to a running process and hoping you catch the right moment.
The cost is latency. A SQL query is slower than reading a local variable. We mitigate this with Redis caching where it matters (IVR definitions, auth tokens), but the source of truth is always the database.
3. Events Through Redis Pub/Sub
We don't use in-process event emitters. Every event — call completed, SMS delivered, health score changed — goes through Redis pub/sub. SSE endpoints, webhook delivery workers, and stats aggregation jobs are all independent subscribers on the same channels.
This gives us three things. First, any subscriber can crash and restart without affecting others. Second, we can add new subscribers (a new analytics pipeline, a new alerting system) without modifying event producers. Third, events are logged to a Postgres events table, so SSE clients can replay missed events using Last-Event-ID.
The downside is that Redis pub/sub is fire-and-forget. If a subscriber is down when an event fires, it misses it. The event log table is our answer to that — subscribers that need guaranteed delivery read from the table on reconnect.
4. Guardrails Are Infrastructure
DNC suppression, TCPA time window enforcement, rate limiting, and content scanning are not application-level checks that a developer remembers to call. They're middleware and guardrail layers that every outbound action passes through before reaching a provider.
An API caller cannot bypass the DNC list. They cannot send SMS outside TCPA-compliant hours. They cannot exceed their rate limit. These aren't features — they're constraints enforced at the infrastructure level. This matters because telecom compliance violations are expensive, and "the developer forgot to check" is not an acceptable failure mode.
5. SIP Channel Semaphores
A shared SIP trunk has a finite number of concurrent channels. Without coordination, a large campaign could consume every channel and starve inbound IVR calls and one-off API calls.
We divide the trunk into pools using Redis semaphores: 60 channels for campaigns, 15 for IVR, 8 for direct API calls. Each pool has its own semaphore with backpressure — when a pool is full, new requests wait or get rejected with a clear error, rather than silently failing at the SIP level. Campaign workers respect their pool boundary, which means your IVR stays responsive even during a 10,000-call outbound campaign.
What This Means for You
If you're building AI agents that need to make calls, send texts, or handle inbound communication, you don't need to become a telecom integration shop. One API key. One auth model. Usage-based billing through Stripe metering. No Twilio credentials in your environment, no webhook URL management across three vendors, no compliance logic scattered through your codebase.
The telecom complexity still exists — we just handle it on our side, behind interfaces that we can evolve without breaking your integration.
For the full system diagram and technical details, see our architecture documentation.
Wrap-up
Telecom infrastructure shouldn't slow you down. Trunx fits into your workflow — whether you're building voice AI agents, managing outbound campaigns, or scaling SMS at 2am.
If that sounds like the kind of tooling you want to use — try Trunx or join us on Discord.