Campaigns & Bulk Calling/SMS
How the campaign dialer works internally — job queues, rate limiting, DID selection, guardrails, health feedback, SMS drip pacing, and where things break at scale.
This guide explains the internals of the campaign engine. It's meant for engineers debugging campaign behavior, tuning throughput, or understanding why a campaign paused itself.
Architecture Overview
A campaign is a state machine backed by three BullMQ job queues and a set of guardrails. No call/send logic lives in the API route — the route just writes rows and enqueues the first job. The dialer branches on campaign.channel to handle voice and SMS differently.
POST /api/campaigns/:id/start
└─ sets status = "dialing"
└─ enqueues first job → campaign-dialer queue
campaign-dialer (BullMQ worker, concurrency: 5)
└─ loads campaign, branches on channel:
VOICE:
└─ fetches batch of prospects (size = calls_per_second)
└─ calls dialProspect() for each (parallel via Promise.allSettled)
└─ enqueues itself with delay = 1000/cps ms
└─ when no prospects left + none dialing → completed
SMS:
└─ fetches 1 prospect per tick (human-like drip)
└─ calls sendProspectSms()
└─ enqueues itself with delay = (15min × 60s) / activeDIDcount
└─ when no prospects left + none sending/sent → completed
dialProspect() (voice, per-prospect)
├─ DNC check → skip if suppressed
├─ TCPA check → re-queue if outside 8am-9pm local
├─ Channel acquire → backpressure if pool full
├─ DID selection (selectBestDid) → pick best caller ID by last_call_at
├─ Originate call (AI or plain)
└─ On result → write health event → recompute score
sendProspectSms() (sms, per-prospect)
├─ DNC check → skip if suppressed
├─ SMS window check → re-queue if outside 7am-7pm local
├─ DID selection (selectBestDidForSms) → pick best by last_sms_at
├─ Send via providers.messaging.send()
├─ Mark prospect sending → sent (with providerMessageId)
└─ Delivery webhook updates to delivered/failed
campaign-health (BullMQ worker, concurrency: 10)
└─ writes did_health_events row
└─ recomputes health score
└─ triggers cooldown/burn if score drops
campaign-stats (BullMQ worker, concurrency: 5)
└─ aggregates prospect statuses + call stats
└─ publishes campaign.stats.updated via Redis pub/subJob Queues
Three BullMQ queues share a single Redis connection:
| Queue | Purpose | Concurrency |
|---|---|---|
campaign-dialer | Fetches prospects, calls dialProspect(), self-enqueues next batch | 5 |
campaign-health | Writes health events, recomputes DID scores, triggers state changes | 10 |
campaign-stats | Aggregates stats and publishes updates via SSE | 5 |
All three are stateless — they read from Postgres, write to Postgres, and publish events to Redis. If a worker crashes mid-job, BullMQ retries it. No in-memory state is lost.
The dialer worker self-enqueues with a delay calculated from calls_per_second. At 10 calls/second, the delay between batches is 100ms. At 1 call/second (default), the delay is 1000ms.
Rate Limiting
Rate limiting works differently for each channel.
Voice: Batch pacing (calls_per_second)
The mode.calls_per_second campaign config controls how many prospects the dialer fetches per batch. After dialing a batch, the worker self-enqueues with a delay of 1000 / calls_per_second milliseconds.
calls_per_second: 1 → 1 prospect/batch, 1000ms delay (default)
calls_per_second: 5 → 5 prospects/batch, 200ms delay
calls_per_second: 10 → 10 prospects/batch, 100ms delaySMS: DID-based drip pacing
SMS campaigns send 1 prospect per tick. The delay between ticks is calculated from the active DID count to spread load across the pool:
delay = (MIN_SMS_GAP_MINUTES × 60 × 1000) / activeDIDcount
10 DIDs × 15-min gap = 90 seconds between messages
50 DIDs × 15-min gap = 18 seconds between messagesEach DID is limited to 4 messages per hour (15-minute gap enforced via last_sms_at column). This produces 36-60 messages per DID per day across the 7am-7pm window.
2. SIP channel semaphore
A Redis-based semaphore limits concurrent SIP channels per pool:
| Pool | Max Concurrent |
|---|---|
campaign | 60 |
ivr | 15 |
api | 8 |
When acquireChannel("campaign") fails (60 channels in use), the prospect is re-queued with a 5-second delay. The channel is released in a finally block after each call attempt.
acquire → Redis INCR channels:campaign
if > 60 → DECR, return false (backpressure)
if == 1 → set TTL 3600s (safety net)
release → Redis DECR channels:campaign
if < 0 → reset to 0If calls are not releasing channels (e.g., provider callback never fires), the semaphore leaks. The 1-hour TTL is a safety net, but stale channels reduce throughput. Check channels:campaign in Redis if campaigns are running slowly.
DID Selection
Each call/message needs a caller ID. Voice uses selectBestDid(), SMS uses selectBestDidForSms(). Both follow the same pattern:
- Query all DIDs where
state = "active"ANDhealthScore >= threshold(default: 0.7) - If
callerIdsarray is set on campaign, filter to those numbers - If prospect's area code matches a DID's area code, prefer it (local presence)
- Otherwise, pick the DID with the highest health score
If no healthy DIDs are available, the campaign auto-pauses and emits campaign.paused.no_dids. This is the most common reason a campaign stops unexpectedly.
Debugging DID exhaustion
# Check how many DIDs are active and healthy
curl "https://api.trunx.io/dids/health" -H "Authorization: Bearer $KEY"
# Check channel pool usage
redis-cli GET channels:campaignPre-flight Guardrails
Every prospect passes through three checks before dialing:
DNC / Suppression
Queries the suppression table for a matching phone + customer. If found, the prospect is set to cancelled and a campaign.prospect.dnc event is published. No call is attempted.
Time Windows
Voice: Calls are only allowed 8 AM – 9 PM in the recipient's local timezone (TCPA).
SMS: Messages are only allowed 7 AM – 7 PM (stricter window for text messages).
The timezone comes from campaign.schedule.timezone (default: America/Los_Angeles). If a prospect fails the window check, they're re-queued with a 30-minute delay (nextRetryAt).
Channel Budget
If the SIP channel semaphore is full, the prospect gets a 5-second re-queue delay. This is backpressure, not a failure — the prospect stays queued and will be dialed once a channel frees up.
Call Outcomes
After AMD (answering machine detection) processes the call, the result flows back through handleCallResult():
| Outcome | Meaning | Next action |
|---|---|---|
human_connected | Human answered, AI agent engaged | Prospect marked completed |
human_hangup | Human answered but hung up quickly | Prospect marked completed |
voicemail | AMD detected voicemail | Voicemail drop if configured |
no_answer | Phone rang, nobody picked up | Re-queued for retry |
busy | Busy signal | Prospect marked completed |
failed | Call failed (network, invalid number) | Prospect marked failed |
Each outcome writes a did_health_events row and triggers a health score recomputation for the DID that placed the call.
Health Feedback Loop
The campaign engine and DID health system form a closed loop:
Campaign dials prospect
→ call completes with outcome
→ did_health_events row written
→ health score recomputed (sliding window of last 50 calls)
→ health action determined:
score >= 0.8 → ok (no action)
0.7 – 0.8 → warning (event published)
0.5 – 0.7 → cooldown (DID pulled from rotation for 2 hours)
< 0.5 → burned (DID permanently removed)
→ next call uses updated scores for DID selectionThe did-lifecycle worker runs every 60 seconds to transition cooling → active when cooldown periods expire, and warming → active when warming periods complete.
Health score weights
| Component | Weight | What it measures |
|---|---|---|
| Answer rate | 30% | Completed calls / total calls |
| Avg call duration | 25% | Normalized to 60s = 1.0 |
| Human engagement | 20% | Humans answered / completed calls |
| No-answer trend | 15% | Inverse of no-answer rate |
| Spam clean | 10% | Inverse of spam flag rate |
Scores are computed over the last 50 calls per DID. New DIDs with no history start at 1.0.
Campaign State Machine
created → dialing → completed
↕
paused
↓
cancelled| Transition | Trigger |
|---|---|
created → dialing | POST /campaigns/:id/start |
dialing → paused | Manual pause, or no healthy DIDs available |
paused → dialing | POST /campaigns/:id/resume |
dialing → completed | All prospects dialed, none in dialing status |
any → cancelled | POST /campaigns/:id/cancel |
A campaign that auto-pauses due to DID exhaustion will not resume on its own. You need to either add healthy DIDs to the pool or manually resume after DIDs recover from cooldown.
Prospect Statuses
Voice statuses
| Status | Meaning |
|---|---|
queued | Waiting to be dialed |
dialing | Call in progress |
human_connected | Human answered |
human_hangup | Human answered but disconnected quickly |
voicemail | Voicemail detected |
no_answer | Not answered (eligible for retry) |
busy | Busy signal |
failed | Call failed |
cancelled | Skipped (DNC match or campaign cancelled) |
SMS statuses
| Status | Meaning |
|---|---|
queued | Waiting to be sent |
sending | Message handed off to carrier |
sent | Carrier accepted, awaiting delivery confirmation |
delivered | Carrier confirmed delivery to recipient |
failed | Carrier rejected or delivery failed |
cancelled | Skipped (DNC match or campaign cancelled) |
For SMS campaigns, sent counts as in-progress (the campaign won't complete until all prospects reach a terminal status: delivered, failed, or cancelled). Delivery webhooks from the carrier automatically update prospect status.
What Breaks at Scale
SIP channel saturation
At 60 concurrent campaign channels, new calls back off with a 5-second delay. If your calls_per_second is higher than your answer rate allows, calls queue up behind the semaphore. The campaign still runs, just slower.
Fix: Monitor channels:campaign in Redis. If it's consistently at 60, either lower calls_per_second or increase the pool limit in channel-budget.ts.
DID pool exhaustion
If too many DIDs drop below the health threshold (0.7), the campaign runs out of caller IDs and auto-pauses. This typically happens when answer rates are low and the same DIDs get burned through a high-volume campaign.
Fix: Add more DIDs to the pool. Use a larger DID pool for high-volume campaigns. Monitor health scores via GET /dids/health and watch for the campaign.paused.no_dids event.
BullMQ backlog
If Redis is slow or the worker process restarts, the dialer queue can back up. Since the dialer self-enqueues, a long backlog means delayed dialing. BullMQ jobs have removeOnComplete: true, so completed jobs don't accumulate.
Fix: Check queue depth with bull:campaign-dialer:waiting in Redis. If the worker is down, restart the service. If Redis is slow, check memory usage.
TCPA re-queue storms
If a campaign targets a timezone where the window just closed, all remaining prospects get re-queued with 30-minute delays simultaneously. When the window reopens, they all become eligible at once, creating a burst.
Fix: This is generally fine — the semaphore and rate limiter absorb the burst. But if it coincides with DID exhaustion, the campaign may pause.
Health score oscillation
A DID that gets cooled down for 2 hours, comes back, and immediately gets hammered with calls can oscillate between active and cooling. The health score resets to whatever the last 50 calls show — if those calls were all before the cooldown, the score looks healthy, but new calls quickly degrade it again.
Fix: After cooldown, gradually ramp the DID back in rather than putting it back at full rotation. Consider using callerIds on the campaign to exclude recently-cooled DIDs.
Monitoring
Key Redis keys
| Key | Type | What it tracks |
|---|---|---|
channels:campaign | counter | Active SIP channels for campaigns |
channels:ivr | counter | Active SIP channels for IVR |
channels:api | counter | Active SIP channels for API calls |
bull:campaign-dialer:* | BullMQ | Dialer job queue state |
bull:campaign-health:* | BullMQ | Health job queue state |
bull:campaign-stats:* | BullMQ | Stats job queue state |
Key SSE events
Subscribe to campaign events for real-time monitoring:
curl -N "https://api.trunx.io/api/events?channels=campaign:{id}" \
-H "Authorization: Bearer $KEY"| Event | When |
|---|---|
campaign.call.initiated | Call placed (voice) |
campaign.call.human_connected | Human answered (voice) |
campaign.call.voicemail | Voicemail detected (voice) |
campaign.call.no_answer | No answer (voice) |
campaign.call.failed | Call failed (voice) |
campaign.sms.sent | SMS sent to carrier (SMS) |
campaign.sms.delivered | SMS delivered to recipient (SMS) |
campaign.sms.failed | SMS send failed (SMS) |
campaign.sms.delivery_failed | SMS delivery failed (SMS) |
campaign.prospect.dnc | Prospect skipped (DNC) |
campaign.paused.no_dids | Campaign auto-paused — no healthy DIDs (voice only) |
campaign.stats.updated | Stats snapshot published |
did.health.cooldown | DID pulled from rotation |
did.health.burned | DID permanently removed |
Quick health check
# Campaign progress
curl "https://api.trunx.io/api/campaigns/$ID/stats" -H "Authorization: Bearer $KEY"
# DID pool health
curl "https://api.trunx.io/dids/health/report" -H "Authorization: Bearer $KEY"
# Channel usage
redis-cli GET channels:campaign