Campaigns & Bulk Calling/SMS

How the campaign dialer works internally — job queues, rate limiting, DID selection, guardrails, health feedback, SMS drip pacing, and where things break at scale.

This guide explains the internals of the campaign engine. It's meant for engineers debugging campaign behavior, tuning throughput, or understanding why a campaign paused itself.

Architecture Overview

A campaign is a state machine backed by three BullMQ job queues and a set of guardrails. No call/send logic lives in the API route — the route just writes rows and enqueues the first job. The dialer branches on campaign.channel to handle voice and SMS differently.

POST /api/campaigns/:id/start
  └─ sets status = "dialing"
  └─ enqueues first job → campaign-dialer queue

campaign-dialer (BullMQ worker, concurrency: 5)
  └─ loads campaign, branches on channel:

  VOICE:
    └─ fetches batch of prospects (size = calls_per_second)
    └─ calls dialProspect() for each (parallel via Promise.allSettled)
    └─ enqueues itself with delay = 1000/cps ms
    └─ when no prospects left + none dialing → completed

  SMS:
    └─ fetches 1 prospect per tick (human-like drip)
    └─ calls sendProspectSms()
    └─ enqueues itself with delay = (15min × 60s) / activeDIDcount
    └─ when no prospects left + none sending/sent → completed

dialProspect() (voice, per-prospect)
  ├─ DNC check → skip if suppressed
  ├─ TCPA check → re-queue if outside 8am-9pm local
  ├─ Channel acquire → backpressure if pool full
  ├─ DID selection (selectBestDid) → pick best caller ID by last_call_at
  ├─ Originate call (AI or plain)
  └─ On result → write health event → recompute score

sendProspectSms() (sms, per-prospect)
  ├─ DNC check → skip if suppressed
  ├─ SMS window check → re-queue if outside 7am-7pm local
  ├─ DID selection (selectBestDidForSms) → pick best by last_sms_at
  ├─ Send via providers.messaging.send()
  ├─ Mark prospect sending → sent (with providerMessageId)
  └─ Delivery webhook updates to delivered/failed

campaign-health (BullMQ worker, concurrency: 10)
  └─ writes did_health_events row
  └─ recomputes health score
  └─ triggers cooldown/burn if score drops

campaign-stats (BullMQ worker, concurrency: 5)
  └─ aggregates prospect statuses + call stats
  └─ publishes campaign.stats.updated via Redis pub/sub

Job Queues

Three BullMQ queues share a single Redis connection:

Queue	Purpose	Concurrency
`campaign-dialer`	Fetches prospects, calls `dialProspect()`, self-enqueues next batch	5
`campaign-health`	Writes health events, recomputes DID scores, triggers state changes	10
`campaign-stats`	Aggregates stats and publishes updates via SSE	5

All three are stateless — they read from Postgres, write to Postgres, and publish events to Redis. If a worker crashes mid-job, BullMQ retries it. No in-memory state is lost.

The dialer worker self-enqueues with a delay calculated from calls_per_second. At 10 calls/second, the delay between batches is 100ms. At 1 call/second (default), the delay is 1000ms.

Rate Limiting

Rate limiting works differently for each channel.

Voice: Batch pacing (calls_per_second)

The mode.calls_per_second campaign config controls how many prospects the dialer fetches per batch. After dialing a batch, the worker self-enqueues with a delay of 1000 / calls_per_second milliseconds.

calls_per_second: 1  → 1 prospect/batch, 1000ms delay (default)
calls_per_second: 5  → 5 prospects/batch, 200ms delay
calls_per_second: 10 → 10 prospects/batch, 100ms delay

SMS: DID-based drip pacing

SMS campaigns send 1 prospect per tick. The delay between ticks is calculated from the active DID count to spread load across the pool:

delay = (MIN_SMS_GAP_MINUTES × 60 × 1000) / activeDIDcount

10 DIDs × 15-min gap = 90 seconds between messages
50 DIDs × 15-min gap = 18 seconds between messages

Each DID is limited to 4 messages per hour (15-minute gap enforced via last_sms_at column). This produces 36-60 messages per DID per day across the 7am-7pm window.

2. SIP channel semaphore

A Redis-based semaphore limits concurrent SIP channels per pool:

Pool	Max Concurrent
`campaign`	60
`ivr`	15
`api`	8

When acquireChannel("campaign") fails (60 channels in use), the prospect is re-queued with a 5-second delay. The channel is released in a finally block after each call attempt.

acquire → Redis INCR channels:campaign
  if > 60 → DECR, return false (backpressure)
  if == 1 → set TTL 3600s (safety net)

release → Redis DECR channels:campaign
  if < 0 → reset to 0

If calls are not releasing channels (e.g., provider callback never fires), the semaphore leaks. The 1-hour TTL is a safety net, but stale channels reduce throughput. Check channels:campaign in Redis if campaigns are running slowly.

DID Selection

Each call/message needs a caller ID. Voice uses selectBestDid(), SMS uses selectBestDidForSms(). Both follow the same pattern:

Query all DIDs where state = "active" AND healthScore >= threshold (default: 0.7)
If callerIds array is set on campaign, filter to those numbers
If prospect's area code matches a DID's area code, prefer it (local presence)
Otherwise, pick the DID with the highest health score

If no healthy DIDs are available, the campaign auto-pauses and emits campaign.paused.no_dids. This is the most common reason a campaign stops unexpectedly.

Debugging DID exhaustion

# Check how many DIDs are active and healthy
curl "https://api.trunx.io/dids/health" -H "Authorization: Bearer $KEY"

# Check channel pool usage
redis-cli GET channels:campaign

Pre-flight Guardrails

Every prospect passes through three checks before dialing:

DNC / Suppression

Queries the suppression table for a matching phone + customer. If found, the prospect is set to cancelled and a campaign.prospect.dnc event is published. No call is attempted.

Time Windows

Voice: Calls are only allowed 8 AM – 9 PM in the recipient's local timezone (TCPA).

SMS: Messages are only allowed 7 AM – 7 PM (stricter window for text messages).

The timezone comes from campaign.schedule.timezone (default: America/Los_Angeles). If a prospect fails the window check, they're re-queued with a 30-minute delay (nextRetryAt).

Channel Budget

If the SIP channel semaphore is full, the prospect gets a 5-second re-queue delay. This is backpressure, not a failure — the prospect stays queued and will be dialed once a channel frees up.

Call Outcomes

After AMD (answering machine detection) processes the call, the result flows back through handleCallResult():

Outcome	Meaning	Next action
`human_connected`	Human answered, AI agent engaged	Prospect marked completed
`human_hangup`	Human answered but hung up quickly	Prospect marked completed
`voicemail`	AMD detected voicemail	Voicemail drop if configured
`no_answer`	Phone rang, nobody picked up	Re-queued for retry
`busy`	Busy signal	Prospect marked completed
`failed`	Call failed (network, invalid number)	Prospect marked failed

Each outcome writes a did_health_events row and triggers a health score recomputation for the DID that placed the call.

Health Feedback Loop

The campaign engine and DID health system form a closed loop:

Campaign dials prospect
  → call completes with outcome
  → did_health_events row written
  → health score recomputed (sliding window of last 50 calls)
  → health action determined:
      score >= 0.8  → ok (no action)
      0.7 – 0.8    → warning (event published)
      0.5 – 0.7    → cooldown (DID pulled from rotation for 2 hours)
      < 0.5        → burned (DID permanently removed)
  → next call uses updated scores for DID selection

The did-lifecycle worker runs every 60 seconds to transition cooling → active when cooldown periods expire, and warming → active when warming periods complete.

Health score weights

Component	Weight	What it measures
Answer rate	30%	Completed calls / total calls
Avg call duration	25%	Normalized to 60s = 1.0
Human engagement	20%	Humans answered / completed calls
No-answer trend	15%	Inverse of no-answer rate
Spam clean	10%	Inverse of spam flag rate

Scores are computed over the last 50 calls per DID. New DIDs with no history start at 1.0.

Campaign State Machine

created → dialing → completed
              ↕
           paused
              ↓
          cancelled

Transition	Trigger
`created → dialing`	`POST /campaigns/:id/start`
`dialing → paused`	Manual pause, or no healthy DIDs available
`paused → dialing`	`POST /campaigns/:id/resume`
`dialing → completed`	All prospects dialed, none in `dialing` status
`any → cancelled`	`POST /campaigns/:id/cancel`

A campaign that auto-pauses due to DID exhaustion will not resume on its own. You need to either add healthy DIDs to the pool or manually resume after DIDs recover from cooldown.

Prospect Statuses

Voice statuses

Status	Meaning
`queued`	Waiting to be dialed
`dialing`	Call in progress
`human_connected`	Human answered
`human_hangup`	Human answered but disconnected quickly
`voicemail`	Voicemail detected
`no_answer`	Not answered (eligible for retry)
`busy`	Busy signal
`failed`	Call failed
`cancelled`	Skipped (DNC match or campaign cancelled)

SMS statuses

Status	Meaning
`queued`	Waiting to be sent
`sending`	Message handed off to carrier
`sent`	Carrier accepted, awaiting delivery confirmation
`delivered`	Carrier confirmed delivery to recipient
`failed`	Carrier rejected or delivery failed
`cancelled`	Skipped (DNC match or campaign cancelled)

For SMS campaigns, sent counts as in-progress (the campaign won't complete until all prospects reach a terminal status: delivered, failed, or cancelled). Delivery webhooks from the carrier automatically update prospect status.

What Breaks at Scale

SIP channel saturation

At 60 concurrent campaign channels, new calls back off with a 5-second delay. If your calls_per_second is higher than your answer rate allows, calls queue up behind the semaphore. The campaign still runs, just slower.

Fix: Monitor channels:campaign in Redis. If it's consistently at 60, either lower calls_per_second or increase the pool limit in channel-budget.ts.

DID pool exhaustion

If too many DIDs drop below the health threshold (0.7), the campaign runs out of caller IDs and auto-pauses. This typically happens when answer rates are low and the same DIDs get burned through a high-volume campaign.

Fix: Add more DIDs to the pool. Use a larger DID pool for high-volume campaigns. Monitor health scores via GET /dids/health and watch for the campaign.paused.no_dids event.

BullMQ backlog

If Redis is slow or the worker process restarts, the dialer queue can back up. Since the dialer self-enqueues, a long backlog means delayed dialing. BullMQ jobs have removeOnComplete: true, so completed jobs don't accumulate.

Fix: Check queue depth with bull:campaign-dialer:waiting in Redis. If the worker is down, restart the service. If Redis is slow, check memory usage.

TCPA re-queue storms

If a campaign targets a timezone where the window just closed, all remaining prospects get re-queued with 30-minute delays simultaneously. When the window reopens, they all become eligible at once, creating a burst.

Fix: This is generally fine — the semaphore and rate limiter absorb the burst. But if it coincides with DID exhaustion, the campaign may pause.

Health score oscillation

A DID that gets cooled down for 2 hours, comes back, and immediately gets hammered with calls can oscillate between active and cooling. The health score resets to whatever the last 50 calls show — if those calls were all before the cooldown, the score looks healthy, but new calls quickly degrade it again.

Fix: After cooldown, gradually ramp the DID back in rather than putting it back at full rotation. Consider using callerIds on the campaign to exclude recently-cooled DIDs.

Monitoring

Key Redis keys

Key	Type	What it tracks
`channels:campaign`	counter	Active SIP channels for campaigns
`channels:ivr`	counter	Active SIP channels for IVR
`channels:api`	counter	Active SIP channels for API calls
`bull:campaign-dialer:*`	BullMQ	Dialer job queue state
`bull:campaign-health:*`	BullMQ	Health job queue state
`bull:campaign-stats:*`	BullMQ	Stats job queue state

Key SSE events

Subscribe to campaign events for real-time monitoring:

curl -N "https://api.trunx.io/api/events?channels=campaign:{id}" \
  -H "Authorization: Bearer $KEY"

Event	When
`campaign.call.initiated`	Call placed (voice)
`campaign.call.human_connected`	Human answered (voice)
`campaign.call.voicemail`	Voicemail detected (voice)
`campaign.call.no_answer`	No answer (voice)
`campaign.call.failed`	Call failed (voice)
`campaign.sms.sent`	SMS sent to carrier (SMS)
`campaign.sms.delivered`	SMS delivered to recipient (SMS)
`campaign.sms.failed`	SMS send failed (SMS)
`campaign.sms.delivery_failed`	SMS delivery failed (SMS)
`campaign.prospect.dnc`	Prospect skipped (DNC)
`campaign.paused.no_dids`	Campaign auto-paused — no healthy DIDs (voice only)
`campaign.stats.updated`	Stats snapshot published
`did.health.cooldown`	DID pulled from rotation
`did.health.burned`	DID permanently removed

Quick health check

# Campaign progress
curl "https://api.trunx.io/api/campaigns/$ID/stats" -H "Authorization: Bearer $KEY"

# DID pool health
curl "https://api.trunx.io/dids/health/report" -H "Authorization: Bearer $KEY"

# Channel usage
redis-cli GET channels:campaign

Campaigns & Bulk Calling/SMS

On this page