Advanced

Campaigns & Bulk Calling/SMS

How the campaign dialer works internally — job queues, rate limiting, DID selection, guardrails, health feedback, SMS drip pacing, and where things break at scale.

This guide explains the internals of the campaign engine. It's meant for engineers debugging campaign behavior, tuning throughput, or understanding why a campaign paused itself.

Architecture Overview

A campaign is a state machine backed by three BullMQ job queues and a set of guardrails. No call/send logic lives in the API route — the route just writes rows and enqueues the first job. The dialer branches on campaign.channel to handle voice and SMS differently.

POST /api/campaigns/:id/start
  └─ sets status = "dialing"
  └─ enqueues first job → campaign-dialer queue

campaign-dialer (BullMQ worker, concurrency: 5)
  └─ loads campaign, branches on channel:

  VOICE:
    └─ fetches batch of prospects (size = calls_per_second)
    └─ calls dialProspect() for each (parallel via Promise.allSettled)
    └─ enqueues itself with delay = 1000/cps ms
    └─ when no prospects left + none dialing → completed

  SMS:
    └─ fetches 1 prospect per tick (human-like drip)
    └─ calls sendProspectSms()
    └─ enqueues itself with delay = (15min × 60s) / activeDIDcount
    └─ when no prospects left + none sending/sent → completed

dialProspect() (voice, per-prospect)
  ├─ DNC check → skip if suppressed
  ├─ TCPA check → re-queue if outside 8am-9pm local
  ├─ Channel acquire → backpressure if pool full
  ├─ DID selection (selectBestDid) → pick best caller ID by last_call_at
  ├─ Originate call (AI or plain)
  └─ On result → write health event → recompute score

sendProspectSms() (sms, per-prospect)
  ├─ DNC check → skip if suppressed
  ├─ SMS window check → re-queue if outside 7am-7pm local
  ├─ DID selection (selectBestDidForSms) → pick best by last_sms_at
  ├─ Send via providers.messaging.send()
  ├─ Mark prospect sending → sent (with providerMessageId)
  └─ Delivery webhook updates to delivered/failed

campaign-health (BullMQ worker, concurrency: 10)
  └─ writes did_health_events row
  └─ recomputes health score
  └─ triggers cooldown/burn if score drops

campaign-stats (BullMQ worker, concurrency: 5)
  └─ aggregates prospect statuses + call stats
  └─ publishes campaign.stats.updated via Redis pub/sub

Job Queues

Three BullMQ queues share a single Redis connection:

QueuePurposeConcurrency
campaign-dialerFetches prospects, calls dialProspect(), self-enqueues next batch5
campaign-healthWrites health events, recomputes DID scores, triggers state changes10
campaign-statsAggregates stats and publishes updates via SSE5

All three are stateless — they read from Postgres, write to Postgres, and publish events to Redis. If a worker crashes mid-job, BullMQ retries it. No in-memory state is lost.

The dialer worker self-enqueues with a delay calculated from calls_per_second. At 10 calls/second, the delay between batches is 100ms. At 1 call/second (default), the delay is 1000ms.

Rate Limiting

Rate limiting works differently for each channel.

Voice: Batch pacing (calls_per_second)

The mode.calls_per_second campaign config controls how many prospects the dialer fetches per batch. After dialing a batch, the worker self-enqueues with a delay of 1000 / calls_per_second milliseconds.

calls_per_second: 1  → 1 prospect/batch, 1000ms delay (default)
calls_per_second: 5  → 5 prospects/batch, 200ms delay
calls_per_second: 10 → 10 prospects/batch, 100ms delay

SMS: DID-based drip pacing

SMS campaigns send 1 prospect per tick. The delay between ticks is calculated from the active DID count to spread load across the pool:

delay = (MIN_SMS_GAP_MINUTES × 60 × 1000) / activeDIDcount

10 DIDs × 15-min gap = 90 seconds between messages
50 DIDs × 15-min gap = 18 seconds between messages

Each DID is limited to 4 messages per hour (15-minute gap enforced via last_sms_at column). This produces 36-60 messages per DID per day across the 7am-7pm window.

2. SIP channel semaphore

A Redis-based semaphore limits concurrent SIP channels per pool:

PoolMax Concurrent
campaign60
ivr15
api8

When acquireChannel("campaign") fails (60 channels in use), the prospect is re-queued with a 5-second delay. The channel is released in a finally block after each call attempt.

acquire → Redis INCR channels:campaign
  if > 60 → DECR, return false (backpressure)
  if == 1 → set TTL 3600s (safety net)

release → Redis DECR channels:campaign
  if < 0 → reset to 0

If calls are not releasing channels (e.g., provider callback never fires), the semaphore leaks. The 1-hour TTL is a safety net, but stale channels reduce throughput. Check channels:campaign in Redis if campaigns are running slowly.

DID Selection

Each call/message needs a caller ID. Voice uses selectBestDid(), SMS uses selectBestDidForSms(). Both follow the same pattern:

  1. Query all DIDs where state = "active" AND healthScore >= threshold (default: 0.7)
  2. If callerIds array is set on campaign, filter to those numbers
  3. If prospect's area code matches a DID's area code, prefer it (local presence)
  4. Otherwise, pick the DID with the highest health score

If no healthy DIDs are available, the campaign auto-pauses and emits campaign.paused.no_dids. This is the most common reason a campaign stops unexpectedly.

Debugging DID exhaustion

# Check how many DIDs are active and healthy
curl "https://api.trunx.io/dids/health" -H "Authorization: Bearer $KEY"

# Check channel pool usage
redis-cli GET channels:campaign

Pre-flight Guardrails

Every prospect passes through three checks before dialing:

DNC / Suppression

Queries the suppression table for a matching phone + customer. If found, the prospect is set to cancelled and a campaign.prospect.dnc event is published. No call is attempted.

Time Windows

Voice: Calls are only allowed 8 AM – 9 PM in the recipient's local timezone (TCPA).

SMS: Messages are only allowed 7 AM – 7 PM (stricter window for text messages).

The timezone comes from campaign.schedule.timezone (default: America/Los_Angeles). If a prospect fails the window check, they're re-queued with a 30-minute delay (nextRetryAt).

Channel Budget

If the SIP channel semaphore is full, the prospect gets a 5-second re-queue delay. This is backpressure, not a failure — the prospect stays queued and will be dialed once a channel frees up.

Call Outcomes

After AMD (answering machine detection) processes the call, the result flows back through handleCallResult():

OutcomeMeaningNext action
human_connectedHuman answered, AI agent engagedProspect marked completed
human_hangupHuman answered but hung up quicklyProspect marked completed
voicemailAMD detected voicemailVoicemail drop if configured
no_answerPhone rang, nobody picked upRe-queued for retry
busyBusy signalProspect marked completed
failedCall failed (network, invalid number)Prospect marked failed

Each outcome writes a did_health_events row and triggers a health score recomputation for the DID that placed the call.

Health Feedback Loop

The campaign engine and DID health system form a closed loop:

Campaign dials prospect
  → call completes with outcome
  → did_health_events row written
  → health score recomputed (sliding window of last 50 calls)
  → health action determined:
      score >= 0.8  → ok (no action)
      0.7 – 0.8    → warning (event published)
      0.5 – 0.7    → cooldown (DID pulled from rotation for 2 hours)
      < 0.5        → burned (DID permanently removed)
  → next call uses updated scores for DID selection

The did-lifecycle worker runs every 60 seconds to transition cooling → active when cooldown periods expire, and warming → active when warming periods complete.

Health score weights

ComponentWeightWhat it measures
Answer rate30%Completed calls / total calls
Avg call duration25%Normalized to 60s = 1.0
Human engagement20%Humans answered / completed calls
No-answer trend15%Inverse of no-answer rate
Spam clean10%Inverse of spam flag rate

Scores are computed over the last 50 calls per DID. New DIDs with no history start at 1.0.

Campaign State Machine

created → dialing → completed

           paused

          cancelled
TransitionTrigger
created → dialingPOST /campaigns/:id/start
dialing → pausedManual pause, or no healthy DIDs available
paused → dialingPOST /campaigns/:id/resume
dialing → completedAll prospects dialed, none in dialing status
any → cancelledPOST /campaigns/:id/cancel

A campaign that auto-pauses due to DID exhaustion will not resume on its own. You need to either add healthy DIDs to the pool or manually resume after DIDs recover from cooldown.

Prospect Statuses

Voice statuses

StatusMeaning
queuedWaiting to be dialed
dialingCall in progress
human_connectedHuman answered
human_hangupHuman answered but disconnected quickly
voicemailVoicemail detected
no_answerNot answered (eligible for retry)
busyBusy signal
failedCall failed
cancelledSkipped (DNC match or campaign cancelled)

SMS statuses

StatusMeaning
queuedWaiting to be sent
sendingMessage handed off to carrier
sentCarrier accepted, awaiting delivery confirmation
deliveredCarrier confirmed delivery to recipient
failedCarrier rejected or delivery failed
cancelledSkipped (DNC match or campaign cancelled)

For SMS campaigns, sent counts as in-progress (the campaign won't complete until all prospects reach a terminal status: delivered, failed, or cancelled). Delivery webhooks from the carrier automatically update prospect status.

What Breaks at Scale

SIP channel saturation

At 60 concurrent campaign channels, new calls back off with a 5-second delay. If your calls_per_second is higher than your answer rate allows, calls queue up behind the semaphore. The campaign still runs, just slower.

Fix: Monitor channels:campaign in Redis. If it's consistently at 60, either lower calls_per_second or increase the pool limit in channel-budget.ts.

DID pool exhaustion

If too many DIDs drop below the health threshold (0.7), the campaign runs out of caller IDs and auto-pauses. This typically happens when answer rates are low and the same DIDs get burned through a high-volume campaign.

Fix: Add more DIDs to the pool. Use a larger DID pool for high-volume campaigns. Monitor health scores via GET /dids/health and watch for the campaign.paused.no_dids event.

BullMQ backlog

If Redis is slow or the worker process restarts, the dialer queue can back up. Since the dialer self-enqueues, a long backlog means delayed dialing. BullMQ jobs have removeOnComplete: true, so completed jobs don't accumulate.

Fix: Check queue depth with bull:campaign-dialer:waiting in Redis. If the worker is down, restart the service. If Redis is slow, check memory usage.

TCPA re-queue storms

If a campaign targets a timezone where the window just closed, all remaining prospects get re-queued with 30-minute delays simultaneously. When the window reopens, they all become eligible at once, creating a burst.

Fix: This is generally fine — the semaphore and rate limiter absorb the burst. But if it coincides with DID exhaustion, the campaign may pause.

Health score oscillation

A DID that gets cooled down for 2 hours, comes back, and immediately gets hammered with calls can oscillate between active and cooling. The health score resets to whatever the last 50 calls show — if those calls were all before the cooldown, the score looks healthy, but new calls quickly degrade it again.

Fix: After cooldown, gradually ramp the DID back in rather than putting it back at full rotation. Consider using callerIds on the campaign to exclude recently-cooled DIDs.

Monitoring

Key Redis keys

KeyTypeWhat it tracks
channels:campaigncounterActive SIP channels for campaigns
channels:ivrcounterActive SIP channels for IVR
channels:apicounterActive SIP channels for API calls
bull:campaign-dialer:*BullMQDialer job queue state
bull:campaign-health:*BullMQHealth job queue state
bull:campaign-stats:*BullMQStats job queue state

Key SSE events

Subscribe to campaign events for real-time monitoring:

curl -N "https://api.trunx.io/api/events?channels=campaign:{id}" \
  -H "Authorization: Bearer $KEY"
EventWhen
campaign.call.initiatedCall placed (voice)
campaign.call.human_connectedHuman answered (voice)
campaign.call.voicemailVoicemail detected (voice)
campaign.call.no_answerNo answer (voice)
campaign.call.failedCall failed (voice)
campaign.sms.sentSMS sent to carrier (SMS)
campaign.sms.deliveredSMS delivered to recipient (SMS)
campaign.sms.failedSMS send failed (SMS)
campaign.sms.delivery_failedSMS delivery failed (SMS)
campaign.prospect.dncProspect skipped (DNC)
campaign.paused.no_didsCampaign auto-paused — no healthy DIDs (voice only)
campaign.stats.updatedStats snapshot published
did.health.cooldownDID pulled from rotation
did.health.burnedDID permanently removed

Quick health check

# Campaign progress
curl "https://api.trunx.io/api/campaigns/$ID/stats" -H "Authorization: Bearer $KEY"

# DID pool health
curl "https://api.trunx.io/dids/health/report" -H "Authorization: Bearer $KEY"

# Channel usage
redis-cli GET channels:campaign

On this page