QUICK SUMMARY
Most RTP systems fail long before they run out of bandwidth; they fail under packet velocity.
This blog dives into RTP scaling architecture, from kernel-bypass media relays and tiered designs to codec strategy and cloud instance selection.
If you’re building platforms that must survive 30,000+ concurrent media streams, this is the engineering reality you can’t ignore.
Most telephony engineers learn the hard way that SIP is easy, but RTP is the silent killer.
You can spin up a Kamailio server on a modest virtual machine and route thousands of call setups per second without breaking a sweat. SIP is just text; it’s lightweight.
But the moment those calls are answered, the real challenge arises: RTP (Real-time Transport Protocol).
Handling 30,000 concurrent calls doesn’t just mean tracking 30,000 database entries. It means processing roughly 3 million UDP packets every second. If your RTP scaling architecture isn’t built to handle that packet velocity, your CPU will redline, audio will chop, and your customers will churn.
This blog moves beyond basic setups to explore the high-performance engineering required to handle carrier-grade media traffic without exploding your cloud bill. We’ll dig into the architectural decisions, hardware optimizations, and operational strategies that separate systems handling thousands of calls from those handling tens of thousands.
Handling 30,000 Calls in a Large-Scale RTP Architecture
The most common misconception in large-scale RTP architecture is that bandwidth is the bottleneck. It rarely is. Engineers often size their infrastructure around throughput capacity, only to discover that their systems collapse under the weight of packet-per-second (PPS) load long before saturating the available bandwidth.
Let’s look at the numbers for 30,000 concurrent G.711 calls:
Bandwidth consumption:
- G.711 payload: 64 kilobits per second per direction
- IP/UDP/RTP headers: approximately 12 bytes per packet overhead per direction
- Total per call leg: approximately 68–70 kilobits per second
- 30,000 calls = 60,000 legs (inbound + outbound)
- Total bandwidth: approximately 4 gigabits per second
Raw throughput analysis:
A modern 25-gigabit Network Interface Card (NIC) can handle this easily. You’re using less than 20% of available capacity. In fact, if bandwidth were the constraint, you could run this on a single physical 10-gigabit link and still have headroom.
So why do servers crash at lower call counts? PPS (Packets Per Second) is the real killer.
The Packet Rate Problem
Voice audio is sliced into small chunks. By default, RTP sends voice packets every 20 milliseconds (the standard packetization interval for G.711). A single RTP stream generates 50 packets per second in each direction.
- 30,000 calls = 60,000 RTP streams (inbound + outbound)
- 60,000 streams × 50 packets/sec = 3 million packets per second
This is where the CPU bottleneck emerges. Standard Linux networking stacks (and most operating systems) are not architected to handle this packet rate.
Here’s what happens when each RTP packet arrives:
- Hardware interrupt: The NIC triggers an interrupt signal to the CPU.
- Context switch: The CPU pauses the running process and switches to interrupt handler mode.
- Kernel processing: The kernel examines packet headers, performs routing lookups, and applies firewall rules.
- User space boundary crossing: If the packet belongs to an application (like RTPProxy or a media processing engine), it must be copied from kernel space into user space memory.
- Application processing: The application decodes headers, applies media logic, and may perform transformations.
- Return crossing: The processed packet is copied back into kernel space.
- Driver transmission: The kernel queues the packet for transmission and notifies the NIC driver.
- Context switch out: The CPU returns to the interrupted process.
This “context switching tax” (especially the repeated boundary crossings between kernel space and user space) is what kills your RTP media server optimization efforts.
At 3 million packets per second, you’re forcing the CPU to perform millions of these expensive state transitions. You will hit 100% CPU usage and thermal throttling long before you saturate your available bandwidth.
Why Traditional Approaches Fail
Standard Linux can handle approximately 1 million packets per second with optimization, but beyond that, the kernel becomes the bottleneck. A standard server running Linux with a standard NIC driver hits saturation well before reaching multi-million PPS loads at the user-space application level.
This is why a FreeSWITCH instance handling media in user space maxes out at roughly 1,000–3,000 concurrent calls, depending on transcoding requirements. It’s not memory pressure. It’s not disk I/O. It’s pure CPU saturation caused by interrupt and context-switch overhead.
Build media infrastructure that survives real traffic.
The Fix: Kernel-Bypass Networking (RTPEngine)
To survive high concurrency, you must stop processing media in User Space. You need a solution that pushes packet forwarding down into the OS kernel, bypassing the application layer entirely. This is the architectural pattern that separates scaling from failure.
RTPProxy vs. RTPEngine
The difference between RTP Proxy vs. RTPEngine is fundamentally about where packet processing happens.
RTPProxy (Legacy Architecture):
- Runs entirely in User Space as a daemon process.
- Receives RTP packets via kernel network stack.
- Application decodes RTP headers, examines metadata, applies transformation logic, and re-encodes packets.
- Application sends modified packets back to the kernel for transmission.
- Great for compatibility and feature richness.
- Terrible for scale (every single packet makes the expensive user/kernel boundary crossing).
RTPProxy is adequate for small deployments, but scales linearly with computational cost. To handle 30,000 calls with RTPProxy, you would need dozens of servers running in parallel. The operational and cost overhead becomes astronomical.
RTPEngine (Industry Standard for Scaling):
- Hybrid architecture: signaling logic runs in User Space, but media forwarding runs in Kernel Space.
- Uses a custom kernel module (xt_RTPENGINE) to intercept and forward RTP packets at wire speed.
- Packet processing is offloaded to Netfilter (the Linux kernel’s packet filtering framework).
- Scales sublinearly (adding more CPUs and NICs adds nearly linear throughput).
The genius of RTPEngine is that it separates concerns: the application handles the complex, relatively infrequent signaling decisions (which codec? which endpoint? enable recording?), while the kernel handles the high-frequency, simple forwarding (here’s a packet, forward it to destination IP: port).
Here’s how RTPEngine kernel offloading works
- Signaling: The RTPEngine daemon negotiates the ports and logic in User Space.
- Handover: Once the call is established, it pushes a forwarding rule down to the kernel using nftables or iptables.
- Speed: Subsequent RTP packets are grabbed by the kernel NIC driver and forwarded immediately. They never touch the User Space application.
Result: At 3 million PPS, the CPU load is distributed across the forwarding hardware offload engine (if available) and the kernel’s optimized netfilter code. This is far more efficient for an RTP scaling architecture than user-space processing.
| Feature | RTPProxy | RTPEngine |
| Architecture | User Space Only | Hybrid (User Space + Kernel) |
| Packet Processing | Application-level | Kernel-level (xt_RTPENGINE module) |
| Best For | Small deployments (<1,000 calls) | Large-scale (30,000+ calls) |
| Scalability | Linear (add servers for more capacity) | Sublinear (scale with hardware efficiently) |
| CPU Overhead | High (context switching, boundary crossing) | Low (kernel offloading) |
| Setup Complexity | Simple | More complex (kernel module required) |
| Codec Support | Excellent (flexible) | Excellent (efficient passthrough) |
| Cost at 30,000 Calls | Dozens of servers needed | A few |
| Operational Burden | High | Moderate |
RTPEngine Deployment Patterns for 30,000 Calls
For 30,000 concurrent streams, you need multiple RTPEngine instances in a stateless, load-balanced cluster:
- 8–12 RTPEngine nodes (depending on hardware), each handling thousands of concurrent streams.
- Kamailio dispatcher to load-balance RTP allocation decisions across the cluster.
- Hardware NIC with SR-IOV (Single Root I/O Virtualization) on each RTPEngine instance to handle the high PPS load.
- Optional: DPDK (Data Plane Development Kit) integration for further kernel-bypass (advanced deployments).
Each RTPEngine node is stateless. If one fails, the SIP proxy redistributes new calls to remaining nodes. Existing calls on the failed node will drop (because the kernel routing rules are gone), which is why graceful draining is important.
Dedicated Media Servers vs. Media Proxies (For The Tiered Architecture)
Another major architectural flaw is forcing your PBX (Asterisk/FreeSWITCH) to relay media for every call. This is a fundamental misunderstanding of roles.
Why Your PBX Should NOT be Your Media Relay
The engines (Asterisk, FreeSWITCH, etc.) are B2BUAs (Back-to-Back User Agents). They’re designed to:
- Decode incoming RTP, extract audio
- Analyze the audio (transcoding, voicemail detection, tone detection)
- Re-encode and send outgoing RTP
- Apply business logic (IVRs, call transfers, voicemail)
This is computationally expensive.
Per-call CPU cost of a B2BUA:
- RTP decoding: approximately 5% CPU per core
- Jitter buffer management: approximately 3% CPU
- Audio analysis/transcoding: approximately 15–30% CPU (depending on codecs)
- Re-encoding: approximately 5% CPU
- Memory management, context switching overhead: approximately 10–15% CPU
A 16-core server running FreeSWITCH can handle a few concurrent transcoded calls before saturating. Scaling to 30,000 calls would require thousands of instances. That’s operationally infeasible.
But here’s the secret: most calls don’t need a B2BUA at all.
The Tiered Architecture Model for RTP Media Server Optimization
You need to implement a three-tier architecture.
Tier 0 (SIP Signaling Proxy):
- Kamailio or OpenSIPS
- Handles SIP routing, authentication, and billing records
- Stateless (easy to scale horizontally)
- Delegates media handling to Tier 1
Tier 1 (RTP Media Proxy – The Edge):
- RTPEngine cluster
- Stateless relays that don’t understand audio content
- Handle NAT traversal, SRTP encryption, and packet reordering
- Scale to 30,000+ concurrent streams with modest hardware
- Protect your core infrastructure from media load
Tier 2 (Core PBX – The Brain):
- FreeSWITCH or Asterisk cluster
- Only processes calls that need business logic:
- IVRs (phone tree menus)
- Voicemail recording and retrieval
- Conference bridges
- Call recording (or delegate to a separate recording engine)
- Transcoding (if endpoints negotiate incompatible codecs)
- Capacity: 1,000–5,000 concurrent calls per instance (depending on features)
Tier 3 (Specialized Services):
- Dedicated recording engines (separate fleet)
- Dedicated transcoding engines (separate fleet)
- STT/TTS engines (speech-to-text, text-to-speech for IVR)
- Each is purpose-built and scales independently
Handling millions of packets per second requires real engineering.
The Transcoding Trap in an RTP Scaling Architecture
Transcoding (converting audio from one codec to another, e.g., Opus to G.711) is the enemy of density. It’s also often unnecessary. Understanding when to transcode (and when to avoid it) is critical for scaling.
The CPU Cost of Transcoding
Transcoding requires:
- Decompression: Extract audio samples from the incoming encoded format
- Resampling: Convert sample rate if codecs use different rates
- Compression: Re-encode into the outgoing format
For codec transformation (e.g., G.729 to G.711), the computational cost increases significantly. The reason is that complex algorithms like G.729 require substantial CPU cycles for decompression and resampling before re-encoding in a different format.
At 30,000 calls with even a modest transcode rate, you’d need dedicated hardware or a separate transcoding engine.
Avoiding Transcoding Can Help
The best way to reduce transcoding is to avoid it entirely.
Implement Late Negotiation:
- User A makes a SIP INVITE specifying “I support G.711, Opus, and PCMU.”
- User B receives it and responds with “I support G.711 and GSM.”
- No SIP proxy is forcing a codec choice. The endpoints negotiate directly.
- The intersection is G.711. Both support it, so they use it.
- Result: No transcode needed.
Configure your SIP proxy (Kamailio) and PBX (FreeSWITCH) to:
- Avoid codec locking: Don’t force a specific codec in the SDP offer. Let endpoints choose.
- Pass-through on common codecs: If endpoints agree on G.711, PCMU, or PCMA, configure the PBX to relay audio without processing.
- Reserve transcoding pools: When transcoding is truly necessary (e.g., a PSTN gateway uses only G.711, but internal users prefer Opus), delegate it to a separate transcoding engine, not your main PBX.
Ecosmob Expert Tip
The easiest way to unlock massive RTP scale is to treat transcoding as an exception, not the default. Let endpoints negotiate codecs directly and design your flows so that most media stays in passthrough. When transcoding does happen, isolate it on a separate pool instead of burdening your core call logic. This single shift often delivers the biggest jump in call density and stability.
Codec Selection for Maximum Density
So, which codecs support maximum scalability in a large-scale RTP architecture?
Let’s look at when you should choose specific codecs.
G.711
- Audio quality: Excellent for voice (narrowband sampling)
- Bandwidth: 64 kilobits per second per direction (uncompressed)
- CPU overhead: Negligible (simple quantization, minimal computation)
- Latency: Minimal
- Use case: If you’re scaling to 30,000 calls and bandwidth isn’t a constraint, use G.711.
Opus
- Audio quality: Excellent across a wide bandwidth
- Bandwidth: 20–40 kilobits per second (highly configurable)
- CPU overhead: Moderate
- Latency: Low
- Use case: When bandwidth is expensive (mobile apps, international calls). Requires dedicated transcoding or peer support.
G.729
- Audio quality: Good, but noticeably lower than G.711
- Bandwidth: approximately 8 kilobits per second
- CPU overhead: Very high due to a complex patented algorithm
- Licensing: Requires per-channel royalties; legal complexity
- Use case: Only if you absolutely need the bandwidth savings and can afford dedicated DSP hardware
- Recommendation for 30,000+ scale: Avoid. The licensing cost and CPU overhead make this economically unviable at scale.
GSM
- Audio quality: Poor to fair
- Bandwidth: approximately 13 kilobits per second
- CPU overhead: High
- Use case: Legacy mobile phones (rare now)
- Recommendation: Deprecated. Don’t design new systems around it.
| Codec | Bandwidth | CPU Overhead | Audio Quality | Best Use Case | Scale Suitability |
| G.711 | 64 kbps/direction | Negligible | Excellent (voice) | Default choice for all compatible endpoints | ⭐⭐⭐⭐⭐
Perfect |
| Opus | 20–40 kbps | Moderate | Excellent (wideband) | Mobile apps, international calls, bandwidth-constrained | ⭐⭐⭐⭐
Good |
| G.729 | ~8 kbps | Very High | Good (but degraded) | Only with dedicated DSP hardware | ⭐⭐
Poor (avoid) |
| GSM | ~13 kbps | High | Poor to Fair | Legacy mobile (deprecated) | ⭐
Not recommended |
Practical codec strategy for 30,000 calls:
- Default to G.711 for all calls between compatible endpoints.
- Support Opus for endpoints that explicitly request it (high-quality, low-bandwidth scenarios).
- Operate a separate Transcoding Pool (2–4 dedicated Asterisk/FreeSWITCH instances) for the rare case where you must bridge G.711 and Opus.
- Never advertise G.729 or GSM in new deployments. If legacy endpoints demand it, handle them separately.
This keeps the vast majority of your 30,000 concurrent calls flowing through RTPEngine without any codec transformation, maximizing density and minimizing cost.
Choosing the Right Cloud Specs for High Media Concurrency
When deploying RTP infrastructure for 30,000+ concurrent streams, the instance you choose matters far more than the size. Generic cloud instances aren’t built for the relentless packet-per-second demands of real-time media. You need to prioritize packet processing over raw CPU power. Here’s exactly which instances to pick on each cloud platform.
AWS
- Use: C5n.4xlarge or C5n.9xlarge (or C6gn variants)
- Why: Network-optimized (ENA + SR-IOV). Can handle high PPS without jitter degradation.
- Avoid: T-series, M-series, general-purpose instances
Google Cloud
- Use: M2 series with Gvnic, or Tau T2D
- Why: High packet processing capability with premium networking options.
- Avoid: Standard n2 or e2 series
Azure
- Use: Fsv2 series with Accelerated Networking, or D-series with Accelerated Networking
- Why: High CPU clock rates + direct NIC access = low jitter.
- Avoid: B-series, general-purpose instances without Accelerated Networking
Spot Instances (All Clouds)
Spot Instances are unused cloud capacity that providers (AWS, Google Cloud, Azure) sell at steep discounts. The trade-off is that the provider can reclaim them with a few minutes’ notice if they need the capacity. For RTP deployments with graceful draining, this is acceptable because most calls complete before termination happens.
- Save: Significant cost reduction through discount pricing
- Trade-off: Small risk of interruption (handle with graceful draining)
- Best for: Non-mission-critical deployments, testing, internal systems
- Avoid for: SLA-critical customer-facing calls
Scaling RTP to 30,000+ concurrent streams isn’t about buying bigger servers; it’s about respecting the unique ins and outs of real-time UDP traffic and architecting systems that embrace those constraints rather than fight them.
To handle 30,000+ concurrent streams, you must:
- Understand the real bottleneck.
- Offload media forwarding to the kernel.
- Separate concerns with a tiered architecture.
- Choose codecs wisely.
- Select network-optimized cloud instances.
If you build your large-scale RTP architecture on these principles, you won’t just support 30,000 calls; you’ll be ready for 100,000, and you’ll do it cost-effectively.
Building RTP infrastructure that needs to survive real packet rates and real concurrency?
FAQs
What's the difference between jitter and packet loss? Which matters more for RTP?
Both matter, but they hurt differently. Packet loss makes audio droppy and choppy. Jitter (variation in packet arrival time) creates buffering delays and distortion. At high concurrency, jitter problems emerge first (caused by context switching and hypervisor overhead). This is why SR-IOV and ENA matter: they reduce jitter by letting VMs talk directly to the NIC.
How to know if the RTPEngine cluster is about to hit capacity?
Watch three signals: CPU utilization approaching saturation, packet loss appearing in your metrics, and jitter spiking. If you see these signs, you're near capacity. Set up alerting at the warning threshold and gracefully drain or add nodes before you hit a wall.
Should I use an RTP proxy, RTPEngine, or dedicated media servers for scaling?
Use RTPEngine as your primary media relay layer, and treat dedicated media servers like FreeSWITCH or Asterisk as feature engines, not routers. RTP proxy-style components that live entirely in user space are fine for small deployments but become inefficient at scale. A tiered architecture (SIP proxy at the edge, RTPEngine for media, and PBX only when logic is needed) is the most sustainable RTP scaling architecture pattern.
Why does my system need a transcoding pool?
A small percentage of calls will always need transcoding due to legacy equipment or PSTN gateways. Rather than burden your main PBX, dedicate a few instances to transcoding. They'll mostly sit idle, which is fine since their job is to be there when you need them.
Which codecs support maximum scalability at high concurrency?
Codecs with minimal CPU overhead and broad interoperability scale best for RTP. G.711 is ideal when bandwidth is affordable, while Opus is a strong choice where bandwidth is constrained but you can control endpoints and transcoding pools. Heavy, legacy codecs like G.729 or GSM reduce density and add complexity, so they should be isolated and never used as the default.







