RTP Scaling Architecture to Handle 30,000+ Concurrent Media Streams

Q: What's the difference between jitter and packet loss? Which matters more for RTP?

Both matter, but they hurt differently. Packet loss makes audio droppy and choppy. Jitter (variation in packet arrival time) creates buffering delays and distortion. At high concurrency, jitter problems emerge first (caused by context switching and hypervisor overhead). This is why SR-IOV and ENA matter: they reduce jitter by letting VMs talk directly to the NIC.

Q: How to know if the RTPEngine cluster is about to hit capacity?

Watch three signals: CPU utilization approaching saturation, packet loss appearing in your metrics, and jitter spiking. If you see these signs, you're near capacity. Set up alerting at the warning threshold and gracefully drain or add nodes before you hit a wall.

Q: Should I use an RTP proxy, RTPEngine, or dedicated media servers for scaling?

Use RTPEngine as your primary media relay layer, and treat dedicated media servers like FreeSWITCH or Asterisk as feature engines, not routers. RTP proxy-style components that live entirely in user space are fine for small deployments but become inefficient at scale. A tiered architecture (SIP proxy at the edge, RTPEngine for media, and PBX only when logic is needed) is the most sustainable RTP scaling architecture pattern.

Q: Why does my system need a transcoding pool?

A small percentage of calls will always need transcoding due to legacy equipment or PSTN gateways. Rather than burden your main PBX, dedicate a few instances to transcoding. They'll mostly sit idle, which is fine since their job is to be there when you need them.

Q: Which codecs support maximum scalability at high concurrency?

Codecs with minimal CPU overhead and broad interoperability scale best for RTP. G.711 is ideal when bandwidth is affordable, while Opus is a strong choice where bandwidth is constrained but you can control endpoints and transcoding pools. Heavy, legacy codecs like G.729 or GSM reduce density and add complexity, so they should be isolated and never used as the default.

Updated on : 29th April 2026

9 minutes read

VoIP Software Development

RTP Scaling Architecture to Handle 30,000+ Concurrent Media Streams

QUICK SUMMARY

Most RTP systems fail long before they run out of bandwidth; they fail under packet velocity.

This blog dives into RTP scaling architecture, from kernel-bypass media relays and tiered designs to codec strategy and cloud instance selection.

If you’re building platforms that must survive 30,000+ concurrent media streams, this is the engineering reality you can’t ignore.

Contents show

Most telephony engineers learn the hard way that SIP is easy, but RTP is the silent killer.

You can spin up a Kamailio server on a modest virtual machine and route thousands of call setups per second without breaking a sweat. SIP is just text; it’s lightweight.

But the moment those calls are answered, the real challenge arises: RTP (Real-time Transport Protocol).

Handling 30,000 concurrent calls doesn’t just mean tracking 30,000 database entries. It means processing roughly 3 million UDP packets every second. If your RTP scaling architecture isn’t built to handle that packet velocity, your CPU will redline, audio will chop, and your customers will churn.

This blog moves beyond basic setups to explore the high-performance engineering required to handle carrier-grade media traffic without exploding your cloud bill. We’ll dig into the architectural decisions, hardware optimizations, and operational strategies that separate systems handling thousands of calls from those handling tens of thousands.

Handling 30,000 Calls in a Large-Scale RTP Architecture

The most common misconception in large-scale RTP architecture is that bandwidth is the bottleneck. It rarely is. Engineers often size their infrastructure around throughput capacity, only to discover that their systems collapse under the weight of packet-per-second (PPS) load long before saturating the available bandwidth.

Let’s look at the numbers for 30,000 concurrent G.711 calls:

Bandwidth consumption:

G.711 payload: 64 kilobits per second per direction
IP/UDP/RTP headers: approximately 12 bytes per packet overhead per direction
Total per call leg: approximately 68–70 kilobits per second
30,000 calls = 60,000 legs (inbound + outbound)
Total bandwidth: approximately 4 gigabits per second

Raw throughput analysis:

A modern 25-gigabit Network Interface Card (NIC) can handle this easily. You’re using less than 20% of available capacity. In fact, if bandwidth were the constraint, you could run this on a single physical 10-gigabit link and still have headroom.

So why do servers crash at lower call counts? PPS (Packets Per Second) is the real killer.

The Packet Rate Problem

Voice audio is sliced into small chunks. By default, RTP sends voice packets every 20 milliseconds (the standard packetization interval for G.711). A single RTP stream generates 50 packets per second in each direction.

30,000 calls = 60,000 RTP streams (inbound + outbound)
60,000 streams × 50 packets/sec = 3 million packets per second

This is where the CPU bottleneck emerges. Standard Linux networking stacks (and most operating systems) are not architected to handle this packet rate.

Here’s what happens when each RTP packet arrives:

Hardware interrupt: The NIC triggers an interrupt signal to the CPU.
Context switch: The CPU pauses the running process and switches to interrupt handler mode.
Kernel processing: The kernel examines packet headers, performs routing lookups, and applies firewall rules.
User space boundary crossing: If the packet belongs to an application (like RTPProxy or a media processing engine), it must be copied from kernel space into user space memory.
Application processing: The application decodes headers, applies media logic, and may perform transformations.
Return crossing: The processed packet is copied back into kernel space.
Driver transmission: The kernel queues the packet for transmission and notifies the NIC driver.
Context switch out: The CPU returns to the interrupted process.

This “context switching tax” (especially the repeated boundary crossings between kernel space and user space) is what kills your RTP media server optimization efforts.

At 3 million packets per second, you’re forcing the CPU to perform millions of these expensive state transitions. You will hit 100% CPU usage and thermal throttling long before you saturate your available bandwidth.

Why Traditional Approaches Fail

Standard Linux can handle approximately 1 million packets per second with optimization, but beyond that, the kernel becomes the bottleneck. A standard server running Linux with a standard NIC driver hits saturation well before reaching multi-million PPS loads at the user-space application level.

This is why a FreeSWITCH instance handling media in user space maxes out at roughly 1,000–3,000 concurrent calls, depending on transcoding requirements. It’s not memory pressure. It’s not disk I/O. It’s pure CPU saturation caused by interrupt and context-switch overhead.

Build media infrastructure that survives real traffic.

Design RTP for Scale

The Fix: Kernel-Bypass Networking (RTPEngine)

To survive high concurrency, you must stop processing media in User Space. You need a solution that pushes packet forwarding down into the OS kernel, bypassing the application layer entirely. This is the architectural pattern that separates scaling from failure.

RTPProxy vs. RTPEngine

The difference between RTP Proxy vs. RTPEngine is fundamentally about where packet processing happens.

RTPProxy (Legacy Architecture):

Runs entirely in User Space as a daemon process.
Receives RTP packets via kernel network stack.
Application decodes RTP headers, examines metadata, applies transformation logic, and re-encodes packets.
Application sends modified packets back to the kernel for transmission.
Great for compatibility and feature richness.
Terrible for scale (every single packet makes the expensive user/kernel boundary crossing).

RTPProxy is adequate for small deployments, but scales linearly with computational cost. To handle 30,000 calls with RTPProxy, you would need dozens of servers running in parallel. The operational and cost overhead becomes astronomical.

RTPEngine (Industry Standard for Scaling):

Hybrid architecture: signaling logic runs in User Space, but media forwarding runs in Kernel Space.
Uses a custom kernel module (xt_RTPENGINE) to intercept and forward RTP packets at wire speed.
Packet processing is offloaded to Netfilter (the Linux kernel’s packet filtering framework).
Scales sublinearly (adding more CPUs and NICs adds nearly linear throughput).

The genius of RTPEngine is that it separates concerns: the application handles the complex, relatively infrequent signaling decisions (which codec? which endpoint? enable recording?), while the kernel handles the high-frequency, simple forwarding (here’s a packet, forward it to destination IP: port).

Here’s how RTPEngine kernel offloading works

Signaling: The RTPEngine daemon negotiates the ports and logic in User Space.
Handover: Once the call is established, it pushes a forwarding rule down to the kernel using nftables or iptables.
Speed: Subsequent RTP packets are grabbed by the kernel NIC driver and forwarded immediately. They never touch the User Space application.

Result: At 3 million PPS, the CPU load is distributed across the forwarding hardware offload engine (if available) and the kernel’s optimized netfilter code. This is far more efficient for an RTP scaling architecture than user-space processing.

Feature	RTPProxy	RTPEngine
Architecture	User Space Only	Hybrid (User Space + Kernel)
Packet Processing	Application-level	Kernel-level (xt_RTPENGINE module)
Best For	Small deployments (<1,000 calls)	Large-scale (30,000+ calls)
Scalability	Linear (add servers for more capacity)	Sublinear (scale with hardware efficiently)
CPU Overhead	High (context switching, boundary crossing)	Low (kernel offloading)
Setup Complexity	Simple	More complex (kernel module required)
Codec Support	Excellent (flexible)	Excellent (efficient passthrough)
Cost at 30,000 Calls	Dozens of servers needed	A few
Operational Burden	High	Moderate

RTPEngine Deployment Patterns for 30,000 Calls

For 30,000 concurrent streams, you need multiple RTPEngine instances in a stateless, load-balanced cluster:

8–12 RTPEngine nodes (depending on hardware), each handling thousands of concurrent streams.
Kamailio dispatcher to load-balance RTP allocation decisions across the cluster.
Hardware NIC with SR-IOV (Single Root I/O Virtualization) on each RTPEngine instance to handle the high PPS load.
Optional: DPDK (Data Plane Development Kit) integration for further kernel-bypass (advanced deployments).

Each RTPEngine node is stateless. If one fails, the SIP proxy redistributes new calls to remaining nodes. Existing calls on the failed node will drop (because the kernel routing rules are gone), which is why graceful draining is important.

Dedicated Media Servers vs. Media Proxies (For The Tiered Architecture)

Another major architectural flaw is forcing your PBX (Asterisk/FreeSWITCH) to relay media for every call. This is a fundamental misunderstanding of roles.

Why Your PBX Should NOT be Your Media Relay

The engines (Asterisk, FreeSWITCH, etc.) are B2BUAs (Back-to-Back User Agents). They’re designed to:

Decode incoming RTP, extract audio
Analyze the audio (transcoding, voicemail detection, tone detection)
Re-encode and send outgoing RTP
Apply business logic (IVRs, call transfers, voicemail)

This is computationally expensive.

Per-call CPU cost of a B2BUA:

RTP decoding: approximately 5% CPU per core
Jitter buffer management: approximately 3% CPU
Audio analysis/transcoding: approximately 15–30% CPU (depending on codecs)
Re-encoding: approximately 5% CPU
Memory management, context switching overhead: approximately 10–15% CPU

A 16-core server running FreeSWITCH can handle a few concurrent transcoded calls before saturating. Scaling to 30,000 calls would require thousands of instances. That’s operationally infeasible.

But here’s the secret: most calls don’t need a B2BUA at all.

The Tiered Architecture Model for RTP Media Server Optimization

You need to implement a three-tier architecture.

Tier 0 (SIP Signaling Proxy):

Kamailio or OpenSIPS
Handles SIP routing, authentication, and billing records
Stateless (easy to scale horizontally)
Delegates media handling to Tier 1

Tier 1 (RTP Media Proxy – The Edge):

RTPEngine cluster
Stateless relays that don’t understand audio content
Handle NAT traversal, SRTP encryption, and packet reordering
Scale to 30,000+ concurrent streams with modest hardware
Protect your core infrastructure from media load

Tier 2 (Core PBX – The Brain):

FreeSWITCH or Asterisk cluster
Only processes calls that need business logic:
- IVRs (phone tree menus)
- Voicemail recording and retrieval
- Conference bridges
- Call recording (or delegate to a separate recording engine)
- Transcoding (if endpoints negotiate incompatible codecs)
Capacity: 1,000–5,000 concurrent calls per instance (depending on features)

Tier 3 (Specialized Services):

Dedicated recording engines (separate fleet)
Dedicated transcoding engines (separate fleet)
STT/TTS engines (speech-to-text, text-to-speech for IVR)
Each is purpose-built and scales independently

Handling millions of packets per second requires real engineering.

Talk to Experts

The Transcoding Trap in an RTP Scaling Architecture

Transcoding (converting audio from one codec to another, e.g., Opus to G.711) is the enemy of density. It’s also often unnecessary. Understanding when to transcode (and when to avoid it) is critical for scaling.

The CPU Cost of Transcoding

Transcoding requires:

Decompression: Extract audio samples from the incoming encoded format
Resampling: Convert sample rate if codecs use different rates
Compression: Re-encode into the outgoing format

For codec transformation (e.g., G.729 to G.711), the computational cost increases significantly. The reason is that complex algorithms like G.729 require substantial CPU cycles for decompression and resampling before re-encoding in a different format.

At 30,000 calls with even a modest transcode rate, you’d need dedicated hardware or a separate transcoding engine.

Avoiding Transcoding Can Help

The best way to reduce transcoding is to avoid it entirely.

Implement Late Negotiation:

User A makes a SIP INVITE specifying “I support G.711, Opus, and PCMU.”
User B receives it and responds with “I support G.711 and GSM.”
No SIP proxy is forcing a codec choice. The endpoints negotiate directly.
The intersection is G.711. Both support it, so they use it.
Result: No transcode needed.

Configure your SIP proxy (Kamailio) and PBX (FreeSWITCH) to:

Avoid codec locking: Don’t force a specific codec in the SDP offer. Let endpoints choose.
Pass-through on common codecs: If endpoints agree on G.711, PCMU, or PCMA, configure the PBX to relay audio without processing.
Reserve transcoding pools: When transcoding is truly necessary (e.g., a PSTN gateway uses only G.711, but internal users prefer Opus), delegate it to a separate transcoding engine, not your main PBX.

Ecosmob Expert Tip

🧠

The easiest way to unlock massive RTP scale is to treat transcoding as an exception, not the default. Let endpoints negotiate codecs directly and design your flows so that most media stays in passthrough. When transcoding does happen, isolate it on a separate pool instead of burdening your core call logic. This single shift often delivers the biggest jump in call density and stability.

Codec Selection for Maximum Density

So, which codecs support maximum scalability in a large-scale RTP architecture?

Let’s look at when you should choose specific codecs.

G.711

Audio quality: Excellent for voice (narrowband sampling)
Bandwidth: 64 kilobits per second per direction (uncompressed)
CPU overhead: Negligible (simple quantization, minimal computation)
Latency: Minimal
Use case: If you’re scaling to 30,000 calls and bandwidth isn’t a constraint, use G.711.

Opus

Audio quality: Excellent across a wide bandwidth
Bandwidth: 20–40 kilobits per second (highly configurable)
CPU overhead: Moderate
Latency: Low
Use case: When bandwidth is expensive (mobile apps, international calls). Requires dedicated transcoding or peer support.

G.729

Audio quality: Good, but noticeably lower than G.711
Bandwidth: approximately 8 kilobits per second
CPU overhead: Very high due to a complex patented algorithm
Licensing: Requires per-channel royalties; legal complexity
Use case: Only if you absolutely need the bandwidth savings and can afford dedicated DSP hardware
Recommendation for 30,000+ scale: Avoid. The licensing cost and CPU overhead make this economically unviable at scale.

GSM

Audio quality: Poor to fair
Bandwidth: approximately 13 kilobits per second
CPU overhead: High
Use case: Legacy mobile phones (rare now)
Recommendation: Deprecated. Don’t design new systems around it.

Codec	Bandwidth	CPU Overhead	Audio Quality	Best Use Case	Scale Suitability
G.711	64 kbps/direction	Negligible	Excellent (voice)	Default choice for all compatible endpoints	⭐⭐⭐⭐⭐ Perfect
Opus	20–40 kbps	Moderate	Excellent (wideband)	Mobile apps, international calls, bandwidth-constrained	⭐⭐⭐⭐ Good
G.729	~8 kbps	Very High	Good (but degraded)	Only with dedicated DSP hardware	⭐⭐ Poor (avoid)
GSM	~13 kbps	High	Poor to Fair	Legacy mobile (deprecated)	⭐ Not recommended

Practical codec strategy for 30,000 calls:

Default to G.711 for all calls between compatible endpoints.
Support Opus for endpoints that explicitly request it (high-quality, low-bandwidth scenarios).
Operate a separate Transcoding Pool (2–4 dedicated Asterisk/FreeSWITCH instances) for the rare case where you must bridge G.711 and Opus.
Never advertise G.729 or GSM in new deployments. If legacy endpoints demand it, handle them separately.

This keeps the vast majority of your 30,000 concurrent calls flowing through RTPEngine without any codec transformation, maximizing density and minimizing cost.

Choosing the Right Cloud Specs for High Media Concurrency

When deploying RTP infrastructure for 30,000+ concurrent streams, the instance you choose matters far more than the size. Generic cloud instances aren’t built for the relentless packet-per-second demands of real-time media. You need to prioritize packet processing over raw CPU power. Here’s exactly which instances to pick on each cloud platform.

AWS

Use: C5n.4xlarge or C5n.9xlarge (or C6gn variants)
Why: Network-optimized (ENA + SR-IOV). Can handle high PPS without jitter degradation.
Avoid: T-series, M-series, general-purpose instances

Google Cloud

Use: M2 series with Gvnic, or Tau T2D
Why: High packet processing capability with premium networking options.
Avoid: Standard n2 or e2 series

Azure

Use: Fsv2 series with Accelerated Networking, or D-series with Accelerated Networking
Why: High CPU clock rates + direct NIC access = low jitter.
Avoid: B-series, general-purpose instances without Accelerated Networking

Spot Instances (All Clouds)

Spot Instances are unused cloud capacity that providers (AWS, Google Cloud, Azure) sell at steep discounts. The trade-off is that the provider can reclaim them with a few minutes’ notice if they need the capacity. For RTP deployments with graceful draining, this is acceptable because most calls complete before termination happens.

Save: Significant cost reduction through discount pricing
Trade-off: Small risk of interruption (handle with graceful draining)
Best for: Non-mission-critical deployments, testing, internal systems
Avoid for: SLA-critical customer-facing calls

Scaling RTP to 30,000+ concurrent streams isn’t about buying bigger servers; it’s about respecting the unique ins and outs of real-time UDP traffic and architecting systems that embrace those constraints rather than fight them.

To handle 30,000+ concurrent streams, you must:

Understand the real bottleneck.
Offload media forwarding to the kernel.
Separate concerns with a tiered architecture.
Choose codecs wisely.
Select network-optimized cloud instances.

If you build your large-scale RTP architecture on these principles, you won’t just support 30,000 calls; you’ll be ready for 100,000, and you’ll do it cost-effectively.

Building RTP infrastructure that needs to survive real packet rates and real concurrency?

Do it with industry experts!

FAQs

What's the difference between jitter and packet loss? Which matters more for RTP?

Both matter, but they hurt differently. Packet loss makes audio droppy and choppy. Jitter (variation in packet arrival time) creates buffering delays and distortion. At high concurrency, jitter problems emerge first (caused by context switching and hypervisor overhead). This is why SR-IOV and ENA matter: they reduce jitter by letting VMs talk directly to the NIC.

How to know if the RTPEngine cluster is about to hit capacity?

Watch three signals: CPU utilization approaching saturation, packet loss appearing in your metrics, and jitter spiking. If you see these signs, you're near capacity. Set up alerting at the warning threshold and gracefully drain or add nodes before you hit a wall.

Should I use an RTP proxy, RTPEngine, or dedicated media servers for scaling?

Use RTPEngine as your primary media relay layer, and treat dedicated media servers like FreeSWITCH or Asterisk as feature engines, not routers. RTP proxy-style components that live entirely in user space are fine for small deployments but become inefficient at scale. A tiered architecture (SIP proxy at the edge, RTPEngine for media, and PBX only when logic is needed) is the most sustainable RTP scaling architecture pattern.

Why does my system need a transcoding pool?

A small percentage of calls will always need transcoding due to legacy equipment or PSTN gateways. Rather than burden your main PBX, dedicate a few instances to transcoding. They'll mostly sit idle, which is fine since their job is to be there when you need them.

Which codecs support maximum scalability at high concurrency?

Codecs with minimal CPU overhead and broad interoperability scale best for RTP. G.711 is ideal when bandwidth is affordable, while Opus is a strong choice where bandwidth is constrained but you can control endpoints and transcoding pools. Heavy, legacy codecs like G.729 or GSM reduce density and add complexity, so they should be isolated and never used as the default.

Nikunj Limbachiya

63 posts

Principal VoIP Solution Analyst

Published on: 31st Dec, 2025

Hugh Goldstein

Director of Business Development

2,500+ VoIP projects delivered. Yours could be next.

Consult an Expert

Need a Consultation?

Talk with Expert

Nikunj Limbachiya

63 posts

https://www.linkedin.com/in/parmarnikunj/

Nikunj Limbachiya is Principal Solution Analyst and Head of Solution Analyst & UI/UX Practice at Ecosmob, specializing in architecting scalable, secure technology solutions for Telecom, Government, and Enterprise organizations.

Let’s Get in Touch

Successfully helped enterprises all across the globe to scale.

EMAIL

sales@ecosmob.com

SALES

+91 99988 51106

SKYPE

ecosmob@1234

Hey There!

We’d love to help you.

* Your Name

* Enter Email

*Phone Number

Company Name

Any Requirements