Share Dialog
Share Dialog


Currently, as AI technology demand grows, models are growing at an unprecedented rate, pushing the boundaries of what current hardware can support. Today’s landscape is dominated by increasingly large and complex models, with newer models being pushed, such as NEAR’s 1.4T parameter model. These models require immense computational power and memory. Despite the correlated growth of increasingly powerful GPUs, the traditional approach of vertical scaling is beginning to show limitations.
This article explores the current state of AI inference, the challenges of vertical scaling, and how horizontally scaling through pipeline parallelism, novel optimization techniques, and hybrid GPU configurations provides a transformative solution as the market trends toward increasingly large and complex models.
AI inference involves running pre-trained models to generate insights, outputs, or predictions. From text generation, image generation, to video interpretation, inference is computationally intensive, requiring high-performance hardware to match the industry, filled with ever-growing model sizes.
Massive Models Require Massive Resources:
Deepseek R1 671B, with 671 billion parameters, pushes the boundaries of large-scale AI, requiring distributed compute infrastructure to function efficiently.
Models like Llama 3.1 405B have 405 billion parameters, necessitating multiple high-end GPUs for training and inference.
Google’s Switch C 2048 takes scaling to another level, requiring petabytes of memory and thousands of GPUs for optimal performance.
Computer vision models like Vision Transformers (ViTs) and high-resolution generative models like Stable Diffusion similarly demand significant VRAM, often exceeding 40GB for large-scale deployments.
Vertical Scaling: The Current Solution:
Enterprises today rely on powerful clusters of GPUs to handle the ever-growing computational demands of AI inference. These GPUs, such as NVIDIA’s H100 and A100, are either self-hosted in enterprise datacenters or rented from cloud providers like AWS Bedrock, Google Cloud, and Azure AI.
As models continue to grow, so do GPUs. More and more powerful GPUs, such as NVIDIA’s GH200, and newly unveiled GB200, as well as unified chips such as Apple’s M4 chip continue to emerge in the market.
NVIDIA GH200:
The GH200 offers up to 10 times higher performance for applications handling terabytes of data. It integrates 96GB of HBM3 memory, delivering a bandwidth of 4TB/s.
An upcoming version with HBM3e memory will increase capacity to 144GB and bandwidth to over 4.9TB/s.
NVIDIA GB200:
Recently unveiled, the GB200 provides a combined memory of 1.7TB, designed to handle the most demanding AI workloads, offering exceptional performance and scalability.
Apple's M4: Apple's latest M4 chip supports up to 128GB of unified memory with a bandwidth of 546GB/s.
As models like Deepseek R1 671B and Llama 3.1 405B continue to grow, along with the introduction of newer and larger models, the amount of VRAM required for both training and inference grows exponentially. Larger models demand more memory per layer and require more GPUs to process the increased parameter counts. This trend is pushing vertical scaling to its breaking point:
Physical Limits of Hardware: Although more and more powerful GPUs are being produced, the design of GPUs is approaching practical limits in terms of memory and processing power.
Limited Scalability: Due to the law of diminishing returns, adding more powerful GPUs doesn’t linearly scale the performance output.
Skyrocketing Costs: GPUs like the NVIDIA GH200 can cost hundreds of thousands to own clusters, or tens of thousands per month to rent.
Supply Constraints: Enterprise-grade GPUs are in high demand, often resulting in inflated costs, or limited availability.
Energy Consumption: High-end GPUs consume significant amounts of power, leading to high energy & maintenance costs as well as environmental impacts.
These issues underscore the need for alternative strategies as AI continues to scale. Vertical scaling, while crucial, can no longer track with the pace of AI growth in the market.
Relying on centralized AI service providers poses significant risks, as evidenced by several notable outages and their widespread impacts:
1. OpenAI's ChatGPT Downtime (June 2024, January 2025): An outage rendered ChatGPT inaccessible for several hours during June. Additionally on January 23, 2025, ChatGPT experienced an outage that prevented users from logging in and led to error messages, Prompting users to seek alternative services, as well as affecting multiple downstream enterprises. This shift in user behavior demonstrated the potential of customer attrition, as well as highlighted the risks of relying on a sole provider.
2. DeepSeeks Server Outage (January 2025): The free AI chatbot DeepSeek faced 'server is busy' errors, frustrating users and causing complaints on social media, underscoring the challenges centralized services face in scaling infrastructure to meet growing demand.
3. CrowdStrike-Related IT Outages (July 2024): A faulty update from cybersecurity firm CrowdStrike caused widespread outages, compromising organizations that relied upon their AI-driven cybersecurity solutions. The cascading effects posed risks to user data concerns and demonstrated the issues that centralized services can have across multiple sectors.
4. Amazon Web Services (AWS) Disruptions (Dec 2021): AWS has experienced multiple outages over the years, including a significant one on December 7, 2021, that disrupted services like Disney+ and Netflix’s AI recommendation systems and communication tools. These events illustrate the extensive reach and potential impact of centralized service failures even towards large enterprises.
5. Anthropic's Claude Outages (June 2024): Anthropic's Claude AI Chatbot experienced outages on the morning of June 4, 2024, coinciding with disruptions in ChatGPT and impacting dependent services. Although the cause of the outage was not disclosed, it highlights the risk of simultaneous failures across multiple AI platforms.
6. Perplexity AI Outages (June 2024): Perplexity AI, recognized for its AI-powered search capabilities, experienced service disruptions that same month. The platform displayed messages about reaching its capacity limit, indicating the outage likely resulted from an overload due to high demand. This highlights the critical need for scalable infrastructure to meet the growing market demands.
Implications of Centralized AI Service Dependencies:
Single Point of Failure: Dependence on a sole provider can lead to widespread disruptions if that provider experiences issues.
Operational Risks: Outages can halt business operations, leading to financial losses and reputational damage.
Data Privacy Concerns: Centralized data storage increases the risk of large-scale breaches.
To address the risks posed by outages in centralized AI service providers, such as the outages outlined above various mitigation strategies have been employed. One prominent approach is the use of platforms like OpenRouter, which enable routing across multiple AI providers. While this offers a level of redundancy and operational continuity, it also introduces challenges that highlight the limitations of current solutions.
OpenRouter serves as a middleware that routes requests dynamically between different AI providers (e.g., OpenAI, Anthropic, and others). In the event of an outage from one provider, requests can be redirected to another, maintaining functionality. Despite its benefits, the use of OpenRouter essentially acts as a hacky solution to the issue and introduces a significant technical issue: non-unified Key-Value (KV) Caches.
KV caches store intermediate states (e.g., previous token activations) to speed up processing of subsequent tokens, and serve as a ‘memory’ for subsequent requests. However, KV caches are not standardized across providers, meaning data cached by one provider cannot be reused by another when requests are rerouted. This results in higher computational costs, increased latency, and missing prior data when providers are swapped.
Looking towards the future, shifting from centralized AI providers to decentralized, distributed inference systems will minimize reliance on single points of failure. Additionally allowing for establishment of a unified common format for KV caches across providers to allow for seamless sharing of cached data. The following sections will explore this topic.
As the demand for these larger more powerful AI models grows, so too does the need for more scalable and efficient ways to perform inference. Enter pipeline parallelism, a technique that embraces horizontal scaling by distributing model computation across multiple devices.
Although pipeline parallelism is not novel, it was originally developed to maximize utilization on single computing systems, achieving fast and high throughput by efficiently partitioning and overlapping tasks across multiple GPUs within the same machine. With the advent of larger models and distributed systems, pipeline parallelism has evolved, incorporating more recent optimization techniques, making it an effective tool for scaling AI inference across distributed computing environments.
To understand its significance, let’s dive deeper into what pipeline parallelism is, how it works, and how it can change the landscape of AI inference for the future.
While we can imagine the current standard of vertical scaling as having one worker or one single factory building an entire product from start to finish in a monolithic process, we can instead imagine pipeline parallelism as distributing parts to be built by a series of factories, each performing or creating a part of the product, moving down the line, becoming progressively more complete in a distributed, efficient workflow.
Pipeline parallelism applies this concept to AI inference:
A model’s computations are divided into sequential stages, with each stage assigned to a different device.
As the input data flows through the pipeline, each device processes its assigned portion before passing the results to the next device in line.
To implement pipeline parallelism, a model’s computational graph (the representation of its operations) is divided into segments. Each segment corresponds to a stage in the pipeline, which is handled by a specific GPU or computational node. Here’s an example:
Input Embeddings: The first GPU processes the input data, such as converting text or images into numerical embeddings.
Hidden Layers: The embeddings are passed to the next GPU, which performs calculations for a subset of the model’s layers.
Output Generation: After flowing through all stages, the final device produces the output, whether it’s text, an image, or a classification.
This sequential processing enables multiple devices to work on different parts of the computation simultaneously, optimizing resource usage and reducing bottlenecks.
When horizontally scaling an AI inference system using pipeline parallelism across multiple GPUs, the benefits are amplified, especially for large-scale models that cannot fit or process efficiently on a single machine.
Support for Large Models: Pipeline parallelism splits the model into stages distributed across multiple GPUs, allowing inference of models that exceed the memory and compute capacity of any single GPU or machine.
Dynamic Expansion: Additional GPUs can be integrated into the pipeline to handle increasing workloads or deploy more model partitions.
Elastic Workload Distribution: Pipeline stages can be adjusted dynamically to balance workloads across GPUs, ensuring that no single GPU becomes a bottleneck.
Graceful Degradation: In a multi-GPU setup, failures in one pipeline stage can be mitigated by redistributing tasks to other GPUs or reconfiguring the pipeline dynamically (e.g. Using high-speed interconnects like NVLink or Infiniband ensures that failures in one communication path can be bypassed through alternate routes).
Modular Design: Changes to one stage of the pipeline (e.g. updating a layer or swapping hardware) can be made without affecting the entire system.
Node-Level Redundancy: Multiple GPUs can be allocated to the same pipeline stage, ensuring that a failure in one GPU doesn’t halt the entire stage. Input data and intermediate activations can additionally be replicated across nodes, reducing the risk of data loss during inference.
Increased Uptime: Idle or underutilized GPUs can act as backups, ready to take over when an active GPU fails, acting as hot standby nodes.
Graceful Degradation: In a multi-GPU setup, failures in one pipeline stage can be mitigated by redistributing tasks to other GPUs or reconfiguring the pipeline dynamically (e.g. Using high-speed interconnects like NVLink or Infiniband ensures that failures in one communication path can be bypassed through alternate routes).
Reduced Impact of Node Failures: Horizontal scaling ensures redundancy, preventing single GPU or machine failures from causing complete system downtime, as failures are isolated within a single stage.
Efficient Use of Resources: Distributed pipeline systems can use GPUs with smaller memory capacity, reducing the need for expensive high-memory devices.
Hybrid GPU Integration: High-end GPUs with large VRAM capacities, like the NVIDIA GH200, are expensive and often in short supply. Pipeline parallelism supports combining enterprise-grade GPUs (e.g. NVIDIA GH200) and consumer-grade GPUs (e.g. RTX 3090), balancing cost and performance in hybrid setups.
Parallel Execution: Stages of the pipeline operate concurrently, processing multiple input batches in parallel, which significantly increases overall throughput.
Optimized GPU Utilization: Each GPU focuses on specific parts of the model, ensuring all devices are used efficiently and consistently.
Reduced Bottlenecks: By breaking the model into smaller pipeline stages, each GPU handles a fraction of the total computation, reducing the time per stage and overall latency for batch inference.
Overlap of Computation and Communication: Pipeline parallelism allows concurrent data transfer and computation, hiding communication delays and minimizing idle time.
Optimized Workload Balancing: Each GPU operates at its most efficient load, minimizing unnecessary power consumption.
Selective Activation: Idle GPUs can remain powered down until required, reducing energy use during low-demand periods.
With these benefits in mind, let’s dive further into how horizontal scaling can affect the landscape of AI Inference.
The rise of massive AI models like Llama 3.1 (405B parameters) and Deepseek R1 671B has highlighted the growing need for efficient and scalable inference solutions. Traditionally, these workloads have been relegated to enterprise-grade GPUs like NVIDIA’s GH200, or H100. However, pipeline parallelism introduces an exciting opportunity: leveraging consumer-grade GPUs or hybrid approaches that combine consumer and enterprise-grade hardware to achieve high performance at a fraction of the cost.
Consumer-grade GPUs, such as NVIDIA’s RTX 3090 (24GB VRAM), offer impressive computational power at significantly lower costs than their enterprise counterparts. While these GPUs were not initially designed for multi-device AI workloads, pipeline parallelism makes it possible to use them effectively by distributing the workload across multiple devices.
Affordability:
Consumer GPUs are often 5-10x cheaper than enterprise GPUs with comparable raw performance.
Availability:
Consumer GPUs are widely available, making them a practical choice for organizations with budget constraints.
Hybrid Potential:
Combining consumer-grade GPUs with enterprise-grade GPUs allows for cost-effective scaling while retaining high-end capabilities for bottleneck stages.
Let’s explore how pipeline parallelism enables these possibilities by comparing costs, latency, and throughput.
Below is an example of different configurations setups an enterprise might be able to use with hybrid pipeline parallelism. Currently many enterprises are using H100 clusters, so we will use that as a basis and compare two very large models. Below are some theoretical estimates of potential hybrid setups.
Calculations:
Llama 3.1 405B (8-bit) requires ~486GB of GPU memory in 8 bit mode. [Source]
8 x 80GB = 640 GB ~ 31% Overhead
24 x 24GB = 576 GB ~ 18% Overhead
4 x 80GB + 12 x 24GB = 608GB ~ 25% Overhead
Deepseek R1 671B requires ~ 1,342 GB of GPU Memory. [Source]
20 x 80GB = 1600GB ~ 19% Overhead
64 x 24GB = 1546GB ~ 14% Overhead
10 x 80GB + 32 x 24GB = 1568GB ~ 17% Overhead
Configuration | Llama 3.1 405B (8 Bit) | Deepseek R1 671B |
Enterprise | 8 x H100 (80GB vRAM) | 20 x H100 (80GB vRAM) |
Consumer | 24 x RTX 3090 (24GB vRAM) | 64 x RTX 3090 (24GB vRAM) |
Hybrid | 4 x H100 + 12 x RTX 3090 | 10 * H100 + 32 x RTX 3090 |
H100 cost per unit: $27,988 [Source]
RTX 3090 cost per unit: $1,790[Source]
Configuration | Llama 3.1 405B (8-bit) | Deepseek R1 671B |
Enterprise | $223,904 (8 × $27,988) | $559,760 (20 × $27,988) |
Consumer | $42,960 (24 x $1,790) | $114,560 (64 x $1,790) |
Hybrid | $133,432 (4 x $27,988 + 12 x $1,790) | $337,160 ( 10 x $27,988 + 32 * $1,790) |
H100 cost per unit per hour: $2.13 [Source]
RTX 3090 cost per unit: $0.18 [Source]
At an assumed 720 usage hours per month (24 hours/day x 30 days)
Configuration | Llama 3.1 405B (8-bit) | Deepseek R1 671B |
Enterprise | $12,268.80 ($2.13 x 8 x 720) | $30,672.00 (20 x $2.13 x 720) |
Consumer | $3,110.40 (24 x $0.18 x 720) | $8,294.40 (64 x $0.18 x 720) |
Hybrid | $7,689.60 (4 x $2.13 x 720 + 12 x $0.18 x 720) | $19,483.20 (10 x $2.13 x 720 + 32 x $0.18 x 720) |
Llama 3.1 405B (8-Bit): [Source]
Using the source data, we find that with pipeline parallelism with a H200 setup results in 764 tokens/second.
H200s are approximately 50% more performant than H100s, HBM3e memory bandwidth in H200 is 4.8 TB/s, while H100 has 3.35 TB/s (H200 is ~1.43x faster in memory access).
764 tokens/sec x 66.67% = 509.33 tokens/sec
Deepseek R1 671B: [Source]
Using the source data, we find that with pipeline parallelism with a H200 setup results in 3872 tokens/second.
3872 tokens/sec × 66.67% = 2581.33 tokens/sec
Conf | Llama 3.1 405B (8-Bit) | Llama 3.1 405B (8-Bit) | Llama 3.1 405B (8-Bit) | Deepseek R1 671B | Deepseek R1 671B | Deepseek R1 671B |
throughput (tok/sec) | latency (ms) | monthly cost (USD) | throughput (tok/sec) | latency (ms) | monthly cost (USD) | |
Enterprise | ~510 | ~250ms | $12,268.80 | ~2580 | ~50ms | $30,672.00 |
Consume | ~340 | ~380ms | $3,110.40 | ~1700 | ~75ms | $ |



While pipeline parallelism allows AI inference to scale horizontally, it comes with a significant challenge—inter-node communication bandwidth. Unlike vertical scaling, where all computations happen within a single high-memory GPU, distributed inference requires frequent data transfers between devices, which can quickly become a bottleneck.
Minimizing these transfers is crucial for efficiency. Techniques like activation checkpointing, fused communication, and tensor rematerialization help reduce bandwidth overhead by strategically managing how and when data moves between nodes. However, even with these optimizations, factors like network topology, PCIe/NVLink speeds, and interconnects such as InfiniBand can impact overall performance.
As models grow, the balance between compute efficiency and communication overhead becomes increasingly difficult. High-bandwidth, low-latency connections are essential to maintaining smooth inference, but the scalability of pipeline parallelism will always be constrained by how effectively inter-node communication is handled.
Balanced Cost-Performance Tradeoff:
Hybrid setups combine enterprise GPUs (e.g., NVIDIA GH200, H100) and consumer GPUs (e.g., NVIDIA RTX 3090), balancing the high upfront and operational costs of enterprise solutions with the affordability of consumer-grade hardware.
This configuration provides a cost-efficient way to scale horizontally without compromising significantly on performance.
Increased Reliability and Fault Tolerance:
By leveraging multiple GPUs across consumer and enterprise tiers, hybrid systems reduce the risk of single points of failure, ensuring higher availability and redundancy for mission-critical applications.
If a GPU or node fails, workloads can be dynamically reallocated to maintain continuity, providing strong fault tolerance.
Competitive Throughput with Optimized Parallelism:
Hybrid setups benefit greatly from pipeline parallelism and other distributed inference optimizations, achieving competitive throughput (tokens/sec) compared to fully enterprise configurations. With proper parallelism, hybrid setups may even have improved throughput.
These optimizations mitigate the bottlenecks typically associated with consumer GPUs, such as memory limitations and interconnect latency.
Scalability Across Diverse Workloads:
Hybrid setups provide flexibility in scaling horizontally to accommodate growing model sizes and inference demands.
They are particularly effective for mid-to-large scale models like Llama 3.1 405B and
Hybrid setups represent a pragmatic approach to horizontally scaling AI inference, making high-performance AI accessible to a wider range of organizations while optimizing for cost, latency, and throughput.
Function Network enables this scalability by providing a decentralized AI infrastructure, allowing models to run efficiently across distributed compute resources with seamless optimization for performance and cost. To learn more about how Function Network facilitates efficient decentralized AI inference, check out our deep dive on how we're tackling distributed AI Inference.
Currently, as AI technology demand grows, models are growing at an unprecedented rate, pushing the boundaries of what current hardware can support. Today’s landscape is dominated by increasingly large and complex models, with newer models being pushed, such as NEAR’s 1.4T parameter model. These models require immense computational power and memory. Despite the correlated growth of increasingly powerful GPUs, the traditional approach of vertical scaling is beginning to show limitations.
This article explores the current state of AI inference, the challenges of vertical scaling, and how horizontally scaling through pipeline parallelism, novel optimization techniques, and hybrid GPU configurations provides a transformative solution as the market trends toward increasingly large and complex models.
AI inference involves running pre-trained models to generate insights, outputs, or predictions. From text generation, image generation, to video interpretation, inference is computationally intensive, requiring high-performance hardware to match the industry, filled with ever-growing model sizes.
Massive Models Require Massive Resources:
Deepseek R1 671B, with 671 billion parameters, pushes the boundaries of large-scale AI, requiring distributed compute infrastructure to function efficiently.
Models like Llama 3.1 405B have 405 billion parameters, necessitating multiple high-end GPUs for training and inference.
Google’s Switch C 2048 takes scaling to another level, requiring petabytes of memory and thousands of GPUs for optimal performance.
Computer vision models like Vision Transformers (ViTs) and high-resolution generative models like Stable Diffusion similarly demand significant VRAM, often exceeding 40GB for large-scale deployments.
Vertical Scaling: The Current Solution:
Enterprises today rely on powerful clusters of GPUs to handle the ever-growing computational demands of AI inference. These GPUs, such as NVIDIA’s H100 and A100, are either self-hosted in enterprise datacenters or rented from cloud providers like AWS Bedrock, Google Cloud, and Azure AI.
As models continue to grow, so do GPUs. More and more powerful GPUs, such as NVIDIA’s GH200, and newly unveiled GB200, as well as unified chips such as Apple’s M4 chip continue to emerge in the market.
NVIDIA GH200:
The GH200 offers up to 10 times higher performance for applications handling terabytes of data. It integrates 96GB of HBM3 memory, delivering a bandwidth of 4TB/s.
An upcoming version with HBM3e memory will increase capacity to 144GB and bandwidth to over 4.9TB/s.
NVIDIA GB200:
Recently unveiled, the GB200 provides a combined memory of 1.7TB, designed to handle the most demanding AI workloads, offering exceptional performance and scalability.
Apple's M4: Apple's latest M4 chip supports up to 128GB of unified memory with a bandwidth of 546GB/s.
As models like Deepseek R1 671B and Llama 3.1 405B continue to grow, along with the introduction of newer and larger models, the amount of VRAM required for both training and inference grows exponentially. Larger models demand more memory per layer and require more GPUs to process the increased parameter counts. This trend is pushing vertical scaling to its breaking point:
Physical Limits of Hardware: Although more and more powerful GPUs are being produced, the design of GPUs is approaching practical limits in terms of memory and processing power.
Limited Scalability: Due to the law of diminishing returns, adding more powerful GPUs doesn’t linearly scale the performance output.
Skyrocketing Costs: GPUs like the NVIDIA GH200 can cost hundreds of thousands to own clusters, or tens of thousands per month to rent.
Supply Constraints: Enterprise-grade GPUs are in high demand, often resulting in inflated costs, or limited availability.
Energy Consumption: High-end GPUs consume significant amounts of power, leading to high energy & maintenance costs as well as environmental impacts.
These issues underscore the need for alternative strategies as AI continues to scale. Vertical scaling, while crucial, can no longer track with the pace of AI growth in the market.
Relying on centralized AI service providers poses significant risks, as evidenced by several notable outages and their widespread impacts:
1. OpenAI's ChatGPT Downtime (June 2024, January 2025): An outage rendered ChatGPT inaccessible for several hours during June. Additionally on January 23, 2025, ChatGPT experienced an outage that prevented users from logging in and led to error messages, Prompting users to seek alternative services, as well as affecting multiple downstream enterprises. This shift in user behavior demonstrated the potential of customer attrition, as well as highlighted the risks of relying on a sole provider.
2. DeepSeeks Server Outage (January 2025): The free AI chatbot DeepSeek faced 'server is busy' errors, frustrating users and causing complaints on social media, underscoring the challenges centralized services face in scaling infrastructure to meet growing demand.
3. CrowdStrike-Related IT Outages (July 2024): A faulty update from cybersecurity firm CrowdStrike caused widespread outages, compromising organizations that relied upon their AI-driven cybersecurity solutions. The cascading effects posed risks to user data concerns and demonstrated the issues that centralized services can have across multiple sectors.
4. Amazon Web Services (AWS) Disruptions (Dec 2021): AWS has experienced multiple outages over the years, including a significant one on December 7, 2021, that disrupted services like Disney+ and Netflix’s AI recommendation systems and communication tools. These events illustrate the extensive reach and potential impact of centralized service failures even towards large enterprises.
5. Anthropic's Claude Outages (June 2024): Anthropic's Claude AI Chatbot experienced outages on the morning of June 4, 2024, coinciding with disruptions in ChatGPT and impacting dependent services. Although the cause of the outage was not disclosed, it highlights the risk of simultaneous failures across multiple AI platforms.
6. Perplexity AI Outages (June 2024): Perplexity AI, recognized for its AI-powered search capabilities, experienced service disruptions that same month. The platform displayed messages about reaching its capacity limit, indicating the outage likely resulted from an overload due to high demand. This highlights the critical need for scalable infrastructure to meet the growing market demands.
Implications of Centralized AI Service Dependencies:
Single Point of Failure: Dependence on a sole provider can lead to widespread disruptions if that provider experiences issues.
Operational Risks: Outages can halt business operations, leading to financial losses and reputational damage.
Data Privacy Concerns: Centralized data storage increases the risk of large-scale breaches.
To address the risks posed by outages in centralized AI service providers, such as the outages outlined above various mitigation strategies have been employed. One prominent approach is the use of platforms like OpenRouter, which enable routing across multiple AI providers. While this offers a level of redundancy and operational continuity, it also introduces challenges that highlight the limitations of current solutions.
OpenRouter serves as a middleware that routes requests dynamically between different AI providers (e.g., OpenAI, Anthropic, and others). In the event of an outage from one provider, requests can be redirected to another, maintaining functionality. Despite its benefits, the use of OpenRouter essentially acts as a hacky solution to the issue and introduces a significant technical issue: non-unified Key-Value (KV) Caches.
KV caches store intermediate states (e.g., previous token activations) to speed up processing of subsequent tokens, and serve as a ‘memory’ for subsequent requests. However, KV caches are not standardized across providers, meaning data cached by one provider cannot be reused by another when requests are rerouted. This results in higher computational costs, increased latency, and missing prior data when providers are swapped.
Looking towards the future, shifting from centralized AI providers to decentralized, distributed inference systems will minimize reliance on single points of failure. Additionally allowing for establishment of a unified common format for KV caches across providers to allow for seamless sharing of cached data. The following sections will explore this topic.
As the demand for these larger more powerful AI models grows, so too does the need for more scalable and efficient ways to perform inference. Enter pipeline parallelism, a technique that embraces horizontal scaling by distributing model computation across multiple devices.
Although pipeline parallelism is not novel, it was originally developed to maximize utilization on single computing systems, achieving fast and high throughput by efficiently partitioning and overlapping tasks across multiple GPUs within the same machine. With the advent of larger models and distributed systems, pipeline parallelism has evolved, incorporating more recent optimization techniques, making it an effective tool for scaling AI inference across distributed computing environments.
To understand its significance, let’s dive deeper into what pipeline parallelism is, how it works, and how it can change the landscape of AI inference for the future.
While we can imagine the current standard of vertical scaling as having one worker or one single factory building an entire product from start to finish in a monolithic process, we can instead imagine pipeline parallelism as distributing parts to be built by a series of factories, each performing or creating a part of the product, moving down the line, becoming progressively more complete in a distributed, efficient workflow.
Pipeline parallelism applies this concept to AI inference:
A model’s computations are divided into sequential stages, with each stage assigned to a different device.
As the input data flows through the pipeline, each device processes its assigned portion before passing the results to the next device in line.
To implement pipeline parallelism, a model’s computational graph (the representation of its operations) is divided into segments. Each segment corresponds to a stage in the pipeline, which is handled by a specific GPU or computational node. Here’s an example:
Input Embeddings: The first GPU processes the input data, such as converting text or images into numerical embeddings.
Hidden Layers: The embeddings are passed to the next GPU, which performs calculations for a subset of the model’s layers.
Output Generation: After flowing through all stages, the final device produces the output, whether it’s text, an image, or a classification.
This sequential processing enables multiple devices to work on different parts of the computation simultaneously, optimizing resource usage and reducing bottlenecks.
When horizontally scaling an AI inference system using pipeline parallelism across multiple GPUs, the benefits are amplified, especially for large-scale models that cannot fit or process efficiently on a single machine.
Support for Large Models: Pipeline parallelism splits the model into stages distributed across multiple GPUs, allowing inference of models that exceed the memory and compute capacity of any single GPU or machine.
Dynamic Expansion: Additional GPUs can be integrated into the pipeline to handle increasing workloads or deploy more model partitions.
Elastic Workload Distribution: Pipeline stages can be adjusted dynamically to balance workloads across GPUs, ensuring that no single GPU becomes a bottleneck.
Graceful Degradation: In a multi-GPU setup, failures in one pipeline stage can be mitigated by redistributing tasks to other GPUs or reconfiguring the pipeline dynamically (e.g. Using high-speed interconnects like NVLink or Infiniband ensures that failures in one communication path can be bypassed through alternate routes).
Modular Design: Changes to one stage of the pipeline (e.g. updating a layer or swapping hardware) can be made without affecting the entire system.
Node-Level Redundancy: Multiple GPUs can be allocated to the same pipeline stage, ensuring that a failure in one GPU doesn’t halt the entire stage. Input data and intermediate activations can additionally be replicated across nodes, reducing the risk of data loss during inference.
Increased Uptime: Idle or underutilized GPUs can act as backups, ready to take over when an active GPU fails, acting as hot standby nodes.
Graceful Degradation: In a multi-GPU setup, failures in one pipeline stage can be mitigated by redistributing tasks to other GPUs or reconfiguring the pipeline dynamically (e.g. Using high-speed interconnects like NVLink or Infiniband ensures that failures in one communication path can be bypassed through alternate routes).
Reduced Impact of Node Failures: Horizontal scaling ensures redundancy, preventing single GPU or machine failures from causing complete system downtime, as failures are isolated within a single stage.
Efficient Use of Resources: Distributed pipeline systems can use GPUs with smaller memory capacity, reducing the need for expensive high-memory devices.
Hybrid GPU Integration: High-end GPUs with large VRAM capacities, like the NVIDIA GH200, are expensive and often in short supply. Pipeline parallelism supports combining enterprise-grade GPUs (e.g. NVIDIA GH200) and consumer-grade GPUs (e.g. RTX 3090), balancing cost and performance in hybrid setups.
Parallel Execution: Stages of the pipeline operate concurrently, processing multiple input batches in parallel, which significantly increases overall throughput.
Optimized GPU Utilization: Each GPU focuses on specific parts of the model, ensuring all devices are used efficiently and consistently.
Reduced Bottlenecks: By breaking the model into smaller pipeline stages, each GPU handles a fraction of the total computation, reducing the time per stage and overall latency for batch inference.
Overlap of Computation and Communication: Pipeline parallelism allows concurrent data transfer and computation, hiding communication delays and minimizing idle time.
Optimized Workload Balancing: Each GPU operates at its most efficient load, minimizing unnecessary power consumption.
Selective Activation: Idle GPUs can remain powered down until required, reducing energy use during low-demand periods.
With these benefits in mind, let’s dive further into how horizontal scaling can affect the landscape of AI Inference.
The rise of massive AI models like Llama 3.1 (405B parameters) and Deepseek R1 671B has highlighted the growing need for efficient and scalable inference solutions. Traditionally, these workloads have been relegated to enterprise-grade GPUs like NVIDIA’s GH200, or H100. However, pipeline parallelism introduces an exciting opportunity: leveraging consumer-grade GPUs or hybrid approaches that combine consumer and enterprise-grade hardware to achieve high performance at a fraction of the cost.
Consumer-grade GPUs, such as NVIDIA’s RTX 3090 (24GB VRAM), offer impressive computational power at significantly lower costs than their enterprise counterparts. While these GPUs were not initially designed for multi-device AI workloads, pipeline parallelism makes it possible to use them effectively by distributing the workload across multiple devices.
Affordability:
Consumer GPUs are often 5-10x cheaper than enterprise GPUs with comparable raw performance.
Availability:
Consumer GPUs are widely available, making them a practical choice for organizations with budget constraints.
Hybrid Potential:
Combining consumer-grade GPUs with enterprise-grade GPUs allows for cost-effective scaling while retaining high-end capabilities for bottleneck stages.
Let’s explore how pipeline parallelism enables these possibilities by comparing costs, latency, and throughput.
Below is an example of different configurations setups an enterprise might be able to use with hybrid pipeline parallelism. Currently many enterprises are using H100 clusters, so we will use that as a basis and compare two very large models. Below are some theoretical estimates of potential hybrid setups.
Calculations:
Llama 3.1 405B (8-bit) requires ~486GB of GPU memory in 8 bit mode. [Source]
8 x 80GB = 640 GB ~ 31% Overhead
24 x 24GB = 576 GB ~ 18% Overhead
4 x 80GB + 12 x 24GB = 608GB ~ 25% Overhead
Deepseek R1 671B requires ~ 1,342 GB of GPU Memory. [Source]
20 x 80GB = 1600GB ~ 19% Overhead
64 x 24GB = 1546GB ~ 14% Overhead
10 x 80GB + 32 x 24GB = 1568GB ~ 17% Overhead
Configuration | Llama 3.1 405B (8 Bit) | Deepseek R1 671B |
Enterprise | 8 x H100 (80GB vRAM) | 20 x H100 (80GB vRAM) |
Consumer | 24 x RTX 3090 (24GB vRAM) | 64 x RTX 3090 (24GB vRAM) |
Hybrid | 4 x H100 + 12 x RTX 3090 | 10 * H100 + 32 x RTX 3090 |
H100 cost per unit: $27,988 [Source]
RTX 3090 cost per unit: $1,790[Source]
Configuration | Llama 3.1 405B (8-bit) | Deepseek R1 671B |
Enterprise | $223,904 (8 × $27,988) | $559,760 (20 × $27,988) |
Consumer | $42,960 (24 x $1,790) | $114,560 (64 x $1,790) |
Hybrid | $133,432 (4 x $27,988 + 12 x $1,790) | $337,160 ( 10 x $27,988 + 32 * $1,790) |
H100 cost per unit per hour: $2.13 [Source]
RTX 3090 cost per unit: $0.18 [Source]
At an assumed 720 usage hours per month (24 hours/day x 30 days)
Configuration | Llama 3.1 405B (8-bit) | Deepseek R1 671B |
Enterprise | $12,268.80 ($2.13 x 8 x 720) | $30,672.00 (20 x $2.13 x 720) |
Consumer | $3,110.40 (24 x $0.18 x 720) | $8,294.40 (64 x $0.18 x 720) |
Hybrid | $7,689.60 (4 x $2.13 x 720 + 12 x $0.18 x 720) | $19,483.20 (10 x $2.13 x 720 + 32 x $0.18 x 720) |
Llama 3.1 405B (8-Bit): [Source]
Using the source data, we find that with pipeline parallelism with a H200 setup results in 764 tokens/second.
H200s are approximately 50% more performant than H100s, HBM3e memory bandwidth in H200 is 4.8 TB/s, while H100 has 3.35 TB/s (H200 is ~1.43x faster in memory access).
764 tokens/sec x 66.67% = 509.33 tokens/sec
Deepseek R1 671B: [Source]
Using the source data, we find that with pipeline parallelism with a H200 setup results in 3872 tokens/second.
3872 tokens/sec × 66.67% = 2581.33 tokens/sec
Conf | Llama 3.1 405B (8-Bit) | Llama 3.1 405B (8-Bit) | Llama 3.1 405B (8-Bit) | Deepseek R1 671B | Deepseek R1 671B | Deepseek R1 671B |
throughput (tok/sec) | latency (ms) | monthly cost (USD) | throughput (tok/sec) | latency (ms) | monthly cost (USD) | |
Enterprise | ~510 | ~250ms | $12,268.80 | ~2580 | ~50ms | $30,672.00 |
Consume | ~340 | ~380ms | $3,110.40 | ~1700 | ~75ms | $ |



While pipeline parallelism allows AI inference to scale horizontally, it comes with a significant challenge—inter-node communication bandwidth. Unlike vertical scaling, where all computations happen within a single high-memory GPU, distributed inference requires frequent data transfers between devices, which can quickly become a bottleneck.
Minimizing these transfers is crucial for efficiency. Techniques like activation checkpointing, fused communication, and tensor rematerialization help reduce bandwidth overhead by strategically managing how and when data moves between nodes. However, even with these optimizations, factors like network topology, PCIe/NVLink speeds, and interconnects such as InfiniBand can impact overall performance.
As models grow, the balance between compute efficiency and communication overhead becomes increasingly difficult. High-bandwidth, low-latency connections are essential to maintaining smooth inference, but the scalability of pipeline parallelism will always be constrained by how effectively inter-node communication is handled.
Balanced Cost-Performance Tradeoff:
Hybrid setups combine enterprise GPUs (e.g., NVIDIA GH200, H100) and consumer GPUs (e.g., NVIDIA RTX 3090), balancing the high upfront and operational costs of enterprise solutions with the affordability of consumer-grade hardware.
This configuration provides a cost-efficient way to scale horizontally without compromising significantly on performance.
Increased Reliability and Fault Tolerance:
By leveraging multiple GPUs across consumer and enterprise tiers, hybrid systems reduce the risk of single points of failure, ensuring higher availability and redundancy for mission-critical applications.
If a GPU or node fails, workloads can be dynamically reallocated to maintain continuity, providing strong fault tolerance.
Competitive Throughput with Optimized Parallelism:
Hybrid setups benefit greatly from pipeline parallelism and other distributed inference optimizations, achieving competitive throughput (tokens/sec) compared to fully enterprise configurations. With proper parallelism, hybrid setups may even have improved throughput.
These optimizations mitigate the bottlenecks typically associated with consumer GPUs, such as memory limitations and interconnect latency.
Scalability Across Diverse Workloads:
Hybrid setups provide flexibility in scaling horizontally to accommodate growing model sizes and inference demands.
They are particularly effective for mid-to-large scale models like Llama 3.1 405B and
Hybrid setups represent a pragmatic approach to horizontally scaling AI inference, making high-performance AI accessible to a wider range of organizations while optimizing for cost, latency, and throughput.
Function Network enables this scalability by providing a decentralized AI infrastructure, allowing models to run efficiently across distributed compute resources with seamless optimization for performance and cost. To learn more about how Function Network facilitates efficient decentralized AI inference, check out our deep dive on how we're tackling distributed AI Inference.
Hybrid | ~435 | ~295ms | $7,689.60 | ~2200 | ~60ms | $19,483.20 |
Future-Proofing for AI Workflows:
As model sizes continue to grow, hybrid setups provide a scalable and adaptable architecture that can evolve with technological advancements in both consumer and enterprise hardware.
They enable organizations to experiment with state-of-the-art models without committing fully to high-cost enterprise solutions.
Hybrid | ~435 | ~295ms | $7,689.60 | ~2200 | ~60ms | $19,483.20 |
Future-Proofing for AI Workflows:
As model sizes continue to grow, hybrid setups provide a scalable and adaptable architecture that can evolve with technological advancements in both consumer and enterprise hardware.
They enable organizations to experiment with state-of-the-art models without committing fully to high-cost enterprise solutions.
<100 subscribers
<100 subscribers
Alex Mo
Alex Mo
No comments yet