[Paper Review] Towards Understanding, Analyzing, and Optimizing Agentic AI Execution: A CPU-Centric Perspective

[Paper Review] Towards Understanding, Analyzing, and Optimizing Agentic AI Execution: A CPU-Centric Perspective

2026. 5. 6. 19:39ㆍComputerScience/Computer Architecture

Towards Understanding, Analyzing, and Optimizing Agentic AI Execution: A CPU-Centric Perspective

Agentic AI serving converts monolithic LLM-based inference to autonomous problem-solvers that can plan, call tools, perform reasoning, and adapt on the fly. Due to diverse task execution need, such serving heavily rely on heterogeneous CPU-GPU systems with

arxiv.org

CPU renaissance 관련된 글들을 보다가 찾게된 논문

https://cruciblecapital.substack.com/p/cpus-are-the-bottleneck-and-branching

CPUs are the Bottleneck - and Branching is the Unlock

Why VERS (HDR Research) sits at a durable leverage point of the agentic compute stack

cruciblecapital.substack.com

Abstract

Agentic AI serving converts monolithic LLM-based inference to autonomous problem-solvers that can plan, call tools, per- form reasoning, and adapt on the fly. Due to diverse task execution need, such serving heavily rely on heterogeneous CPU–GPU systems with majority of the external tools responsible for agentic capability, either run on or are orchestrated by the CPU. Towards having a deeper understanding of its role, this paper aims to characterize and analyze the system bottlenecks introduced by agentic AI workloads from a largely overlooked CPU-centric perspective.

We first present a compile-time characterization of agentic AI execution and choose representative workloads to capture the algorithmic diversity. We then perform runtime characterization of the representative workloads analyzing the end-to-end latency and throughput on two different hardware systems to iso- late respective architectural bottlenecks. Based on the insights on the bottlenecks, we finally present two scheduling optimizations, namely, 1 CPU-Aware Overlapped Micro-Batching (COMB) and 2 Mixed Agentic Scheduling (MAS) on homogeneous and heterogeneous agentic workloads1, respectively. In specific, these methods optimize for improved CPU-GPU concurrent utilization while reducing skewed resource allocation for heterogeneous execution.

Experimental evaluations on the two hardware systems demonstrate the efficacy of COMB in yielding up to 1.7×lower P50 latency in standalone homogeneous workload execution and up to 3.9×/1.8×lower service/total latency under homogeneous open-loop load. Additionally, for heterogeneous open-loop load, MAS can reduce the total latency for minority request-type by up to 2.37×/2.49×at P50/P90 percentile.

More importantly, CPU-parallelization strategies often exhibit lower efficiency than their GPU counterparts, prematurely saturating the throughput that can reduce the GPU utilization. This necessitates the CPU execution to be carefully optimized to improve the execution latency for agentic workloads.

In our experiments, we chose a CPU-based lexical summarizer (LexRank [18]) instead of an LLM-based summarizer. The lexical summarizer helps reduce hallucination [44] while improving the domain accuracy [21]. Refer to Appendix A for more workload details. -> 이 부분이 의문 ..

Key Takeaway 1: Tool processing on CPUs can take significant chunk of E2E latency, motivating a CPU-centric optimization strategy. Moreover, a system with HP CPU and LP GPU can match a system with HP GPU in E2E latency on such tool-dominated agentic AI workloads motivating cost-effective agentic AI deployments.

Key Takeaway 2: HP GPU system can shift the bottleneck from GPU to CPU when tool execution latency is comparable to LLM inference latency, making them more CPU-bounded than systems with LP GPU, motivating system-aware optimization strategies.

CPU Parallelism Choice for Agentic Workloads. We analyze the tradeoff between multi-processing (MP) and multi-threading (MT) CPU parallelism strategies. MT has lower memory usage as all the threads share the same memory. On the other hand, MP requires independent memory for each process. Since ENNS retrieval has very high memory usage, we use MT for the RAG (Haystack) workload. MT approach is lightweight and incurs lower creation and switching overhead compared to that with MP. As a result, MT approach works better for I/O workloads. Therefore, we select MT for Toolformer as it contains an I/O tool, i.e. the WolframAlpha API.

For CPU-compute intensive tools including LexRank Summarization, Bash/Python execution, and RDKit Conformer generation, MT is ineffective due to Python Global Interpreter Lock (GIL) limitation and could not attain true multi-core performance. Therefore, we choose MP approach for Web-Augmented Agent (LangChain), SWE-Agent, and ChemCrow workloads. We further quantify the GIL bottleneck of MT by comparing it with MP approach for Web-Augmented Agent on Sys2 in Appendix B. Notably, the CPU throughput on multi-core systems can saturate well before all cores are busy. For instance, a study [6] shows that a dual-socket Haswell node reaches >80% of peak bandwidth on the STREAM benchmark [45] with only four processes per socket. If we increase the number of parallel processes beyond the available cores (over-subscription [28]), OS scheduler contention and context switching overheads dominate.

Key Takeaway 3: CPU-parallelization strategies fundamentally exhibit lower efficiency compared to GPU. In agentic AI workloads, they prematurely saturate the throughput, subsequently bottle-necking the system and degrading the utilization of costly GPU resources.

Conclusions

Agentic AI shifts the system bottleneck from monolithic LLM inference toward CPU-resident tool execution and orchestration. In this work, we characterize representative agentic workloads from a CPU-centric perspective and show that these workloads exhibit CPU latency and throughput bottlenecks. To tackle these bottlenecks, we introduce COMB and MAS, two scheduling techniques for homogeneous and heterogeneous agentic workloads, respectively. Together, these optimizations yield improved CPU-GPU concurrent utilization while reducing skewed resource allocation for heterogeneous execution.

'ComputerScience > Computer Architecture' 카테고리의 다른 글

[Paper Review] Processing in Memory: The Terasys Massively Parallel PlM Array (1)	2026.04.22
[Paper Review] Hitting the Memory Wall: Implications of the Obvious (0)	2026.04.17
[Paper Review] Near-Memory Computing: Past, Present, and Future (0)	2026.04.17
[Paper Review] QuCo: Efficient and Flexible Hardware-Driven Automatic Configure (0)	2026.04.03
[Interconnection Networks] Chap24. Simulation (0)	2026.03.25

KimAnt 🥦

KimAnt 🥦

태그

최근글

댓글

공지사항

아카이브

Towards Understanding, Analyzing, and Optimizing Agentic AI Execution: A CPU-Centric Perspective

Abstract

Conclusions

'ComputerScience > Computer Architecture' 카테고리의 다른 글

관련글

티스토리툴바