SohyeonKim(401)
-
[Paper Review] Towards Understanding, Analyzing, and Optimizing Agentic AI Execution: A CPU-Centric Perspective
Towards Understanding, Analyzing, and Optimizing Agentic AI Execution: A CPU-Centric Perspective https://arxiv.org/abs/2511.00739 Towards Understanding, Analyzing, and Optimizing Agentic AI Execution: A CPU-Centric PerspectiveAgentic AI serving converts monolithic LLM-based inference to autonomous problem-solvers that can plan, call tools, perform reasoning, and adapt on the fly. Due to diverse ..
19:39:45 -
[Paper Review] Processing in Memory: The Terasys Massively Parallel PlM Array
Processing in Memory: The Terasys Massively Parallel PlM Array (PACT'95) The notion of computing in memory has been with us for several decades. For example, Stone’ proposed a logic-in-memory computer consisting of an enhanced cache memory array that serves as a high-speed buffer between CPU and conventional memory. More recently, a group at the University of Toronto has designed a computati..
2026.04.22 -
[Paper Review] DFTL: A Flash Translation Layer Employing Demand-basedSelective Caching of Page-level Address Mappings
DFTL: A Flash Translation Layer Employing Demand-based Selective Caching of Page-level Address Mappings (ASPLOS'09) Abstract Recent technological advances in the development of flash memory based devices have consolidated their leadership position as the preferred storage media in the embedded systems market and opened new vistas for deployment in enterprise-scale storage systems. Unlike hard ..
2026.04.22 -
[Paper Review] Hitting the Memory Wall: Implications of the Obvious
Hitting the Memory Wall: Implications of the Obvious (1994) This brief note points out something obvious—something the authors “knew” without really understanding. With apologies to those who did understand, we offer it to those others who, like us, missed the point.We all know that the rate of improvement in microprocessor speed exceeds the rate of improvement in DRAM memory speed; each is imp..
2026.04.17 -
[Paper Review] Near-Memory Computing: Past, Present, and Future
Near-Memory Computing: Past, Present, and Future The conventional approach of moving data to the CPU for computation has become a significant performance bottleneck for emerging scale-out data-intensive applications due to their limited data reuse. At the same time, the advancement in 3D integration technologies has made the decade-old concept of coupling compute units close to the memory — call..
2026.04.17 -
[Paper Review] QuCo: Efficient and Flexible Hardware-Driven Automatic Configure
QuCo: Efficient and Flexible Hardware-Driven Automatic Configuration of Tile Transfers in GPUs (HPCA'26) Abstract The growing complexity and parallelism demands of modern GPU workloads have driven architectural innovations toward asynchronous tile transfers (ATTs) to overlap computation and data movement. While ATT units such as the NVIDIA’s Tensor Memory Accelerator (TMA) introduce high-thro..
2026.04.03 -
[Interconnection Networks] Chap24. Simulation
Interconnection Networks, Simulation simulation is a double-edged sword — while it can provide excellent models of complex network designs, simulators and simulations are equally complex. To that end, the quality of simulation results is only as good as the methodology used to generate and measure these results. In this chapter, we address the basics of simulation input, measurement, and design...
2026.03.25 -
[CUDA] Chap11. Prefix sum (scan)
Chap11. Prefix sum (scan) Our next parallel pattern is prefix sum, which is also commonly known as scan. Parallel scan is frequently used to parallelize seemingly sequential operations, such as resource allocation, work assignment, and polynomial evaluation. In general, if a computation is naturally described as a mathematical recursion in which each item in a series is defined in terms of the p..
2026.03.24 -
[CUDA] Chap10. Reduction
Chap10. Reduction A reduction derives a single value from an array of values. The single value could be the sum, the maximum value, the minimal value, and so on among all elements. The value can also be of various types: integer, single-precision floating-point, double-precision floating-point, half-precision floating-point, characters, and so on. All these types of reductions have the same comp..
2026.03.24 -
[CUDA] Chap9. Parallel histogram
Chap9. Parallel histogram In practice, whenever there is a large volume of data that needs to be analyzed to distill interesting events, histograms are likely used as a foundational computation. Note that multiple threads need to update the same counter (m-p), which is a conflict that is referred to as output interference. Programmers must understand the concepts of race conditions and atomic o..
2026.03.24