[Paper Review] ZipLLM: Efficient LLM Storage via Model-AwareSynergistic Data Deduplication and Compression

[Paper Review] ZipLLM: Efficient LLM Storage via Model-AwareSynergistic Data Deduplication and Compression

2026. 5. 7. 19:34ㆍComputerScience/Computer Architecture

ZipLLM: Efficient LLM Storage via Model-AwareSynergistic Data Deduplication and Compression (NSDI'26)

https://www.usenix.org/conference/nsdi26/presentation/wang-zirui

ZipLLM: Efficient LLM Storage via Model-Aware Synergistic Data Deduplication and Compression | USENIX

Open Access Media USENIX is committed to Open Access to the research presented at our events. Papers and proceedings are freely available to everyone once the event begins. Any video, audio, and/or slides that are posted after the event are also free and o

www.usenix.org

Abstract

Modern model hubs, such as Hugging Face, store tens of petabytes of LLMs, with fine-tuned variants vastly outnumbering base models and dominating storage consumption. Existing storage reduction techniques—such as deduplication and compression—are either LLM-oblivious or not compatible with each other, limiting data reduction effectiveness.

Our large-scale characterization study across all publicly available Hugging Face LLM repositories reveals several key insights: (1) fine-tuned models within the same family exhibit highly structured, sparse parameter differences suitable for delta compression; (2) bitwise similarity enables LLM family clustering; and (3) tensor-level deduplication is better aligned with model storage workloads, achieving high data reduction with low metadata overhead.

Building on these insights, we design BitX, an effective, fast, lossless delta compression algorithm that compresses the XORed difference between fine-tuned and base LLMs. We build ZipLLM, a model storage reduction pipeline that unifies tensor-level deduplication and lossless BitX compression. By synergizing deduplication and compression around LLM family clustering, ZipLLM reduces model storage consumption by 54%, over 20% higher than state-of-the-art deduplication and compression approaches.

Introduction

Large language models (LLMs) have become foundational tools in modern artificial intelligence (AI). With the rapid progress in open-source LLM development [35, 45, 47–49], millions of LLMs are now publicly available through model hubs such as Hugging Face [21] and TensorFlow Hub [26]. These platforms support uploads, downloads, and sharing of base models and fine-tuned variants, enabling users to adapt models to diverse downstream tasks with minimal effort.

This trend has led to an explosion in the number of hosted models. As shown in Figure 1, Hugging Face alone hosts over 14 petabytes (PB) of models (as of Q1 2025), with storage volume growing exponentially, posing a serious threat to the sustainability of machine learning (ML) infrastructure.

Two observations underscore this challenge. First, fine-tuned LLMs vastly outnumber base models and contribute disproportionately to overall storage footprint, despite being only slight modifications. Second, LLM storage is dominated by two floating-point formats: BF16 and FP32. While FP32 is popular in terms of model count (often in smaller models such as those for computer vision), BF16 accounts for the majority of total LLM storage size. These trends highlight the need to prioritize LLM-specific storage patterns in future optimizations. (그래서 타겟을 저 두 타입으로 선정)

We collect all public LLM repositories from Hugging Face (Cutoff date March 2025) and conduct a first-of-its-kind, large-scale study focusing on LLM storage. Our analysis leads to the following insights

• Element-wise weight deltas are small and structured within LLM model families. Fine-tuned models derived from the same base exhibit tiny differences, making them ideal for lossless delta compression.

• Bitwise similarity enables LLM clustering and lineage tracking. Bit distance, a new metric that we propose, based on the bitwise Hamming distance, serves as a lightweight, robust signal for identifying LLM families and potentially supporting applications like model provenance, duplicate detection, and clustering.

• Chunk-based deduplication is LLM-oblivious and sub-optimal for modern model storage. Chunk-level deduplication, such as content-defined chunking (CDC) [53,56,82], operates on raw byte streams without LLM structure awareness, resulting in the loss of crucial information needed for effective model-aware compression. It also scales poorly with storage capacity.

• Model-aware, tensor-level deduplication is well-suited for LLM-aware lossless compressors, offering reasonable data reduction, but with significantly higher performance and lower metadata overhead compared to CDC.

Cross-model Parameter Difference

Since most fine-tuned LLMs share the same model structure with base models [27, 89]—meaning each tensor shares the same shape and position—a natural and direct approach is to analyze the element-wise differences in their weights (model parameters). To verify that such similarity is indeed prevalent, we compute the value differences (delta ∆w) at each parameter position i as ∆wi = wi− ŵi, where wi and ŵi represent the ith float value in the fine-tuned and base model, respectively. Here, the index i corresponds to the position of each float in the serialized model file, obtained by traversing all tensors in their original storage order and flattening each tensor in row-major layout. This delta is computed across all tensors to capture fine-grained numerical changes between models.

We begin with the Llama-3.1-8B [47] base model and select three of its fine-tuned variants. As shown in Figure 3 (top), the delta values are small and centered around zero, with similar bell-shaped distributions across all variants. This indicates that fine-tuned models in the same family introduce only minor modifications to the base model’s weights.

To validate whether this property holds across model families, we repeat the same analysis using models from the Mistral-7B-v0.3 [35] family. See Figure 3 (bottom). Although the architectures are almost identical (except for the embedding and lm_head layers), the resulting delta distributions are much wider and asymmetric. The element-wise differences are significantly larger, suggesting that models from different origins have less similarity—even if their architecture matches. We tested many models and found consistent results, which are omitted due to space limits.

Element-wise weight deltas can serve as a simple, efficient, and robust tool for identifying model lineage and clustering models by family. Fine-tuned variants derived from the same base model consistently exhibit small and structured deltas, making them well-suited for delta compression.

동일한 base의 fine-tuned model family 간 유사도 (bit D)

Cross-model Bit Distance and LLM Clustering

Building on the observed correlation between element-wise deltas and model similarity, we propose a bitwise distance metric that measures how many bits differ between two model files. Given two models with the same architecture, we align their floating-point weights in original order and compute the bit distance as follows:

Bit Distance: D(w, ŵ) = 1/n ∑(i=1 to n) H(wi, ŵi)

Here, n is the total number of float values in the models. wi and ŵi denote the ith float value from the model pair w and ŵ, respectively, both represented in raw binary format. H(wi, ŵi) computes the number of differing bits (i.e., the Hamming distance [28]) between the two binary representations. The final bit distance measures the average number of differing bits per float across the two models.

Using this metric, we compute pairwise distances across 311 models from four major LLM families: Llama-3.1 [47], Llama-3 [27], Mistral [35], and Qwen2.5 [85]. We then construct a similarity graph by connecting model pairs with bit distance below a fixed threshold in Figure 4.

We observe clear clustering behavior: models within the same family tend to form dense groups, while connections across families are sparse. This supports our earlier hypothesis and findings—models that share a pretrained origin exhibit high structural redundancy, even at the bit level. In contrast, different pretrained base models or models fine-tuned from different bases diverge significantly in their binary representation.

To better understand which bits contribute most to the observed bit-level differences between models, we break down the bit distance by position within the 16-bit BF16 format, as shown in Figure 5. We observe that within the same family, most differences are concentrated in the lower mantissa bits, with the upper mantissa and exponent bits contributing far less, and the sign bit almost never flipping.

This indicates a high degree of bit-level similarity, particularly in the high-order bits, which provides a good compression opportunity. In contrast, cross-family comparisons exhibit nearly uniform bit differences across all bit positions, with the exception of a few exponent bits (typically 1–2), which show slightly lower divergence. It reflects their much lower alignment and compatibility for compression. These findings further support the utility of LLM-family-aware compression techniques.

Bit-level similarity provides a powerful signal for organizing model repositories and guiding LLM storage optimizations. Models that are close in bit distance are more likely to benefit from delta encoding [63, 78], XOR-based compression [61], or structural reuse [37].

Beyond compression, the bit distance metric offers broader implications for large-scale model hubs such as Hugging Face, where accurate and automated identification of model lineage is missing and remains a challenge. Current tools often rely on manually curated metadata. In contrast, bit distance enables content-based provenance analysis, opening the door to a range of applications such as lineage tracking [57], duplicate detection [80], model clustering [3], and even LLM testing and evaluation [90].

BitX main = base model 대상으로 XOR 한 다음에, zstd로 압축

BitX Delta Compression

Our earlier analysis from §3.4 reveals a key structural property of modern LLMs: fine-tuned models within the same LLM family exhibit small, consistent differences from their base models, both at the parameter value level (Figure 3) and at the bit level (Figure 4). These differences are often localized and sparse, forming a strong basis for compression. In particular, element-wise deltas show that most parameters remain nearly unchanged during fine-tuning, while the bit-level similarity confirms that related models can be clustered based on shared pretrained origin. Together, these insights motivate a compression strategy that directly encodes the bitwise differences between models. This is the foundation of our Bit-XOR (BitX) approach, which exploits fine-grained redundancy for efficient and lossless model storage reduction.

BitX Workflow

§3.4.3 shows that models in the same family often exhibit significant bit-level similarity. ZipLLM introduces a new compression algorithm, BitX, which exploits this bit-level redundancy to reduce storage consumption. Figure 6 illustrates the BitX workflow. Given a base model and a fine-tuned model that share the same architecture, BitX first aligns all floating-point values in their original order. For each corresponding pair of floats, BitX performs a bitwise XOR operation. This generates a sequence of XOR results, where many bits—especially in the sign, exponent, and high mantissa bits—are expected to be zero due to high redundancy. The XOR results capture the fine-grained differences between the models. Since most of the XOR bits are zero, the resulting sequence is highly compressible. BitX then applies a generic lossless compression algorithm, such as zstd [11], to further reduce storage. This two-stage process efficiently eliminates redundancy by directly encoding only the minimal changes required to reconstruct the fine-tuned model from its base.

Why XOR?

We choose XOR rather than numerical differencing because it generates more zero bits at the binary level, leading to higher compressibility. For two similar floating-point numbers, numerical differencing often yields a new value with different exponents and mantissas, making the output denser and harder to compress.

In contrast, XOR preserves bit-level similarity in the exponent and mantissa, producing sparse outputs that are far more amenable for compression. By focusing on the bit-level delta between aligned floats, BitX achieves much higher compression ratios (§5.2) for fine-tuned models than traditional methods, without sacrificing accuracy or requiring any changes to model architectures.

Putting It All Together: ZipLLM Design

Figure 7 illustrates the overall design of our LLM storage reduction pipeline, ZipLLM, which is tailored to the unique data characteristics of LLM storage.

In Step 1, ZipLLM deduplicates files by computing content hashes and removing exact duplicates.

In Step 2, ZipLLM extracts all tensors across repositories and hash them individually to identify repeated tensors. These unique tensors are stored in a global tensor pool.

ZipLLM also extracts metadata such as model cards [33] from non-parameter files (Step 1a), and uses them to group models into families (Step 3a). When the model family metadata is missing or incomplete, ZipLLM uses bit distance for similarity search (Step 3b) to identify the closest base model (see §3.4.3).

In Step 4, ZipLLM performs BitX compression, which consists of two sub-steps. In Step 4a, XOR deltas are computed between fine-tuned tensors and their corresponding base tensors, producing sparse binary differences.

In Step 4b, these XOR results are further compressed using generic algorithms such as zstd, yielding the final compact representation.

Lossless Compression

After deduplication, ZipLLM performs LLM-family-aware BitX compression across fine-tuned and base models.

Model Lineage Extraction. This step analyzes the configuration and metadata files extracted from non-parameter files (e.g., config.json, README.md) to identify lineage relationships among models. We use a combination of regular expressions and an LLM-based parser to extract base model information. Specifically, we parse architectures, tokenizers, and family identifiers to group structurally similar models.

Baselines

We compare ZipLLM with both real-world production systems and recent state-of-the-art algorithms

• FileDedup and ChunkDedup (FastCDC) are used by Hugging Face [80]. Because model information is lost during ChunkDedup, Hugging Face does not use compression in conjunction with the deduplication.

• ZipNN is the state-of-the-art model compressor that groups float numbers’ different components for compression [30]. Because it does not consider deduplication, we added FileDedup to ZipNN for a fair comparison.

• Compress-then-FastCDC is a baseline we design to study the effect of execution order. In this setting, we first apply a compression algorithm (e.g., zstd) and then perform ChunkDedup (FastCDC). This allows us to evaluate how the ordering between compression and deduplication impacts the overall reduction efficiency.

Conclusion

This paper presents ZipLLM, a model storage reduction pipeline that unifies tensor-level deduplication and a new lossless delta compression called BitX to address the growing scale of LLM storage. Our large-scale study reveals key redundancies in LLM repositories and motivates design principles that synergize model storage deduplication with compression. ZipLLM achieves significantly higher storage savings and throughput compared to state-of-the-art approaches, without sacrificing losslessness.

'ComputerScience > Computer Architecture' 카테고리의 다른 글

[Paper Review] Towards Understanding, Analyzing, and Optimizing Agentic AI Execution: A CPU-Centric Perspective (0)	2026.05.06
[Paper Review] Processing in Memory: The Terasys Massively Parallel PlM Array (1)	2026.04.22
[Paper Review] Hitting the Memory Wall: Implications of the Obvious (0)	2026.04.17
[Paper Review] Near-Memory Computing: Past, Present, and Future (0)	2026.04.17
[Paper Review] QuCo: Efficient and Flexible Hardware-Driven Automatic Configure (0)	2026.04.03

KimAnt 🥦