Skip to content

A no-cost infrastructure benchmark measuring the VRAM and throughput impact of NF4 (4-bit) quantization on LLMs

Notifications You must be signed in to change notification settings

rudra-swnt-12/llm-quantization-benchmark

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

LLM Quantization Benchmark

A minimalist infrastructure experiment that proves you can fit production-grade LLMs onto consumer hardware by leveraging NF4 quantization. It’s essentially a feasibility study for running Llama-3 on a budget.

What is this?

This is a first-principles infrastructure project. I didn't want to just read the Arxiv papers on quantization or trust the "8-bit" toggle in a UI. I wanted to see the raw memory metrics myself.

Most tutorials on AI Infrastructure fall into two buckets:

  1. The "Hello World": They run a small model in a notebook, hide the complexity, and don't measure anything meaningful.
  2. The "Burn Cash": They spin up an H100 cluster on AWS, cost $40 for a 2-hour experiment, and assume you have unlimited budget.

I wanted the middle ground. Rigorous "Hardware Math" and driver-level monitoring, but executed on a $0 budget.

The Backstory

This project exists because I am currently deep-diving into the physics of AI inference. I realized that while I can write Python code on my MacBook Air M2, I cannot run CUDA kernels locally.

I found myself in a "Dependency Hell" loop:

  • I wanted to write code locally (VS Code, nice linting).
  • I needed to run code remotely (Google Colab T4 GPU).
  • I kept breaking my local environment trying to install NVIDIA-specific libraries on Apple Silicon.

I realized I was approaching MLOps wrong. I didn't need a stronger laptop; I needed a better deployment pipeline.

So I built this benchmark to function as a bridge. It allows me to develop locally on ARM64 and seamlessly deploy to a Linux/CUDA environment without changing a single line of code.

The Experiment

The core goal was to measure the "Tax" vs "Savings" of 4-bit quantization. I built a harness that loads TinyLlama-1.1B (a proxy for Llama-3) in two distinct modes on a Tesla T4:

  1. Baseline (FP16): Standard Half-Precision. This is how most production endpoints default.
  2. Optimized (INT4): 4-bit Normal Float (NF4). This uses the bitsandbytes custom kernels to compress the weights.

The tool doesn't just check model.get_memory_footprint(). It hooks into the NVIDIA Management Library (pynvml) to measure the actual VRAM pressure on the hardware, capturing the context overhead that PyTorch often hides.

Benchmark Results

Benchmark Results

The "Infrastructure Logic"

This is the technical differentiator. Most scripts crash if you move them from Mac to Linux. I architected this with a strict "Platform Agnostic" pattern.

  • Conditional Imports: The engine wraps NVIDIA libraries in safety blocks. If it detects it's running on Metal (Mac), it mocks the GPU monitor. If it detects CUDA, it loads the real drivers.
  • Dependency Markers: I used uv and pyproject.toml with sys_platform markers. This ensures that the heavy CUDA wheels are only pulled during the remote build, keeping the local dev environment clean.
  • Garbage Collection: Python's GC is lazy. I implemented manual CUDA cache flushing to ensure the second benchmark run (INT4) wasn't polluted by the memory fragmentation of the first run (FP16).

Usage

Clone the repo:

git clone https://github.com/rudra-swnt-12/llm-quantization-benchmark.git

Since the target environment (Colab) is separate from your dev environment (Local), I use a "Zip & Ship" workflow.

Package the artifact:

zip -r deploy.zip src main.py requirements-colab.txt

Deploy to Google Colab:

  1. Open a Notebook with a T4 Runtime.
  2. Upload deploy.zip.
  3. Run the harness:
!unzip -o deploy.zip

print("Force-Installing bitsandbytes...")
!pip install -U bitsandbytes accelerate transformers torch scipy pynvml

print("\nVerifying bitsandbytes installation...")
try:
    import bitsandbytes
    print(f"bitsandbytes version {bitsandbytes.__version__} found!")
except ImportError:
    print("bitsandbytes still not found.")

print("\nStarting Benchmark...")
!python main.py --model "TinyLlama/TinyLlama-1.1B-Chat-v1.0"

Tech

Built with Python, uv (for dependency management), PyTorch, and bitsandbytes (for the quantization kernels).

I used Matplotlib for the final visualization because measuring performance is useless if you can't visualize the impact.

Feel free to use this pattern for your own benchmarks. Enjoy.

About

A no-cost infrastructure benchmark measuring the VRAM and throughput impact of NF4 (4-bit) quantization on LLMs

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages