LLM Quantization Benchmark

A minimalist infrastructure experiment that proves you can fit production-grade LLMs onto consumer hardware by leveraging NF4 quantization. It’s essentially a feasibility study for running Llama-3 on a budget.

What is this?

This is a first-principles infrastructure project. I didn't want to just read the Arxiv papers on quantization or trust the "8-bit" toggle in a UI. I wanted to see the raw memory metrics myself.

Most tutorials on AI Infrastructure fall into two buckets:

The "Hello World": They run a small model in a notebook, hide the complexity, and don't measure anything meaningful.
The "Burn Cash": They spin up an H100 cluster on AWS, cost $40 for a 2-hour experiment, and assume you have unlimited budget.

I wanted the middle ground. Rigorous "Hardware Math" and driver-level monitoring, but executed on a $0 budget.

The Backstory

This project exists because I am currently deep-diving into the physics of AI inference. I realized that while I can write Python code on my MacBook Air M2, I cannot run CUDA kernels locally.

I found myself in a "Dependency Hell" loop:

I wanted to write code locally (VS Code, nice linting).
I needed to run code remotely (Google Colab T4 GPU).
I kept breaking my local environment trying to install NVIDIA-specific libraries on Apple Silicon.

I realized I was approaching MLOps wrong. I didn't need a stronger laptop; I needed a better deployment pipeline.

So I built this benchmark to function as a bridge. It allows me to develop locally on ARM64 and seamlessly deploy to a Linux/CUDA environment without changing a single line of code.

The Experiment

The core goal was to measure the "Tax" vs "Savings" of 4-bit quantization. I built a harness that loads TinyLlama-1.1B (a proxy for Llama-3) in two distinct modes on a Tesla T4:

Baseline (FP16): Standard Half-Precision. This is how most production endpoints default.
Optimized (INT4): 4-bit Normal Float (NF4). This uses the bitsandbytes custom kernels to compress the weights.

The tool doesn't just check model.get_memory_footprint(). It hooks into the NVIDIA Management Library (pynvml) to measure the actual VRAM pressure on the hardware, capturing the context overhead that PyTorch often hides.

Benchmark Results

The "Infrastructure Logic"

This is the technical differentiator. Most scripts crash if you move them from Mac to Linux. I architected this with a strict "Platform Agnostic" pattern.

Conditional Imports: The engine wraps NVIDIA libraries in safety blocks. If it detects it's running on Metal (Mac), it mocks the GPU monitor. If it detects CUDA, it loads the real drivers.
Dependency Markers: I used uv and pyproject.toml with sys_platform markers. This ensures that the heavy CUDA wheels are only pulled during the remote build, keeping the local dev environment clean.
Garbage Collection: Python's GC is lazy. I implemented manual CUDA cache flushing to ensure the second benchmark run (INT4) wasn't polluted by the memory fragmentation of the first run (FP16).

Usage

Clone the repo:

git clone https://github.com/rudra-swnt-12/llm-quantization-benchmark.git

Since the target environment (Colab) is separate from your dev environment (Local), I use a "Zip & Ship" workflow.

Package the artifact:

zip -r deploy.zip src main.py requirements-colab.txt

Deploy to Google Colab:

Open a Notebook with a T4 Runtime.
Upload deploy.zip.
Run the harness:

!unzip -o deploy.zip

print("Force-Installing bitsandbytes...")
!pip install -U bitsandbytes accelerate transformers torch scipy pynvml

print("\nVerifying bitsandbytes installation...")
try:
    import bitsandbytes
    print(f"bitsandbytes version {bitsandbytes.__version__} found!")
except ImportError:
    print("bitsandbytes still not found.")

print("\nStarting Benchmark...")
!python main.py --model "TinyLlama/TinyLlama-1.1B-Chat-v1.0"

Tech

Built with Python, uv (for dependency management), PyTorch, and bitsandbytes (for the quantization kernels).

I used Matplotlib for the final visualization because measuring performance is useless if you can't visualize the impact.

Feel free to use this pattern for your own benchmarks. Enjoy.

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
assets		assets
src		src
.gitignore		.gitignore
.python-version		.python-version
README.md		README.md
main.py		main.py
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

LLM Quantization Benchmark

What is this?

The Backstory

The Experiment

Benchmark Results

The "Infrastructure Logic"

Usage

Tech

About

Uh oh!

Releases

Packages

Languages

rudra-swnt-12/llm-quantization-benchmark

Folders and files

Latest commit

History

Repository files navigation

LLM Quantization Benchmark

What is this?

The Backstory

The Experiment

Benchmark Results

The "Infrastructure Logic"

Usage

Tech

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages