The quantized (local) future
Over the last few weeks, I've been playing with quantized versions of LLMs and exploring options to run LLMs locally. I am using a MacBook Pro with 1 TB of storage and 16GB of RAM. Typically, that would not be enough to run a more powerful LLM locally but it's very manageable if you run a quantized model. Quantization in the context of Large Language Models (LLMs) like ChatGPT refers to a technique used to reduce the precision of the numbers (weights) in the neural network model. This is done to decrease the size of the model and increase its operational efficiency, especially in deployment scenarios.
The easiest way for me to understand this was using the analogy of RAW vs. JPEG for photos.
Imagine you're a photographer deciding between saving your photos in RAW or JPEG format. This choice is quite similar to a technique used in artificial intelligence called 'quantization', especially in the context of Large Language Models (LLMs) like ChatGPT.
RAW Images: They are like the unfiltered, complete data from your camera. They offer the highest quality and detail, perfect for professional editing. But they're big files and take up a lot of space.
JPEG Images: When you click a photo and save it as a JPEG, the camera does some behind-the-scenes work to compress the file. This makes it more manageable and easier to share, but you lose some of the finer details and editing flexibility.
High-Precision Weights in LLMs (RAW Format): Think of LLMs initially having high-precision, detailed data (like RAW images). It's rich in information, allowing the model to perform complex tasks with great accuracy. But, it's also big and resource-intensive.
Quantized LLMs (JPEG Format): Quantization is like converting RAW to JPEG. It simplifies the model's data, making it smaller and faster to work with, much like a compressed JPEG file. The trade-off? Just a bit of precision, similar to losing some image details in JPEG.
By quantizing, we can deploy smarter AI on more devices, even those with limited resources. It's like being able to share beautiful photos without needing huge storage space. Just as photographers choose between RAW and JPEG, we can use quantization to balance between the model's depth of knowledge (accuracy) and practical deployment (size and speed).
Using a 5K quantized model, I've been able to run llama2 quite easily on a medium-end laptop without too much loss in quality - best part it's all running locally, and I don't need to worry about data privacy. This technique has a lot of potential for LMICs which may be compute constrained and/or have concerns about data privacy/sovereignty.