Straight talk on scaling teams, shipping AI, and running technology
like a business, not a science project

I spent last week auditing several local LLM setups for a client who was frustrated by “slow” results. They were running a massive model on hardware that couldn’t handle it, but they didn’t know why. When I looked at their filenames, the problem was staring me in the face.

They had chosen a high-parameter count but hadn’t accounted for how quantization or architecture impacts actual performance. It is easy to get lost in the alphabet soup of model filenames. You see strings like Llama-3-70B-Instruct-Q4_K_M.gguf and it looks like a secret code.

It isn’t a code. It is a roadmap.

If you want to choose the right model for your hardware, you have to be able to read this map. Let’s break down what these labels actually mean for your experience. Do not worry, there is only a handful of key indicators you need to keep an eye on and by the time you have read this, you’ll be able to bluff your way around any C suite table!

LLM Size and Purpose

The first part of the name tells you what the “brain” is made of.

A name like 70B means 70 billion parameters. This is the raw size of the model’s knowledge base. Generally, more parameters mean better reasoning and a broader range of facts because the model has more “neurons” to draw from. However, larger brains require more memory. If you don’t have enough VRAM to hold those 70 billion numbers, the system will slow down as it tries to swap data in and out of your system memory.

Following that, you will often see tags like Instruct or Chat. These tell you how the model was “raised.” A “base” model is a raw student; it knows facts but doesn’t know how to follow instructions. It might finish your sentence for you rather than answering your question. An “Instruct” model has been fine-tuned to be an assistant. It understands that when you ask a question, you want an answer, not a continuation of the text.

The Resolution: Quantization (Q4, Q6, Q8)

This is where we talk about “resolution.” Think of it like a television screen.

A Q8 model is like watching a movie in 8K Ultra-HD. It uses high-precision numbers for its internal weights. It is the most accurate version possible, but it requires a massive amount of VRAM to run. You use Q8 when you need absolute precision and have the hardware to support it.

A Q6 model is your High Definition (1080p) sweet spot. For almost every human interaction, you cannot tell the difference between Q6 and Q8, but it fits much more easily on standard hardware. It provides a buffer of accuracy that prevents the model from “tripping” over complex logic.

A Q4 model is like a high-quality stream. It is compressed significantly to fit onto consumer GPUs. While there is a slight loss in “resolution,” the model remains highly capable for 95% of common tasks.

Quant LevelResolution AnalogyUse Case
Q88K Ultra-HDMaximum precision; high VRAM available
Q6High DefinitionThe “sweet spot” for most power users
Q4Standard HDBest for consumer hardware and speed

If you see QAT (Quantization-Aware Training), it means the model was built specifically to be small. Instead of taking a huge model and shrinking it after the fact, the developers trained the model knowing it would be compressed. It is like a suit tailored for your specific body rather than one bought off the rack and pinned at the seams. QAT models often outperform standard Q4 models because they were “prepared” for the lower resolution.

The size and quant levels are the two most important indicators on how well something is going to run, both from a GPU and memory perspective on your local hardware.

The Planning: NTP vs. MTP

Next, we look at how the model “thinks” about the future of the sentence.

Most models use NTP (Next-Token Prediction). They are like a person walking while looking only at their feet. They see the next step clearly, but they can lose the plot of the overall journey because they aren’t “looking ahead.” Because they only care about the very next word, they can occasionally drift into nonsensical loops during long paragraphs.

The newer standard is MTP (Multi-Token Prediction). This is like a person looking down the path. The model is trained to predict a window of several words at once. Because it has to plan for the next few steps, it stays on track much better during long-form writing or complex coding tasks.

StrategyFocusBenefit
NTPNext WordStandard; works well for short responses
MTPFuture WindowBetter coherence and “planning” in long text

The Workforce: MoE (Mixture of Experts)

Finally, we look at how the model organizes its internal labor.

A standard “Dense” model is like a single doctor who has to know everything about medicine. To make that doctor faster, you have to make them “smaller” or less detailed.

An MoE (Mixture of Experts) model is a hospital. It contains many “experts,” but only the ones needed for your specific question are activated at any given moment. This allows a massive model to be incredibly fast because it isn’t using its entire brain to answer every simple question. It provides the intelligence of a large model with the speed of a smaller one.

To wrap this up, let’s take a real-world example and put it through the “decoder” we just built.

If you see a file named Qwen2.5-32B-Instruct-Q5_K_M.gguf, here is exactly what that tells you about the model’s DNA:

The Body: Qwen2.5-32B-Instruct

  • Qwen2.5: This is the “brand.” It’s a highly capable, modern series of models from Alibaba.
  • 32B: This is the size. It has 32 billion parameters. It’s large enough to have deep reasoning capabilities but small enough to be very efficient.
  • Instruct: This tells you it’s been “coached.” It is ready to follow your commands, act as a tutor, or write code, rather than just completing the next line of text like a raw student.

The Resolution: Q5_K_M

  • Q5: This is the “High Definition” sweet spot. It’s a 5-bit quantization. It sits right between the high-fidelity Q6 and the highly-compressed Q4. You are getting almost all the intelligence of the original model with a significant reduction in file size.
  • K_M: This stands for “K-Means Medium.” It means the developers used a specific, smarter way to group the data during compression. It ensures that even though it’s compressed, the most important parts of the “brain” stay sharp.

How to Choose

When you are looking at a list of models, use this logic:

  1. If you have a high-end workstation and want the best possible logic, look for Q8 or Q6 with MTP. You want the highest resolution and the best planning.
  2. If you are running on a home PC and want a fast, capable assistant, look for Q4_K_M (the standard “medium” compression) and an MoE architecture. These provide the best balance of “smart” and “fast.”

Stop guessing which file to download based on the name alone. Look at the resolution, the planning, and the workforce.

Which of these factors is your current hardware’s biggest bottleneck?

AI Disclaimer: Gemini Nano Banana Pro was used to generate the photo – Natural Treasure (2004)

Leave a Reply

Discover more from Alan Williamson

Subscribe now to keep reading and get access to the full archive.

Continue reading