A 70-billion-parameter AI model needs approximately 140GB of RAM to run at full precision. Your Mac Mini has 48GB. Quantization is the reason it runs anyway — and understanding it is the difference between a hardware decision based on guesswork and one based on what actually matters. INT4 quantization delivers a 4x RAM reduction, shrinking that 140GB model to approximately 35GB (digitalapplied.com). That’s why a $1,999 Mac Mini can do what a $20,000 GPU cluster was doing two years ago.
What Quantization Actually Is
Every AI model is built from billions of numerical weights — values that tell the model how to process language and generate responses. At full precision, each weight is stored as a 32-bit floating-point number (FP32). That level of precision is necessary during training. But once training is complete and you’re using the model for inference — asking it questions, generating text — you don’t need that precision.
Quantization is the process of converting those high-precision numbers to lower-precision formats after training. Instead of storing each weight as a 32-bit float, you store it as a 16-bit float (FP16), an 8-bit integer (INT8), or a 4-bit integer (INT4). The model gets smaller. The RAM requirement drops. Speed often improves.
Quantization is not a compromise reluctantly accepted because of hardware limits. It is a mature, well-understood technique that makes powerful AI accessible on consumer hardware with minimal quality trade-off.
The JPEG Analogy: Why You Already Understand Quantization
If you’ve ever saved a photo as a JPEG and chosen a quality level, you’ve made a quantization-style decision. A RAW image file from a professional camera might be 30MB. The same image as a JPEG at 85% quality is 3MB and looks virtually identical to the human eye. At 60% quality it’s 1MB and you start to see degradation. The information lost is real — but at 85% the loss is below the threshold of practical consequence.
Quantized AI models work the same way. INT8 quantization (8-bit) is like JPEG at 90% quality — virtually indistinguishable from the original. INT4 quantization (4-bit) is like JPEG at 80% — some information loss, but for most analytical and conversational tasks, the practical difference is minimal.
The format you’ll encounter most often when downloading models for tools like Ollama or LM Studio is GGUF — a file format that packages quantized models optimised for Apple Silicon’s unified memory architecture. When you see a filename like qwen3-72b-q4_k_m.gguf, the q4 tells you it’s INT4 quantized and the k_m tells you which variant of INT4 compression was used (a mid-range balance of quality and size).
You may be interested in: AI Can Build an Event Pipeline. But There’s a Catch.
Quantization Levels in Practice: Q4, Q6, Q8
Not all quantization is equal. Here’s what the levels mean for marketing and analytics use cases:
Q8 (INT8 — 8-bit integer). The most conservative quantization level. File sizes are roughly half the FP16 original. Quality degradation is negligible across virtually all tasks. If RAM isn’t a constraint, Q8 is the safe default for quality-critical work. A 13B parameter model at Q8 requires around 13-14GB of RAM.
Q6 (6-bit). A practical middle ground that most users won’t be able to distinguish from Q8 in everyday use. Smaller file sizes than Q8 with quality that holds up well for analytical and writing tasks.
Q4 (INT4 — 4-bit integer). The most widely used quantization level for running large models on consumer hardware. INT4 quantization delivers a 4x reduction from full precision — bringing a 70B model from 140GB down to approximately 35GB (digitalapplied.com). This is the quantization level that makes 70B models viable on a 48GB Mac Mini. Quality trade-offs are real but modest for most marketing, analytics, and writing tasks.
For a practical sense of scale: a 7B model at Q4 requires just 4-5GB of RAM (Sitepoint, 2026) — meaning it runs on any Mac sold in the last three years.
What Quantization Means for Your Hardware Decision
This is where quantization knowledge converts into a buying decision that makes sense.
The question isn’t “what’s the biggest model I can run?” — it’s “what’s the smallest model that’s good enough for the tasks I actually have?” For most marketing analytics use cases — querying first-party data, summarising reports, drafting content — a 13B or 32B model at Q4 or Q6 performs excellently. You don’t need 70B for everything.
Here’s the practical hardware map at Q4 quantization:
MacBook Air / Mac Mini M4 base (16GB RAM): Comfortably runs 7B and 13B models. That’s more than sufficient for drafting, summarisation, and simple data analysis. Good entry point for individuals.
Mac Mini M4 Pro (24GB RAM): Handles up to 20B+ models comfortably. 32B models are feasible at lower quantization. The practical sweet spot for a solo marketer who wants serious local AI capability.
Mac Mini M4 Pro (48GB RAM — $1,999): This is the community consensus pick for agencies. It runs 70B parameter models at Q4 (approximately 35GB RAM requirement) with room to spare. At 20+ tokens per second, it’s fast enough for interactive sessions. This is the machine that runs models that would have required a $20,000 GPU cluster two years ago.
You may be interested in: Why Smart Small Businesses Are Planting Data Trees (And You Should Too)
Does Quantization Hurt Quality?
Honestly, less than you’d expect — and it depends almost entirely on the task.
Community benchmarks consistently show that well-implemented INT4 quantization produces minimal measurable quality loss on conversational, analytical, and writing tasks. The degradation is most noticeable in complex multi-step reasoning chains and highly technical domains that require precise numerical outputs. For marketing analytics — querying first-party data, interpreting trends, drafting reports — Q4 models perform at a level that is practically indistinguishable from their full-precision counterparts in real use.
The quality gap between a Q4 70B model and a Q8 7B model is enormous — the larger model wins despite heavier quantization. Model size matters more than quantization level, within reason. If you have the RAM for a 70B model at Q4, that’s a better choice than a 7B model at Q8.
The rule of thumb: run the largest model your hardware can handle at Q4 or Q6. Don’t sacrifice model size for quantization precision — size wins.
The Data Worth Querying with Your Local Model
A capable local LLM is only as useful as the data it has access to. For WooCommerce operators, that means first-party event data in BigQuery — purchase events, customer behaviour, attribution signals — captured server-side before ad blockers or consent mode can touch it.
Transmute Engine™ is a dedicated first-party Node.js server that runs on your subdomain, not a WordPress plugin. The inPIPE WordPress plugin captures WooCommerce events and sends them via API to your Transmute Engine server, which routes them directly to BigQuery via Streaming Insert — complete, unsampled, and fully owned. That’s the dataset a local quantized LLM can query via RAG — your customer intelligence, analysed locally, with zero data leaving your infrastructure.
Key Takeaways
- Quantization converts high-precision model weights to lower-precision formats — INT8 or INT4 — reducing RAM requirements dramatically while preserving most capability.
- INT4 quantization delivers a 4x RAM reduction — shrinking a 70B model from 140GB to approximately 35GB (digitalapplied.com).
- A 7B model at Q4 requires just 4-5GB of RAM (Sitepoint, 2026) — it runs on any modern Mac.
- The Mac Mini M4 Pro 48GB ($1,999) runs 70B models at Q4 — the community consensus pick for agencies setting up a shared local AI server.
- GGUF is the file format you’ll encounter most often for quantized models in Ollama and LM Studio — the number after “q” is the quantization level.
- Run the largest model your hardware can handle at Q4 or Q6. Model size beats quantization precision — a Q4 70B model outperforms a Q8 7B model significantly.
- Quality loss from INT4 quantization is minimal for marketing, analytics, and writing use cases. Community benchmarks show negligible degradation on the tasks marketing teams actually run.
Yes, but minimally for most marketing and analytics use cases. INT4 quantization (Q4) produces measurable quality loss compared to full precision, but community benchmarks consistently show the degradation is negligible for conversational, analytical, writing, and data interpretation tasks. The quality loss is more noticeable in complex multi-step mathematical reasoning. For the tasks marketing teams actually run — querying data, summarising reports, drafting content — Q4 models perform at a practically indistinguishable level from full-precision models.
GGUF (GPT-Generated Unified Format) is a file format for quantized AI models optimised for CPU and Apple Silicon inference. It’s the standard format for models used in Ollama and LM Studio. When you download a model in GGUF format, the filename typically includes the quantization level — for example, “llama3-70b-q4_k_m.gguf” means a 70B parameter Llama 3 model at INT4 quantization using the K-M variant. GGUF replaced the older GGML format and is the format to look for when downloading models for local use.
Q4 (INT4) stores each model weight as a 4-bit integer, while Q8 (INT8) stores each weight as an 8-bit integer. Q8 uses roughly twice as much RAM as Q4 but preserves more of the original model quality. The practical choice: Q8 when RAM is not a constraint and you want maximum quality; Q4 when you want to run a larger model within a fixed RAM limit. For most marketing and analytics tasks, the quality difference between Q4 and Q8 is not meaningful in practice.
Start with Q4 or Q6 at the largest model your hardware can comfortably fit. As a rule of thumb: leave at least 4-6GB of RAM free after the model loads for context and inference overhead. If your primary tasks are conversational or analytical, Q4 is sufficient. If you’re doing tasks where precision matters — complex multi-step calculations, technical documentation review — consider Q6 or Q8 at a slightly smaller model size. Experimentation matters more than theory here; run the model and judge output quality for your specific use case.
Yes. The Mac Mini M4 Pro with 48GB unified memory runs 70B parameter models at Q4 quantization — which requires approximately 35GB of RAM — at 20+ tokens per second. This is fast enough for interactive analytics sessions and general-purpose use. Apple Silicon’s unified memory architecture (where CPU and GPU share the same high-bandwidth memory pool) is the technical reason Macs punch above their weight for local AI inference compared to traditional PC hardware with separate VRAM.
Quantization is the bridge between the AI models that matter and the hardware your team can actually buy. Understanding it means you can make hardware decisions with confidence and choose the right model for your use case. Find out how Seresa’s first-party data infrastructure gives your local model something worth querying at seresa.io.
