Actually, using Llama.cpp it’s possible to run the full (although heavily quantized) 671B version of DeepSeek R1: see https://unsloth.ai/blog/deepseekr1-dynamic
The reason why it’s possible to run a model with less physical RAM than the size of the model, as long as the mass storage is an SDD, is memory mapping (mmap): only the layers currently in use are loaded in RAM. Of course this slows down the operations due to frequent loading of pages from the SDD, but not involving writes (the model is read-only) the SDD is not subject to wear and tear.
I’ve run it on a gaming laptop (MSI Katana 15 with 64 Gb RAM), and it’s indeed very slow: about 0.22 tokens per second.
1 Like