Hello hydn. I’m testing the -R1 model locally with Ollama and ChatBox Ai, out of sheer curiosity and because it’s open source.
So far I have had a good experience with English, in my language which is Spanish, a disaster.
Thank you very much for your linuxblog.io Install DeepSeek on Linux in 3 Minutes, which of course I studied and used it as a guide to do my experiments, and through one I met the linuxcommunity.io.
As it’s open source, all the code making network calls is always public. I have min hosted offline and then connect to it via Open WebUI.
I would say at least 32 GB of RAM, 16+ CPU cores and a Nvidia 4060 ti 16GB or AMD rx 7800 xt 16GB or better. But refer to this page.
The two options are Full Model or Quantized Models. Most of us will be running the Quantized Models. But here’s a comparision.
1. Full Model (FP16/FP32)
- Precision: Uses full-precision floating-point numbers (16-bit or 32-bit).
- Memory Usage: Requires a lot of VRAM or RAM—often 20 GB+ for larger models.
- Performance: Provides the highest accuracy and retains all fine-tuned capabilities.
- Inference Speed: Slower (needs powerful HW) compared to quantized models due to higher computation demands.
- Use Case: Best for research, high-end inference, and tasks where precision is crucial.
Pros:
Highest accuracy
Retains all model weights and structures
Best suited for detailed and complex tasks
Cons:
Very high memory requirements
Requires powerful GPUs or TPUs
Slower inference speed
2. Quantized Models (8-bit, 4-bit, etc.)
- Precision: 16-bit or 32-bit floating point
- Memory: Lots of VRAM or RAM (20+ GB for larger models)
- Performance: Most accurate, keeps all fine-tuned capabilities
- Inference Speed: Slower (needs powerful HW) than quantized models due to higher compute
- Use Case: Research, high-end inference, where precision matters
Pros:
Requires less RAM/VRAM (can run on 8GB GPUs or lower)
Faster inference times
Works well on consumer-grade hardware
Cons:
Slight degradation in accuracy (more noticeable in 4-bit or extreme quantization)
Some models may lose certain capabilities due to weight compression
Not ideal for research-level precision
Comparison Table
Feature | Full Model (FP16/FP32) | Quantized Model (8-bit | 4-bit) |
---|---|---|---|
Accuracy | ![]() |
![]() |
|
Memory Usage | ![]() |
![]() |
|
Inference Speed | ![]() |
![]() |
|
Hardware Needs | ![]() |
![]() |
|
Best For | Research, high-detail inference | Consumer hardware, efficient inference |
Which One Should You Use?
- If you have a high end GPU with lots of VRAM, use full models
- For consumer GPUs or just need fast responses? Use Quantized Models
- Fine-tuning? full models. (but quantized might work depending on framework)
If offline, it’s self-sustained. These models do not require internet connection to run.
If you need to monitor your GPU performance. Check out this article/list:
Hi all. First of all thanks for great guidance on the installation process. I am planning to install Deepseek to my virtual computer in aws and use it to prepare SQL queries that will fetch data from my database.
The question is: which model would be sufficient for my use case, and what is the most cost efficient ec2 instance type that could run Deepseek?
Any help is appreciated. Thanks.
Hi @Ubuntu welcome to our Linux community! Nice job of securing that username.
If your SQL queries are relatively simple and follow predictable patterns, a 4-bit quantized model should work fine. For complex multi-table queries or optimizations, an 8-bit quantized model is a safer choice.
You will want to use DeepSeek Coder: https://deepseekcoder.github.io/
As for instance type, g4dn.xlarge (NVIDIA T4) or g5.xlarge (NVIDIA A10G). If you can get what you need out of the smallest models, then even c6i.4xlarge (CPU-only).
Ollama is made by a clowns
I’ll using deepseek in different framework,
maybe write an article on how to deploy deepseek text-to-image with webui
Thanks a lot! Will try g4dn.xlarge and see how it goes.
Actually, using Llama.cpp it’s possible to run the full (although heavily quantized) 671B version of DeepSeek R1: see https://unsloth.ai/blog/deepseekr1-dynamic
The reason why it’s possible to run a model with less physical RAM than the size of the model, as long as the mass storage is an SDD, is memory mapping (mmap): only the layers currently in use are loaded in RAM. Of course this slows down the operations due to frequent loading of pages from the SDD, but not involving writes (the model is read-only) the SDD is not subject to wear and tear.
I’ve run it on a gaming laptop (MSI Katana 15 with 64 Gb RAM), and it’s indeed very slow: about 0.22 tokens per second.
Been loving your work here @hydn; thanks for the insights and excellent walkthroughs. Have not yet implemented, but am gearing up.
Speaking of gearing up, do you talk anywhere about the best headless ( ideally mini, but possibly full-form ) system to use exclusively for this?
I have been favoring Beelink
machines for the last few years, which tend to be AMD
based, but it also has Intel
based systems. Curious if you ever worked with Beelink
machines or if you have any comparable machines come to mind?
They are basically laptop hardware ramped up to the level of a stationary system, and they work great both for servers and for workstations. For example, they gear toward multi-monitor and ‘gamer’ implementations… but they unfortunately disproportionately favor Windows, which is a major pain from the perspective of hardware support.
Have a hell of a time trying to get Linux
to play nice with them, but kept using them anyway. Especially when it comes to AMD
support for X11
… with many weird wrinkles, but I suspect they will be an awesome option for AI-only systems, as docker clusters, etc. I am trying to see the best ( performance/heat/energy/etc. vs. cost ) arrangement to implement on-site as a “local cloud fall-back” where everything that is run in a VPC can have a fail-safe LAN-mirror with a master-master update system between them, since we control/develop all the systems, or all the systems are F/OSS and chosen for their ability to be master-master.
Pardon the lengthy follow-up but I am trying to see how to best implement your recommendations and demonstrations so far as a standardizeable model next.
Thanks for the kind words @migrator
So I’ve worked with this one before. For a friend who owns a small business and the needed to replace a full-size tower with something small because their storefront cashout area is small. What I can say is that after 1 year it has no issues yet. It’s a lot smaller than the photos make it seem, so I was able to mount it behind the 24" monitor. They are indeed very powerful. It runs Win 11 and runs a pretty heavy POS application. This was the model I purchased on March 19th 2024.
Reviews suggest that this one should work with Linux:
I didn’t realize that they were trouble to make work with Linux. Have not tried. My go-to device has been the ThinkCentre Tiny by Lenovo — works like a dream with Linux. I buy them on Ebay for ~ $200.
You can see my the ThinkCentre pictured and reviewed a bit here:
I own 2, one failed after 4 years due to lightning strike, so I replaced it with this model off eBay:
I have not got around the review yet. But it would be 5/5. From power, cooling, noise, compatibility, ports, etc. etc. It’s 100% a mini server.
Well, only for small models like < 8b. Unless you have like 64 GB of RAM and even then, CPU process is slow, so only the smaller models work without to long of a delay. I’m running the deepseek-r1:8b on the other ThinkCentre Tiny. It has 16 GB of RAM: DeepSeek Local: How to Self-Host DeepSeek (Privacy and Control) - #3 by hydn
Glad to know you have worked with Beelink
@hydn, and great recommendation on ThinkCenter Tiny
… And “lightning strike” is the type of reason for hardware failure that I am willing to accept
It seems like you might be the ideal person to do some projects since you have so much experience and focus on the same core purpose, it seems. Also comments like “cons: one less closet, unhappy wife” shows real understanding
Funny enough, I also have that same Beelink
model deployed and that is by far the most reliable one, and is Intel
based. I have never had any issue with that one, other than the fan… after running it for 4 years straight, and starting it out as a mobile lab system in the middle of the desert, hours from civilization, with extremely fine sand. Still using it! And deployed others of the same line in the field. I tend to use those as “jump off point” inside a LAN, with a VPN connection into an external secure VPC. That is the go-to system I had in mind to use as a local web server, but now I will consider ThinkCenter Tiny
also and try that too.
Thanks for the thoughtful answer, and glad to hear it will not take a massive beast to run a headless digital agent, and not expand the closet too seriously. I am fortunate my wife did not really want the ( coat ) closet I took over ( that bad )
A post was merged into an existing topic: Welcome! Please introduce yourself