Install DeepSeek on Linux in 3 Minutes

hydn · February 2, 2025, 9:54pm

As it’s open source, all the code making network calls is always public. I have min hosted offline and then connect to it via Open WebUI.

I would say at least 32 GB of RAM, 16+ CPU cores and a Nvidia 4060 ti 16GB or AMD rx 7800 xt 16GB or better. But refer to this page.

The two options are Full Model or Quantized Models. Most of us will be running the Quantized Models. But here’s a comparision.

1. Full Model (FP16/FP32)

Precision: Uses full-precision floating-point numbers (16-bit or 32-bit).
Memory Usage: Requires a lot of VRAM or RAM—often 20 GB+ for larger models.
Performance: Provides the highest accuracy and retains all fine-tuned capabilities.
Inference Speed: Slower (needs powerful HW) compared to quantized models due to higher computation demands.
Use Case: Best for research, high-end inference, and tasks where precision is crucial.

Pros:

Highest accuracy
Retains all model weights and structures
Best suited for detailed and complex tasks

Cons:

Very high memory requirements
Requires powerful GPUs or TPUs
Slower inference speed

2. Quantized Models (8-bit, 4-bit, etc.)

Precision: 16-bit or 32-bit floating point
Memory: Lots of VRAM or RAM (20+ GB for larger models)
Performance: Most accurate, keeps all fine-tuned capabilities
Inference Speed: Slower (needs powerful HW) than quantized models due to higher compute
Use Case: Research, high-end inference, where precision matters

Pros:

Requires less RAM/VRAM (can run on 8GB GPUs or lower)
Faster inference times
Works well on consumer-grade hardware

Cons:

Slight degradation in accuracy (more noticeable in 4-bit or extreme quantization)
Some models may lose certain capabilities due to weight compression
Not ideal for research-level precision

Comparison Table

Feature	Full Model (FP16/FP32)	Quantized Model (8-bit
Accuracy	Best accuracy	Slightly lower accuracy (varies by quantization level)
Memory Usage	High (20 GB+)	Lower (4 GB-16 GB)
Inference Speed	Slower	Faster
Hardware Needs	High-end GPUs (A100, 4090, etc.)	Runs on mid-range GPUs (RTX 3060, 4060, etc.)
Best For	Research, high-detail inference	Consumer hardware, efficient inference

Which One Should You Use?

If you have a high end GPU with lots of VRAM, use full models
For consumer GPUs or just need fast responses? Use Quantized Models
Fine-tuning? full models. (but quantized might work depending on framework)

If offline, it’s self-sustained. These models do not require internet connection to run.

If you need to monitor your GPU performance. Check out this article/list:

Ubuntu-user · February 3, 2025, 5:46am

Hi all. First of all thanks for great guidance on the installation process. I am planning to install Deepseek to my virtual computer in aws and use it to prepare SQL queries that will fetch data from my database.

The question is: which model would be sufficient for my use case, and what is the most cost efficient ec2 instance type that could run Deepseek?

Any help is appreciated. Thanks.

hydn · February 3, 2025, 11:47am

Hi @Ubuntu-user welcome to our Linux community! Nice job of securing that username.

If your SQL queries are relatively simple and follow predictable patterns, a 4-bit quantized model should work fine. For complex multi-table queries or optimizations, an 8-bit quantized model is a safer choice.

You will want to use DeepSeek Coder: https://deepseekcoder.github.io/

As for instance type, g4dn.xlarge (NVIDIA T4) or g5.xlarge (NVIDIA A10G). If you can get what you need out of the smallest models, then even c6i.4xlarge (CPU-only).

Halano · February 3, 2025, 5:10pm

Ollama is made by a clowns
I’ll using deepseek in different framework,
maybe write an article on how to deploy deepseek text-to-image with webui

Ubuntu-user · February 4, 2025, 11:23am

Thanks a lot! Will try g4dn.xlarge and see how it goes.

Enzo_Michelangeli · February 7, 2025, 3:14pm

Actually, using Llama.cpp it’s possible to run the full (although heavily quantized) 671B version of DeepSeek R1: see https://unsloth.ai/blog/deepseekr1-dynamic
The reason why it’s possible to run a model with less physical RAM than the size of the model, as long as the mass storage is an SDD, is memory mapping (mmap): only the layers currently in use are loaded in RAM. Of course this slows down the operations due to frequent loading of pages from the SDD, but not involving writes (the model is read-only) the SDD is not subject to wear and tear.
I’ve run it on a gaming laptop (MSI Katana 15 with 64 Gb RAM), and it’s indeed very slow: about 0.22 tokens per second.

migrator · February 8, 2025, 2:24am

Been loving your work here @hydn; thanks for the insights and excellent walkthroughs. Have not yet implemented, but am gearing up.

Speaking of gearing up, do you talk anywhere about the best headless ( ideally mini, but possibly full-form ) system to use exclusively for this?

I have been favoring Beelink machines for the last few years, which tend to be AMD based, but it also has Intel based systems. Curious if you ever worked with Beelink machines or if you have any comparable machines come to mind?

They are basically laptop hardware ramped up to the level of a stationary system, and they work great both for servers and for workstations. For example, they gear toward multi-monitor and ‘gamer’ implementations… but they unfortunately disproportionately favor Windows, which is a major pain from the perspective of hardware support.

Have a hell of a time trying to get Linux to play nice with them, but kept using them anyway. Especially when it comes to AMD support for X11 … with many weird wrinkles, but I suspect they will be an awesome option for AI-only systems, as docker clusters, etc. I am trying to see the best ( performance/heat/energy/etc. vs. cost ) arrangement to implement on-site as a “local cloud fall-back” where everything that is run in a VPC can have a fail-safe LAN-mirror with a master-master update system between them, since we control/develop all the systems, or all the systems are F/OSS and chosen for their ability to be master-master.

Pardon the lengthy follow-up but I am trying to see how to best implement your recommendations and demonstrations so far as a standardizeable model next.

hydn · February 8, 2025, 1:35pm

Thanks for the kind words @migrator

So I’ve worked with this one before. For a friend who owns a small business and the needed to replace a full-size tower with something small because their storefront cashout area is small. What I can say is that after 1 year it has no issues yet. It’s a lot smaller than the photos make it seem, so I was able to mount it behind the 24" monitor. They are indeed very powerful. It runs Win 11 and runs a pretty heavy POS application. This was the model I purchased on March 19th 2024.
Reviews suggest that this one should work with Linux:

I didn’t realize that they were trouble to make work with Linux. Have not tried. My go-to device has been the ThinkCentre Tiny by Lenovo — works like a dream with Linux. I buy them on Ebay for ~ $200.

You can see my the ThinkCentre pictured and reviewed a bit here:

I own 2, one failed after 4 years due to lightning strike, so I replaced it with this model off eBay:

I have not got around the review yet. But it would be 5/5. From power, cooling, noise, compatibility, ports, etc. etc. It’s 100% a mini server.

Well, only for small models like < 8b. Unless you have like 64 GB of RAM and even then, CPU process is slow, so only the smaller models work without to long of a delay. I’m running the deepseek-r1:8b on the other ThinkCentre Tiny. It has 16 GB of RAM: DeepSeek Local: How to Self-Host DeepSeek (Privacy and Control) - #3 by hydn

hydn · February 9, 2025, 10:55am

A post was merged into an existing topic: Welcome! Please introduce yourself

migrator · February 14, 2025, 5:19am

Glad to know you have worked with Beelink @hydn, and great recommendation on ThinkCenter Tiny … And “lightning strike” is the type of reason for hardware failure that I am willing to accept

It seems like you might be the ideal person to do some projects since you have so much experience and focus on the same core purpose, it seems. Also comments like “cons: one less closet, unhappy wife” shows real understanding

Funny enough, I also have that same Beelink model deployed and that is by far the most reliable one, and is Intel based. I have never had any issue with that one, other than the fan… after running it for 4 years straight, and starting it out as a mobile lab system in the middle of the desert, hours from civilization, with extremely fine sand. Still using it! And deployed others of the same line in the field. I tend to use those as “jump off point” inside a LAN, with a VPN connection into an external secure VPC. That is the go-to system I had in mind to use as a local web server, but now I will consider ThinkCenter Tiny also and try that too.

Thanks for the thoughtful answer, and glad to hear it will not take a massive beast to run a headless digital agent, and not expand the closet too seriously. I am fortunate my wife did not really want the ( coat ) closet I took over ( that bad )

Thanks to this post:

Which exposed me to Zed …

I then finally got around to this post ( while waiting to reboot and test the answer from @IronRod, procrastinating per usual on systems tasks )

And finally wired-up ollama into an IDE that is built for that, versus the old-world I had been holding on to like a shard of sunken ship, adrift in the ocean

And then had this realization:

Thanks to this original post pushing me to look at this at all finally ( see: procrastination )

Thanks @hydn for the drive-by on this… what a super easy and not weird world at all, same as everything else and just called something different!

IronRod · February 14, 2025, 5:47pm

I can second the use of Lenovo ThinkCentre for small box. I have a couple of them (currently running CachyOS) and have had zero issues with installing or using Linux on them.

Joseph_Abdi · May 28, 2025, 9:01am

Thank you. I have installed it on Debian. My question is, how do I use PHP to query the AI, send pompts and capture responses so as to integrate with my application.

hydn · May 29, 2025, 11:17am

Welcome to the forum @Joseph_Abdi

Once DeepSeek is running locally, you can interact with it using PHP Curl to send HTTP requests.