There is much conversation around the rising cost of public AI, which is driving more people to look inward, and experiment with running an AI LLM locally on their own desktop. I am going to take you through the steps to do that so you can discover how painfully easy it is, and how good they really are.
Now before you get too excited, as the old saying goes, mileage will vary. You are not going to achieve anywhere the performance of the public models sitting at the end of an API call. I am going to assume you have standard hardware on your local machine, and not a rack of Nvidia graphic cards providing not only the heating for your room, but a load of GPU compute. That said, for most tasks, the performance is more than adequate.
Let’s get started and lift the lid on this. Get you running in no time.
What is an AI LLM file?
Most don’t think of what an LLM actually is – all they know is typing into a chat box, some magic happens in the cloud, and results are then returned. In reality, an LLM is just a file, albeit a very large one, in the GGUF (GPT-Generated Unified Format) which has become the standard binary format to package a model for local execution. This includes all the learnings, weights and metadata to run the model.

Just like any other file format, you need an app to run it. Like a JPG file requiring a viewer, a GGUF file requires a model runner to make it accessible.
If you are familiar with the world of video files, you will be familiar the wide range of codecs, creating a whole sub-standard of file naming to distinguish between one another (H264 / H265 / 10bit etc.), but the actual wrapper file (mkv/mp4) is a single file and it is the viewer (the popular VLC viewer) that knows how to render the video. LLM AI files are no different – though it is all about model weights (as you can see in the screen shot above). This is deserving of an article in of itself. Once you get comfortable running your own LLM engines, they will become clearer as to their meaning/impact.
Where do I get LLM GGUF files?
The most popular place to get open source files is Hugging Face – the “GitHub” of AI models. This is a whole ecosystem for those developing models training (taking a base model and training it on a specific data set to be highly specialized). For us, we just want to consume their hard (and we appreciate it) work locally.

Head over to Hugging Face and discover some of the near 3 million models currently available, covering a wide range of use cases. Now this can be intimidating to begin with, so bookmark this site for the time being.
Word of caution, hosting an LLM model requires a lot of compute, so just because you can download the file, does not mean you will be able to execute it. Disk space is not your yard stick here. Fortunately help is at hand as to what you can and cannot run.
Top 3 Self hosting AI LLM Apps
As you can imagine, running a model is a little more complicated than rendering a JPG file. Given the sheer compute/memory required, apps have been designed to inspect the local hardware setup and utilize the memory, CPU and GPU as efficiently as possible. Some apps abstract all this away from you, others let you fine tune the execution.
The following apps abstract all the logistics away from hosting and setting up the environment, sitting on top of some very powerful open source software (llama.cpp is to AI what ffmpeg is to video). Incidentally, if you have an Apple M series, then you are in luck, as Apple have their own MLX framework, that works seamless across their CPU/GPU (two of these apps take advantage of that). They all offer a clean GUI (Linux, Windows and Mac) and do all the heavy lifting of acquiring and execution models for you.
These are listed in the order of increasing functionality. Ollama is relatively basic, but can get you going quickly. Though, given the ethical controversy that is surrounding Ollama not crediting the underlying framework and its limited functionality, I am going to skip this one.
Both the other two offer functionality beyond that of running and interacting with your local model (including automation, workflows and team work). I am going to leave those features to a future in-depth article.
Msty Studio
Msty offers a clean UX allowing you to quickly install and configure llama.cpp, as well as its own Local AI runtime engine. Once you have these configured (auto detecting any GPU cards) you can move quickly to the model selection area.
Msty and LM Studio both offer an integrated model/search feature, which keeps things real simple and quick. The real nice feature is their guidance, based on your local hardware, as to which models will run perfectly acceptable, the ones that will struggle, and the ones that simply won’t run at all.

Look for the little green icon indicating the model selection will run happily. Select the one you want and then hit the download button. Msty will download the model in the background and once acquired, will let you start to prompt against it.
Incidentally the “search” box will accept a full Hugging Face link to a model you have looked up on their website. It will inspect and then make recommendations on whether or not you can use it or not. LM Studio has the same feature and it is a beautiful feature to include.
Feel free to start downloading some models so you can start experimenting. Keep an eye on your disk space as these models will chew up GB’s very quickly. Having acquired a handful, move over to the main prompting UX space. This is where you will spend most of your time. It is how you interact with your model.

You will be faced with a familiar prompting chatting experience, with previous chats on the left, and a chat text box on the bottom. Like Claude/ChatGPT/Gemini, you can choose the model to which you want to utilize. Except this time, the selection is the ones you have downloaded.
So lets try it out. I have selected a model that is optimized for coding, so let me ask it to write a small node app. You can see the results below in the 18 second video.

This was a pretty fast example, at 219 tokens/second (a universal rough benchmark you will often see people refer to, especially if you frequent some of the Reddit subreddits). One of the nice features of the local models is that you get a little more raw metrics on what is going on that you do with the public cloud based models.
This is something you should keep an eye on, as different models will have different metrics. I ran the same prompt on a much larger model (Qwen3.5-9B-Q4_K_M.gguf) and it took 83 tokens/second.
Now this was a completely raw interaction. Msty gives you the tools to create persona’s and customize the tools that will be passed to the model. The persona is effectively the system prompt that will be passed to the model on each call (something I went into detail previously with “Behind the Prompt Curtain“). Each model lets you customize the parameters to which is sent to the model, and assuming you read my previous post, these will start to make some sense. Temperature is a universal metric on the randomness of the output – the higher the number the more random it is.

You can pass images/files to the models (assuming they support it) just as you would normally. Tools are supported via MCP, and Msty has a nice built-in tool to let the model make Google searches if needs to. The more tools you pass to the model the more of the context it uses up – so utilize them sparingly.
Msty is a lot of fun and makes it real easy to getting started. For most, this will be more than adequate.
LM Studio
Now, if you want to have a little more control then look at LM Studio, which is a very complete wrapper to llama.cpp that runs the model under the covers. One of the first things you will notice, is how well LM Studio lets you know the state of your current setup, highlighting the resources it has, and giving you the ability to offload some of the processing from the GPU to your main CPU/Memory.

Don’t go changing anything too much out of the gate, except to ensure your GPU is enabled if you have one installed. Now let us get some models installed. Again, like Msty, you can use the in-app search to find models to go after, with it advising in real time, the ones that will comfortably run within your hardware. I found this one to be a little more informative, in terms of which models will run exclusively on your GPU or needs to be partially offloaded to the CPU.

Again, this one supports dropping in a Hugging Face URL to a model, that will then inspect and determine if what you are about to download is a wise decision.
Once you have some models acquired, you can then choose to load them. LM Studio differs slightly from the others, in that will pre-load the models in to memory (GPU or main depending on setup).

This time, lets ask it to describe a given image and write up some content for promoting it on social media. As you can see from the video below, the results are extremely impressive and fast. Do not let anyone tell you running locally is not feasible.
LM Studio lets you setup what they call presets, which is the model configuration (temperature etc.) and the system prompt that best describes the context to which you are in. It is very worthwhile setting these up, as they will improve the output of the model considerable.
Quick piece of advice – models that support “Thinking Mode” will have it turned on by default. This is where it basically deliberates with itself before giving you back an answer. Useful when you are needing to do complex problem solving or need it to review something in detail. Most of the time however, you do not need it and you can turn it off. The difference in speed/response is huge.
Integration with Development Tools
If you are looking to use local models for development, you are not going to do this via the chat prompt. You are going to integrate a coding agent into your world. All of these apps offer easy hosting of these, by exposing your loaded model (and optional system context) via an OpenAI compliant API on a built in server.

This allows you to hook in the likes of Zed and Open Code, including all the tools that the AI model may ask the calling agent to execute for it on its behalf. This works surprisingly well and again, calling the local model, means your source code is not leaving your machine.
Why run locally
While the AI models available are not necessarily the latest’n’greatest, they are maybe 6-12 months behind their contemporary public cloud counterparts (though the likes of DeepSeek/Kimi/Qwen this can be shorter). Think of it like this – the AI you were using in Gemini/ChatGPT/Claude 6 months ago, is “close” to what you can get locally now. So with that, what are the reasons you want to run locally?
- Cost
Outside of the original hardware outlay (which you most likely already have spent) it costs $0 to run prompts. - Security
Knowing your data remains within your domain. Processing sensitive financial (or even medical) data without fear that isn’t being used to train a public model. - Predictability
Public models go on and off line, depending on their load; your world is completely under your control. - Offline access
Running locally, means you are not requiring network access. Handy if you are up in the air, or in an area where internet access is problematic (or in a clean room implementation deep inside the CIA!) - Experimentation
Easy to do extensive research to try different highly specialized models. - Avoiding Censorship
Public models can be heavily censored; even from profanity usage. Local models do not suffer this fate, depending on your use case.
If you are interested in figuring out just what types of models your current laptop/desktop can handle then head on over to https://llmfitcheck.com. This will filter the models (largely from Hugging Face) that you comfortable can manage.
In Summary
The secret to the success of running local is to temper expectations and be realistic. You may not be able to replace your Claude/ChatGPT subscription instantly, but what you can do is to whin yourself off of them slowly. I find myself reaching for LM Studio/Gemma a lot more often, especially for tasks that do not involve any internet searches. Writing validation, email interpretation, PDF analysis for example are all things I have thrown locally without any issues.
I have also left Zed running overnight, creating unit tests for a large code base I am working on, and come in the next morning to great success. All without it costing a single dime.
Nothing stopping you from trying. Download one of the apps and start playing.
No models will be harmed.
AI Disclaimer: Gemini Nano Banana Pro was used to generate the photo – Mission Impossible






Leave a Reply