Run LLMs Locally with Ollama: A Guide

This tutorial explores the practical aspects of integrating and using DeepSeek Coder through Ollama. We aim to provide a comprehensive walkthrough for running LLMs locally.

This is a good first step towards mastering the local operation of large language models (LLMs) on your devices. This allows you to power AI-based applications independently without needing third-party APIs. By focusing on running LLMs locally, we also get privacy, control, and customization, allowing developers and researchers to harness the full potential of these models securely and efficiently.

Understanding Ollama

Before getting into the technicalities of running LLMs locally, it's important to understand what Ollama is and its significance in leveraging large language models. Ollama is a framework designed to simplify the deployment and operation of large language models (LLMs) on local machines. It provides a flexible platform for developers and researchers, enabling them to run, customize, and manage LLMs efficiently without relying on external cloud services.

By abstracting the complexities involved in setting up and running LLMs, Ollama facilitates a more accessible and user-friendly experience. Ollama provides the tools and flexibility to innovate and experiment with LLMs locally, whether for development, research, or production purposes.

With this foundation, let's explore how you can integrate and use DeepSeek Coder through Ollama, harnessing the power of LLMs directly on your devices.

Installing Ollama

Ollama is available on the three major OSs: macOS, Linux, and Windows. You can download it and install it directly from the Ollama website.

To get started, download and install Ollama on your platform.

That's it, it's that simple.

Introducing DeepSeek Coder

DeepSeek Coder is a series of open-source code language models developed by DeepSeek. With models trained on an extensive dataset comprising 87% code and 13% natural language, DeepSeek Coder is adept at understanding and generating code across various programming languages and tasks. Its training on 2 trillion tokens ensures the models' proficiency in not only code generation but also in offering suggestions and optimizations to improve coding efficiency.

This is the model we'll use and run locally in this tutorial.

Running DeepSeek Coder with Ollama

After successfully installing Ollama, you can run DeepSeek Coder directly on your local machine. Note that I didn't really talk about hardware requirements; the thing is, Ollama does a good job of running the models without dedicated hardware. I run them on a MacBook Pro M2 with no issues.

Models Available

1.3 billion parameter model: Ideal for general code generation and smaller-scale tasks.
6.7 billion parameter model: Offers a balance between performance and computational requirements, suitable for more complex code generation tasks.
33 billion parameter model: Designed for the most demanding tasks, offering the highest level of performance and understanding.

💡

I have been running the default 1.3 billion parameter model on my laptop.

Ollama basic commands

Ollama has various CLI commands available. Start from ollama --help to see the available commands:

$ ollama --help

Large language model runner

Usage:
  ollama [flags]
  ollama [command]

Available Commands:
  serve       Start ollama
  create      Create a model from a Modelfile
  show        Show information for a model
  run         Run a model
  pull        Pull a model from a registry
  push        Push a model to a registry
  list        List models
  cp          Copy a model
  rm          Remove a model
  help        Help about any command

Flags:
  -h, --help      help for ollama
  -v, --version   Show version information

Use "ollama [command] --help" for more information about a command.

As you can see, we have a few commands available. You can, for example, display the system prompts for a model using ollama show MODEL_NAME --system .

💡

You need to download the model first; instructions are in the next section.

$ ollama show deepseek-coder --system

You are an AI programming assistant, utilizing the Deepseek Coder model, 
developed by Deepseek Company, 
and you only answer questions related to computer science. 
For politically sensitive questions, security and privacy issues, 
and other non-computer science questions, you will refuse to answer.

Running Models

Ollama essentially allows two ways to interact with a model: directly from the CLI or by exposing an API endpoint. The first time you run the CLI command, the LLM will be downloaded so that it might take some time; those models are surprisingly light in size. The 1.3 billion parameter DeepSeek Coder model is only around 800 MB.

To run DeepSeek Coder using the Ollama CLI, open your terminal and execute one of the following commands based on the model size you wish to use:

For the 1.3 billion parameter model:
```
  ollama run deepseek-coder
```
For the 6.7 billion parameter model:
```
  ollama run deepseek-coder:6.7b
```
For the 33 billion parameter model:
```
  ollama run deepseek-coder:33b
```

This command initializes the specified DeepSeek Coder model, preparing it to generate code or assist with your coding queries. The first time you run it, it will take longer as it requires downloading the model.

So, let's try it: run ollama run deepseek-coder it in your terminal, and the interface will be ready.

$ ollama run deepseek-coder
>>> Make a simple Fast API server in Python with one health check endpoint.

Here is an example of how we can implement it using the Starlette (ASGI) web framework for python that 
provides us great flexibility and scalability when developing APIs, including its built-in support to 
provide HTTP HEAD requests on /healthcheck endpoints which FastAPI uses by default. 
This endpoint will return a simple "200 OK" response if the server is running correctly:

```python
from fastapi import FastAPI
app = FastAPI()

@app.get("/")
async def root():
    return {"message": "Hello World"}

@app.get("/healthcheck", status_code=200)  # Specify HTTP HEAD request here as default is GET which can't
be used in the same way for health check requests  
def read_root() -> dict:   
    return {"status": "up"}     
```    

This server will respond with a message to "/" and always say that it’s up when accessed through 
/healthcheck.  Please note this endpoint is not available on default configuration of FastAPI as by 
design, the root endpoints (like '/' or '') do NOT provide additional information about health check 
requests - they are designed for system status only

I asked to make a simple Fast API server in Python, which performed well. If you test the code, you'll see it works. You can also ask follow-up questions:

>>> how do I run the code?
Once you've installed all necessary dependencies, here are general steps on how to start your FastAPI server and its health check endpoint 
using Uvicorn. 
This example assumes that `uvicorn main:app --host=0.0.0.0 --port=8001` is the command you run in terminal for development purposes, where 
"main" refers back to our python file from which FastAPI app was created (in this case it's called 'run') and `--reload` will refresh your 
server every time there are changes made.

You can then use curl commands like below for health checking of running app.   Replace 'localhost' and ':8001' with your server hostname &
port if they are different:    `curl http://localhost:8001/healthcheck -X HEAD ` (Use "-v" option to see more detailed response)
```bash 
# Run this command in terminal   --> For checking the health of FastAPI app.    You should be able get a "2xx OK" message if server is 
running correctly and it's healthy, otherwise you would expect different responses from /healthcheck endpoint or have some sort of error 
logs present indicating issues with your application starting up
curl http://localhost:8001/healthcheck -X HEAD  & python utils.log_handler   #--> this will log health check response in the file specified
(defaults to ./logs)    you can run above commands at least once and then try running FastAPI server again, it should provide more detailed
logs indicating whether app is up or not
```     

In both of these cases I've used HEAD request as we need only status information without actual content. It would be different if 
the endpoint requires a specific format/content type (like POST with json in above curl commands), then use `-d 'json={"key":"value"}' -H 
"Content-Type: application/json"'` instead to send data or JSON payload, depending on what your health check request actually needs.

This is great for simple tasks and unlimited access to LLMs while offline, but now, let's see how to use them to power other applications.

Running Models via API

Ollama also supports API access for applications that require integrating DeepSeek Coder programmatically. This method is particularly useful for web applications or services that need to generate code dynamically.

Once the model is running, Ollama exposes the API on http://localhost:11434 and you can use the generate API to get responses.

Here's how you can make a request using curl in your terminal after you use the run command:

curl -X POST http://localhost:11434/api/generate -d '{
  "model": "deepseek-coder",
  "prompt": "Make a simple Python script to add two numbers."
}'

You'll see the stream in your terminal, but this is not particularly useful since it's not parsed properly. Let's see how to make a simple Python app to use this API.

Make a Python app using Ollama's API

This app will be very simple and give you an idea of how to use the Ollama API. Create a new Python project, we'll need the requests library, pip install requests , and paste the following code:

import requests
import json

url = "http://localhost:11434/api/generate"
headers = {
    'Content-Type': 'application/json',
}
conversation_history = []

while True:
    print('Code with Deepseek Coder:')
    user_input = input("You: ")
    if user_input.lower() in ["quit", "exit"]:
        print("Exiting conversation.")
        break

    conversation_history.append(user_input)
    full_prompt = "\n".join(conversation_history)
    data = {
        "model": "deepseek-coder",
        "stream": False,
        "prompt": full_prompt,
    }

    response = requests.post(url, headers=headers, data=json.dumps(data))
    if response.status_code == 200:
        response_data = response.json()
        ai_response = response_data["response"]
        print("AI:", ai_response)
        conversation_history.append(ai_response)
    else:
        print(f"Error: {response.status_code} - {response.text}")

Here is how it works:

Importing Necessary Libraries

requests: This is used for sending HTTP requests to the AI model's API, which is essential for communication between the script and the model.
json: Facilitates encoding and decoding data in JSON format. The script uses JSON to send data to and receive data from the AI model's API.

Setting Up API Communication

url: The API endpoint where the AI model is hosted. A local URL pointing to the model's generate function indicates that the AI model is running on the same machine as the script.
headers: Specifies that the content type of the request body is JSON. This header ensures that the server can correctly parse the data sent in the requests.

Initializing Conversation History

conversation_history: A list that keeps track of the entire conversation. It includes both the user's prompts and the AI's responses, maintaining context for the AI model to generate relevant replies.

Interaction Loop

The script enters an infinite loop, continuously accepting input from the user and generating responses from the AI model until the user decides to exit.
Prompt for User Input: The script prints a message inviting the user to input their message. It then waits for the user's input.
Exit Condition: If the user types "quit" or "exit," the script prints a farewell message and breaks out of the loop, ending the conversation.
Update Conversation History: The user's input is added to the conversation_history list.
Construct Full Prompt: The script combines all items in conversation_history into a single string, full_prompt, which serves as the context for the AI model's next response.
Prepare Data for Request: The data sent to the AI model includes the name of the model ("deepseek-coder"), a flag indicating that streaming is not used ("stream": False), and the full_prompt as the input for the model.
Sending the Request: The script sends a POST request to the AI model's API with the prepared headers and data. The data is encoded in JSON format.
Handling the Response:
- If the request is successful (HTTP status code 200), the script decodes the JSON response to extract the AI's generated response and prints it. This response is also added to conversation_history maintain the flow of the conversation.
- If there's an error (e.g., network issues, server errors), the script prints the error status and message.

Run the code and ask a coding question; on my side performed well:

$ python3 main.py

Code with Deepseek Coder:
You: make a simple python script to add two numbers

AI: Sure! Here's a very basic Python program that takes in user inputs for the first number (num1) and then second number(Num2). It adds them together using an addition operation, prints out the result. 
Here is your requested code snippet :
```python
# Taking input from users/receiver of output stream by terminal or command line interface
print("Enter First Number: ") # prompt to enter first nummber  
num1 = float(input())    # user's response (both numbers) are stored in variable 'a'.  Typecast is necessary because we can get an input as string, so a number should be there.

# Input for the second numeric data type has to match with datatype of first numeral provided by users/receiver     # prompts user enter Second Number  
print("Enter The Another Nummber: ") 
num2 = float(input())    # same as above, this time we are getting input from terminal or command line interface. It is stored in variable 'b' and typecast to make sure it can perform an addition operation with the numbers provided by user  

# performing Addition of two inputs 
sum_of = num1 + num2    # sum equals first number added second one (numbers)    
print("The Sum is : ", sum_of )      # print out result. The program will output a text message saying the numbers and their sums   when we run this code in python console/terminal or IDEs like PyCharm, etc   
```  
You can copy paste it into your Python environment where you have an active internet connection (Python Repl is recommended). If not connected to any web server then replace "input" with just a number. It will work as long both numbers are numeric in nature ie., integers or floats, etc..

This is the code that I gave me; it works well, even though it made a typo!

# Taking input from users/receiver of output stream by terminal or command line interface
print("Enter First Number: ") # prompt to enter first nummber  
num1 = float(input())    # user's response (both numbers) are stored in variable 'a'.  Typecast is necessary because we can get an input as string, so a number should be there.

# Input for the second numeric data type has to match with datatype of first numeral provided by users/receiver     # prompts user enter Second Number  
print("Enter The Another Nummber: ") 
num2 = float(input())    # same as above, this time we are getting input from terminal or command line interface. It is stored in variable 'b' and typecast to make sure it can perform an addition operation with the numbers provided by user  

# performing Addition of two inputs 
sum_of = num1 + num2    # sum equals first number added second one (numbers)    
print("The Sum is : ", sum_of )      # print out result. The program will output a text message saying the numbers and their sums   when we run this code in python console/terminal or IDEs like PyCharm, etc

💡

Also, note that the stream is not working in this simple example.

So, this is a good way to run a local LLM and even use it to power your apps over the Ollama API. Find more APIs on the Ollama documentation.

Conclusion

In this tutorial, we've journeyed through running Large Language Models (LLMs) locally, leveraging the Ollama framework to deploy and operate DeepSeek Coder directly on personal devices. This exploration underscores the significant advantages of local LLM operation, including enhanced privacy, control, and the ability to customize the models to meet specific needs without reliance on third-party APIs. Through a step-by-step walkthrough, we've demonstrated the ease of installing Ollama across major operating systems and the simplicity of integrating and running the versatile DeepSeek Coder, covering various model sizes to accommodate a wide range of tasks.

Now go and create cool apps!

Run LLMs locally with Ollama

Getting started with local LLMs. Run DeepSeek Coder on your pc.

Understanding Ollama

Installing Ollama

Introducing DeepSeek Coder

Running DeepSeek Coder with Ollama

Models Available

Ollama basic commands

Running Models

Running Models via API

Make a Python app using Ollama's API

Importing Necessary Libraries

Setting Up API Communication

Initializing Conversation History

Interaction Loop

Conclusion

Run LLMs locally with Ollama

Getting started with local LLMs. Run DeepSeek Coder on your pc.

Understanding Ollama

Installing Ollama

Introducing DeepSeek Coder

Running DeepSeek Coder with Ollama

Models Available

Ollama basic commands

Running Models

Running Models via API

Make a Python app using Ollama's API

Importing Necessary Libraries

Setting Up API Communication

Initializing Conversation History

Interaction Loop

Conclusion

Did you find this article valuable?