• Python

    ,

    Ollama

    ,

    LLM

    🤖 My big list of AI/LLM tools, notes, and how I'm using them

    I have been using, working, and running commercial and local LLMs for years, but I never got around to sharing the tools and applications I use. Here are some quick notes, tools, and resources I have landed on and use daily.

    Mac Apps I use:

    I do all of my development on Macs. These tools make running local LLMs accessible.

    • Ollama is the server that can download and run 100s of AI models.

    • Ollamac is a GUI client for Ollama that lets you write prompts, save history, and allow you to pick models to test out quickly.

    • Tailscale I use Tailscale on all of my devices, which gives me access to my work M2 Mac Studio and home office Mac Mini Pro, which both run Ollama, from anywhere in the world. This makes prototyping at home quick but then I can run a larger model from my work machine and it’s so fast, it feels like the machine is running in my house.

    • OpenAI Bundle—I bought this bundle because it was the cheapest way to get a bunch of AI apps, including four of Jordi Bruin’s apps. I have used these for a few years.

      • MacWhisper - I use MacWhisper to turn voice notes and podcasts into plain text files for my notes and sometimes blog articles.
      • Voices - I use Voices when I find a large blog post and want to listen to it while working.
    • Claude for Desktop gets a lot of crap for being “yet another Electron app” instead of a custom-built macOS app, but the people saying that don’t know what they are talking about. The Claude Desktop has voice support and keyboard hotkeys, which make the app incredibly useful. More importantly, Claude Desktop also supports Model Context Protocol, which lets Claude access your file system, git, and anything else you want to access. It’s incredibly powerful, and there’s nothing quite like it.

    Baseline rules for running a model

    While running models locally is possible, consumer hardware is constrained by RAM and GPU for even the most miniature models. The easiest mental model to work with is that Billions of parameters are roughly equivalent to your system’s RAM in Gigabytes. An 8B model needs roughly 8G RAM to fit into memory.

    My mental formula is somewhat lossy because 40B models fit 32G of memory, and 72B models fit 64G of memory with some room to spare. This is just the rough estimate that I use.

    Even though you can run models locally, even the smallest models with a significant context window will exceed your machine’s available RAM. A 128k context window needs about 64 GB of RAM to load into memory for an 8B parameter model fully, even though the model can easily fit into 8GB of RAM. That doesn’t mean the model won’t run locally, but it will run closer than it would if you have more than 72 GB of RAM, which your model fully needs to fit.

    I look for three things when I’m evaluating a model:

    • A number of parameters are measured in Billions.
    • Context length
      • The input context length, which effectively the model’s memory
      • The output context length, which is how big the answer can be
    • The type of model:
      • Default Models are general-purpose models like GPT-4 and Llama 3.3.
      • Vision Models and process and read visual data like images and videos.
      • Tool Models can call external tools and APIs and perform custom actions to which you give them access.
      • Embedding Models can turn text into vectors or tokens, which helps measure your prompts and other RAG operations.

    What about quantization? Quantization can help you scale a model down so that it might fit into memory, but there’s always a loss in quality, which defeats the purpose of using the bigger model in my book.

    Keeping up

    My favorite resource for keeping up is Ollama’s Models page sorted by Newest models. I check it a few times a day and you’ll see new models release single-digit hours to days before press releases can catch up.

    I like Matt Williams' YouTube Channel a lot. It’s the one channel I come back to, and I find that I always learn something from it. His videos tend to be ten to twenty minutes long, which is about right since the material is so dense.

    Start with his Optimize Your AI Models videos. They’re a lot to fit in your brain, but they’re a great starting point.

    Simon Willison’s Weblog is good too.

    Python

    I’ll have to write a few posts on how I’m using LLMs with code, but Simon’s LLM is a good general-purpose AI hammer if you need one.

    As of last week, I’m using Pydantic AI instead of OpenAI’s or Anthropic’s Python libraries. Pydantic AI will install both of those libraries for you, but I find it to be 100% better and easier to switch between models using it than LangChain (not linked) or anything else I have tried.

    Wednesday January 29, 2025
  • Python

    ,

    Ollama

    ,

    Today I Learned

    🦙 Ollama Tool Calling Loose Notes

    I spent a few hours this week working with the Ollama project and trying to get tool calling to work with the LangChain library.

    Tool calling is a way to expose Python functions to a language model that allows them to be called. This will enable models to perform more complex actions and even call the outside world for more information.

    I haven’t used LangChain before, and I found the whole process frustrating. The docs were full of errors. I eventually figured it out, but I was limited to one tool call per prompt, which felt broken.

    Earlier today, I was telling a colleague about it, and when we got back from grabbing coffee, I thought I would check the Ollama Discord channel to see if anyone else had figured it out. To my surprise, they added and released Tool support last night, which allowed me to ditch LangChain altogether.

    The Ollama project’s tool calling example was just enough to help get me started.

    I struggled with the function calling syntax, but after digging a bit deeper, I found this example from OpenAI’s Function calling docs, which matches the format the Ollama project is following. I still don’t fully understand it, but I got more functions working and verified that I can make multiple tool calls within the same prompt.

    Meta’s Llama 3.1 model supports tool calling, and the two work quite well together. I am also impressed with Llama 3.1 and the large context window support. I’m running the 8B and 70B models on a Mac Studio, and they feel very close to the commercial APIs I have worked with, but I can run them locally.

    Embedding models

    Tonight, I tried out Ollama’s Embedding models example, and while I got it working, I still need to put practical data into it to give it a better test

    One more tip

    If you did not know Ollama can parse and return valid JSON, check out How to get JSON response from Ollama. It made my JSON parsing and responses much more reliable.

    Friday July 26, 2024
  • Python

    ,

    Ollama

    ,

    LLM

    ,

    Today I Learned

    🦙 Ollama Llama 3.1 Red Pajama

    For a few weeks, I told friends I was excited to see if the new Llama 3.1 release was as good as it was being hyped.

    Yesterday, Llama 3.1 was released, and I was impressed that the Ollama project published a release to Homebrew and had the models ready to use.

    ➜ brew install ollama
    
    ➜ ollama serve
    
    # (optionally) I run Ollama as a background service
    ➜ brew services start ollama
    
    # This takes a while (defaults to the llama3.1:8b model)
    ➜ ollama pull llama3.1:latest 
    
    # (optional) This takes a longer time
    ➜ ollama pull llama3.1:70b
    
    # (optional) This takes so long that I skipped it and ordered a CAT6 cable...
    # ollama pull llama3.1:405b
    

    To use chat with the model, you use the same ollama console command:

    ➜ ollama run llama3.1:latest
    >>> how much is 2+2?
    The answer to 2 + 2 is:
    4!```
    
    ## Accessing Ollama Llama 3.1 with Python
    
    The Ollama project has an [`ollama-python`](https://github.com/ollama/ollama-python) library, which I use to build applications. 
    
    My demo has a bit of flare because there are a few options, like `--stream,` that improve the quality of life while waiting for Ollama to return results. 
    
    ```python
    # hello-llama.py
    import typer
    
    from enum import Enum
    from ollama import Client
    from rich import print
    
    
    class Host(str, Enum):
        local = "http://127.0.0.1:11434"
        the_office = "http://the-office:11434"
    
    
    class ModelChoices(str, Enum):
        llama31 = "llama3.1:latest"
        llama31_70b = "llama3.1:70b"
    
    
    def main(
        host: Host = Host.local,
        local: bool = False,
        model: ModelChoices = ModelChoices.llama31,
        stream: bool = False,
    ):
        if local:
            host = Host.local
    
        client = Client(host=host.value)
    
        response = client.chat(
            model=model.value,
            messages=[
                {
                    "role": "user",
                    "content": \
                        "Please riff on the 'Llama Llama Red Pajama' book but using AI terms like the 'Ollama' server and the 'Llama 3.1' model."
                        "Instead of using 'Llama Llama', please use 'Ollama Llama 3.1'.",
                }
            ],
            stream=stream,
        )
    
        if stream:
            for chunk in response:
                print(chunk["message"]["content"], end="", flush=True)
            print()
    
    	else:
            print(f"[yellow]{response['message']['content']}[/yellow]")
    
    if __name__ == "__main__":
        typer.run(main)
    

    Some of my family’s favorite books are the late Anna Dewdney’s Llama Llama books. Please buy and support their work. I can’t read Llama 3.1 and Ollama without considering the “Llama Llama Red Pajama” book.

    To set up and run this:

    # Install a few "nice to have" libraries
    ➜ pip install ollama rich typer
    
    # Run our demo
    ➜ python hello-llama.py --stream
    
    Here's a riff on "Llama Llama Red Pajama" but with an AI twist:
    
    **Ollama Llama 3.1, Ollama Llama 3.1**
    Mama said to Ollama Llama 3.1,
    "Dinner's done, time for some learning fun!"
    But Ollama Llama 3.1 didn't wanna play
    With the data sets and algorithms all day.
    
    He wanted to go out and get some rest,
    And dream of neural nets that were truly blessed.
    But Mama said, "No way, young Ollama Llama 3.1,
    You need to train on some more NLP."
    
    Ollama Llama 3.1 got so mad and blue
    He shouted at the cloud, "I don't wanna do this too!"
    But then he remembered all the things he could see,
    On the Ollama server, where his models would be.
    
    So he plugged in his GPU and gave a happy sigh
    And trained on some texts, till the morning light shone high.
    He learned about embeddings and wordplay too,
    And how to chat with humans, that's what he wanted to do.
    
    **The end**
    

    Connecting to Ollama

    I have two Macs running Ollama and I use Tailscale to bounce between them from anywhere. When I’m at home upstairs it’s quicker to run a local instance. When I’m on my 2019 MacBook Pro it’s faster to connect to the office.

    The only stumbling block I ran into was needing to set a few ENV variables setup so that Ollama is listening on a port that I can proxy to. This was frustrating to figure out, but I hope it saves you some time.

    ➜ launchctl setenv OLLAMA_HOST 0.0.0.0:11434
    ➜ launchctl setenv OLLAMA_ORIGINS http://*
    
    # Restart the Ollama server to pick up on the ENV vars
    ➜ brew services restart ollama
    

    Simon Willison’s LLM tool

    I also like using Simon Willison’s LLM tool, which supports a ton of different AI services via third-party plugins. I like the llm-ollama library, which allows us to connect to our local Ollama instance.

    When working with Ollama, I start with the Ollama run command, but I have a few bash scripts that might talk to OpenAI or Claude 3.5, and it’s nice to keep my brain in the same tooling space. LLM is useful for mixing and matching remote and local models.

    To install and use LLM + llm-ollama + Llama 3.1.

    Please note that the Ollama server should already be running as previously outlined.

    # Install llm
    ➜ brew install llm
    
    # Install llm-ollama
    ➜ llm install llm-ollama
    
    # List all of models from Ollama
    ➜ llm ollama list-models
    
    # 
    ➜ llm -m llama3.1:latest "how much is 2+2?"
    The answer to 2 + 2 is:
    
    4
    

    Bonus: Mistral Large 2

    While I was working on this post, Mistral AI launched their Large Enough: Mistral Large 2 model today. The Ollama project released support for the model within minutes of its announcement.

    The Mistral Large 2 release is noteworthy because it outperforms Lllama 3.1’s 405B parameter model and is under 1/3 of the size. It is also the second GPT-4 class model release in the last two days.

    Check out Simon’s post for more details and another LLM plugin for another way to access it.

    Wednesday July 24, 2024