Building a Local GenAI Service with Ollama, Mistral, and Go

Running Large Language Models (LLMs) locally provides data privacy, zero latency costs, and full control over your inference environment. This guide demonstrates how to containerize Ollama, automate the Mistral model download, and expose it through an Nginx reverse proxy to a Golang backend.

1. Orchestrating the Stack (Docker Compose)

We use a “double-container” pattern for Ollama: one to initialize the model and one to serve it. This prevents the backend from querying a model that hasn’t finished downloading. Here is a sample Docker Compose file.

services:
  # The Gateway
  nginx:
    image: nginx:alpine
    ports: ["3001:80"]
    volumes: ["./docker/nginx.conf:/etc/nginx/nginx.conf"]
    depends_on:
      backend: { condition: service_started }
      frontend: { condition: service_healthy }

  # LLM Initializer: Pulls Mistral then exits
  llama-init:
    image: ollama/ollama:latest
    volumes: ["./models:/root/.ollama/models"]
    entrypoint: >
      bash -c "ollama pull mistral && echo 'Init complete'"

  # LLM Inference Server
  llama:
    image: ollama/ollama:latest
    volumes: ["./models:/root/.ollama/models"]
    environment:
      - OLLAMA_NUM_PARALLEL=1
      - OLLAMA_NUM_THREAD=8
    depends_on:
      llama-init: { condition: service_completed_successfully }

2. Configuring the API Gateway (Nginx)

Nginx acts as the single entry point. We map /api/ai/ specifically to the Ollama service. Crucially, we increase the timeouts; local LLM inference takes significantly longer than standard REST requests. Below is a sample nginx.conf snippet.

upstream ollama {
    server llama:11434;
}

server {
    listen 80;

    location /api/ai/ {
        proxy_pass http://ollama/; # Trailing slash strips /api/ai/ from the request
        proxy_read_timeout 300s;    # Prevent timeouts during long inference
        proxy_connect_timeout 300s;
        proxy_send_timeout 300s;
        
        proxy_set_header Host $host;
        proxy_set_header X-Real-IP $remote_addr;
    }
}

3. Consuming the Service in Go

With the infrastructure set up, calling the LLM becomes a standard POST request. The backend uses an environment variable OLLAMA_ENDPOINT to point to the Nginx proxy.

func callSummarizer(ctx context.Context, text string) (string, error) {
    endpoint := os.Getenv("OLLAMA_ENDPOINT") // e.g., http://nginx/api/ai
    
    payload := map[string]interface{}{
        "model":  "mistral",
        "prompt": fmt.Sprintf("Summarize this: %s", text),
        "stream": false, // Set to false for a single JSON response
    }

    jsonPayload, _ := json.Marshal(payload)
    url := fmt.Sprintf("%s/api/generate", endpoint)
    
    req, _ := http.NewRequestWithContext(ctx, "POST", url, bytes.NewBuffer(jsonPayload))
    client := &http.Client{Timeout: 180 * time.Second}
    
    resp, err := client.Do(req)
    if err != nil {
        return "", err
    }
    defer resp.Body.Close()

    var result struct { Response string `json:"response"` }
    json.NewDecoder(resp.Body).Decode(&result)
    
    return result.Response, nil
}

Key Technical Takeaways

Model Persistence: By mounting ./models, you download the 4.1GB Mistral weights only once.
Initialization Safety: Using service_completed_successfully for llama-init ensures the inference engine never starts with a missing model.
Reverse Proxy Benefits: The Go backend doesn’t need to know the Ollama port (11434). It simply talks to nginx/api/ai, allowing you to swap LLM providers or versions without changing application code.

The Importance of Extended Timeouts

In a standard web application, a 5–10 second timeout is usually more than enough. However, when dealing with local Generative AI, you must adjust your expectations. In the provided configuration, we set the Nginx and Go timeouts to 180–300 seconds.

There are three primary reasons for this:

Resource Contention: Unlike cloud APIs (like OpenAI) that run on clusters of H100 GPUs, a local container often relies on the host’s CPU or a consumer-grade GPU. If the system is under load, the “Time to First Token” (TTFT) can increase significantly.
Model Loading: If the Ollama service has been idle, it may need to load the Mistral model weights from the disk into memory (RAM/VRAM) before it can begin processing the prompt. For a 4GB+ model, this can take several seconds on slower drives.
Prompt Complexity: The longer the input text (e.g., a large patient clinical history), the longer the “pre-fill” phase takes. The LLM must process the entire prompt before it can generate the first word of the summary.

Failure to extend these timeouts may result in 504 Gateway Timeouts from Nginx or “Context Deadline Exceeded” errors in your Go backend.

Conclusion

By combining Docker’s service orchestration, Nginx’s routing and timeout management, and Go’s robust HTTP handling, we can create a resilient local AI environment.