Scaling SQL Server 2025 Vector Search with Load-Balanced Ollama Embeddings

SQL Server 2025 introduces native support for vector data types and external AI models. This opens up new scenarios for semantic search and AI-driven experiences directly in the database. But as with any external service integration, performance and scalability are immediate concerns, especially when generating embeddings at scale.

https://github.com/nocentino/ollama-lb-sql

Problem: Bottlenecks in Embedding Generation

When you call out to an external embedding service from T-SQL via REST over HTTPS, you’re limited by the throughput of that backend. If you’re running a single Ollama instance, you’ll quickly hit a ceiling on how fast you can generate embeddings, especially for large datasets. I recently attended an event and discussed this topic. My first attempt at generating embeddings was for a three-million-row table. I had access to some world-class hardware to generate the embeddings. When I arrived at the lab and initiated the embedding generation process for this dataset, I quickly realized it would take approximately 9 days to complete. Upon closer examination, I found that I was not utilizing the GPUs to their full potential; in fact, I was only using about 15% of one GPU’s capacity. So I started to cook up this concept in my head, and here we are, load balancing embedding generation across multiple instances of ollama to more fully utilize the resources.

Solution: Load Balancing with Nginx and Docker

To address this, I built a demo that runs multiple Ollama instances behind an Nginx load balancer. SQL Server connects to the load balancer as an external model endpoint. This distributes embedding requests across all available Ollama instances. The setup uses Docker Compose for repeatability and quick local deployment. Now, this isn’t production-grade code; this is more of an educational experience to show you how all these pieces work together. Additionally, there are some nuances regarding the sizes of models and how they fit into the memory available in your GPUs. As you begin to build something like this for real, you will need to address these scalability parameters for your workload and the models you’re using.

Key points:

Vector-enabled tables and external models in SQL Server 2025
Nginx load balancer in front of multiple Ollama containers
Automated setup scripts
Performance comparison: load-balanced vs. single backend

Results

With the load balancer in place, generating 1,000 embeddings took about 5.4 seconds. Using a single backend, the same workload took over 30 seconds. That’s a 6x improvement just by scaling out the embedding service.

T-SQL for comparison:

Below is code that generates embeddings for 1,000 rows. The first one uses the load-balanced endpoint and also the query hint OPTION(USE HINT('ENABLE_PARALLEL_PLAN_PREFERENCE')) to force a parallel plan, which will kick off multiple threads for the embedding generation using AI_GENERATE_EMBEDDINGS.

Using the ENABLE_PARALLEL_PLAN_PREFERENCE hint will ensure that the AI_GENERATE_EMBEDDINGS part of the plan goes parallel, since the SELECT will go parallel. In this post, the example does a serial insert. Doing parallel inserts opens up a whole additional level of complexity that I don’t want to introduce here. I experimented with TABLOCK, but there are additional complexities around that. Check out my friend Erik Darling’s post on this if you need to drive additional concurrency using parallel inserts.

-- Load-balanced endpoint
INSERT INTO dbo.PostEmbeddings (PostID, Embedding, CreatedAt)
SELECT TOP 1000
 p.Id,
 AI_GENERATE_EMBEDDINGS(p.Title USE MODEL ollama_lb),
 GETDATE()
FROM dbo.Posts p
WHERE p.Title IS NOT NULL
    AND NOT EXISTS (SELECT 1 FROM dbo.PostEmbeddings pe WHERE pe.PostID = p.Id) OPTION(USE HINT('ENABLE_PARALLEL_PLAN_PREFERENCE'))

Below is the code that generates embeddings for 1,000 rows. This one uses the single backend endpoint and also uses OPTION (MAXDOP 1) to force a serial plan.

-- Single backend endpoint
INSERT INTO dbo.PostEmbeddings (PostID, Embedding, CreatedAt)
SELECT TOP 1000
 p.Id,
 AI_GENERATE_EMBEDDINGS(p.Title USE MODEL ollama_single),
 GETDATE()
FROM dbo.Posts p
WHERE p.Title IS NOT NULL
    AND NOT EXISTS (SELECT 1 FROM dbo.PostEmbeddings pe WHERE pe.PostID = p.Id) OPTION (MAXDOP 1);

Execution times:

Load-balanced: ~5.4 seconds
Single backend: ~30.4 seconds

Takeaway

If you’re using SQL Server 2025’s vector features and need to scale embedding generation, load balancing your AI backend is straightforward, delivering immediate performance gains. The full demo is on GitHub—spin it up, run the scripts, and see the results for yourself. https://github.com/nocentino/ollama-lb-sql