Scaling SQL Server 2025 Vector Search with Load-Balanced Ollama Embeddings
SQL Server 2025 introduces native support for vector data types and external AI models. This opens up new scenarios for semantic search and AI-driven experiences directly in the database. But as with any external service integration, performance and scalability are immediate concerns, especially when generating embeddings at scale.
https://github.com/nocentino/ollama-lb-sql
Problem: Bottlenecks in Embedding Generation
When you call out to an external embedding service from T-SQL via REST over HTTPS, you’re limited by the throughput of that backend. If you’re running a single Ollama instance, you’ll quickly hit a ceiling on how fast you can generate embeddings, especially for large datasets. I recently attended an event and discussed this topic. My first attempt at generating embeddings was for a three-million-row table. I had access to some world-class hardware to generate the embeddings. When I arrived at the lab and initiated the embedding generation process for this dataset, I quickly realized it would take approximately 9 days to complete. Upon closer examination, I found that I was not utilizing the GPUs to their full potential; in fact, I was only using about 15% of one GPU’s capacity. So I started to cook up this concept in my head, and here we are, load balancing embedding generation across multiple instances of ollama to more fully utilize the resources.
Solution: Load Balancing with Nginx and Docker
To address this, I built a demo that runs multiple Ollama instances behind an Nginx load balancer. SQL Server connects to the load balancer as an external model endpoint. This distributes embedding requests across all available Ollama instances. The setup uses Docker Compose for repeatability and quick local deployment. Now, this isn’t production-grade code; this is more of an educational experience to show you how all these pieces work together. Additionally, there are some nuances regarding the sizes of models and how they fit into the memory available in your GPUs. As you begin to build something like this for real, you will need to address these scalability parameters for your workload and the models you’re using.
Key points:
- Vector-enabled tables and external models in SQL Server 2025
- Nginx load balancer in front of multiple Ollama containers
- Automated setup scripts
- Performance comparison: load-balanced vs. single backend
Results
With the load balancer in place, generating 1,000 embeddings took about 5.4 seconds. Using a single backend, the same workload took over 30 seconds. That’s a 6x improvement just by scaling out the embedding service.
T-SQL for comparison:
Below is code that generates embeddings for 1,000 rows. The first one uses the load-balanced endpoint and also the query hint OPTION(USE HINT('ENABLE_PARALLEL_PLAN_PREFERENCE'))
to force a parallel plan, which will kick off multiple insert threads.
-- Load-balanced endpoint
INSERT INTO dbo.PostEmbeddings (PostID, Embedding, CreatedAt)
SELECT TOP 1000
p.Id,
AI_GENERATE_EMBEDDINGS(p.Title USE MODEL ollama_lb),
GETDATE()
FROM dbo.Posts p
WHERE p.Title IS NOT NULL
AND NOT EXISTS (SELECT 1 FROM dbo.PostEmbeddings pe WHERE pe.PostID = p.Id) OPTION(USE HINT('ENABLE_PARALLEL_PLAN_PREFERENCE'))
Below is the code that generates embeddings for 1,000 rows. This one uses the single backend endpoint and also uses OPTION (MAXDOP 1)
to force a serial plan.
-- Single backend endpoint
INSERT INTO dbo.PostEmbeddings (PostID, Embedding, CreatedAt)
SELECT TOP 1000
p.Id,
AI_GENERATE_EMBEDDINGS(p.Title USE MODEL ollama_single),
GETDATE()
FROM dbo.Posts p
WHERE p.Title IS NOT NULL
AND NOT EXISTS (SELECT 1 FROM dbo.PostEmbeddings pe WHERE pe.PostID = p.Id) OPTION (MAXDOP 1);
Execution times:
- Load-balanced: ~5.4 seconds
- Single backend: ~30.4 seconds
Takeaway
If you’re using SQL Server 2025’s vector features and need to scale embedding generation, load balancing your AI backend is straightforward, delivering immediate performance gains. The full demo is on GitHub—spin it up, run the scripts, and see the results for yourself. https://github.com/nocentino/ollama-lb-sql