Microsoft Azure SQL and Entity Framework Core Performance Issues and Solution
Many IdentityServer users are upgrading their deployments to the latest version of .NET and Duende IdentityServer. Most upgrades are uneventful and fair, with many users pleasantly surprised by a smoother-than-expected process. Successful deployments are excellent news! We love to hear that our care and attention to building a secure and stable product have led to a great developer experience.
Developers choose Duende IdentityServer over other commercial offerings because of its flexibility and almost infinite configurability. With great power comes great responsibility, and as developers, we all need to understand the consequences of our decisions. While these options are good in most cases, they can make performance investigations bespoke and time-consuming.
During the upgrade process, a handful of users reported a particular kind of performance degradation, which led our highly skilled and experienced customer success team to investigate.
The investigation has taken a few weeks of back-and-forth with customers, jumping on calls, and deep-diving into the Duende IdentityServer code and the .NET codebases. In this post, we’ll describe what our customers saw, the cause of the instability, and the solution.
The Symptoms of the Performance Degradation
To understand the performance degradation, we must first describe the context in which many of these users found themselves when performing the upgrade. These users have a typical profile that found themselves in:
- Upgrading from an older version, e.g., IdentityServer4, to Duende IdentityServer 7+
- Upgrading from .NET 6 to .NET 8+
- Upgrading NuGet packages, including Entity Framework Core
- The use of Microsoft Azure SQL or SQL Server
Once upgrading their codebases and deployments, users found several concerning problems in their production environments.
- CPU usage was sometimes spiking and doubling, leading to increased resource utilization.
- Database connection timeouts or slow query response times.
- Cache lock timeouts leading to IdentityServer exceptions.
- Overall system performance degradation with request/response times in the seconds.
As you can see, running a system under this performance profile is not what anyone wants, especially users who rely on a mission-critical system like Duende IdentityServer. Also, given the breadth of changes between .NET 6 and .NET 8, we had an expansive area to cover to solve this mysterious problem.
Thanks to two customers who jumped on a call with us to help diagnose the root cause, we noticed a couple of things:
- CPU usage was higher than previously, and as a result, the HTTP response time was up
- Application Insights traces showed a lot of time spent connecting to and querying the database
- Database engine metrics did not show a meaningful bump in CPU usage or query response times
Based on our customer success team’s experience, we began to suspect issues with the connection between the database and the application. So, we started scouring the Internet for other developers experiencing a similar problem.
Entity Framework Core, Microsoft.Data.SqlClient, and Transitive Packages
As many Duende IdentityServer users may know, we offer a wide array of templates and starter implementations to get your instance of Duende IdentityServer up and running. The options we provide have helped thousands of developers implement, modify, and deploy security solutions to their organizations.
One of those essential implementations is a set of Entity Framework Core implementations powering our operational and configuration stores. The data access implementation allows developers to store and share system information across multiple instances. Database storage is highly convenient for teams operating at an enterprise scale but does require your instance to have a healthy connection to a database.
As we mentioned, most users upgrading their codebases have successfully deployed Duende IdentityServer instances, but only a handful have seen these performance degradations. But why?
We stumbled upon this issue in the .NET GitHub repository, and it all started to make sense.
The issue states that there are bands of Entity Framework Core that suffer from a severe and crippling issue due to the dependency on Microsoft.Data.SqlServer
. Thanks to developer Evgeny Muryshkin, we now have a reproduction of what our users were reporting.
Let’s look at the test we’ll run that recreates the issue folks are seeing in their Duende IdentityServer deployments using Entity Framework Core.
using Microsoft.EntityFrameworkCore;
using test.sql;
[TestClass]
public class MigrationTests
{
static tstTestDBContext CreateTestDBContext(string[] args = null) =>
new TestDbContextDesignFactory()
.CreateDbContext(Guid.Empty, args);
[TestMethod]
public async Task MigrateAsync()
{
using (var ctx = CreateTestDBContext())
{
await ctx.Database.EnsureDeletedAsync();
await ctx.Database.MigrateAsync();
}
}
}
Evgeny created a test project for multiple Entity Framework Core package versions. After running the reproduction, we found several versions of Entity Framework Core affected by this issue. Notice the increase from 1 sec to 11 seconds in execution time.
When profiling these tests, we found a peculiar method invocation originating from Microsoft.Data.SqlClient
causing these long execution run times. Can you spot the issue?
There’s a call to Thread.Sleep
for over 10 seconds originating from within the SqlInternalConnectionTds
class.
Let’s look at the code itself to see what’s happening.
_timeout = TimeoutTimer.StartSecondsTimeout(connectionOptions.ConnectTimeout);
// If transient fault handling is enabled then we can retry the login up to the ConnectRetryCount.
int connectionEstablishCount = applyTransientFaultHandling ? connectionOptions.ConnectRetryCount + 1 : 1;
int transientRetryIntervalInMilliSeconds = connectionOptions.ConnectRetryInterval * 1000; // Max value of transientRetryInterval is 60*1000 ms. The max value allowed for ConnectRetryInterval is 60
for (int i = 0; i < connectionEstablishCount; i++)
{
try
{
OpenLoginEnlist(_timeout, connectionOptions, credential, newPassword, newSecurePassword, redirectedUserInstance);
break;
}
catch (SqlException sqlex)
{
if (i + 1 == connectionEstablishCount
|| !applyTransientFaultHandling
|| _timeout.IsExpired
|| _timeout.MillisecondsRemaining < transientRetryIntervalInMilliSeconds
|| !IsTransientError(sqlex))
{
throw;
}
else
{
Thread.Sleep(transientRetryIntervalInMilliSeconds);
}
}
}
If the error is “transient”, then the code invokes Thread.Sleep
, which, when debugging, immediately jumps up to 10 seconds.
As anyone who works with a cloud hosting provider knows, you should build your application to handle errors and be resilient to transient cloud issues, but 10 seconds is an eternity in computer time. During this wait time, your server cannot process any other requests, and in high-load situations, you can quickly find yourself out of threads. What makes this worse is that this transient error can occur on every attempt to open a new SQL connection to the database, adding more pressure to a system already taxed by exceptions and diminishing threads.
So why aren’t all of our users seeing this issue? Well, the fix is available but only in specific versions of Microsoft.Data.SqlClient
. Since the Microsoft.Data.SqlClient
is a transitive dependency of Microsoft.EntityFrameworkCore.SqlServer
, users can find themselves in a game of chance depending on which version of EF Core 8 they choose.
Here’s a breakdown of the Entity Framework Core versions affected based on their transitive dependency on Microsoft.Data.SqlClient
. Note this is not comprehensive, but we’ll give you an easy fix after this breakdown.
Microsoft.EntityFrameworkCore.SqlServer | Microsoft.Data.SqlClient | Status |
---|---|---|
8.0.0 | >=5.1.1 | Good |
8.0.3 | >=5.1.5 | Affected |
8.0.4 | >=5.1.5 | Affected |
8.0.6 | >=5.1.5 | Affected |
8.0.11 | >=5.1.6 | Good |
9.0.1 | >=5.1.6 | Good |
NuGet’s default behavior is to install the lowest applicable version of each transitive dependency. The behavior can mean folks upgrading their Entity Framework Core dependencies, including the ones used by Duende’s implementations, might be affected by which Entity Framework Core version they choose when upgrading.
The “Easy” Fix to Microsoft.Data.SqlClient
For folks now wondering what next steps they should take, it’s easy. We recommend an explicitly dependency on Microsoft.Data.SqlClient
at the 5.2.2
version.
<PackageReference Include="Microsoft.Data.SqlClient" Version="5.2.2" />
Do not upgrade to version 6.0.0
or higher, as this performance issue is also present there. The problem persists in higher package versions because Microsoft forks its packages based on .NET release cycles, and the .NET developers forked before they discovered this issue. The fix has not been ported to that version and includes the same problematic resiliency code from above.
We hope this helps resolve any degradation issues you’re currently experiencing in your Microsoft Azure deployments of Duende IdentityServer. If you were affected by this particular issue, please let us know in the comments.
Conclusion
At Duende, we are a developer-centric company; when you do better, we do better. While these investigations are outside our product support agreements, we take out-of-scope reports seriously when several users report similar issues, even when it looks like the Duende code does not directly cause the problem. That said, when using an SDK like Duende IdentityServer, there are decisions we leave up to our developer customers, and sometimes those decisions can cause unintended consequences.
Thankfully, our investigations led us to a resolution, and we thought we’d share our findings with anyone else who may be trying to diagnose and fix this issue.
We want to thank our customers for informing us about this particular issue, the Duende customer success engineers for going above and beyond, and the .NET community for communicating and reproducing the issue in GitHub.