fails when installing Setup Support files

May 7, 2012, 7:49 am

≫ Next: SQL Server Cluster Failover Root Cause Analysis–the what, where and how

I’m sure many of you would have seen this issue when running SQL 2008/2008 R2/2012 setup on a new server. The setup will proceed to install Setup support files, the window will disappear but, strangely enough, the next window never shows up.

Here’s what you need to do:

Click on start->run and type %temp% and press enter (basically, go to the temp folder)
Here, look for SQLSetup.log and SQLSetup_1.log. Open the SQLSetup_1.log file. In there, check for the following messages:
04/16/2012 17:16:47.950 Error: Failed to launch process
04/16/2012 17:16:47.952 Error: Failed to launch local setup100.exe: 0x80070003

Typically, you get this error only in SQL 2008, SQL 2008 R2 and SQL 2012. The steps are slightly different for all 3, and I’ve tried to outline them here:

SQL Server 2008

1. Save the following in a .reg file and merge to populate the registry:

Windows Registry Editor Version 5.00

[HKEY_LOCAL_MACHINE\SOFTWARE\Microsoft\Microsoft SQL Server\100\Bootstrap]
"BootstrapDir"="C:\\Program Files\\Microsoft SQL Server\\100\\Setup Bootstrap\\"

[HKEY_LOCAL_MACHINE\SOFTWARE\Microsoft\Microsoft SQL Server\100\Bootstrap\Setup]
"PatchLevel"="10.0.1600.22"

2. Next, copy the following files and folders from the media to the specified destinations:

File/Folder in media	Destination
X64/X86 (depending on what architecture you want to install)	C:\Program Files\Microsoft SQL Server\100\Setup Bootstrap\Release
Setup.exe	C:\Program Files\Microsoft SQL Server\100\Setup Bootstrap\Release
Setup.rll	C:\Program Files\Microsoft SQL Server\100\Setup Bootstrap\Release\Resources\1033\

SQL Server 2008 R2

1. Save the following in a .reg file and merge to populate the registry:

Windows Registry Editor Version 5.00

[HKEY_LOCAL_MACHINE\SOFTWARE\Microsoft\Microsoft SQL Server\100\Bootstrap]
"BootstrapDir"="C:\\Program Files\\Microsoft SQL Server\\100\\Setup Bootstrap\\"

[HKEY_LOCAL_MACHINE\SOFTWARE\Microsoft\Microsoft SQL Server\100\Bootstrap\Setup]
"PatchLevel"="10.50.1600.00"

2. Next, copy the following files and folders from the media to the specified destinations:

File/Folder in media	Destination
X64/X86 folder (depending on what architecture you want to install)	C:\Program Files\Microsoft SQL Server\100\Setup Bootstrap\SQLServer2008R2
Setup.exe	C:\Program Files\Microsoft SQL Server\100\Setup Bootstrap\SQLServer2008R2
Resources folder	C:\Program Files\Microsoft SQL Server\100\Setup Bootstrap\SQLServer2008R2

Next, re-run the setup, and it should proceed beyond the point of error this time.

SQL Server 2012

1. Save the following in a .reg file and merge to populate the registry:

Windows Registry Editor Version 5.00

[HKEY_LOCAL_MACHINE\SOFTWARE\Microsoft\Microsoft SQL Server\110\Bootstrap]
"BootstrapDir"="C:\\Program Files\\Microsoft SQL Server\\110\\Setup Bootstrap\\"

[HKEY_LOCAL_MACHINE\SOFTWARE\Microsoft\Microsoft SQL Server\110\Bootstrap\Setup]
"PatchLevel"="11.00.2100.60"

2. Next, copy the following files and folders from the media to the specified destinations:

File/Folder in media	Destination
X64/X86 folder (depending on what architecture you want to install)	C:\Program Files\Microsoft SQL Server\110\Setup Bootstrap\SQLServer2012
Setup.exe	C:\Program Files\Microsoft SQL Server\110\Setup Bootstrap\SQLServer2012
Resources folder	C:\Program Files\Microsoft SQL Server\110\Setup Bootstrap\SQLServer2012

Next, re-run the setup, and it should proceed beyond the point of error this time.

As always, comments/suggestions/feedback are welcome and solicited.

↧

SQL Server Cluster Failover Root Cause Analysis–the what, where and how

September 3, 2012, 12:55 pm

≫ Next: Why the registry size can cause problems with your SQL 2012 AlwaysOn/Failover Cluster setup

≪ Previous: SQL 2008/2008 R2/2012 setup disappears/fails when installing Setup Support files

I know many of you get into situations where SQL Server fails over from one node of a cluster to the other, and you’re hard-pressed to find out why. In this post, I shall seek to answer quite a few questions about how to about conducting a post-mortem analysis for SQL Server cluster failover, aka Cluster Failover RCA.

First up, since this is a post mortem analysis, we need all the logs we can get. Start by collecting the following:

SQL Server Errorlogs
The “Application” and “System” event logs, saved in txt or csv format (eases analysis)
The cluster log (see here and here for details on how to enable/collect cluster logs for Windows 2003 and 2008 respectively)

Now that we have all the logs in place, then comes the analysis part. I’ve tried to list down the steps and most common scenarios here:

Start with the SQL Errorlog. The Errorlog files in the SQL Server log folder can be viewed using notepad, textpad or any other text editor. The current file will be named Errorlog, the one last used Errorlog.1, and so on. See if the SQL Server was shut down normally. For example, the following stack denotes a normal shutdown for SQL:

2012-09-04 00:32:54.32 spid14s     Service Broker manager has shut down.
2012-09-04 00:33:02.48 spid6s      SQL Server is terminating in response to a 'stop' request from Service Control Manager. This is an informational message only. No user action is required.
2012-09-04 00:33:02.50 spid6s      SQL Trace was stopped due to server shutdown. Trace ID = '1'. This is an informational message only; no user action is required.
You might see a lot of situations where SQL Server failed over due to a system shutdown i.e. the node itself rebooted. In that case, the stack at the bottom of the SQL Errorlog will look something like this:

2012-07-13 06:39:45.22 Server      SQL Server is terminating because of a system shutdown. This is an informational message only. No user action is required.
2012-07-13 06:39:48.04 spid14s     The Database Mirroring protocol transport has stopped listening for connections.
2012-07-13 06:39:48.43 spid14s     Service Broker manager has shut down.
2012-07-13 06:39:55.39 spid7s      SQL Trace was stopped due to server shutdown. Trace ID = '1'. This is an informational message only; no user action is required.
2012-07-13 06:39:55.43 Server      The SQL Server Network Interface library could not deregister the Service Principal Name (SPN) for the SQL Server service. Error: 0x6d3, state: 4. Administrator should deregister this SPN manually to avoid client authentication errors.
You can also use the systeminfo command from a command prompt to check when the node was last rebooted (look for “System Boot Time”), and see if this matches the time of the Failover. If so, then you need to investigate why the node rebooted, because SQL was just a victim in this case.

Next come the event logs. Look for peculiar signs in the application and system event logs that could have caused the failover. For example, one strange scenario that I came across was when the disks hosting tempdb became inaccessible for some reason. In that case, I saw the following in the event logs:

Information 7/29/2012 12:44:07 AM MSSQLSERVER 680 Server Error [8, 23, 2] occurred while attempting to drop allocation unit ID 423137010909184 belonging to worktable with partition ID 423137010909184.

Error 7/29/2012 12:44:07 AM MSSQLSERVER 823 Server The operating system returned error 2(The system cannot find the file specified.) to SQL Server during a read at offset 0x000001b6d70000 in file 'H:\MSSQL\Data\tempdata4.ndf'. Additional messages in the SQL Server error log and system event log may provide more detail. This is a severe system-level error condition that threatens database integrity and must be corrected immediately. Complete a full database consistency check (DBCC CHECKDB). This error can be caused by many factors; for more information, see SQL Server Books Online.
And then some time later, we see SQL shutting down in reaction to this:

Error 7/29/2012 12:44:17 AM MSSQLSERVER 3449 Server SQL Server must shut down in order to recover a database (database ID 2). The database is either a user database that could not be shut down or a system database. Restart SQL Server. If the database fails to recover after another startup, repair or restore the database.

Error 7/29/2012 12:44:17 AM MSSQLSERVER 3314 Server During undoing of a logged operation in database 'tempdb', an error occurred at log record ID (12411:7236:933). Typically, the specific failure is logged previously as an error in the Windows Event Log service. Restore the database or file from a backup, or repair the database.

Error 7/29/2012 12:44:17 AM MSSQLSERVER 9001 Server The log for database 'tempdb' is not available. Check the event log for related error messages. Resolve any errors and restart the database.
Another error that clearly points toward the disks being a culprit is this:

Error 7/29/2012 12:44:15 AM MSSQLSERVER 823 Server The operating system returned error 21(The device is not ready.) to SQL Server during a read at offset 0x00000000196000 in file 'S:\MSSQL\Data\tempdb.mdf'. Additional messages in the SQL Server error log and system event log may provide more detail. This is a severe system-level error condition that threatens database integrity and must be corrected immediately. Complete a full database consistency check (DBCC CHECKDB). This error can be caused by many factors; for more information, see SQL Server Books Online.
The next logical step of course would be to check why the disks became unavailable/inaccessible. I would strongly recommend having your disks checked for consistency, speed and stability by your vendor.

If you don’t have any clue from these past steps, try taking a look at the cluster log as well. Please do note that the Windows cluster logs are recorded in GMT/UTC time zone always, so you’ll need to make the necessary calculations to determine what time to focus on in the cluster log. See if you can find anything which could have caused the cluster group to fail, such as the network being unavailable, failure of the IP/Network name, etc.

There is no exhaustive guide to finding the root cause for a Cluster Failover, mainly because it is an approach thing. I do, however, want to talk about a few cluster concepts here, which might help you understand the messages from the various logs better.

checkQueryProcessorAlive: Also known as the isAlive check in SQL Server, this executes “SELECT @@servername” against the SQL Server instance. It waits 60 seconds before running the query again, but checks every 5 seconds whether the service is alive by calling sqsrvresCheckServiceAlive. Both these values(60 seconds and 5 seconds) are configured “by default” and can be changed from the properties of the SQL Server resource in Failover Cluster Manager/Cluster Administrator. I understand that for SQL 2012, we’ve included some more comprehensive checks like running sp_server_diagnostics as part of this check to ensure that SQL is in good health.

sqsrvresCheckServiceAlive: Also known as the looksAlive check in SQL Server, this checks to see if the status of the SQL Service and returns “Service is dead” if the status is not one of the following:

SERVICE_RUNNING
SERVICE_START_PENDING
SERVICE_PAUSED
SERVICE_PAUSE_PENDING

So if you see messages related to one of these checks failing in either the event logs or the cluster logs, you know that SQL Server was not exactly “available” at that time, which caused the failover. The next step, of course would be to investigate why SQL Server was not available at that time. It can be due to a resource bottleneck such as high CPU or memory consumption, SQL Server hung/stalled, etc.

The base idea here, as with any post-mortem analysis, is to construct a logical series of events leading up to the failover, based on the data. If we can do that, then we have at least a clear indication on what caused the failover, and more importantly, how to avoid such a situation in the future.

If you’re still unable to determine anything about the cause of the failover, I would strongly recommend contacting Microsoft CSS to review the data once and see if they’re able to spot anything.

Hope this helps. As always, comments, feedback and suggestions are welcome.

↧

Why the registry size can cause problems with your SQL 2012 AlwaysOn/Failover Cluster setup

October 25, 2012, 3:14 am

≫ Next: How To : SQL 2012 Filetable Setup and Usage

≪ Previous: SQL Server Cluster Failover Root Cause Analysis–the what, where and how

I recently worked on a very interesting issue, where one of the cluster nodes in an AlwaysOn environment became unstable, and the administrators ended up evicting the node from the Windows cluster as an emergency measure. Ideally, since the primary node/replica was no longer available, the Availability Group should have come up on the secondary replica, but it didn’t in this case. The AG was showing online in the Failover Cluster Manager, but in SQL Server Management studio, the database in the AG was in “Not Synchronizing\Recovery Pending” state.

We checked the errorlogs (on the secondary), and found these messages:

2012-09-05 04:01:32.300 spid18s      AlwaysOn Availability Groups: Waiting for local Windows Server Failover Clustering service to start. This is an informational message only. No user action is required.
2012-09-05 04:01:32.310 spid21s      Error: 35262, Severity: 17, State: 1.
2012-09-05 04:01:32.310 spid21s      Skipping the default startup of database 'Test' because the database belongs to an availability group (Group ID: 65537). The database will be started by the availability group. This is an informational message only. No user action is required.
……..

2012-09-05 04:01:32.430 spid18s      AlwaysOn: The local replica of availability group 'PST TEST' is starting. This is an informational message only. No user action is required.
…….
2012-09-05 04:01:32.470 spid18s      The state of the local availability replica in availability group 'PST TEST' has changed from 'NOT_AVAILABLE' to 'RESOLVING_NORMAL'. The replica state changed because of either a startup, a failover, a communication issue, or a cluster error.
…….

2012-09-05 04:01:32.880 spid52       AlwaysOn: The local replica of availability group 'PST TEST' is preparing to transition to the primary role in response to a request from the Windows Server Failover Clustering (WSFC) cluster. This is an informational message only. No user action is require
2012-09-05 04:01:32.980 spid52       The state of the local availability replica in availability group 'PST TEST' has changed from 'RESOLVING_NORMAL' to 'PRIMARY_PENDING'. The replica state changed because of either a startup, a failover, a communication issue, or a cluster error.
2012-09-05 04:01:33.090 Server       Error: 41015, Severity: 16, State: 1.
2012-09-05 04:01:33.090 Server       Failed to obtain the Windows Server Failover Clustering (WSFC) node handle (Error code 5042). The WSFC service may not be running or may not be accessible in its current state, or the specified cluster node name is invalid.

Since there were clear errors related to the Windows Server Failover Cluster (WSFC), we checked and ensured that the windows cluster was stable. It was, and the cluster validation came back clean.

We tried bringing the database online using "Restore database lab with recovery", but it failed saying the database is part of an availability group. We then tried removing it from the Availability Group, but it failed with error 41190, stating that the database is not in a state that it can be removed from the Availability Group. The only option we had at this point was to delete the AG. We tried doing so, but that too returned with an error:

Msg 41172, Level 16, State 0, Line 3
An error occurred while dropping availability group 'PST TEST' from Windows Server Failover Clustering (WSFC) cluster and from the local metadata. The operation encountered SQL OS error 41036, and has been terminated. Verify that the specified availability group name is correct, and then retry the command.

However, the AG was no longer visible in SQL Server Management Studio and Failover Cluster Manager. I was still skeptical, since the error had clearly complained about the metadata cleanup. When we tried creating a new AG with the name PST TEST, it errored out as expected, stating that the AG as still present. So we ended up creating an AG with a different name and adding the Test database to it.

Root Cause Analysis

So much for getting the environment back up, but what about the root cause? I mean, how can we ensure that such as issue never happens again? I checked with some friends in the Product Group, and according to them, deleting an AG should “Always” work. So why didn’t it work in this case?

The answer lies in the size of the registry on the servers. As many of you might know, the limit for registry size is still 2 GB. This is also documented in the msdn article here. The proper way to investigate would be to follow these steps:

Check the Paged pool usage from perfmon by checking the Memory->Pool Paged Bytes counter
If you see high memory usage there (close to the 2 GB limit), then we need to find who’s using the pool. There’s a very useful article on this:
http://msdn.microsoft.com/en-us/library/windows/hardware/gg463213.aspx
Using one of the methods described in the article, we can easily identify which process is using the paged pool. One other way is to use the Process->Pool paged bytes Perfmon counter.
In our case, we identified CM31 as the tag using about 1.97 GB from the paged pool. Looking up the tag list available here, we can see that the CM series corresponds to “Configuration Manager (registry)”.
So it’s clear that registry is using a large chunk of the paged pool, and once this usage hits 2 GB, users will not be able to login to the system, and as a result, everything else, including the cluster service and the AG, will fail. This issue can happen either due to large registry hives or some process loading keys multiple times.
Next, check the sizes of the files in the Windows/system32/config folder. If these are large (>1 GB), then that will be the cause of the issue. Also, check the sizes of the NTUser.dat files in C:\Users. There will be one for each user, so searching for them in c:\users is the simplest way.
In our case, we could clearly see that the SOFTWARE hive was by far the largest, and very close to the limit:
The next step is to figure out which process/hive is responsible for the huge size of the Software branch. In our case we found that it was a known issue with the Cluster service, outlined in this KB:
http://support.microsoft.com/kb/2616514
Another known issue with SQL 2012 SP1:
https://connect.microsoft.com/SQLServer/feedback/details/770630/msiexec-exe-processes-keep-running-after-installation-of-sql-server-2012-sp1
Yet another known issue that can cause similar issues in Windows 2000 and 2003:
http://support.microsoft.com/kb/906952

The best remedial measure is to compress the “Bloated” registry hives, using the steps outlined in this KB:
http://support.microsoft.com/kb/2498915

There can, of course, be other processes bloating the Software hive, and the only way to find out is to take a backup of the registry hive and try to find which hives/keys are the largest. Once we have identified the keys, we can trace them back to the process which is responsible.

Update: A fix for the issue SQL Server issue (the msiexec.exe keeps running after SP1 installation) is available at:
http://www.microsoft.com/en-us/download/details.aspx?id=36215

The fix is also included in the Cumulative Update 2 (CU2) for SQL 2012 SP1, available at:
http://support.microsoft.com/kb/2790947

Hope this helps. Any feedback/suggestions are welcome.

↧

How To : SQL 2012 Filetable Setup and Usage

November 10, 2012, 8:24 am

≫ Next: An in-depth look at SQL Server Memory–Part 3

≪ Previous: Why the registry size can cause problems with your SQL 2012 AlwaysOn/Failover Cluster setup

One of the cool things about my job is that I get to work on the latest technologies earlier than most people. I recently stumbled upon an issue related to Filetables, a new feature in SQL Server 2012.

To start with, a Filetable brings you the ability to view files and documents in SQL Server, and allows you to use SQL Server specific features such as Full-Text Search and semantic search on them. At the same time, it also allows you to access those files and documents directly, through windows explorer or Windows Filesystem API calls.

Setting up Filetables

Here are some basic steps for setting up Filetables in SQL Server 2012:

Enable Filestream for the instance in question from SQL Server Configuration Manager (Right click on the SQL Server Service-> Properties->Filestream-> Enable Filestream for Transact-SQL access). Also make sure you provide a Windows Share name. Restart SQL after making this change.
Create a database in SQL (exclusively)for Filetables (preferable to using an existing database), and specify the WITH FILESTREAM option. Here’s an example:

CREATE DATABASE FileTableDB
ON PRIMARY
(
    NAME = N’FileTableDB',
    FILENAME = N'C:\FileTable\FileTableDB.mdf'
),
FILEGROUP FilestreamFG CONTAINS FILESTREAM
(
    NAME = FileStreamGroup1,
    FILENAME= 'C:\FileTable\Data'
)
LOG ON
(
    NAME = N'FileTableDB_Log',
    FILENAME = N'C:\FileTable\FileTableDB_log.ldf'
)
WITH FILESTREAM
(
    NON_TRANSACTED_ACCESS = FULL,
    DIRECTORY_NAME = N'FileTables'
)

Alternatively, you can add a Filestream Filegroup to an existing database, and then create a Filestream directory for the database:

ALTER DATABASE [FileTableDB] ADD FILEGROUP FileStreamGroup1 CONTAINS FILESTREAM(NAME = FileStreamGroup1, FILENAME= 'C:\FileTable\Data')
GO

ALTER DATABASE FileTableDB
SET FILESTREAM ( NON_TRANSACTED_ACCESS = FULL, DIRECTORY_NAME = N'FileTables' );
GO

To verify the directory creation for the database, run this query:

SELECT DB_NAME ( database_id ), directory_name
FROM sys.database_filestream_options;
GO

Next, you can run this query to check if the enabling Non Transacted Access on the database was successful (the database should have the value ‘FULL’ in the non_transacted_access_desc column):

SELECT DB_NAME(database_id), non_transacted_access, non_transacted_access_desc
FROM sys.database_filestream_options;
GO

The next step is to create a Filetable. It is optional to specify the Filetable Directory name. If you don’t specify one, the directory will be created with the same name as the Filetable.
Example:

CREATE TABLE DocumentStore AS FileTable
    WITH (
          FileTable_Directory = 'DocumentTable',
          FileTable_Collate_Filename = database_default
         );
GO

Next, you can verify the previous step using this query (don’t be daunted by the number of rows you see for a single object):

SELECT OBJECT_NAME(parent_object_id) AS 'FileTable', OBJECT_NAME(object_id) AS 'System-defined Object'
FROM sys.filetable_system_defined_objects
ORDER BY FileTable, 'System-defined Object';
GO

Now comes the most exciting part. Open the following path in windows explorer:
\\<servername>\<Instance FileStream Windows share name (from config mgr)>\<DB Filetable directory>\<Table Directory Name>
In our case, it will be:
\\Harsh2k8\ENT2012\Filetables\DocumentTable
Next, copy files over to this share, and see the magic:
select * from DocumentStore

So you get the best of both worlds: Accessing files through SQL, searching for specific words/strings inside the files from inside SQL, etc. while retaining the ability to access the files directly through a windows share. Really cool, right? I think so too.

A few points to remember:

The Fielstream/Filetable features together give you the ability to manage windows files from SQL Server. Since we’re talking about files on the file system, accessing them requires a Windows user. Thus, these features will not work with SQL Server authentication. The only exception is using a SQL Server login that has sysadmin privileges (in which case it will impersonate the SQL Server Service account).
Filetables give you the ability to get the logical/UNC path to files and directories, but any file manipulation operations (such as copy, cut, delete, etc.) must be performed by your application, possibly using file system API's such as CreateFile or CreateDirectory. In short, the onus is on the application to obtain a handle to the file using file system API’s. Filetables only serve the purpose of providing the path to the application.

Some useful references for Filetables:
http://msdn.microsoft.com/en-us/library/gg492089.aspx
http://msdn.microsoft.com/en-us/library/gg492087.aspx

Hope this helps. Any comments/feedback/suggestions are welcome.

↧

An in-depth look at SQL Server Memory–Part 3

March 15, 2013, 4:59 am

≫ Next: Why the service account format matters for upgrades

≪ Previous: How To : SQL 2012 Filetable Setup and Usage

In part 1 and part 2 of the series, we talked about the memory architecture and the Procedure Cache respectively. In this third and final instalment of the SQL Server Memory series, I will look to focus on troubleshooting SQL Server Memory pressure issues.

Before we start on the troubleshooting part though, we need to determine the type of memory pressure that we’re seeing here. I’ve tried to list those down here:

1.     External Physical Memory pressure – Overall RAM pressure on the server. We need to find the largest consumers of memory (might be SQL), and try to reduce their consumption. It might also be that the system is provided with RAM inadequate for the workload it’s running.

2.     Internal Physical Memory pressure – Memory Pressure on specific components of SQL Server. Can be a result of External Physical Memory pressure, or of one of the components hogging too much memory.

3.     Internal Virtual Memory pressure – VAS pressure on SQL server. Mostly seen only on 32 bit (X86) systems these days (X64 has 8 TB of VAS, whereas X86 only had 4 GB. Refer to Part 1 for details).

4.     External Virtual Memory pressure – Page file pressure on the OS. SQL Server does not recognize or respond to this kind of pressure.

Troubleshooting

Now for getting our hands dirty. When you suspect memory pressure on a server, I would recommend checking the following things, in order:

1. Log in to the server, and take a look at the performance tab of the Task Manager. Do you see the overall memory usage on the server getting perilously close to the total RAM installed on the box? If so, it’s probable that we’re seeing External Physical Memory pressure.

2. Next, look at the Processes tab, and see which of the processes is using the maximum amount of RAM. Again, for SQL, the true usage might not reflect in the Working set if LPIM is enabled (i.e. SQL is using AWE API’s to allocate memory). To check SQL’s total memory consumption, you can run the following query from inside SQL (valid from SQL 2008 onwards):

select physical_memory_in_use_kb/(1024) as sql_physical_mem_in_use_mb,

locked_page_allocations_kb/(1024) as awe_memory_mb,

total_virtual_address_space_kb/(1024) as max_vas_mb,

virtual_address_space_committed_kb/(1024) as sql_committed_mb,

memory_utilization_percentage as working_set_percentage,

virtual_address_space_available_kb/(1024) as vas_available_mb,

process_physical_memory_low as is_there_external_pressure,

process_virtual_memory_low as is_there_vas_pressure

from sys.dm_os_process_memory

Go

For SQL installations prior to 2008 (valid for 2008 and 2008 R2 as well), you can run DBCC Memorystatus, and take the total of VM Committed and AWE Allocated from the memory manager section to get a rough idea of the amount of memory being used by SQL Server.

3. Next, compare this with the total amount of RAM installed on the server. If SQL seems to be taking most of the memory, or at least, much more than it should, then we need to focus our attentions on SQL Server. The exact specifics will vary according to the environment, and factors such as whether it is a dedicated SQL server box, number of instances of SQL Server running on the server, etc. In case you have multiple instances of SQL Server, it will be best to start with the instance consuming the maximum amount of memory (or the maximum deviation from “what it should be consuming”), tune it and then move on to the next one.

4. One of the first things to check should be the value of the “max server memory” setting for SQL Server. You can check this by turning on the ‘show advanced options’ setting of sp_configure, or by right clicking on the instance in Object Explorer in SSMS, selecting properties, and navigating to the “memory” tab. If the value is “2147483647“, this means that the setting has been left to default, and has not been set since the instance was installed. It’s absolutely vital to set the max server memory setting to an optimal value. A general rule of thumb that you can use to set a starting value is as follows:
Total server memory – (Memory for other applications/instances+ OS memory)
The recommendation for the OS memory value is around 3-4 GB on 64 bit systems, and 1-2 GB on 32 bit systems. Please note that this is only a recommendation for the starting value. You need to fine tune it based on observations w.r.t performance of both SQL and other applications (if any) on the server.

5. Once you’ve determined that the max server memory is set properly, the next step is to find out which component within SQL is consuming the most memory. The best place to start is, quite obviously, the good old “DBCC Memorystatus” command, unless you’re using NUMA, in which case, it will be best to use perfmon counters to track page allocations across NUMA nodes, as outlined here.
I will try to break down most of the major components in the DBCC Memorystatus output here (I would recommend reading KB 907877 as a primer before this):

I. First up is the memory manager section. As discussed earlier, this section contains details about the overall memory comsumption of SQL Server. An example:

Memory Manager                           KB

—————————————- ———–

VM Reserved                                 4059416

VM Committed                                 43040

Locked Pages Allocated                   41600

Reserved Memory                              1024

Reserved Memory In Use                        0

II. Next, we have the memory nodes, starting with 0. As I mentioned, because there is a known issue with the way dbcc memorystatus displays the distribution of allocations across memory nodes, it is best to study the distribution through the SQL Server performance counters. Here’s a sample query:

select * from sys.dm_os_performance_counters

where object_name like ‘%Buffer Node%’

III. Next, we have the clerks. I’ve tried to outline the not so obvious ones in this table, along with their uses:

Clerk Name

Used for

MEMORYCLERK_SQLUTILITIES

Database mirroring, backups, etc.

MEMORYCLERK_SQLXP

Extended Stored Procedures (loaded into SQL Server)

MEMORYCLERK_XE, MEMORYCLERK_XE_BUFFER

Extended Events

If you see any of the clerks hogging memory, then you need to focus on that, and try and narrow down the possible causes.

Another thing to watch out for is high values for the multipage allocator. If you see any clerk with extremely high values for multipage allocator, it means that the non-Bpool area is growing due to one of the following:

                                       i.            CLR Code: Check the errorlog for appdomain messages

                                     ii.            COM Objects : Check the errorlog for sp_oacreate

                                    iii.            Linked servers: Can be checked using Object Explorer in SSMS

                                   iv.             Extended stored procedures : Check the errorlog for loading extended stored procedure messages.

                                    Alternatively, you can query the sys.extended_procedures view as well.

                                     v.            Third party DLL’s : Third party DLL’s loaded into the SQL server process space. Run the following query to check:
        select * from sys.dm_os_loaded_modules where company <> ‘Microsoft Corporation’

Here’s a query to check for the biggest multipage consumers:

select type, name, sum(multi_pages_kb)/1024 as multi_pages_mb

from sys.dm_os_memory_clerks

where multi_pages_kb > 0

group by type, name

order by multi_pages_mb desc

Yet another symptom to watch out for is a high ratio of stolen pages from the Buffer Pool. You can check this in the ‘Buffer Pool’ section of the MEMORYSTATUS output. A sample:

Buffer Pool                                      Value

—————————————- ———–

Committed                                          4448

Target                                                25600

Database                                             2075

Dirty                                                        50

In IO                                                          0

Latched                                                     0

Free                                                       791

Stolen                                                  1582

Reserved                                                   0

Visible                                                25600

Stolen Potential                                 22738

Limiting Factor                                        17

Last OOM Factor                                       0

Last OS Error                                             0

Page Life Expectancy                         87529

What this means is that Buffer Pool pages are being utilized for “other” uses, and not for holding data and index pages in the BPool. This can lead to performance issues and a crunch on the Bpool, thereby slowing down overall query performance (please refer to part 1 for consumers that “Steal” pages from the BPool). You can use the following query to check for the highest “Steal” consumers:

select type, name, sum((single_pages_kb*1024)/8192) as stolen_pages

from sys.dm_os_memory_clerks

where single_pages_kb > 0

group by type, name

order by stolen_pages desc

IV. Next, we have the stores namely, Cachestore, Userstore and Objectstore. Please refer to part 1 for how and by which component these clerks are used. You can use the following queries to check for the biggest Cachestores, Userstores and Objectstores respectively:

select name, type, (SUM(single_pages_kb)+SUM(multi_pages_kb))/1024

as store_size_mb

from sys.dm_os_memory_cache_counters

where type like ‘CACHESTORE%’

group by name, type

order by store_size_mb desc

go

select name, type, (SUM(single_pages_kb)+SUM(multi_pages_kb))/1024

as store_size_mb

from sys.dm_os_memory_cache_counters

where type like ‘USERSTORE%’

group by name, type

order by store_size_mb desc

go

select name, type, (SUM(single_pages_kb)+SUM(multi_pages_kb))/1024

as store_size_mb

from sys.dm_os_memory_clerks

where type like ‘OBJECTSTORE%’

group by name, type

order by store_size_mb desc

go

V. Next, we have the gateways. The concept of gateways was introduced to throttle the use of query compilation memory. In plain english, this means that we did not want to allow too many queries with a high requirement for compilation memory to be running at the same time, as this would lead to consequences like internal memory pressure (i.e. one of the components of the buffer pool growing and creating pressure on other components).

The concept basically works like this: When a query starts execution, it will start with a small amount of memory. As its consumption grows, it will cross the threshold for the small gateway, and must wait to acquire it. The gateway is basically implemented through a semaphore, which means that it will allow upto a certain number of threads to acquire it, and make threads beyond the limit wait. As the memory consumption for the query grows, it must acquire the medium and big gateways before being allowed to continue execution. The exact thresholds depend on factors like total memory on the server, SQL Max server memory sitting, memory architecture (x86 or x64), load on the server, etc.

The number of queries allowed at each of the gateways described in the following table:

Gateway

Dynamic/Static

Config Value

Small

Dynamic

Default is (no. of CPU’s SQL sees * 4)

Medium

Static

Number of CPU’s SQL sees.

Large

Static

1 per instance

So if you see a large number of queries waiting on the large gateway, it means that you need to see why there are so many queries requiring large amounts of memory, and try to tune those queries. Such queries will show up with RESOURCE_SEMAPHORE_QUERY_COMPILE or RESOURCE_SEMAPHORE wait types in sysprocesses, sys.dm_exec_requests, etc.

I am listing down some DMV’s that might come in handy for SQL Server Memory Troubleshooting:

Sysprocesses

Sys.dm_exec_requests

Sys.dm_os_process_memory: Usage above.

Sys.dm_os_sys_memory: Will give you the overall memory picture for the server

Sys.dm_os_sys_info: Can be used to check OS level information like hyperthread ratio, CPU Ticks, OS Quantum, etc.

Sys.dm_os_virtual_address_dump: Used to check for VAS usage (reservations). The following query will give you VAS usage in descending order of reservations:

with vasummary(Size,reserved,free) as (select size = vadump.size,

reserved = SUM(case(convert(int, vadump.base) ^ 0) when 0 then 0 else 1 end),

free = SUM(case(convert(int, vadump.base) ^ 0x0) when 0 then 1 else 0 end)

from

(select CONVERT(varbinary, sum(region_size_in_bytes)) as size,

region_allocation_base_address as base

from sys.dm_os_virtual_address_dump

where region_allocation_base_address<> 0x0

group by region_allocation_base_address

UNION(

select CONVERT(varbinary, region_size_in_bytes),

region_allocation_base_address

from sys.dm_os_virtual_address_dump

where region_allocation_base_address = 0x0)

)

as vadump

group by size)

select * from vasummary order by reserved desc

go

Sys.dm_os_memory_clerks (Usage above)

Sys.dm_os_memory_nodes: Just a select * would suffice. This DMV has one row for each memory node.

Sys.dm_os_memory_cache_counters: Used above to find the size of the cachestores. Another sample query would be

select (single_pages_kb+multi_pages_kb) as memusage,* from Sys.dm_os_memory_cache_counters order by memusage desc

Once you have narrowed down the primary consumer and the specific component which is causing a memory bottleneck, the resolution steps should be fairly simple. For example, if you see some poorly written code, you can hound the developers to tune it. For other processes hogging memory at the OS Level, you will need to investigate them. For high consumption by a particular clerk, check the corresponding components. An example would be, say, in case of high usage by the SQLUtilities clerk, one of the first things you need to check if there is any Mirroring set up on the instance, and if it’s working properly.

Another thing I would strongly recommend would be to watch out for memory related KB articles, and make sure you have the relevant fixes applied.

Hope this helps. Any feedback, questions or comments are welcome.

↧

Why the service account format matters for upgrades

April 1, 2013, 1:25 pm

≫ Next: An interesting issue with SQL Server Script upgrade mode

≪ Previous: An in-depth look at SQL Server Memory–Part 3

I’ve seen this issue a few times in the past few months, so decided to blog about this. When upgrading from SQL 2005 to SQL 2008/SQL 2008 R2 (or even from SQL 2008 to SQL 2008 R2), you might face an error with the in-place upgrade.

Open the setup logs folder (located in C:Program filesMicrosoft SQL Server<Version -100 for 2008 and 2008 r2>Setup BootstrapLog folder by default), and look for a folder with the datetime of the upgrade attempt. Inside this folder, look for a file named “Detail.txt”.

Looking inside the detail.txt file, check for the following stack:

2013-01-21 11:16:42 Slp: Sco: Attempting to check if container ‘WinNT://Harsh2k8,computer’ of user account exists

2013-01-21 11:16:42 Slp: Sco: User srv_sql@contoso.test wasn’t located

2013-01-21 11:16:42 Slp: Sco: User srv_sql@contoso.test doesn’t exist

2013-01-21 11:16:42 SQLBrowser: SQL Server Browser Install for feature ‘SQL_Browser_Redist_SqlBrowser_Cpu32′ generated exception, and will invoke retry option. The exception: Microsoft.SqlServer.Configuration.Sco.ScoException: The specified user ‘srv_sql@contoso.test’ does not exist.

at Microsoft.SqlServer.Configuration.Sco.UserGroup.AddUser(String userName)

at Microsoft.SqlServer.Configuration.SqlBrowser.SqlBrowserPrivateConfig.AddAccountToGroup(SqlBrowserPublicConfig publicConfigSqlBrowser)

at Microsoft.SqlServer.Configuration.SqlBrowser.SqlBrowserPrivateConfig.UpdateAccountIfNeeded(SqlBrowserPublicConfig publicConfigSqlBrowser)

at Microsoft.SqlServer.Configuration.SqlBrowser.SqlBrowserPrivateConfig.ConfigUserProperties(SqlBrowserPublicConfig publicConfigSqlBrowser)

at Microsoft.SqlServer.Configuration.SqlBrowser.SqlBrowserPrivateConfig.ExecConfigNonRC(SqlBrowserPublicConfig publicConfigSqlBrowser)

at Microsoft.SqlServer.Configuration.SqlBrowser.SqlBrowserPrivateConfig.SelectAndExecTiming(ConfigActionTiming timing, Dictionary`2 actionData, PublicConfigurationBase spcbPublicConfig)

at Microsoft.SqlServer.Configuration.SqlBrowser.SqlBrowserPrivateConfigBase.ExecWithRetry(ConfigActionTiming timing, Dictionary`2 actionData, PublicConfigurationBase spcbPublicConfig).

2013-01-21 11:16:42 SQLBrowser: The last attempted operation: Adding account ‘srv_sql@contoso.test’ to the SQL Server Browser service group ‘SQLServer2005SQLBrowserUser$Harsh2k8′..

The key thing here is the message “Attempting to check if container WinNT://Harsh2k8, computer of user account exists“. If you see this message, go to the SQL Server configuration manager, right click on the offending service mentioned in the detail.txt, open the properties window and navigate to the “Log On” tab. Check the format of the service account here. It should be in the format domainusername. Change this to username@domain, and type in the password. After this, restart the SQL Service to make sure the changes have taken effect.

Try the setup again, and it should work this time.

Hope this helps.

↧

An interesting issue with SQL Server Script upgrade mode

April 14, 2013, 6:08 pm

≫ Next: How To: Troubleshooting SQL Server I/O bottlenecks

≪ Previous: Why the service account format matters for upgrades

Here’s another common issue that I’ve seen quite a few people run into of late.

When you run a patch against SQL Server, the patch installs successfully, but on restart, SQL goes into “script upgrade mode” and you’re unable to connect to it. Upon looking at the errorlog, you see something like this:

2012-08-23 03:43:38.29 spid7s Error: 5133, Severity: 16, State: 1.

2012-08-23 03:43:38.29 spid7s Directory lookup for the file “D:SQLDatatemp_MS_AgentSigningCertificate_database.mdf” failed with the operating system error 2(The system cannot find the file specified.).

2012-08-23 03:43:38.29 spid7s Error: 1802, Severity: 16, State: 1.

2012-08-23 03:43:38.29 spid7s CREATE DATABASE failed. Some file names listed could not be created. Check related errors.

2012-08-23 03:43:38.31 spid7s Error: 912, Severity: 21, State: 2.

2012-08-23 03:43:38.31 spid7s Script level upgrade for database ‘master’ failed because upgrade step ‘sqlagent100_msdb_upgrade.sql’ encountered error 598, state 1, severity 25. This is a serious error condition which might interfere with regular operation and the database will be taken offline. If the error happened during upgrade of the ‘master’ database, it will prevent the entire SQL Server instance from starting. Examine the previous errorlog entries for errors, take the appropriate corrective actions and re-start the database so that the script upgrade steps run to completion.

2012-08-23 03:43:38.31 spid7s Error: 3417, Severity: 21, State: 3.

2012-08-23 03:43:38.31 spid7s Cannot recover the master database. SQL Server is unable to run. Restore master from a full backup, repair it, or rebuild it. For more information about how to rebuild the master database, see SQL Server Books Online.

Script upgrade means that when SQL is restarted for the first time after the application of the patch, the upgrade scripts are run against each system db (to upgrade the system tables, views, etc. ). During this process, SQL Server attempts to create this mdf file in the default data location, and if the path is not available, then we get this error. Most of the time, it’s a result of the data having been moved to a different folder, and the original Default Data path being no longer available.

The default data path can be checked from the following registry key (for a default SQL 2008 instance):

HKEY_LOCAL_MACHINESoftwareMicrosoftMicrosoft SQL ServerMSSQL10.MSSQLSERVERMSSQLServer

The Mssqlserver key will have a string entry named “DefaultData”. If you see a location here that’s no longer available, please change it to the current data location (alternatively, you can also “recreate” the default data path mentioned in the string value).

If you do not see the key, check for the please check the HKEY_LOCAL_MACHINESOFTWAREMicrosoftMicrosoft SQL ServerMSSQL10.<instance name>Setup key, and see if you can spot the SQLDataRoot key there. Check to see if this key has the path mentioned above, and if so, update it to the current path.

If the path is correct, then one of the following conditions holds true:

1. The relevant drive is not added as a resource to the SQL Server group in Failover cluster manager.

2. The SQL Server resource does not have a dependency on the specified drive.

After this, restart SQL Server and the script upgrade should complete successfully this time. Hope this helps.

↧

How To: Troubleshooting SQL Server I/O bottlenecks

June 3, 2013, 10:01 am

≫ Next: SQL Server patch fails with "Could not find any resources appropriate for the specified culture or the neutral culture"

≪ Previous: An interesting issue with SQL Server Script upgrade mode

One of the most common reason for server performance issues with respect to SQL Server is the presence of an I/O bottleneck on the system. When I say I/O bottleneck, it can mean issues like slow disks, other processes hogging I/O, out-dated drivers, etc. In this blog, I will seek to outline the approach for identifying and troubleshooting I/O bottlenecks on SQL Server.

The Symptoms

The following are the most common symptoms of an I/O bottleneck on the SQL Server machine:

You see a lot of threads waiting on one or more of the following waits:

PAGEIOLATCH_*
WRITELOG
TRACEWRITE
SQLTRACE_FILE_WRITE_IO_COMPLETION
ASYNC_IO_COMPLETION
IO_COMPLETION
LOGBUFFER

You see the famous “I/O taking longer than 15 seconds” messages in the SQL Server errorlogs:
2012-11-11 00:21:25.26 spid1 SQL Server has encountered 192 occurrence(s) of IO requests taking longer than 15 seconds to complete on file [E:SEDATAstressdb5.ndf] in database [stressdb] (7). The OS file handle is 0x00000000000074D4. The offset of the latest long IO is:0x00000000022000”.

Troubleshooting

Data Collection:

If you see the symptoms outlined above quite frequently on your SQL Server installation, then it will be safe to draw the conclusion that your instance is suffering from a disk subsystem or I/O bottleneck. Let’s look at the data collection and troubleshooting approach pertaining to the same:

Enable a custom Performance Monitor collector to capture all disk related counters. Just go to start->run, type perfmon, and hit ok. Next, go to Data Collector sets->User Defined, right click on User Defined, and click New-> Data Collector set.
Note: The best thing about perfmon(apart from the fact that it is built into windows) is that it’s a very lightweight diagnostic, and has negligible performance overhead/impact.
Give the data collector set a name, and select Create manually. Under type of data, select the “Create data logs” option, and check the Performance Counter checkbox under it.
Next, click on add performance counters, and select the “LogicalDisk”, “Process” and “PhysicalDisk” groups, and select “All instances” for both before adding them.
After you have added the counters, you can also modify the sample interval. You might want to do this if you see spikes lasting less than 15 seconds, which is the default sample interval. I sometimes use an interval of 5 seconds when I want to closely monitor an environment .
Click on Finish and you will now see the new Data Collector set created under User Defined.
Next, right click on the Data Collector set you just created, and click start.

I normally recommend my clients to run the perfmon collector set for at least one business day, so that it has captured data for the load exerted by at least one standard business cycle.

Analysis:

Now that we have the data, we can start the analysis. After stopping the collector set, you can open the blg file generated (the path is displayed under the output column, on the right hand side in perfmon) using perfmon (a simple double click works, as the file type is associated with perfmon by default). Once open, is should have automatically loaded all the counters. Analysing with all the counters can be a bit cumbersome, so I would suggest that you first delete all the counters and then add specific counters one by one.

I will list out the important counters here, along with their expected values:

Process->IO Data Bytes/sec: This counter represents the average amount of IO Data bytes/sec spawned by each process. In conjunction with IO Other Bytes/sec, this counter can be used to determine the average IO per second as well as the total amount of IO spawned by each process during the capture. Check for the largest I/O consumers, and see if SQL is being starved of I/O due to some other process spawning a large amount of I/O on the system.
Process-> IO Other Bytes/sec: This counter represents the non-data IO spawned by each process during the capture. Usually, the amount of non-data IO is very low as compared to data IO. Use the total of both IO Data Bytes and IO other bytes to determine the total amount of IO spawned by each process during the capture. Check for the largest I/O consumers, and see if SQL is being starved of I/O due to some other process spawning a large amount of I/O on the system.
Physical Disk/Logical Disk->Avg. Disk Sec/Read: This counter signifies the average amount of time, in ms, that it takes for a read I/O request to be serviced for each physical/logical disk. An average of less than 10 ms (0.010) is good, and between 10-15 ms (0.010-0.015) is acceptable, but anything beyond 15 ms (0.015) is a cause for concern.
Physical Disk/Logical Disk->Avg. Disk Sec/Write: This counter signifies the average amount of time, in ms, that it takes for a write I/O request to be serviced for each physical/logical disk. An average of less than 10 ms (0.010) is good, and between 10-15 ms (0.010-0.015) is acceptable, but anything beyond 15 ms (0.015) is a cause for concern.
Physical Disk/Logical Disk->Disk Bytes/Sec: This counter represents, in bytes, the throughput of your I/O subsystem for each physical/logical disk. Look for the max value for each disk, and divide it by 1024 twice to get the max throughput in MB for the disk. SAN’s generally start from 200-250 MB/s these days. If you see that the throughput is lower than the specifications for the disk, it’s not necessarily a cause for concern. Check this counter in conjunction with the Avg Disk Sec/Read or Avg Disk Sec/Write counters (depending on the wait/symptom you see in SQL), and see the latency at the time of the maximum throughput. If the latency is green, then it just means that SQL spawned I/O that was less the disk throughput capacity, and was easily handled by the disk.
Physical Disk/Logical Disk->Avg. Disk Queue Length: This counter represents the average number of I/O’s pending in the I/O queue for each physical/logical disk. Generally, if the average is greater than 2 per spindle, it’s a cause for concern. Please note that I mentioned the acceptable threshold as 2 per spindle. Most SAN’s these days have multiple spindles. So, for example, if your SAN has 4 spindles, the acceptable threshold for Avg Disk Queue Length would be 8.
Check the other counters to confirm.
Physical Disk/Logical Disk->Split IO/Sec: This counter indicates the I/O’s for which the Operating System had to make more than one command call, grouped by physical/logical disk. This happens if the IO request touches data on non-contiguous file segments. It’s a good indicator of file/volume fragmentation.
Physical Disk/Logical Disk->%Disk Time: This counter is a general mark of how busy the physical/logical disk is. Actually, it is nothing more than the “Avg. Disk Queue Length” counter multiplied by 100. It is the same value displayed in a different scale. This is the reason you can see the %Disk Time going greater than 100, as explained in the KB http://support.microsoft.com/kb/310067. It basically means that the Avg. Disk Queue Length was greater than 1 during that time. If you’ve captured the perfmon for a long period (a few hours or a complete business day), and you see the %Disk Time to be greater than 80%, it’s generally indicative of a disk bottleneck, and you should take a closer look at the other counters to arrive at a logical conclusion.

It’s important to keep 2 things in mind. One, make sure your data capture is not skewed or biased in any way (for example, do not run a capture at the time of a monthly data load or something). Second, make sure you correlate the numbers reflected across the various counters to arrive at the overall picture of how your disks are doing.

Most of the time, I see that people are surprised when they are told that there are I/O issues on the system. Their typical response is “But, it’s been working just fine for x years, how can it create a bottleneck now?”. The answer lies within the question itself. When the server was initially configured, the disk resources were sufficient for the load on the server. However, with time, it’s inevitable that the business grows as a whole, and so do the number of transactions, as well as the overall load. As a result, there comes a day when the load breaches that threshold, and the disk resources on the server are no longer sufficient to handle it. If you come to office one fine day, see high latency on the disks during normal working hours, and are sure that

No special/additional workloads are running on SQL
No other process on the server is spawning excessive I/O,
Nothing changed on the server in the past 24 hours (like a software installation, patching, reboot, etc.)
All the BIOS and disk drivers on the server are up to date,

Then it’s highly likely that the load on your server has breached this threshold, and you should think about asking your disk vendor(s) for a disk upgrade (after having them check the existing system once for latency and throughput, of course). Another potential root cause that can cause high latency is that your disk drivers and/or BIOS are out of date. I would strongly recommend checking periodically for updates to all the drivers on the machine, as well as the BIOS.

Hope this helps. As always, comments, feedbacks and suggestions are welcome.

↧

SQL Server patch fails with "Could not find any resources appropriate for the specified culture or the neutral culture"

June 12, 2013, 8:52 am

≫ Next: SQL 2012 Availability Group does not come up on one instance

≪ Previous: How To: Troubleshooting SQL Server I/O bottlenecks

I recently worked on a number of issues where SQL Server Service Pack/patch installation would fail, and we would see this error in the relevant Detail.txt (located in C:Program FilesMicrosoft SQL Server100Setup BootstrapLog<Date time of the installation attempt> for SQL 2008/2008 R2):

2013-04-07 20:14:07 Slp: Package sql_bids_Cpu64: – The path of cached MSI package is: C:WindowsInstaller5c23b5e.msi . The RTM product version is: 10.50.1600.1

2013-04-07 20:14:07 Slp: Error: Action “Microsoft.SqlServer.Configuration.SetupExtension.InitializeUIDataAction” threw an exception during execution.

2013-04-07 20:14:13 Slp: Received request to add the following file to Watson reporting: C:UserskalerahulAppDataLocalTemp2tmpCC09.tmp

2013-04-07 20:14:13 Slp: The following is an exception stack listing the exceptions in outermost to innermost order

2013-04-07 20:14:13 Slp: Inner exceptions are being indented

2013-04-07 20:14:13 Slp:

2013-04-07 20:14:13 Slp: Exception type: System.Resources.MissingManifestResourceException

2013-04-07 20:14:13 Slp: Message:

2013-04-07 20:14:13 Slp: Could not find any resources appropriate for the specified culture or the neutral culture. Make sure “Errors.resources” was correctly embedded or linked into assembly “Microsoft.SqlServer.Discovery” at compile time, or that all the satellite assemblies required are loadable and fully signed.

2013-04-07 20:14:13 Slp: Stack:

2013-04-07 20:14:13 Slp: at System.Resources.ResourceManager.InternalGetResourceSet(CultureInfo culture, Boolean createIfNotExists, Boolean tryParents)

2013-04-07 20:14:13 Slp: at System.Resources.ResourceManager.GetObject(String name, CultureInfo culture, Boolean wrapUnmanagedMemStream)

2013-04-07 20:14:13 Slp: at Microsoft.SqlServer.Discovery.MsiException.GetErrorMessage(Int32 errorNumber, CultureInfo culture)

2013-04-07 20:14:13 Slp: at Microsoft.SqlServer.Discovery.MsiException.GetErrorMessage(MsiRecord errorRecord, CultureInfo culture)

2013-04-07 20:14:13 Slp: at Microsoft.SqlServer.Discovery.MsiException.get_Message()

2013-04-07 20:14:13 Slp: at System.Exception.ToString()

2013-04-07 20:14:13 Slp: at Microsoft.SqlServer.Setup.Chainer.Workflow.ActionEngine.RunActionQueue()

2013-04-07 20:14:13 Slp: at Microsoft.SqlServer.Setup.Chainer.Workflow.Workflow.RunWorkflow(WorkflowObject workflowObject, HandleInternalException exceptionHandler)

2013-04-07 20:14:13 Slp: at Microsoft.SqlServer.Chainer.Setup.Setup.RunRequestedWorkflow()

2013-04-07 20:14:13 Slp: at Microsoft.SqlServer.Chainer.Setup.Setup.Run()

2013-04-07 20:14:13 Slp: at Microsoft.SqlServer.Chainer.Setup.Setup.Start()

2013-04-07 20:14:13 Slp: at Microsoft.SqlServer.Chainer.Setup.Setup.Main()

Now that’s a weird and hard to understand error, isn’t it? However, look closely at what the setup is trying to do, and you will see that it’s trying to access the following file from the installer cache:
C:WindowsInstaller5c23b5e.msi

Open the installer cache and try to install the msi manually. If it succeeds, try running the patch setup again and it should proceed beyond the error this time. If the msi setup fails, then you will need to troubleshoot that first, before the patch setup proceeds further. This behaviour is expected, in that the service pack setup will try to access the msi’s (Microsoft Installer files, installed with the base installation of SQL) and msp’s (Microsoft Patch files, installed by Service packs, CU’s and hotfixes) of each of the installed components of SQL Server. If it’s unable to access/run any of these, the Service pack setup will fail.

Hope this helps.

↧

SQL 2012 Availability Group does not come up on one instance

July 15, 2013, 9:19 am

≫ Next: When using SSL, SQL Failover Cluster Instance fails to start with error 17182

≪ Previous: SQL Server patch fails with "Could not find any resources appropriate for the specified culture or the neutral culture"

I recently came across this interesting issue with SQL 2012 Always on Availability Groups, wherein after the network and IP were changed, the AG would not come up on one of the instances.

We checked the errorlogs on the server, found the following successful stacks for the failovers that had been attempted for the successful instance:

2013-01-30 13:03:21.09 spid1347 The state of the local availability replica in availability group ‘SQL2012CLUS02′ has changed from ‘NOT_AVAILABLE’ to ‘RESOLVING_NORMAL’. The replica state changed because of either a startup, a failover, a communication issue, or a cluster error. For more information, see the availability group dashboard, SQL Server error log, Windows Server Failover Cluster management console or Windows Server Failover Cluster log.

2013-01-30 13:03:21.66 spid135 AlwaysOn: The local replica of availability group ‘SQL2012CLUS02′ is preparing to transition to the primary role in response to a request from the Windows Server Failover Clustering (WSFC) cluster. This is an informational message only. No user action is required.

2013-01-30 13:03:21.69 spid135 The state of the local availability replica in availability group ‘SQL2012CLUS02′ has changed from ‘RESOLVING_NORMAL’ to ‘PRIMARY_PENDING’. The replica state changed because of either a startup, a failover, a communication issue, or a cluster error. For more information, see the availability group dashboard, SQL Server error log, Windows Server Failover Cluster management console or Windows Server Failover Cluster log.

2013-01-30 13:03:21.70 Server The Service Broker endpoint is in disabled or stopped state.

2013-01-30 13:03:21.72 Server The Service Broker endpoint is in disabled or stopped state.

2013-01-30 13:03:21.82 Server The state of the local availability replica in availability group ‘SQL2012CLUS02′ has changed from ‘PRIMARY_PENDING’ to ‘PRIMARY_NORMAL’. The replica state changed because of either a startup, a failover, a communication issue, or a cluster error. For more information, see the availability group dashboard, SQL Server error log, Windows Server Failover Cluster management console or Windows Server Failover Cluster log.

2013-01-30 13:03:21.82 Server The Service Broker endpoint is in disabled or stopped state.

2013-01-30 13:04:15.72 spid50s A connection for availability group ‘SQL2012CLUS02′ from availability replica ‘CTGDNAV’ with id [E4A205DD-0098-481F-94F4-B15ABDC3BAD1] to ‘CDGI-SQLPROD-02′ with id [58F02E10-68CB-4EB2-B517-60306BCC0E72] has been successfully established. This is an informational message only. No user action is required.

And

2013-01-30 13:04:36.51 spid754 AlwaysOn: The local replica of availability group ‘SQL2012CLUS02′ is preparing to transition to the resolving role in response to a request from the Windows Server Failover Clustering (WSFC) cluster. This is an informational message only. No user action is required.

2013-01-30 13:04:36.51 spid754 The state of the local availability replica in availability group ‘SQL2012CLUS02′ has changed from ‘PRIMARY_NORMAL’ to ‘RESOLVING_NORMAL’. The replica state changed because of either a startup, a failover, a communication issue, or a cluster error. For more information, see the availability group dashboard, SQL Server error log, Windows Server Failover Cluster management console or Windows Server Failover Cluster log.

2013-01-30 13:04:36.52 Server The Service Broker endpoint is in disabled or stopped state.

2013-01-30 13:04:38.62 spid305 AlwaysOn: The local replica of availability group ‘SQL2012CLUS02′ is preparing to transition to the primary role in response to a request from the Windows Server Failover Clustering (WSFC) cluster. This is an informational message only. No user action is required.

2013-01-30 13:04:38.97 spid305 The state of the local availability replica in availability group ‘SQL2012CLUS02′ has changed from ‘RESOLVING_NORMAL’ to ‘PRIMARY_PENDING’. The replica state changed because of either a startup, a failover, a communication issue, or a cluster error. For more information, see the availability group dashboard, SQL Server error log, Windows Server Failover Cluster management console or Windows Server Failover Cluster log.

2013-01-30 13:04:38.97 Server The Service Broker endpoint is in disabled or stopped state.

2013-01-30 13:04:39.13 Server The state of the local availability replica in availability group ‘SQL2012CLUS02′ has changed from ‘PRIMARY_PENDING’ to ‘PRIMARY_NORMAL’. The replica state changed because of either a startup, a failover, a communication issue, or a cluster error. For more information, see the availability group dashboard, SQL Server error log, Windows Server Failover Cluster management console or Windows Server Failover Cluster log.

2013-01-30 13:04:39.14 Server The Service Broker endpoint is in disabled or stopped state.

2013-01-30 13:04:46.01 spid30s A connection for availability group ‘SQL2012CLUS02′ from availability replica ‘CTGDNAV’ with id [E4A205DD-0098-481F-94F4-B15ABDC3BAD1] to ‘CDGI-SQLPROD-02′ with id [58F02E10-68CB-4EB2-B517-60306BCC0E72] has been successfully established. This is an informational message only. No user action is required.

We then checked the critical events for the AG in failover cluster manager, this was all i could find:

Cluster resource ‘SQL2012CLUS02′ in clustered service or application ‘SQL2012CLUS02′ failed.

I then collected the errorlog for the CDGI-SQLPROD-02 instance, and found this:

2013-01-30 13:04:14.890 spid1535 The state of the local availability replica in availability group ‘SQL2012CLUS02′ has changed from ‘NOT_AVAILABLE’ to ‘RESOLVING_NORMAL’. The replica state changed because of either a startup, a failover, a communication issue, or a cluster error. For more

2013-01-30 13:04:15.580 spid1536s The state of the local availability replica in availability group ‘SQL2012CLUS02′ has changed from ‘RESOLVING_NORMAL’ to ‘SECONDARY_NORMAL’. The replica state changed because of either a startup, a failover, a communication issue, or a cluster error. For m

2013-01-30 13:04:15.890 spid43s A connection for availability group ‘SQL2012CLUS02′ from availability replica ‘CDGI-SQLPROD-02′ with id [58F02E10-68CB-4EB2-B517-60306BCC0E72] to ‘CTGDNAV’ with id [E4A205DD-0098-481F-94F4-B15ABDC3BAD1] has been successfully established. This is an infor

2013-01-30 13:04:36.370 spid1540 The state of the local availability replica in availability group ‘SQL2012CLUS02′ has changed from ‘SECONDARY_NORMAL’ to ‘RESOLVING_PENDING_FAILOVER’. The replica state changed because of either a startup, a failover, a communication issue, or a cluster er

2013-01-30 13:04:38.080 Logon Error: 18456, Severity: 14, State: 5.

2013-01-30 13:04:38.080 Logon Login failed for user ‘CLT-SQLPROD-02$’. Reason: Could not find a login matching the name provided. [CLIENT: <local machine>]

We can clearly see that the Login failure seems to be responsible for the failed failover of the AG. I tried adding the login manually, and restarted the instance, but the failover still failed. I then checked the event logs on CDGI-SQLPROD-02, found just this:

Log Name: Application

Source: MSSQLSERVER

Date: 1/30/2013 4:54:43 PM

Event ID: 35206

Task Category: Server

Level: Information

Keywords: Classic

User: N/A

Computer: CDGI-SQLPROD-02.CLT.com

Description:

A connection timeout has occurred on a previously established connection to availability replica ‘CTGDNAV’ with id [E4A205DD-0098-481F-94F4-B15ABDC3BAD1]. Either a networking or a firewall issue exists or the availability replica has transitioned to the resolving role.

And then this:

Log Name: Application

Source: MSSQLSERVER

Date: 1/30/2013 4:54:33 PM

Event ID: 18456

Task Category: Logon

Level: Information

Keywords: Classic,Audit Failure

User: SYSTEM

Computer: CDGI-SQLPROD-02.CLT.com

Description:

Login failed for user ‘CLT-SQLPROD-02$’. Reason: Could not find a login matching the name provided. [CLIENT: <local machine>]

The interesting thing here is that the connection attempt seems to be coming from the OS on the same box. I then captured a Profiler trace to confirm, and saw this:

Looked at the profiler trace, found the login error event:

ErrorLog 2013-01-31 11:21:30.55 Logon Error: 18456, Severity: 14, State: 5.

2013-01-31 11:21:30.55 Logon Login failed for user ‘CLT-SQLPROD-02$’. Reason: Could not find a login matching the name provided. [CLIENT: <local machine>]

Microsoft® Windows® Operating System CDGI-SQLPROD-02$ 3828 86 2013-01-31 11:21:30.550 1 master 18456 26102 CDGI-SQLPROD-02 CLT0 CDGI-SQLPROD-02 CLTCDGI-SQLPROD-02$ 5 14

EventLog Login failed for user ‘CLTCDGI-SQLPROD-02$’. Reason: Could not find a login matching the name provided. [CLIENT: <local machine>] Microsoft® Windows® Operating System CDGI-SQLPROD-02$ 3828 86 2013-01-31 11:21:30.550 0X184800000E0000001000000043004400470049002D00530051004C00500052004F0044002D00300032000000070000006D00610073007400650072000000 1 master 18456 26103 CDGI-SQLPROD-02 CLT0 CDGI-SQLPROD-02 CLTCDGI-SQLPROD-02$ 5 14

Audit Login Failed Login failed for user ‘CLTCDGI-SQLPROD-02$’. Reason: Could not find a login matching the name provided. [CLIENT: <local machine>] Microsoft® Windows® Operating System CDGI-SQLPROD-02$ CLTCDGI-SQLPROD-02$ 3828 86 2013-01-31 11:21:30.550 1 master 18456 26104 1 – Nonpooled CDGI-SQLPROD-02 CLT 0 CDGI-SQLPROD-02 CLTCDGI-SQLPROD-02$ 5 0 1 – Non-DAC

User Error Message Login failed for user ‘CLTCDGI-SQLPROD-02$’. Microsoft® Windows® Operating System CDGI-SQLPROD-02$ 3828 86 2013-01-31 11:21:30.550 1 master 18456 26105 CDGI-SQLPROD-02 CLT 0 CDGI-SQLPROD-02 CLTCDGI-SQLPROD-02$ 1 1 0 14

The profiler trace confirms our hunch. We then proceeded to run the following commands to add local system as a sysadmin on the problem instance:

USE [master]

/****** Object: Login [NT AUTHORITYSYSTEM] Script Date: 01-02-2013 03:31:56 ******/

CREATE LOGIN [NT AUTHORITYSYSTEM] FROM WINDOWS WITH DEFAULT_DATABASE=[master],

DEFAULT_LANGUAGE=[us_english]

ALTER SERVER ROLE [sysadmin] ADD MEMBER [NT AUTHORITYsystem]

After this, the failover worked perfectly fine.

Hope this helps.

↧

When using SSL, SQL Failover Cluster Instance fails to start with error 17182

July 22, 2013, 7:02 am

≫ Next: Something to watch out for when using IS_MEMBER() in TSQL

≪ Previous: SQL 2012 Availability Group does not come up on one instance

I recently worked on an interesting issue with a SQL Server Failover Cluster Instance (FCI). We were trying to use an SSL certificate on the instance, and we followed these steps:

Made sure the certificate was requested according to the requirements defined here.

Loaded the certificate into the Personal store of the computer account across all the nodes

Copied the thumbprint of the certificate, eliminated the spaces, and pasted it into the value field HKEY_LOCAL_MACHINE\SOFTWARE\Microsoft\Microsoft SQL Server\MSSQL10.CLUSTEST\MSSQLServer\Certificate key. Please note that this was a SQL 2008 instance named "CLUSTEST"

However, when we restarted SQL Server after performing these changes, it failed. In the errorlog, we saw these messages:

2013-07-21 14:06:11.54 spid19s Error: 17182, Severity: 16, State: 1.

2013-07-21 14:06:11.54 spid19s TDSSNIClient initialization failed with error 0xd, status code 0x38. Reason: An error occurred while obtaining or using the certificate for SSL. Check settings in Configuration Manager. The data is invalid.

2013-07-21 14:06:11.54 spid19s Error: 17182, Severity: 16, State: 1.

2013-07-21 14:06:11.54 spid19s TDSSNIClient initialization failed with error 0xd, status code 0x1. Reason: Initialization failed with an infrastructure error. Check for previous errors. The data is invalid.

2013-07-21 14:06:11.54 spid19s Error: 17826, Severity: 18, State: 3.

2013-07-21 14:06:11.54 spid19s Could not start the network library because of an internal error in the network library. To determine the cause, review the errors immediately preceding this one in the error log.

2013-07-21 14:06:11.54 spid19s Error: 17120, Severity: 16, State: 1.

2013-07-21 14:06:11.54 spid19s SQL Server could not spawn FRunCommunicationsManager thread. Check the SQL Server error log and the Windows event logs for information about possible related problems.

I checked and made sure the certificate was okay, and that it was loaded properly. Then, I noticed something interesting. After copying the thumbprint to a text file, I got a Unicode to ANSI conversion warning when I tried to save the file in txt format:

This is expected, since the default format for notepad is indeed ANSI. I went ahead and clicked OK. When we reopened the file, we saw a "?" at the beginning, which basically meant that there was a Unicode character at the beginning of the string. We followed these steps to resolve the issue:

Eliminated the Unicode character from the thumbprint

Converted all the alphabetical characters in the thumbprint to Caps.

Eliminated the spaces from the thumbprint

Saved this thumbprint to the HKEY_LOCAL_MACHINE\SOFTWARE\Microsoft\Microsoft SQL Server\MSSQL10.CLUSTEST\MSSQLServer\Certificate key.

The instance came online just fine this time.

Hope this helps.

↧

Something to watch out for when using IS_MEMBER() in TSQL

August 5, 2013, 3:44 am

≫ Next: An interesting issue with Peer to Peer Replication

≪ Previous: When using SSL, SQL Failover Cluster Instance fails to start with error 17182

I recently worked on an interesting issue with my good friend Igor (@sqlsantos), where we were facing performance issues with a piece of code that used the IS_MEMBER () function. Basically, the IS_MEMBER function is used to find out whether the current user (windows/sql login used for the current session) is a member of the specified Windows group or SQL server database role.

In the specified code, the IS_MEMBER function was being used to determine the windows group membership of the windows login. The windows groups were segregated according to geographical areas, and based on the user’s group membership, the result set was filtered to show rows for only those geographical areas for which the user was a member of the corresponding groups in Active Directory.

Here’s an example of a piece of code where we perform this check:

With SalesOrgCTE AS(

SELECT

MinSalesOrg, MaxSalesOrg

FROM

RAuthorized WITH (NOLOCK)

WHERE

IS_MEMBER([Group]) = 1

)

The problem was that the complete procedure where we were using IS_MEMBER was taking several minutes to complete, for a table where the max result set cardinality was in the range of 18000-20000. We noticed the following wait types while the procedure was executing:

PREEMPTIVE_OS_AUTHORIZATIONOPS

PREEMPTIVE_OS_LOOKUPACCOUNTSID

I did some research on these waits, and found that since both of these are related to the communication/validation from Active Directory, they lie outside of SQL server, and there’s no changes we can do from a configuration standpoint to help reduce/eliminate these waits.

Next, we studied the code, broke it down and tested the performance of the various sections that used the IS_MEMBER function, and found that the main section responsible for the execution time was the “WHERE” condition where we were using the result set of the code mentioned above. This is what the “WHERE” clause looked like:

(SELECT

COUNT(*)

FROM

SalesOrgCTE

WHERE

SORGNBR BETWEEN MinSalesOrg AND MaxSalesOrg) > 0

Notice that in this code, we’ve asked SQL to check the value of SORGNBR for each row, and if it’s between MinSalesORG and MaxSalesOrg, then add it to the rowcount. We observed that due to this design, it had to make a trip to AD for validating each row, which meant quite a long time for a 18000-20000 row result set, which was responsible for the slow performance of the procedure.

We did some more research with different approaches for the where clause, and the combined efforts of myself, Igor and his team resulted in the following where clause whose performance was acceptable:

WHERE

SORGNBR IN

(

SELECT MinSalesOrg FROM SalesOrgCTE

)

AND SORGNBR IN

(

SELECT MaxSalesOrg FROM SalesOrgCTE

)

If you look carefully, you’ll notice that in this code snippet, we’ll need to communicate with AD only twice, thereby improving the performance of the procedure as a whole.

Summing up: The importance of writing good code cannot be over-emphasized. It’s good coding practices like this that lead to performance gains most of the time.

Hope this helps.

↧

An interesting issue with Peer to Peer Replication

September 30, 2013, 5:24 pm

≫ Next: An In-depth look at memory – SQL Server 2012/2014

≪ Previous: Something to watch out for when using IS_MEMBER() in TSQL

I recently ran into an interesting issue when setting up Peer 2 Peer Replication across 3 instances.

The primary instance was SM-UTSQL, where we configured a Peer-to-Peer publication named "PUBLISH1" on the database "DDS-TRANS". Next, we proceeded to configure Peer-to-Peer topology (right-click on publication, click on "Configure Peer-To-Peer topology"), and added the other 2 instances to the topology. After this, we clicked on the primary node and selected "Connect to all displayed nodes" :

We then went ahead through the UI and configured the replication. However, when we checked in object explorer, we saw that on the Primary instance (SM-UTSQL), under replication->Publish1, we could see both the Peer nodes SO-UTSQL and ST-UTSQL as subscribers, but on SO-UTSQL and ST-UTSQL, we could see on SM-UTSQL as the subscriber for the publication i.e. SO-UTSQL and ST-UTSQL did not recognize each other as subscribers.

We tried to add the missing subscriber through the new subscriptions wizard, but got the following error:

TITLE: New Subscription Wizard

——————————

SQL Server could not create a subscription for Subscriber ‘ST-UTSQL’.

——————————

ADDITIONAL INFORMATION:

An exception occurred while executing a Transact-SQL statement or batch. (Microsoft.SqlServer.ConnectionInfo)

——————————

Peer-to-peer publications only support a ‘@sync_type’ parameter value of ‘replication support only’, ‘initialize with backup’ or ‘initialize from lsn’.

The subscription could not be found.

Changed database context to ‘DDS_TRANS’. (Microsoft SQL Server, Error: 21679)

For help, click: http://go.microsoft.com/fwlink?ProdName=Microsoft+SQL+Server&ProdVer=10.50.4000&EvtSrc=MSSQLServer&EvtID=21679&LinkId=20476

——————————

BUTTONS:

——————————

The resolution? Here are the steps:

Navigate to the replication tab in object explorer on the Primary instance (where you can see both subscriptions under the publication, SM-UTSQL in our case).
Right click on the publication and select generate scripts, and select the "To Create or enable the components" radio button.
In the resulting script, navigate to the bottom. Here, you will see 2 sets of "sp_addsubscription" and "sp_addpushsubscription_agent" calls:

– Adding the transactional subscriptions

use [DDS_TRANS]

exec sp_addsubscription @publication = N’PUBLISH1′, @subscriber = N’SO-UTSQL‘, @destination_db = N’DDS_TRANS’, @subscription_type = N’Push’, @sync_type = N’replication support only’, @article = N’all’, @update_mode = N’read only’, @subscriber_type = 0

exec sp_addpushsubscription_agent @publication = N’PUBLISH1′, @subscriber = N’SO-UTSQL‘, @subscriber_db = N’DDS_TRANS’, @job_login = N’dds\dtsql.admin’, @job_password = null, @subscriber_security_mode = 1, @frequency_type = 64, @frequency_interval = 1, @frequency_relative_interval = 1, @frequency_recurrence_factor = 0, @frequency_subday = 4, @frequency_subday_interval = 5, @active_start_time_of_day = 0, @active_end_time_of_day = 235959, @active_start_date = 0, @active_end_date = 0, @dts_package_location = N’Distributor’

use [DDS_TRANS]

exec sp_addsubscription @publication = N’PUBLISH1′, @subscriber = N’ST-UTSQL‘, @destination_db = N’DDS_TRANS’, @subscription_type = N’Push’, @sync_type = N’replication support only’, @article = N’all’, @update_mode = N’read only’, @subscriber_type = 0

exec sp_addpushsubscription_agent @publication = N’PUBLISH1′, @subscriber = N’ST-UTSQL‘, @subscriber_db = N’DDS_TRANS’, @job_login = N’dds\dtsql.admin’, @job_password = null, @subscriber_security_mode = 1, @frequency_type = 64, @frequency_interval = 1, @frequency_relative_interval = 1, @frequency_recurrence_factor = 0, @frequency_subday = 4, @frequency_subday_interval = 5, @active_start_time_of_day = 0, @active_end_time_of_day = 235959, @active_start_date = 0, @active_end_date = 0, @dts_package_location = N’Distributor’

Copy these commands over, provide the value for the @job_password parameter (password for the login used to configure replication, reflected in the @job_login parameter), and run the appropriate set on the 2 subscribers. For example, we ran the first set of commands (@subscriber=N’SO-UTSQL’) on ST-UTSQL instance, and the second set(@subscriber=N’ST-UTSQL’) on the SO-UTSQL instance.

And voila, the subscriptions were created and syncing.

Hope this helped you. Comments and feedback are welcome.

↧

An In-depth look at memory – SQL Server 2012/2014

April 27, 2015, 1:57 am

≪ Previous: An interesting issue with Peer to Peer Replication

I finally had some time on my hands, so thought to try and get around to blogging about the memory architecture of SQL server 2012/2014. The memory architecture for the SQL Server relational database engine was practically overhauled with SQL Server 2012, and most of it has remained the same in SQL 2014. Please note that the In-memory OLTP feature introduced in SQL 2014 has a different engine dedicated to it, which will not be covered as part of this blog post.

Memory Manager

We have a new memory manager in SQL 2012. The new memory manager is responsible for almost all the memory management related activities, especially memory allocation. It includes 2 allocators, the Page Allocator, and the Virtual Address Space (VAS) allocator.

Memory consumers request memory from the memory manager through these allocators. The new memory manager supports allocations of all sizes, we no longer have the concept of single page and multipage allocators. For the sake of simplicity, I will refer to the allocator as "Any Size Page allocator" in this post. Consumers like memory clerks and the Buffer Pool are clients of the memory manager. This is a major change from previous versions, where the Buffer pool was a consumer as well as a provider(allocator) of memory (single pages).

Another major change that you will experience is that with the introduction of the Any Size Page allocator, the SQL Server "Max Server Memory" setting now controls the memory allocated for the Buffer Pool, CLR, as well as larger than 8 kb (single page) allocations. In prior versions, max server memory only controlled the size of the buffer pool, which made it difficult for DBA’s/architects to apportion memory between applications/instances on shared servers. SQL Server 2012 gives us the ability to establish much tighter control on memory allocation using tools such as resource governor, because the resource governor is now able to control all page allocations.

The buffer pool in SQL 2012 does not contain any memory management functionality. It just manages the caching of database pages, and is now treated like a regular, external cache. Considering that the Buffer Pool is still likely to be the largest consumer of memory though, it is allowed to use almost all free memory from the SQL memory manager.

There are several visual representations of the changes to memory manager in SQL 2012 floating around on the internet, which I’ve included for your benefit:

As you might have noticed, memory allocation for CLR forms a bit of a "special case", in that even though it uses VAS Allocator(Virtual Allocator in the diagram), it still is governed by the limits set using the "max server memory" setting in SQL. An auxiliary effect of this is that CLR is now initialized at SQL Server DB Engine startup, as opposed to being initialized when it was called in earlier versions of SQL Server.

The memory manager architecture in SQL Server 2012 can be broken down into the following components:

Fragments/Fragment Manager: A fragment is a large region of VAS. The fragment manager deals with fragments, which the memory manager will commit dynamically.
Top level block allocator: The Top level block allocator takes these large fragments and splits them up into top level blocks. These top level block sizes are fixed at 16 MB in 64-bit installations and 4 MB in 32-bit installations.
Workspace: A Workspace is a set of top level blocks. A workspace contains allocations of similar lifetime. Examples of workspaces include buffer pool, memory objects, etc.
Fixed Size Allocator: The most crucial component of the new memory manager hierarchy is the Fixed Size Allocator. This allocator is part of the workspace, and it gets memory from the top level block allocator, and breaks it up into smaller sized blocks. These are called parent blocks, and each parent block has its own descriptor, which has a free list as well as the current state of the block. The parent block can be in one of four states:

Active – The block from which allocations happen. New allocations requests are serviced from the Active block of the local CPU or the NUMA node.
Full – The block is full of allocations. There is no more free space in that block. This was previously an Active block.
Partial – The block is neither Full nor Active. There are multiple lists of Partial blocks. These lists are per NUMA node. A block is made part of a specific list according to the fill factor of allocations inside the block. This supports the mechanism to try to Free blocks which have less allocations sooner or try to allocate from blocks which have more allocation already to make it full. Thus, the overall algorithm pushes against having a large list of Partial blocks to manage.
Empty – When the block is allocated for the first time, it is in empty state. It is does not have any allocations.

Memory Allocation

Let’s talk about how memory is allocated using the new memory manager. The first step is, of course, to determine which allocator to use, based on the amount of memory requested. This involves traversing the workspace hierarchy to find the appropriate allocator. If we’re unable to find an allocator of the exact same size, the next highest size allocator is used.

Once we have the allocator, the fastest way to allocate is to look at the local partition, find the Active block and then pop an entry from the Free list. If the Active block does not have enough free space to allocate from, then we need to replace this Active block with one that has enough free space.

Next, we start traversing the partial buckets, looking for a partial block that is almost full. If a partial block is found, we first re-evaluate the fill factor, since many free operations could have happened after this partial block was inserted into this fill factor bucket. If the fill factor has changed, then this partial block is inserted to the appropriate fill factor bucket. Repeat this process to find a partial block with the most allocations in it. Once found this partial block is installed as an Active Block. After this the allocation is performed from this Active block.

During the above process, if another thread installs another active block, then that Active block should be used for allocation instead. If allocation is successful from that active block, then this block should be returned to the partial bucket where it came from.

If no partial block was found, then go up to the parent block allocator and request a new block [empty] to be created. This corresponds to the process of growing against the target memory configured [for conventional and locked page memory].

If during the above process we were not able to find a suitable block, then we switch to the next NUMA node and repeat the same process. Switching to next node happens once the target for current node is reached. If all nodes reach target memory then an OOM (Out Of Memory) error is returned.

In order to free a block, we first find the parent block descriptor for the current block that is being freed. Next, we return the block to the free list and update the state in the descriptor. It can go to two states, Partial or Empty.

When we configure SQL to use large pages, all the top level blocks are committed at startup. For conventional and locked page memory model, the blocks are committed as needed.

The rest of the components work in pretty much the same way as earlier versions of SQL Server. For more details on the memory architecture up to SQL Server 2008 R2, please refer the following posts:

An in-depth look at SQL Server Memory–Part 1

An in-depth look at SQL Server Memory–Part 2

An in-depth look at SQL Server Memory–Part 3

As always, comments and feedback are appreciated.

Disclaimer: The information in this weblog is provided “AS IS” with no warranties, and confers no rights. This weblog does not represent the thoughts, intentions, plans or strategies of my employer. It is solely my opinion. Feel free to challenge me, disagree with me, or tell me I’m completely nuts in the comments section of each blog entry, but I reserve the right to delete any comment for any reason whatsoever (abusive, profane, rude, or anonymous comments) – so keep it polite, please.

Technorati Tags: SQL,SQL Server 2012,SQL Server 2014,SQL 2012,SQL 2014,Memory,Memory architecture,in-depth memory,SQL Server engine,workspaces,any size allocator,fragment manager,Top level block allocator,SQL 2012 memory mangement,SQL 2014 memory management

↧