AlwaysOn Archives - SQL Authority with Pinal Dave

SQL SERVER - Mirroring Error 1456 - The ALTER DATABASE Command Could not be Sent to the Remote Server Instance mirrorerror As a part of my consulting, I still get few clients who prefer to use database mirroring over Always On Availability Group. In this blog we would cover about one possible cause of database mirroring Error 1456 – The ALTER DATABASE command could not be sent to the remote server instance.

The client contacted me during a disaster and they were facing this issue after the disaster situation. They were trying to re-configure database mirroring witness server and were having a hard time in doing that. Their old witness server had crashed, and they had built a new server and were trying to add it to the database mirroring configuration.

Here is the error when we try to add witness server to the current mirrored database.

The ALTER DATABASE command could not be sent to the remote server instance ‘TCP://srv_w.sqlauthority.com:5022’. The database mirroring configuration was not changed. Verify that the server is connected, and try again. (Microsoft SQL Server, Error: 1456)

As usual, I asked them to check ERRORLOG and found below on Principal Server

2018-03-15 07:16:12.040 spid49s      Database mirroring is inactive for database ‘test’. This is an informational message only. No user action is required.
2018-03-15 07:16:12.110 Logon        Database Mirroring login attempt by user ‘SQLAUTHORITY\srv_w$’ failed with error: ‘Connection handshake failed. The login ‘SQLAUTHORITY\srv_w$’ does not have CONNECT permission on the endpoint. State 84.’. [CLIENT: 10.17.144.60]
2018-03-15 07:16:12.110 spid109s     Error: 1474, Severity: 16, State: 1.
2018-03-15 07:16:12.110 spid109s     Database mirroring connection error 5 ‘Connection handshake failed. The login ‘SQLAUTHORITY\srv_w$’ does not have CONNECT permission on the endpoint. State 84.’ for ‘TCP://srv_w.sqlauthority.com:5022’.
2018-03-15 07:16:14.600 Logon        Database Mirroring login attempt by user ‘SQLAUTHORITY\srv_w$’ failed with error: ‘Connection handshake failed. The login ‘SQLAUTHORITY\srv_w$’ does not have CONNECT permission on the endpoint. State 84.’. [CLIENT: 10.17.144.60]

I have already blogged about similar errors earlier. You may want to refer them if the solution in this blog is not helping.

SOLUTION/WORKAROUND

This issue looked like the CONNECT permissions are not provided on the endpoint, but it was not that easy. I used the below query to check of the CONNECT permission has been granted for this account to the endpoint.

SELECT EP.name, SP.STATE, 
   CONVERT(nvarchar(38), suser_name(SP.grantor_principal_id)) 
      AS GRANTOR, 
   SP.TYPE AS PERMISSION,
   CONVERT(nvarchar(46),suser_name(SP.grantee_principal_id)) 
      AS GRANTEE 
   FROM sys.server_permissions SP , sys.endpoints EP
   WHERE SP.major_id = EP.endpoint_id
   ORDER BY Permission,grantor, grantee; 
GO

If CONNECT permission is not provided we need to provide the same. In our case, everything looked good, but we still encountered failures. Then a thought provoked me that this is a new server and the account we are talking about is a Machine Account. i.e., SQLAUTHORITY\srv_w$. As this is a new server the SID of this account will be different of that of the account which was present on the old server.

As per documentation “For two server instances to connect to each other’s database mirroring endpoint point, the login account of each instance requires access to the other instance. Also, each login account requires “connect” permission to the Database Mirroring endpoint of the other instance.”

So, the SID for this login present in other servers will have an Old SID as they have not re-added the account. I ran below command to make a note of the SID.

select * from sys.syslogins where name = 'sqlauthority\srv_w$'

We deleted the old login and re-added the login which will create the new SID. After that, we were able to add witness successfully.

Reference: Pinal Dave (https://blog.sqlauthority.com)

First appeared on SQL SERVER – Mirroring Error 1456 – The ALTER DATABASE Command Could not be Sent to the Remote Server Instance

Recently I wrote a blog explaining a situation where taking log backup can break automatic seeding and synchronization would not work. Below is the blog reference. SQL SERVER – AlwaysOn Automatic Seeding – Database Stuck in Restoring State

While fixing above, I came across an interesting situation where I wanted to add multiple databases to the availability group. In my availability group, two of them were able to seed properly but one database was not getting restored on the secondary replica. In this blog, we would learn how to VDI Client failed caused during automatic seeding.

After looking into various articles on the internet, I learned that dynamic management views (DMVs) sys.dm_hadr_physical_seeding_stats can be used to track the progress of seeding. I queried the DMV to see the cause of failure. Here is the query which I used to find the status.

SELECT local_database_name
,role_desc
,internal_state_desc
,failure_code
,failure_message
FROM sys.dm_hadr_physical_seeding_stats

Here is the output

SQL SERVER - AlwaysOn Automatic Seeding Failure - Failure_code 15 and Failure_message: VDI Client Failed seed-fail-vdi-01

I had no clue about failure_code 15 and failure_message VDI Client failed shown in above output. To move further, I check ERRORLOG on the secondary replica and found below interesting messages.

2018-03-17 02:01:48.56 spid69s Error: 911, Severity: 16, State: 1.
2018-03-17 02:01:48.56 spid69s Database ‘SQLAuthorityDB’ does not exist. Make sure that the name is entered correctly.
2018-03-17 02:01:48.70 spid69s Error: 3633, Severity: 16, State: 1.
2018-03-17 02:01:48.70 spid69s The operating system returned the error ‘5(Access is denied.)’ while attempting ‘RestoreContainer::ValidateTargetForCreation’ on ‘C:\Database\SQLAuthorityDB.mdf’ at ‘container.cpp'(2759).
2018-03-17 02:01:48.70 spid69s Error: 3634, Severity: 16, State: 1.
2018-03-17 02:01:48.70 spid69s The operating system returned the error ‘5(Access is denied.)’ while attempting ‘RestoreContainer::ValidateTargetForCreation’ on ‘C:\Database\SQLAuthorityDB.mdf’.
2018-03-17 02:01:48.70 spid69s Error: 3156, Severity: 16, State: 5.
2018-03-17 02:01:48.70 spid69s File ‘SQLAuthorityDB’ cannot be restored to ‘C:\Database\SQLAuthorityDB.mdf’. Use WITH MOVE to identify a valid location for the file.
2018-03-17 02:01:48.70 spid69s Error: 3119, Severity: 16, State: 1.
2018-03-17 02:01:48.70 spid69s Problems were identified while planning for the RESTORE statement. Previous messages provide details.
2018-03-17 02:01:48.70 spid69s Error: 3013, Severity: 16, State: 1.
2018-03-17 02:01:48.70 spid69s RESTORE DATABASE is terminating abnormally.
2018-03-17 02:01:48.70 spid71s Automatic seeding of availability database ‘SQLAuthorityDB’ in availability group ‘AG’ failed with a transient error. The operation will be retried.

Now, this was interesting and clear to tell us the problem. The automatic seeding was trying to restore the database and created file in C:\Database folder. It was failing with error operating system returned the error ‘5(Access is denied.)’. When I checked other two databases which I was adding along with this, I would that they were in a different location and permissions were perfect there.

WORKAROUND/SOLUTION

There were two options to fix the issue so that automatic seeding can work.

Provide permission on the destination folder on secondary replica to service account. Once it’s done, we can restart the secondary so that seeding would kick in again. We can also remove and this database from availability group and add again.
Move the files on primary to a location where seeding is working.

If we don’t want automatic seeding, then we can also perform manual backup and restore of this database and then add it to availability group by using “Join Only” option in the wizard.

Have you come across any other seeding failure code? If yes, please share your experience with other readers via comment. If I can reproduce those codes, I would write a blog on that.

Reference: Pinal Dave (https://blog.sqlauthority.com)

First appeared on SQL SERVER – AlwaysOn Automatic Seeding Failure – Failure_code 15 and Failure_message: VDI Client Failed

Even after working with many clients on various issues (Always On, Deployment, Performance Tuning) I always get new challenges as a part of my freelancing job. One thing I have learned is that one needs a systematic approach to diagnose the issue and, of course, passion to solve the problem. In this blog, we would learn about how to fix the error: 1067 which comes while trying to bring Analysis Service online in the cluster.

Recently, one of my clients wanted me to fix an issue where they were not able to start SQL Analysis Services which was installed in a cluster. The major challenge here was that all the data available in the logs were very generic and what was surprising is that even the cluster logs had generic error messages. Refer below:

If you are new to cluster and wanted to know how to generate cluster log, then refer my earlier blog.

ERR [RES] Generic Service : Service failed during initialization. Error: 1067.
ERR [RHS] Online for resource Analysis Services (SSAS) failed.
WARN [RCM] HandleMonitorReply: ONLINERESOURCE for ‘Analysis Services (SSAS)’, gen(29) result 5018/0.
ERR [RCM] rcm::RcmResource::HandleFailure: (Analysis Services (SSAS))
WARN [RCM] Not failing over group Data Warehousing Instances, failoverCount 4, failoverThresholdSetting 4294967295, computedFailoverThreshold 1, lastFailover

From cluster log (ERR stands for error) we can see that we pretty much have a very generic error – “Error: 1067” which means “The process terminated unexpectedly”

SQL SERVER - Error: 1067 - Unable to Bring Analysis Service Online in Cluster ssas-clu-err-01-800x167

As I mentioned earlier, let’s start with the basic approach. We know this is a cluster and SSAS instance is a clustered instance. There are two ways we can confirm that the service is starting successfully.

Successfully starts as a clustered resource in the failover cluster manager
Successfully starts when we try to start it locally using Service Control Manager.

We know that we are failing with the first option. So, we went on to the second option.

We opened services.msc à right click on SSAS serviceà Click Start. We got the same error 1067.
Used NET START <servicename> method and that too failed with the same error.
Then we tried to start the service using the SSAS exe file exe.
We went to the properties of the SSAS service in Service Control Manager to get the path of the EXE and the startup parameters being used. And we saw the below values:

“D:\Program Files\Microsoft SQL Server\MSAS11.SSAS\OLAP\bin\msmdsrv.exe” -s “F:\OLAP\Config”

SOLUTION/WORKAROUND

As soon as my client saw the above data, he screamed “I think, I know what the problem is” So, I asked him what is it? And he replied – “We don’t have F:\drive in this cluster!!”

He then explained the cluster storage re-shuffling they had done recently which may have introduced this issue. Now how do we fix this?

Go back and rename the cluster drive letters to the original ones [this was unlikely to happen]
Find another way to edit those values and we are talking about the registry here.

Every service installed in windows get their registration under,

HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Services\<ServiceName>\

So, in this case we headed to,

HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Services\MSAS$SSAS\ImagePath

Here we could see the value what we saw in services properties. We changed that to the below,

“D:\Program Files\Microsoft SQL Server\MSAS11.SSAS\OLAP\bin\msmdsrv.exe” -s “M:\OLAP\Config”

Then on we could start SSAS service locally from service control manager. And we were also able to start it from Failover Cluster Management.

Reference: Pinal Dave (https://blog.sqlauthority.com)

First appeared on SQL SERVER – Error: 1067 – Unable to Bring Analysis Service Online in Cluster

In the current era where companies’ mergers and take over is a common phenomenon, it is possible to run into a situation where there are two domains which are connected but not trusted. I was engaged in consulting for a client who had a similar merger in the past. The reporting application was in a different domain and they wanted data to be synchronized in SQL Server database on parent domain. Essentially, they had two windows clusters and they want to have one AG across them. The only solution which came to my mind was Distributed Availability Group in SQL Server. To make things little more complicated, they did not want to use certificates to configure this setup. As per them, certificates will add an overhead of maintenance. Here is the oversimplified diagram of their deployment.

SQL SERVER - Steps to Deploy Distributed Availability Group - Windows Clusters in Different Domains dag-wg-01-800x283

SOLUTION/WORKAROUND

After contemplating and testing for a while, we decided to make use of NTLM Pass-Through Authentication between the domains and were also successful in doing the same. Below were high-level steps followed to achieve the same.

Create a local windows user in each of the nodes (N1, N2, N3, N4) across domains using the same name and password. Note: Only the NodeName will change here. Username and password must the same.

For example N1\DAGUser, N2\DAGUser, N3\DAGUser and so on.

Add this account to Local Administrators group on all the nodes across domains
Change SQL Server startup account to this newly created account on all the nodes across domains (via SQL Server Configuration Manager).
Create the Primary Availability Group (HA_AG) with a corresponding listener name (HA _LIST)
Grant the SQL Server service account CONNECT permissions to the endpoint. Run on ALL nodes — GRANT CONNECT ON ENDPOINT::Hadr_endpoint TO [<nodename>\UserName]
Each node should have “connect” permission from all other three nodes.
Create the secondary Availability Group (DR_AG) with a corresponding listener name (DR_LIST)
Join the secondary replicas to the secondary Availability Group
Add entries to the HOST File with “IPAddr and Listener FQDN” info of the other domain. i.e. Domain1 nodes will contain [1.1.xxx DR_LIST.MyDomain2.com] Domain2 nodes will contain [192.168.1.xxx HA_LIST.MyDomain.com]
Create Distributed Availability Group (HA_DR_AG) on the primary Availability Group (HA_AG)
Join the Secondary Availability Group (DR_AG) to the Distributed Availability Group

There were some hiccups in deployment but overall, above steps worked and we were able to fulfill their requirement. If you have such consultancy requirements for Always On, feel free to contact me.

Reference: Pinal Dave (https://blog.sqlauthority.com)

First appeared on SQL SERVER – Steps to Deploy Distributed Availability Group – Windows Clusters in Different Domains

SQL SERVER - Parallel Redo on AlwaysOn Secondary - DIRTY_PAGE_TABLE_LOCK ParallelLinesOpticalIllusion As most of you might know that my expert area in SQL Server in performance tuning. But, I also deal with almost every issue related to SQL Server engine. Many times, it so happens then clients call me for performance tuning and while fixing, we see some other problem. In this blog, we would learn about Parallel Redo on AlwaysOn Secondary causing new waits which were introduced in SQL Server 2016.

As mentioned earlier, one of my clients is using AlwaysOn Availability Group feature and contacted me for assistance in performance tuning. They were seeing latency associated with the redo on the secondary. When I logged in to secondary, I could clearly see high waits due to “PARALLEL_REDO_WORKER_WAIT_WORK”. There were few other waits on the top of the list. Another predominant wait types which we saw were:

DIRTY_PAGE_TABLE_LOCK (this was the top wait)
REDO_THREAD_PENDING_WORK
PARALLEL_REDO_DRAIN_WORKER
PARALLEL_REDO_FLOW_CONTROL

Based on my search, it all appears when there is parallel redo activity happening. This can also happen during recovery of the database (restart of SQL Service or if we take the database offline/online). Based on documentation this is a new feature introduced in SQL Server 2016. The secondary was used for reading purpose and they were doing load testing on SQL Server.

WORKAROUND/SOLUTION

When I asked them to show me the testing environment and steps they take to simulate the issue, they said that there is no much infra required. They just rebuild the index online on primary replica and perform select on secondary replica so that it scans complete table. While doing that they see waits. After a lot of search on the internet, I was able to find the way to disabled parallel redo. You can also do it in SQL Server by using trace flag 3459. SQL SERVER – What is Trace Flag – An Introduction

Using configuration manager, you can enable trace flag and use the older functionality of redo the way it used to happen for ages.

Have you ever faced issue with new features in SQL Server released by Microsoft? I am sure there are many! Please comment and let others know.

Reference: Pinal Dave (https://blog.sqlauthority.com)

First appeared on SQL SERVER – Parallel Redo on AlwaysOn Secondary – DIRTY_PAGE_TABLE_LOCK

As a part of my AlwaysOn related consultancy, one of my clients was having challenges to install SQL Server 2016 in a clustered environment. In this blog, we would learn about the cause of Slow SQL Server 2016 Installation in Cluster.

When I started SQL Server Setup, it got hung on this screen.

SQL SERVER - Slow SQL Server 2016 Installation in Cluster: RunRemoteDiscoveryAction hung-2k16-install-01

In the Detail.txt we saw the below info:

(01) 2018-03-30 09:25:34 Slp: Running Action: RunRemoteDiscoveryAction
(08) 2018-03-30 09:25:34 Slp: Discovered update on path C:\SQL2016\SQLServer2016SP1 \PCUSOURCE; Update: Microsoft SQL Server 2016 with SP1, Type: PCU, KB: 3182545, Baseline: 13.0.1601, Version: 13.1.4001
(01) 2018-03-30 09:25:34 Slp: Running discovery on remote machine: NODE2
(01) 2018-03-30 09:25:34 Slp: Running discovery on local machine
(08) 2018-03-30 09:25:34 Slp: Using service ID ‘3da21691-e39d-4da6-8a4b-b43877bcb1b7’ to search product updates.
(10) 2018-03-30 09:25:34 Slp: Searching updates on server: ‘3da21691-e39d-4da6-8a4b-b43877bcb1b7’

Based on above snip of the log, we can see that the action which setup was running is — Running Action: RunRemoteDiscoveryAction

What came to my mind is what does it take for setup to connect to Node2. This is where it trying to perform RemoteDiscoveryAction. Based on my previous experiences fixing such things, I could think of

Remote registry service in a stopped state
Remote Registry connectivity is disabled
Admin$ shares are disabled.

SOLUTION/WORKAROUND

In this case, we saw that the Admin$ shares were disabled. It can be easily tested by typing the below command in CMD prompt or use any file explorer window.

C:\>\\NODE1\c$

As soon as we hit the enter key, we got the message.

SQL SERVER - Slow SQL Server 2016 Installation in Cluster: RunRemoteDiscoveryAction hung-2k16-install-02

Here are the steps to get Admin$ share back (Reference)

Open a registry editor, start > Run > Regedit.exe.
Navigate to: HKLM\System\CurrentControlSet\Services\LanmanServer\Parameters
In the right pane, locate and double-click AutoShareServer.
Change the value from 0 to 1.
Close the registry editor and restart the “Server” service for the change to take effect.

After allowing Admin$ share access, SQL setup did not have any further challenges and completed successfully. This action needs to be done all the nodes participating in the cluster. A reboot of the node is also required. Maybe on the latest operating systems (like Windows 2016), a reboot may not be required.

If above steps solve your installation issue, please let me know via comments.

Reference: Pinal Dave (https://blog.sqlauthority.com)

First appeared on SQL SERVER – Slow SQL Server 2016 Installation in Cluster: RunRemoteDiscoveryAction

In my lab environment, I was testing a script to change service account and it worked fine. I had Always On configured on this SQL Server and soon I realized that after changing the SQL Server service account, the secondary replica went into a disconnected state.

SQL SERVER - Always On Replica Disconnected After Changing SQL Server Service Account ao-sec-disc-01

We could see the below errors in the Primary Replica SQL Errorlog

<DateTime> Logon Database Mirroring login attempt by user ‘MyDC\SQLAccount’ failed with error: ‘Connection handshake failed. The login ‘MyDC\SQLAccount’ does not have CONNECT permission on the endpoint. State 84.’. [CLIENT: 192.168.x.x]

If we change the service account back to the old one, everything goes back to normal. There are no issues. So, it seemed like that there were some issues with the new account we were using and apparently AlwaysOn AG did not like that account. The error also says that the account does not have “connect” permission to the endpoints. We ran the below query to check who all have got permissions on the AlwaysOn Endpoint — Hadr_endpoint

SELECT e.name AS mirror_endpoint_name
	,s.name AS login_name
	,p.permission_name
	,p.state_desc AS permission_state
	,e.state_desc endpoint_state
FROM sys.server_permissions p
INNER JOIN sys.endpoints e ON p.major_id = e.endpoint_id
INNER JOIN sys.server_principals s ON p.grantee_principal_id = s.principal_id
WHERE p.class_desc = 'ENDPOINT'
	AND e.type_desc = 'DATABASE_MIRRORING'

The output looked like below:

SQL SERVER - Always On Replica Disconnected After Changing SQL Server Service Account ao-sec-disc-02

We had changed the SQL service account to — ‘MyDC\SQLAccount’. The old one was ‘MyDC\SQLSvcAccount’

Looks like we were missing the CONNECT GRANT on the EndPoint permission here. We performed the following steps that resolved the issue.

WORKAROUND/SOLUTION

Created login of the newly added service account on both replicas.

USE [master]
GO
CREATE LOGIN [MyDC\SQLAccount] FROM WINDOWS WITH DEFAULT_DATABASE=[master]
GO

Granted connect permission the endpoints on both replicas.

GRANT CONNECT ON ENDPOINT::hadr_endpoint TO [MyDC\SQLAccount]
GO

Stopped the endpoints on both the replicas.

ALTER ENDPOINT hadr_endpoint STATE=STOPPED

Started endpoints on both the replicas.

ALTER ENDPOINT hadr_endpoint STATE=STARTED

After making the above changes replicas were back in the connected state. We tested the failovers and it worked great. Have you faced a similar issue with AlwaysOn AG? Please share your experience via comments.

Reference: Pinal Dave (https://blog.sqlauthority.com)

First appeared on SQL SERVER – Always On Replica Disconnected After Changing SQL Server Service Account

In recent past, I was assisting a client in installing SQL Server clustered instance in a Windows Cluster. There were many errors encountered and I learned a lot from this experience. In this blog, we would learn about Microsoft Cluster Service (MSCS) cluster verification errors which might appear during installation or AddNode.

We got the below error and as we can see, it was a setup “rule” failure rather than an installation error.

SQL SERVER - Install Error: Microsoft Cluster Service (MSCS) Cluster Verification Errors - Part 3 clus-facet-p3-01-800x581

Setup wizard failed with the rule — Microsoft Cluster Service (MSCS) cluster verification errors. Once we have this error, we would not be able to proceed next unless we fix it or follow KB to skip this rule.

From the Detail.txt we found the below error:

<DateTime>Slp: Initializing rule : Microsoft Cluster Service (MSCS) cluster verification errors
<DateTime>Slp: Rule applied features : ALL
<DateTime>Slp: Rule is will be executed : True
<DateTime>Slp: Init rule target object: Microsoft.SqlServer.Configuration.Cluster.Rules.ClusterServiceFacet
<DateTime>Slp: Validation Report not found on Node1.
<DateTime>Slp: Validation Report not found on Node2.DomainName.com.
<DateTime>Slp: Rule ‘Cluster_VerifyForErrors’ detection result: Is Cluster Online Results = True; Is Cluster Verfication complete = False; Verfication Has Warnings = False; Verification Has Errors = False; on Machine Node1
<DateTime>Slp: Evaluating rule : Cluster_VerifyForErrors
<DateTime>Slp: Rule running on machine: Node1
<DateTime>Slp: Rule evaluation done : Failed
<DateTime>Slp: Rule evaluation message: The cluster either has not been verified or there are errors or failures in the verification report. Refer to KB953748 or SQL Server Books Online for more information.

<DateTime>Slp: Send result to channel : RulesEngineNotificationChannel
<DateTime>Slp: Initializing rule : Microsoft Cluster Service (MSCS) cluster verification warnings
<DateTime>Slp: Rule applied features : ALL
<DateTime>Slp: Rule is will be executed : True
<DateTime>Slp: Init rule target object: Microsoft.SqlServer.Configuration.Cluster.Rules.ClusterServiceFacet
<DateTime>Slp: Validation Report not found on Node1.
<DateTime>Slp: Validation Report not found on Node2.DomainName.com.
<DateTime>Slp: Rule ‘Cluster_VerifyForWarnings’ detection result: Is Cluster Online Results = True; Is Cluster Verfication complete = False; Verfication Has Warnings = False; Verification Has Errors = False; on Machine Node1
<DateTime>Slp: Evaluating rule : Cluster_VerifyForWarnings
<DateTime>Slp: Rule running on machine: Node1
<DateTime>Slp: Rule evaluation done : Warning
<DateTime>Slp: Rule evaluation message: The MSCS cluster has been validated but there are warnings in the MSCS cluster validation report, or some tests were skipped while running the validatation. To continue, run validation from the Windows Cluster Administration tool to ensure that the MSCS cluster validation has been run and that the MSCS cluster validation report does not contain errors.

We had discussed a few issues in Part 1 and Part 2 of this blog and you can find it here

Today we will discuss another scenario which produces the same error, but the solution is different.

Unable to find the Cluster Validation report:

If the respective file is not found in the given location, we would see it clearly mentioned in Detail.txt file as shown below:

<DateTime>Slp: Validation Report not found on Node1.DomainName.com
<DateTime>Slp: Validation Report not found on Node2.DomainName.com

If you use one of my favorite tool and a procmon trace and you will notice entries like below. The user might have deleted/moved the given file. The report is usually located under – C:\Windows\Cluster\Reports

<DateTime>
setup1xx.exe
7212 CreateFile
\\Node1.Domain.com\admin$\Cluster\Reports\Validation Data For Node Set 9431D731EF6810180D4000D550DDB48DD960F349.xml
NAME NOT FOUND

SOLUTION/WORKAROUND

Open the cluster validation report to check the results. If there are no errors found, just rename the file to the name found in the procmon (Validation Data For Node Set 9431D731EF6810180D4000D550DDB48DD960F349.xml). else re-run Windows Cluster validation to generate a new report and this new report will be used by the setup the next time when it runs.

I have written three blogs in the same series as I found multiple reasons for such behavior. Please comment and let me know if you found some more causes and ways to fix it.

Reference: Pinal Dave (https://blog.sqlauthority.com)

First appeared on SQL SERVER – Install Error: Microsoft Cluster Service (MSCS) Cluster Verification Errors – Part 3

While Microsoft has introduced an AlwaysOn availability group feature in SQL Server 2012 and people have been using it instead of database mirroring. In this blog, we would learn how to fix error Database Mirroring login attempt failed with error: ‘Connection handshake failed. An OS call failed: (80090350).

When they contacted me, the situation was that the primary replica was showing secondary in a disconnected state. When I asked to check SQL Server ERRORLOG, we found below on secondary replica.

Database Mirroring login attempt failed with error: ‘Connection handshake failed. An OS call failed: (80090350) 0x80090350(The system cannot contact a domain controller to service the authentication request. Please try again later.). State 66.’. [SERVER: 10.xx.xx.xx]

The IP address which is at the end of the error message was the IP address of the primary replica. One of the interesting things about error message is that even though they were not using database mirroring the error message was saying -“Database Mirroring Login”. I think this is because of the fact that the endpoint is same for availability group and mirroring.

WORKAROUND/SOLUTION

I asked more about how they landed up in this situation. They informed me that there were some network issues. Their network team fixed the issue but right now, they have an issue, where the mirror server is not able to get in sync with the principal and LDF file, is growing.

I have an existing blog on the same error message. SQL SERVER – Database Mirroring login attempt failed – Connection handshake failed

In our case hexadecimal code was different. It is 0x80090350. I used COM Error Codes (Security and Setup) and it says SEC_E_DOWNGRADE_DETECTED – The system cannot contact a domain controller to service the authentication request. Please try again later. That is the message which we are seeing ERRORLOG as well. So, we are in the right direction.

We were able to log in to SQL using Windows accounts on the mirror server. We were also able to ping domain controller. I asked them to restart the machine so that it re-establishes the connection to the domain controller. Since it was a secondary replica, there were no problems in rebooting it. As expected, it worked for them and after the restart, the replicas were able to talk to each other and AlwaysOn started working.

Have you ever encountered such cryptic errors?

Reference: Pinal Dave (https://blog.SQLAuthority.com)

First appeared on SQL SERVER – Database Mirroring Login Attempt Failed With Error: ‘Connection Handshake Failed. An OS Call Failed: (80090350)

Today, I have received a very interesting script from SQL Server Expert Dominic Wirth. He has written a very helpful script which displays a utilization overview of the local availability group replica server. The overview will contain the number of databases as well as the total size of databases (DATA, LOG, FILESTREAM) and is group by following two categories

Replica role (PRIMARY / SECONDARY)
Availability Group

Let us first see the script:

/*==================================================================
Script: HADR Local Replica Overview.sql
Description: This script will display a utilisation overview
of the local Availability Group Replica Server.
The overview will contain amount of databases as
well as total size of databases (DATA, LOG, FILESTREAM)
and is group by ...
1) ... Replica role (PRIMARY / SECONDARY)
2) ... Availability Group
Date created: 05.09.2018 (Dominic Wirth)
Last change: -
Script Version: 1.0
SQL Version: SQL Server 2014 or higher
====================================================================*/
-- Load size of databases which are part of an Availability Group
DECLARE @dbSizes TABLE (DatabaseId INT, DbTotalSizeMB INT, DbTotalSizeGB DECIMAL(10,2));
DECLARE @dbId INT, @stmt NVARCHAR(MAX);
SELECT @dbId = MIN(database_id) FROM sys.databases WHERE group_database_id IS NOT NULL;
WHILE @dbId IS NOT NULL
BEGIN
SELECT @stmt = 'USE [' + DB_NAME(@dbId) + ']; SELECT ' + CAST(@dbId AS NVARCHAR) + ', (SUM([size]) / 128.0), (SUM([size]) / 128.0 / 1024.0) FROM sys.database_files;';
INSERT INTO @dbSizes (DatabaseId, DbTotalSizeMB, DbTotalSizeGB) EXEC (@stmt);
SELECT @dbId = MIN(database_id) FROM sys.databases WHERE group_database_id IS NOT NULL AND database_id > @dbId;
END;
GO
-- Show utilisation overview grouped by replica role
SELECT AR.replica_server_name, DRS.is_primary_replica AS IsPrimaryReplica, COUNT(DB.database_id) AS [Databases]
,SUM(DBS.DbTotalSizeMB) AS SizeOfAllDatabasesMB, SUM(DBS.DbTotalSizeGB) AS SizeOfAllDatabasesGB
FROM sys.dm_hadr_database_replica_states AS DRS
INNER JOIN sys.availability_replicas AS AR ON DRS.replica_id = AR.replica_id
LEFT JOIN sys.databases AS DB ON DRS.group_database_id = DB.group_database_id
LEFT JOIN @dbSizes AS DBS ON DB.database_id = DBS.DatabaseId
WHERE DRS.is_local = 1
GROUP BY DRS.is_primary_replica, AR.replica_server_name
ORDER BY AR.replica_server_name ASC, DRS.is_primary_replica DESC;
GO
-- Show utilisation overview grouped by Availability Group
SELECT AR.replica_server_name, DRS.is_primary_replica AS IsPrimaryReplica, AG.[name] AS AvailabilityGroup, COUNT(DB.database_id) AS [Databases]
,SUM(DBS.DbTotalSizeMB) AS SizeOfAllDatabasesMB, SUM(DBS.DbTotalSizeGB) AS SizeOfAllDatabasesGB
FROM sys.dm_hadr_database_replica_states AS DRS
INNER JOIN sys.availability_groups AS AG ON DRS.group_id = AG.group_id
INNER JOIN sys.availability_replicas AS AR ON DRS.replica_id = AR.replica_id
LEFT JOIN sys.databases AS DB ON DRS.group_database_id = DB.group_database_id
LEFT JOIN @dbSizes AS DBS ON DB.database_id = DBS.DatabaseId
WHERE DRS.is_local = 1
GROUP BY AG.[name], DRS.is_primary_replica, AR.replica_server_name
ORDER BY AG.[name] ASC, AR.replica_server_name ASC;
GO

Here is the screenshot of the resultset which you will get if you run above script.

There are many different kinds of reports which you can run via SQL Server Management Studio. However, sometimes the scripts simple as this script are very helpful and returns us quick results. You can further modify the above script to get additional details for your server as well.

Here are few additional scripts which also discusses the various concepts related to AlwaysOn.

If you have any other interesting script, please let me know and I will be happy to publish on the blog with due credit to you.

Reference: Pinal Dave (https://blog.SQLAuthority.com)

First appeared on SQL SERVER – Scripts to Overview HADR / AlwaysOn Local Replica Server

With current market trends of cloud shifting, many of my clients have moved their SQL Servers from on-premise virtual machines to Microsoft Azure Virtual Machines. In this blog, I would explain a situation where my client was not able to find relevant size when he wanted to upsize the VM to a higher tier.

My client was using AlwaysOn Availability Groups feature in Azure virtual machines for production. Due to increased workload, they decided to increase the number of CPU and RAM of this VM by upsizing it to a higher tier. As of now, they have been using D-Series VM and they wanted to now move to G-Series VM. Then the tried upsizing the VM, it was no showing the G Series sizes to upgrade.

When I tried my VM on the same region, I was able to see the size. This means there was something specific to their VM. (As of now G Series is not there in all regions)

I suggested to stop the VM and try again but there was no luck. Later I realized that since they are using AlwaysOn, they have created an availability set in for two Azure VMs (primary and secondary). As soon as we stopped both the VMs, we were able to upsize the VM.

WORKAROUND/SOLUTION

To summarize, below are the steps if you see that VM size is not available during upsize are as follows

Make sure the region, where VM was deployed, supports that VM size.
If a region has that VM size, then check if VM is part of availability set else contact Microsoft support to know whether your subscription allows it.
If VM is part of an availability set, then shut down all VMs in that availability set.

You should be able to see all the VM sizes, which were not visible earlier.

One interesting error which my client received after upgrade and start of the other VM. (Note that error is for sqlserver-1 and we upgraded sqlserver-0)

Failed to start virtual machine ‘sqlserver-1’. Error: Unable to perform Operation ‘Start VM’ on the VM because the requested size Standard_DS13_v2 is not available in the cluster where the availability set is currently allocated. The available sizes are: Standard_G1,Standard_G2,Standard_G3,Standard_G4,Standard_G5,Standard_GS1,Standard_GS2,Standard_GS3,Standard_GS4,Standard_GS4-8,Standard_GS4-4,Standard_GS5,Standard_GS5-16,Standard_GS5-8,Standard_L4s,Standard_L8s,Standard_L16s,Standard_L32s. Read more on VM resizing strategy.

This means that when the VMs are deployed in an Availability Set, all VMs should either be of the same size or same family. Different family size VM is not possible in an Availability Set. That’s why we received above error.

Let me know, via comments, if you have found something else other then above.

Reference: Pinal Dave (https://blog.sqlauthority.com)

First appeared on SQL SERVER – Microsoft Azure – Unable to Find Higher Tier Series Virtual Machine to Upgrade

Earlier this week I had multiple engagements of Comprehensive Database Performance Health Check. While working with one of the customers we realized that when we query sys.databases for one of the databases, we always see column log_reuse_wait_desc containing reasons AVAILABILITY_REPLICA. The customer was really curious about what it means. Let us see what we found out after quick research.

First of all, run the following script

SELECT name, log_reuse_wait_desc
FROM sys.databases

Above script will return the name of the database along with the log_reuse_wait_desc. The column displays the reason of transaction log space is currently waiting on to clear. If you see it containing AVAILABILITY_REPLICA, that means an AlwaysOn Availability Groups secondary replica is applying transaction log records of this database to a corresponding secondary database. In another word, your secondary database is being synced at that moment.

There can be multiple reasons for why this particular wait type may show up. Let me quickly list all the reasons here:

Reason 1: Slow Network or Unreachable Network

In this case, you may want to check with your network admin and check the reasons for the slow network. One of my customers had incorrectly configured network card once was slowing down the syncing.

Reason 2: Long Running Transactions

Sometimes it is quite possible you will see both the primary and secondary database both synced but still you will see the log_reuse_wait_desc displaying AVAILABILITY_REPLICA. In that case, it is quite possible that on the primary server there is a long-running transaction, which is currently actively writing in the log file which is yet not shipped to secondary.

Reason 3: Resource Constraint

In a rare incident, it is quite possible that sometimes on the secondary we run out of the worker thread. In that case, you can either restart the sync or if that works, you may have to reset the sync from the beginning.

Well, so far I have found realized that there can be any one of the reasons listed above can be the cause of the AVAILABILITY_REPLICA as a log_reuse_wait_desc. Let me know if you find out any other reasons and solutions for the same.

Reference: Pinal Dave (https://blog.SQLAuthority.com)

First appeared on SQL SERVER – How to Fix log_reuse_wait_desc – AVAILABILITY_REPLICA?

During my On Demand (50 Minutes) consultancy, I solve the issue which seems quick to my client. SQL not starting, AlwaysOn not failing over, Cluster not working are few of quick things where my clients engage me. In this blog, I would share a situation where Always On Availability Group was not coming online due to error – Did not find the instance to connect in SqlInstToNodeMap key.

THE SITUATION

There was some instability in a cluster which caused few unexpected failovers of always-on availability group from node1 to node2 – back and forth sometimes. When they contacted me, we found that clustered resource for availability group was not coming online.

My first step, always, is to get the error what is being reported by SQL or Cluster or Windows. Event log reported below error:

Cluster resource ‘PRODAG’ of type ‘SQL Server Availability Group’ in clustered role ‘PRODAG’ failed.

Based on the failed policies for the resource and role, the cluster service may try to bring the resource online on this node or move the group to another node of the cluster and then restart it. Check the resource and group state using Failover Cluster Manager or the Get-ClusterResource Windows PowerShell cmdlet.

Above error is very generic and does not tell more than what we know already.

When I checked the SQL Server Management studio we saw that the secondary replica is not connected to the primary replica. The connected state is “DISCONNECTED” in DMV and it shows “red” symbol for this replica. Next step was to generate a Cluster log.

SQL SERVER – Steps to Generate Windows Cluster Log?

And BINGO! We were able to see some relevant messages there.

INFO [RES] SQL Server Availability Group <PRODAG>: [hadrag] The DeadLockTimeout property has a value of 300000
INFO [RES] SQL Server Availability Group <PRODAG>: [hadrag] The PendingTimeout property has a value of 180000
ERR [RES] SQL Server Availability Group <PRODAG>: [hadrag] Did not find the instance to connect in SqlInstToNodeMap key.
ERR [RHS] Online for resource PRODAG failed.

“ERR” is the tag I look for in cluster log and you should focus on. Just before failure, we see this error: Did not find the instance to connect in SqlInstToNodeMap key. I search and found that SqlInstToNodeMap is a registry key which should have the same information as sys.dm_hadr_instance_node_map.

When I checked the primary replica, we were not able to see the AG under “availability group” node in SSMS. Also, there were no replicas listed under “availability replica” node. When we tried querying sys.dm_hadr_database_replica_states, we did not get any results.

WORKAROUND/SOLUTION

All above symptoms mean that there is some metadata mismatch between information in cluster and information in SQL Server. Even both replicas are having a mismatch of information about availability group. We ran below command on secondary to remove information about AG. We were not able to use UI and it was giving an error.

DROP AVAILABILITY GROUP PRODAG

As soon as we executed, the databases were in restoring state and AG information was cleared from all DVMs and cluster also. Then we recreated the availability group using the AG wizard and we were back in business in less than 20 min of call with me.

I truly hope that this blog can help someone who is getting the same issue with AG.

Reference: Pinal Dave (https://blog.sqlauthority.com)

First appeared on SQL SERVER – Always On AG – HADRAG: Did not Find the Instance to Connect in SqlInstToNodeMap Key

One of my earlier clients to whom I helped in configuration Always On Availability Group came back to me with strange behavior. In this blog, we would discuss error Version store is full. New version(s) could not be added. In this situation, I also found that the redo queue was increasing continuously on the secondary replica.

THE SITUATION

My client has set up three nodes availability group with one primary, one synchronous secondary and one asynchronous secondary. They noticed that redo queue size was increasing continuously on secondary replica (both of them, sync and async)

I started digging using DMV and found that redo process was working but it was waiting for the LATCH on APPEND_ONLY_STORAGE_FIRST_ALLOC. Here is the query used

SELECT * FROM sys.sysprocesses WHERE dbid = db_id('ProdDB)

Then I looked into SQL Server ERRORLOG and found below message logged continuously

The version store is full. New version(s) could not be added. A transaction that needs to access the version store may be rolled back. Please refer to BOL on how to configure tempdb for versioning.

Based on my understanding of secondary version store, it would come into picture when secondary is readable. All queries that run against the secondary databases are automatically mapped to snapshot isolation transaction level, even when other transaction isolation levels are explicitly set.

Active Secondaries: Readable Secondary Replicas (Always On Availability Groups)

WORKAROUND/ SOLUTION

I found that they have a serious space issue on the drive where TempDB is located. The TempDB was not able to grow and hence new versions were not getting generated.

Due to readable secondary, SQL was using version store so the quick solution, in this case, was to disable read from secondary replicas. We used below command to disallow reads from secondary.

USE [master]
GO
ALTER AVAILABILITY GROUP [PROD_AG]
MODIFY REPLICA ON N'Godzilla_2' WITH (SECONDARY_ROLE (ALLOW_CONNECTIONS = NO))
GO

We did this for both replicas. As soon as we disabled the reads, version store vanished, and redo picked up the speed. Now since TempDB was not in use, we were able to shrink it as well to avoid space issues on the drive which was having TempDB database.

Reference: Pinal Dave (https://blog.sqlauthority.com)

First appeared on SQL SERVER – Always On Secondary Replica Huge Redo Queue – Version Store is Full. New Version(s) Could Not be Added

This is one of the common causes in Always On for which my clients contact me – “Unable to create listener”. In this blog we would learn about how to fix event Id 1212 – Attempt to locate a writeable domain controller.

When my client contacted me, they were having the same error with three different clusters which makes them believe that this an issue outside the cluster configuration. They were unable to Create Listener and getting Msg 41009 – The Windows Server Failover Clustering (WSFC) Resource Control API Returned Error.

First thing I did was to check the event log and I found below message.

Log Name: System
Source: Microsoft-Windows-FailoverClustering
Event ID: 1212
Task Category: Network Name Resource
Level: Error
Keywords:
User: SYSTEM
Description:
Cluster network name resource ‘AGListener’ cannot be brought online. Attempt to locate a writeable domain controller (in domain unspecified domain) in order to create or update a computer object associated with the resource failed for the following reason:
More data is available.
The error code was ‘234’. Ensure that a writeable domain controller is accessible to this node within the configured domain. Also, ensure that the DNS server is running in order to resolve the name of the domain controller.

I explained that whenever there are any issues related to cluster resource, we should always look at cluster log. If you are not sure how to generate cluster logs, read my earlier blog on the same topic. SQL SERVER – Steps to Generate Windows Cluster Log?

Here were the messages in the cluster log

ERR[RES] Network Name: [NNLIB] Searching for object SQLLISTENER on first choice DC failed. Error 234.
WARN [RES] Network Name : AccountAD: PopulateNetnameADState – FindSuitableDCNew failed with error 234 with known good password. Retrying with proposed password
INFO [RES] Network Name: [NNLIB] FindSuitableDCNew – objectName SQLLISTENER, username – MZNAPWCLEQU004$, firstChoiceDCName – \\DCAZ.domain.com
INFO [RES] Network Name: [NNLIB] LDAP bind to first choice DC \\DCAZ.domain.com failed. Error 5. Trying to LDAP bind with empty DC
INFO [RES] Network Name: [NNLIB] LDAP bind to first choice DC failed. Error 5. Trying second choice DC
INFO [RES] Network Name: [NNLIB] LDAP bind to second choice DC failed. Error 5.
ERR [RES] Network Name : AccountAD: PopulateNetnameADState – FindSuitableDCNew failed with error 5 with known proposed password
ERR [RES] Network Name : AccountAD: Populate netname AD state (bind to DC & search for object if exists) for object SQLLISTENER failed with error 234

We are seeing two errors here:

Error 234
Error 5

I don’t know what error 234 is but I do know that error 5 = Access is denied.

One of the most common cause would be where the Domain Administrator does not allow the CNO “Read All Properties” and “Create Computer Objects” permissions. You might see “Access is denied” in the event log.

Here are the steps, which are also known as pre-staging of virtual computer object (VCO) in domain controller.

If possible, connect to the domain controller. Ensure that we are logged in as a user that has permissions to create computer objects in the domain.
Open the Active Directory Users and Computers Snap-in (dsa.msc).
In Menu > View -> Advanced Features. (Otherwise, we would not see option explained in next steps)
Right-click the OU/Container where we want the VCO to be created and click “New” -> “Computer”
Provide a name for the object (This will be your SQL Server Network Name in FCI or Listener Name in AG) and click “OK”:
Right-click on the on the VCO which we just created and select “Properties”. Click the security tab and then click “Add”:
Enter the CNO (Make sure to select “Computers” option in the “Object Types” window) and click “OK”. The CNO is a Cluster Name Object. This is the name of the Windows Cluster name NOT listener or FCI name.
Give CNO “Full Control” over the VCO.

If all above steps are followed, we should not get access denied and if we try creating Listener, it should be successful.

Reference: Pinal Dave (https://blog.SQLAuthority.com)

First appeared on SQL SERVER – Unable to Create Always On Listener – Attempt to Locate a Writeable Domain Controller (in Domain Unspecified Domain) Failed

One of the most successful offerings from me has been Comprehensive Database Performance Health Check. Sometimes during my assistance, some random issues appear which I try to solve as well. In a recent engagement, one of their developers asked a question about the coexistence of full-text index and always on availability groups. In this blog, we would learn about one common issue which you might face when full text and availability group is used together.

THE PROBLEM

One of my clients to whom I helped in configuring Always On Availability Groups came back to me with an interesting situation. They have observed blocking of reading queries on the secondary replica. Since the database is in read-only mode, they wanted to know how write is being performed in the database which is causing blocking?

THE INVESTIGATION

I knew that this is not a user write activity but must be a system write activity which is causing blocking. When I started troubleshooting, I found below.

DB STARTUP thread (redo thread) being blocked by user session in sys.dm_exec_requests
Wait type: LCK_M_SCH_M
Wait_resource: METADATA: database_id = 8 COMPRESSED_FRAGMENT(object_id = 484196875, fragment_id = 9715700) – found using sys.all_objects

When I looked further, I found the object name was ifts_comp_fragment_484196875_10739738 and it was an INTERNAL_TABLE.

THE SOLUTION

It became clear that the redo thread was getting blocked not a user session. This causes the replica to start lagging because redo stops often. In my lab, I also observed that if a database with a full-text index is in an availability group, we can see the same type of blocking whenever the full text is index is enabled for automatic or manual population, and if there are read queries running full-text searches.

For my client, we were able to prevent this behavior by disabling change tracking. My client was OK with disabling change tracking on the full-text index temporarily and then setting up an incremental population on a schedule. Here is the T-SQL to change the tracking to manual.

USE [CRM]
GO
ALTER FULLTEXT INDEX ON [dbo].[CustomerData] SET CHANGE_TRACKING = MANUAL
GO

Later I suggested my client to refer Populate Full-Text Indexes and think about “Incremental population based on a timestamp”. This was a long-term solution for them.

Reference: Pinal Dave (https://blog.SQLAuthority.com)

First appeared on SQL SERVER – Always On Availability Groups and Full-Text Index

I never leave my customers alone when they are having an issue with something which I helped them. Typically, I help customers in creating POC and deploying AlwaysOn Availability Groups. Just the other day while doing the Comprehensive Database Performance Health Check, I came across error related to cluster resources.

I must admit that configuring availability group is a piece of cake and smooth as butter but the challenge comes when something breaks in a cluster. A DBA should know about troubleshooting windows cluster so that he can recover from the disaster.

My client contacted me and informed that due to some issues SQL Server availability group is in “Resolving” state in SQL Server Management Studio (SSMS). When they tried to bring the resource online in Failover Cluster Manager, it didn’t work and showed below message in Event Logs.

Cluster resource ‘AGNAME’ of type ‘SQL Server Availability Group’ in clustered role ‘AGNAME’ failed.

Based on the failure policies for the resource and role, the cluster service may try to bring the resource online on this node or move the group to another node of the cluster and then restart it. Check the resource and group state using Failover Cluster Manager or the Get-ClusterResource Windows PowerShell cmdlet.

This is not a very useful message and doesn’t tell us what needs to be done. Here is how it looks like in SSMS.

If we try to remove the database from the secondary replica, below is the error we got

The database ‘AG_DB’ failed to leave the availability group ‘AGNAME’ on the availability replica ‘DB02’. (Microsoft.SqlServer.Management.SDK.TaskForms)

The local availability replica of availability group ‘AGNAME’ cannot accept signal ‘UNJOIN_DB’ in its current replica role, ‘RESOLVING_NORMAL’, and state (configuration is in Windows Server Failover Clustering store, local availability replica has joined). The availability replica signal is invalid given the current replica role. When the signal is permitted based on the current role of the local availability replica, retry the operation. (Microsoft SQL Server, Error: 41121)

What is RESOLVING state in SQL Server AlwaysOn?

When there is an availability group, the replica would be either in primary state or secondary state – when its online in failover cluster manager. Resolving is an intermediate state when the transition is happening from primary to secondary or vice versa. If due to some reason the transition is not successful, it goes to “resolving” state. In this state, the database is not accessible.

What can we do?

First, we need to find the cause why it’s not coming online. There are multiple logs which need review.

SQL Server ERRORLOG. SQL SERVER – Where is ERRORLOG? Various Ways to Find ERRORLOG Location
Cluster Log: SQL SERVER – Steps to Generate Windows Cluster Log?
Windows Event Viewer.

In most of the situations, cluster logs give the right message and cause. There are various blogs which could have various causes and I would continue sharing my knowledge if I find more, like below.

SQL SERVER – Always On AG – HADRAG: Did not Find the Instance to Connect in SqlInstToNodeMap Key

WORKAROUND/SOLUTION

Based on error messages and situations sometimes we need to perform force failover of the availability group. Perform a Forced Manual Failover of an Availability Group (SQL Server)

If the error message makes sense and you are able to solve an issue, please share via comments.

Reference: Pinal Dave (https://blog.sqlauthority.com)

First appeared on SQL SERVER – Cluster Resource ‘AGName’ of type ‘SQL Server Availability Group’ in Clustered Role ‘AGName’ Failed

In the past few days, I am being contacted by clients for AlwaysOn related issue. I have been writing a blog about them. In this blog, we would learn about how to fix wait for HADR_AR_CRITICAL_SECTION_ENTRY.

THE SITUATION

My client was using SQL Server in a virtual environment. Due to some instability with their network infrastructure, windows cluster lost quorum for few minutes and then it came back. As you might know that AlwaysOn availability group is tightly coupled with windows server failover cluster, so anything happening in the cluster could also impact AlwaysOn availability group. That is what precisely has happened here.

As usual, they sent me an email, I responded back with GoToMeeting details and we were talking to each other in a few minutes. When I joined the call with them:

All of our AG modification queries (removing availability database, removing availability replica) were stuck waiting on HADR_AR_CRITICAL_SECTION_ENTRY.
We were unable to make modifications to the AG as it was in an inconsistent state, pending updating the state of the replica.
As per the Microsoft docs – Occurs when an Always On DDL statement or Windows Server Failover Clustering command is waiting for exclusive read/write access to the runtime state of the local replica of the associated availability group.

SOLUTION/WORKAROUND

Based on my search on the internet, restart of SQL instance is the only way to come out of this.

We set the AG failover to manual and restarted both replicas; after doing so, our secondary replica became synchronized after a few minutes and we were able to successfully remove databases from the AG. We tested failover back and forth, and everything was working as expected.’

Have you seen this wait in your environment? It would be great if you can share the cause of that via comments and how did you come out of it.

Reference: Pinal Dave (https://blog.sqlauthority.com)

First appeared on SQL SERVER – AlwaysOn – Queries Waiting for HADR_AR_CRITICAL_SECTION_ENTRY

Sometimes there are errors which give us the solution to the problem and I love to discover other ways to fix it. In this blog, we would learn how to fix Failed to create the Windows Server Failover Clustering (WSFC) resource with name and type ‘SQL Server Availability Group’

Here is the exact error which I received while creating an availability group:

Msg 41105, Level 16, State 0, Line 3
Failed to create the Windows Server Failover Clustering (WSFC) resource with name ‘SQLAUTHORITY_AG’ and type ‘SQL Server Availability Group’. The resource type is not registered in the WSFC cluster. The WSFC cluster many have been destroyed and created again. To register the resource type in the WSFC cluster, disable and then enable Always On in the SQL Server Configuration Manager.
Msg 41152, Level 16, State 2, Line 3
Failed to create availability group ‘SQLAUTHORITY_AG’. The operation encountered SQL Server error 41105 and has been rolled back. Check the SQL Server error log for more details. When the cause of the error has been resolved, retry CREATE AVAILABILITY GROUP command.

The T-SQL which I used was as follows.

USE [master]
GO
CREATE AVAILABILITY GROUP [SQLAUTHORITY_AG]
WITH (AUTOMATED_BACKUP_PREFERENCE = SECONDARY,
DB_FAILOVER = OFF,
DTC_SUPPORT = NONE,
REQUIRED_SYNCHRONIZED_SECONDARIES_TO_COMMIT = 0)
FOR DATABASE [SQLAUTHORITY_DB]
REPLICA ON N'NODE1' WITH (ENDPOINT_URL = N'TCP://NODE1.SQLAUTHORITY.COM:5022', FAILOVER_MODE = AUTOMATIC, AVAILABILITY_MODE = SYNCHRONOUS_COMMIT, SESSION_TIMEOUT = 10, BACKUP_PRIORITY = 50, SEEDING_MODE = MANUAL, PRIMARY_ROLE(ALLOW_CONNECTIONS = ALL), SECONDARY_ROLE(ALLOW_CONNECTIONS = NO));
GO

Here is the screenshot of the error message.

I checked ERRORLOG and there were no messages. As the error message says, “The resource type is not registered in the WSFC cluster” so I checked the PowerShell to get resource type and found below.

Get-ClusterResourceType | where name -like "SQL Server Availability Group"

The output showed no result which means the error message is correct.

SQL SERVER - FIX: Msg 41105: Failed to Create the Windows Server Failover Clustering (WSFC) Resource With Name and Type 'SQL Server Availability Group' ao-type-err-02

As we can see there is no result and that’s what is being told by an error message.

WORKAROUND/SOLUTION

I was able to find two ways to fix the issue:

Add-ClusterResourceType -Name "SQL Server Availability Group" -DisplayName "SQL Server Availability Group" -Dll "C:\Windows\System32\hadrres.dll"

SQL SERVER - FIX: Msg 41105: Failed to Create the Windows Server Failover Clustering (WSFC) Resource With Name and Type 'SQL Server Availability Group' ao-type-err-03

The better way is: Disable and Enable the feature using SQL Server Configuration Manager. That is what has been told in error message as well.

Now, if we run the same command as earlier, we should see the output

SQL SERVER - FIX: Msg 41105: Failed to Create the Windows Server Failover Clustering (WSFC) Resource With Name and Type 'SQL Server Availability Group' ao-type-err-04

Have you encountered the same error? What was the cause of it?

Reference: Pinal Dave (https://blog.sqlauthority.com)

First appeared on SQL SERVER – FIX: Msg 41105: Failed to Create the Windows Server Failover Clustering (WSFC) Resource With Name and Type ‘SQL Server Availability Group’

I have helped many customers to solve complex issues in their environment by Comprehensive Database Performance Health Check. Sometimes, the issue looks very complex but once the solution is found it seems very easy. In this blog, we would learn about a situation where the listener is missing in SSMS but working fine in failover cluster manager (cluadmin.msc).

While doing checks of their database, they showed me an interesting situation. Here it goes.

THE SITUATION

My client had 2 nodes Always On Availability Group on SQL Server 2017 and Windows Server 2016. They noticed that;

In failover cluster manager, we are able to see Network Name resource for Listener.
In SQL Server Management Studio (SSMS), we were not able to see anything under “Availability Group Listener”. It was empty!
Below query also doesn’t show any listener in SQL. (0 rows affected)

SELECT *
FROM sys.availability_group_listeners
GO

SELECT *
FROM sys.availability_group_listener_ip_addresses
GO

SQL SERVER - Always On Availability Group Listener Missing in SSMS but Working Fine in Failover Cluster Manager list-miss-02

We were able to connect to the Listener and it was working fine, even after failover also.

SOLUTION/WORKAROUND

When I asked them the history of the listener creation, they informed me that this was created by Windows Admin team. SQLDBA team couldn’t create listener due to an issue which I have written in my previous blog

SQL SERVER – AlwaysOn Listener Error – The WSFC Cluster Could Not Bring the Network Name Resource With DNS Name ‘DNS name’ Online

When I checked availability group resource in cluster manager, I found that it was not having any dependency on the listener. As soon as dependency was added (no downtime needed) we were able to see the listener in #2 and #3 above.

SQL SERVER - Always On Availability Group Listener Missing in SSMS but Working Fine in Failover Cluster Manager list-miss-03

In short, if you are creating the listener via cluster manager, a dependency must be added to the AG resource in Windows Failover Cluster Manager to make the AG dependent upon the listener. If you create it via SQL Server (using SSMS, T-SQL or PowerShell) you should not face this issue.

Have you seen such a situation in your production? Check it now and fix it!

Reference: Pinal Dave (https://blog.sqlauthority.com)

First appeared on SQL SERVER – Always On Availability Group Listener Missing in SSMS but Working Fine in Failover Cluster Manager