SCOM Login Failure Error

Posted on June 2, 2011 by jeff

System Center Operations manager is a pretty nice monitoring and management application, but it is also a complex application that can be very difficult to configure and troubleshoot. Recently, after some account clean-up, we started receiving login failure pop-up windows when trying to run some reports inside the interface. This happened to coincide with a password change, so we thought for sure it was just a password issue – boy were we wrong.

After months of troubleshooting, we finally had a breakthrough today. Previous attempts of resetting passwords, reconfiguring Run As accounts and profiles, applying hotfixes and service packs, the answer came to us. A colleague was able to narrow down the issue specifically to this Event Viewer entry:

Log Name:      Operations Manager
Source:        OpsMgr SDK Service
Date:          6/2/2011 10:45:04 AM
Event ID:      26319
Task Category: None
Level:         Error
Keywords:      Classic
User:          N/A
Computer:      SERVER
Description:
An exception was thrown while processing GetDataWarehouseMonitoringObjectsByRowId
for session id uuid:54910ec9-1832-4399-9864-a2fd482aa340;id=863.
Exception Message: The creator of this fault did not specify a Reason.
Full Exception:
System.ServiceModel.FaultException`1[Microsoft.EnterpriseManagement.Common.UnknownDatabaseException]:
The creator of this fault did not specify a Reason. (Fault Detail is equal to Login failed for user 'SCOMUser'.).

That’s obviously a SQL error, however, we had quadruple checked that the Data Warehouse account (we knew these was the run as credential it was using, because we changed the case of the user to ‘SCOmuser’ and the error message changed with that same case) was configured to use the proper SCOMUser domain account, and that the domain account had a proper login and permissions to the databases. Furthermore, after enabling SQL login auditing, we received the following message:

Log Name:      Application
Source:        MSSQLSERVER
Date:          6/2/2011 2:22:36 PM
Event ID:      18456
Task Category: Logon
Level:         Information
Keywords:      Classic,Audit Failure
User:          N/A
Computer:      SERVER
Description:
Login failed for user 'SCOMUser'. Reason: Could not find a login matching the name provided. [CLIENT: ]

What finally occurred to me was that it was not passing the domain as part of the login, rather, it was attempting to use SCOMUser as a SQL account, and not a windows account. By clicking properties on the Data Warehouse Action Account, going to the Distribution tab, and selecting the “Where is this credential used?” link, we found that it was being used by the “Reporting SDK SQL SQL Authentication Account” profile. Obviously, it being a Windows account was the reason SCOM was not passing the domain.

To correct the issue, we re-associated the Reporting SDK SQL Authentication Account with the appropriate run as credential. After doing so, the pop-up errors are gone!

ASP.NET Routing and IIS 404 errors

Posted on March 28, 2011 by jeff

ASP.NET Routing is a powerful feature introduced in .NET 3.5 SP1 and included with .NET 4.0 that allows a developer to route URL’s that are not real files. There are several ways you can accomplish this same task, but having it in the ASP.NET pipeline allows the developer great flexibility in how URL’s are routed. However, it does not always work “out-of-the-box” for devs. The most common error I’ve seen is a 404 error – meaning the page cannot be found.

There can be several contributing factors. If you are attempting to utilize extenionless URL’s, ensure you have the appropriate IIS hotfix installed: http://support.microsoft.com/kb/980368. More commonly however, the issue is a result of the module not firing for your request type. The easy fix is to add runAllManagedModulesForAllRequests=”true” to the modules tag in your web.config. However, that could have performance implications as you are now telling IIS to ignore the preCondition setting for all modules. The alternative solution is to remove the managedHandler preCondition for the URLRouting module:

Problems installing KB2479628, KB2485376 and SP1 with SATA IDE mode on P55 chipset

Posted on February 17, 2011 by jeff

After Microsoft released February’s patches, I ran into a strange issue on my custom-built Win7 x64 desktop. I would get to the windows 7 splash screen, but the system would inexplicably reboot at that screen. In order to get my system back up and running, I would have to boot into Safe mode to allow it to unconfigure the failed patch attempt, and then boot Safe mode a second time before I could boot windows normally. No memory dump was created and nothing was logged to Event Viewer about the failed attempts.

I spent one evening installing each patch one-by-one and rebooting until I was able to narrow it down to KB2479628 and KB2485376. The first patch affects kernel mode drivers and given the type of failure I was seeing, I started looking at driver issues. The most logical was my graphics driver since I could load safe mode without issue. I have an ATI Radeon HD5770 and a co-worker recommended trying driver ver 10.11 but that didn’t solve the issue. Since SP1 was around the corner, I figured I just hide the patches and deal with it then.

Fast-forward to yesterday, and I decided to give the SP1 install a try after I finished working for the day. Installation went off without a hitch, but alas, I ran into the same issue on first reboot. It was time tackle the driver issue. First thing I did was get back to a good desktop, and then perform a reboot and enable boot logging (press F8 after the POST to enable boot logging). Doing so will write a list of drivers loaded to C:\windows\ntbtlog.txt. This can be helpful in finding the last driver loaded before a failure when no BSOD occurs.

Knowing that KB2479628 caused the same behavior as the SP1 install, was directly related to drivers and only took a second to install, I decided that I would use it to “test” if a change I made solved the problem. So, I ran the install and on reboot I enabled boot logging again – last driver to load lis flpydisk.sys … hmmm … I don’t even have a floppy drive. I then compared to the reference log looking for what might load right after this driver. First thing that pops up is my graphics card, so after I go through the safe mode process to boot back to a desktop, I uninstall the display adapter, selecting the “Delete driver software for this device” option and then reboot. I have to repeat the process several times as it keeps picking up an older version of the driver but eventually, I get the base MSFT VGA driver. At that point, I retry the patch installation with boot logging enabled, but again windows automatically power cycles at the splash screen.

Long story short, I repeat the process for nearly every driver on my system: SATA controller, Realtek HD audio, Realtek NIC, etc. The boot logging led me on a wild goose chase and had me uninstalling any software that had a device associated with it including MagicDisc, my IOCell NDAS software, even Microsoft Intellipoint all to no avail. Finally, I remember that when I first built this system, I had an issue with the SATA controller mode. I had tried installing Windows 7 in AHCI mode, but saw something similar to this issue so switched to IDE mode. Looking at the BIOS settings, it was still set to IDE mode, but I tried switching to AHCI mode since nothing else had worked and I was basically out of ideas at that point. To my surprise, system booted after the patch installation.

I was then able to re-install the latest drivers from the mobo manufacturer’s website and finally, I tried a SP1 install. After about an hour, the system rebooted and updated to SP1 properly. After all was said and done, I tried switching SATA mode back to IDE and for whatever reason, I now can’t reproduce the issue. It’s possible it was a race condition with the driver for the SATA controller in IDE mode while trying to update (perhaps the updates were looking to replace though drivers but couldn’t because of some incompatibility) – though the currently loaded driver is from 2006 so that’s doubtful.

For reference, this is a Core i5-750 on a Gigabyte P55-UD4P motherboard running BIOS F10. If you have similar problems, try switching your SATA controller mode from IDE to AHCI or vice-versa.

UPDATE: After installing a OCZ Vertex SSD, I had to set the controller mode back to AHCI as I was experiencing the same issue. Since changing it, I’ve had no problems.

HP Proliant MicroServer

Posted on January 2, 2011 by jeff

I have been wanting to upgrade my home server for quite some time and have been eyeballing WHS machines as I was looking for something with a small footprint. While I would normally build my own system, cases for mini-ITX boards that have space for at least 4 HDD’s are quite expensive. In fact, the only one I could find that really fit the bill was the Chenbro case. I also had my eye on the ACER H340 series (now H341 and H342), but when I found the HP Proliant MicroServer, it was exactly what I was looking for.

A quick peek at the specs:

AMD Athlon II Neo N36L Dual-Core processor @ 1.3GHz w/ 2M L2 cache
2 DIMM slots supporting up to 8GB DDR3 PC3-10600E unbuffered ECC RAM @ 800MHz (while ECC is supported, it is not required)
AMD RS785E/SB820M Chipset
Integrated SATA controller with RAID0,1 (this is done in the driver)
4-port SATA backplane supports up to 4x2TB LFF SATA drives (ships with single 160GB HDD)
Single ODD SATA port (forced IDE mode)
Embedded Broadcom NC107i PCI Express Gigabit Ethernet Server Adapter supports PXE & WOL
Onboard VGA with 128MB shared video RAM
1x PCI-e Gen2 16x half-height full length slot (max 25W)
1x PCI-e Gen2 1x half-height full length slot
7 USB ports (4 on the front, 2 in the back, 1 internal)
1 rear eSATA port
200W 1U Flex ATX Power Supply
Trusted Platform Module support
IPMI 2.0 compliant
Optional ILO management card

I made a few small modifications to the base configuration by adding 2x4GB PC3-10666 DDR3 RAM modules (this was non-ECC consumer grade RAM), a CD/DVD-ROM drive, 2x1TB Western Digital Caviar Green HDD’s, and a 16GB Patriot memory stick (to run ESXi). The integrated CPU supports AMD-V and XD bit required for virtualization, so running ESXi (and I suspect Hyper-V) works fine. The system itself is light-weight and the perfect size (10.5″ x 8.3″ x 10.2″). With the exception of the HDD caddies, the case is very sturdy – metal all around.

Great compact design in my opinion. Cables are routed well and the motherboard is mounted to a tray secured by two thumb screws. You need to disconnect cables and remove the motherboard in order to install PCI-e cards and RAM.

You’ll notice what looks like a PCI-e x4 slot on the motherboard. That’s actually for the optional remote management card and adding it will cover the PCI-e x1 slot. The motherboard sits underneath the HDD housing, so RAM with certain size heat-spreaders may not fit. The internal USB connection is in the lower left-hand corner of the motherboard. You’ll notice there is plenty of room for a large USB flash drive if you so choose.

The HDD drive caddies are plastic and the backplane has necessary SATA connections. Screws are located on the bottom inside the front door. There’s also a handy torx driver for working on the system.

Plenty of clearance for the USB flash drive that will hold the ESXi installation.

As far as the ESXi installation goes, couldn’t be simpler. I attempted to use SYSLINUX to install ESXi from a USB flash drive, but couldn’t get the installation going so I hit Best Buy for a cheap SATA CD/DVD-ROM drive. Note that the system has just a standard Molex power connector for the ODD, so you’ll need to purchase a Molex to SATA power cable. Total installation time was only about 10 minutes. I am using the supplied 160GB for VMDK storage and will use RDM for the 1TB HDD’s to create a software R1 array since the embedded RAID controller is not supported by ESXi.

While 8GB of RAM won’t be enough to virtualize a datacenter, there should be plenty of RAM to run a few Virtual Machines. I was able to P2V my domain controller (was an Optiplex GX1 PII 400MHz box with 256MB – I know … well past it’s prime) in about an hour (only had 100Mbps nic) and am currently downloading Windows Home Server codename Vail to test out WHS in a virtual machine. All-in-all, if you’re looking for a home server or virtual test lab, this box is a great fit!

DPM 2010 Tape Belongs to a DPM server sharing this library

Posted on October 21, 2010 by jeff

Recently, I ran into an issue with our DPM 2010 shared tape library installation where several tapes added back to the library where reporting that they belonged to another DPM server sharing the library. I did not care about the data on these tapes, rather, they just needed to be marked as Free in order to be re-used. I logged into each of our DPM servers trying to find the server that owned the tape, but all of them reported the same error.

I tried performing erase operations, re-cataloging the tapes, identifying the unknown tapes, using the ForceTapeFree script , and external erase operations but DPM did not want to free it’s grip. Finally, I surmised that it must be something in the DPMDB rather than actual data on the tape.

It turns out that the media had been assoicated with an orphaned Media Pool. To correct this, I used the following DB queries.

First, I needed to locate the proper information about the tape. This query will give you the slot and barcode number which should allow you to find the piece of media you need to correct. You’ll want the GlobalMediaId field from this query:

select media.BarcodeValue, media.SlotNum, media.MediaId, gmedia.MediaPoolId
from tbl_MM_Global_ArchiveMedia gmedia
innerjoin tbl_MM_Media media
on gmedia.MediaId = media.GlobalMediaId

Next, you’ll want to find the appropriate “Free Media Pool” for your library. You can do this with the following query:

select library.ProductId, library.SerialNo, library.LibraryId, mpool.Name, mpool.MediaPoolId, mpool.GlobalMediaPoolId
from tbl_MM_MediaPool mpool
innerjoin tbl_MM_Library library
on mpool.LibraryId = library.LibraryId
where mpool.Name =‘Free Media Pool’

You’ll want the GlobalMediaPoolId GUID from that query. We then need to update the media with the proper MediaPoolId:

declare @GlobalMediaId asvarchar
declare @GlobalMediaPoolId asvarchar

set @GlobalMediaId =‘<GUID from query 1>’
set @GlobalMediaPoolId =‘<GUID from query 2>’

update tbl_MM_Global_ArchiveMedia
set MediaPoolId = @GlobalMediaPoolId
where MediaId = @GlobalMediaId

Lastly, perform a refresh in the DPM Console. Your tapes should now be marked as Free.

Network Uptime

Posted on June 8, 2010 by jeff

We upgraded the firmware on some network devices at OrcsWeb during last month’s maintenance window. Before that, they had some impressive uptime:

Firewall Uptime

Switch Uptime

The devices are configured with HA redundancy, so the rolling firmware upgrades went beautifully with minimal downtime during the route convergence and no manual intervention.

High CPU on Cisco 4500 with MSFT NLB multicast cluster

Posted on May 26, 2010 by jeff

Recently, we were alerted to higher than normal CPU on some of our core Cisco Catalyst 4507 switches running IOS 12.2. Using Cisco’s CPU troubleshooting doc, I was able to narrow down the source to the Cat4K Mgmt LoPri process. From there, issuing a “sh platform health” command found it was the K2CpuMan Review process meaning packets are being forward by the CPU. To find out which queue, we issued the “sh platform cpu packet statistics” command. That showed the L3 Fwd Queue was much higher than normal.

By creating a CPU span and monitoring the traffic with Network Monitor 3.3, we could see that all of the traffic destined for VIP’s in our 2008 NLB clusters was hitting the CPU. I checked the configuration to ensure it matched the Catalyst and MSFT NLB example on Cisco’s site which it did. We were using multicast NLB configuration as explained in the document. I setup a test NLB cluster to play with the settings to figure out why cluster bound packets were hitting the CPU. What I found was in relation to this section:

However, since the incoming packets have a unicast destination IP address and multicast destination MAC the Cisco device ignores this entry and process-switches each cluster-bound packets. In order to avoid this process switching, insert a static mac-address-table entry as given below in order to switch cluster-bound packets in hardware.

mac-address-table static 0300.5e11.1111 vlan 200 interface fa2/3 fa2/4

Note: For Cisco Catalyst 6000/6500 series switches, you must add the disable-snopping parameter. For example:

mac-address-table static 0300.5e11.1111 vlan 200 interface fa2/3 fa2/4 disable-snooping

The disable-snooping parameter is essential and applicable only for Cisco Catalyst 6000/6500 series switches. Without this statement, the behavior is not affected.

I double and triple checked that our switches had the satic mac entry for the CAM tables and they did. So, I reconfigured my test cluster from the ground up and found that cluster bound packets only hit the cpu AFTER this command was entered. By removing this command from my switches for our production, CPU dropped 30-40% instantly. This seems to contradict what Cisco has posted in their example.

There was no adverse affect or downtime from removing this command. Both cluster nodes are connected locally to the switch however, and this command may be necessary if a NLB node is connected to a down-level swtich. Furthermore, a “sh int stats” is showing that no packets are switched by the “processor.”

SCVMM 2008 R2 Installation defaults and Self Service Portal

Posted on April 8, 2010 by jeff

SCVMM is an excellent product for managing your Hyper-V environment. The Self Service Portal (SSP), a component of the SCVMM install, allows end users to manage and deploy VM’s remotely. However, but sure to read the fine print when installing.

During the installation process, you will be prompted to select what ports the VMM Agent will use when communicating with the Hyper-V host. The default ports WinRM and BITS use are 80 and 443 respectively. If you plan on running the Self Service Portal from the same host system, you will either need to change the ports the VMM Agent uses or change the ports the Self Service Portal uses.

Since browsers and IIS always default to 80 and 443 for HTTP and HTTPS, I would recommend making the change to the VMM Agent. Port 8080 for the VMM Agent control port (WinRM) and 8443 for the VMM Agent data port (BITS) are nice alternatives. Note that using a different IP for the SSP is NOT an option, as WinRM and BITS will self-configure to listen on all IP addresses thereby hijacking the ports.

A quick note, changing the default ports is recommended if you are planning on running ANY website on the same box. For instance, we had initially installed SCVMM on the same box running Operations Manager. It wasn’t until our VM migrations began failing that we realized the default installation of SCVMM was being interfered with the Operations Manager Console which was also running on the same system.

Lastly, you may not even receive an error of any type when having this issue – rather, the SSP simply won’t install. You may see behavior similar to this:

http://social.technet.microsoft.com/Forums/en-US/virtualmachinemanager/thread/bba52f08-7b95-4a74-9c9b-ceaf0499e29c/#1dbef478-f896-48e4-af4e-b455d120c10b

Error 0x800423f3 backing up Hyper-V VM with DPM 2007

Posted on April 8, 2010 by jeff

One error you may receive while backing up a Hyper-V VM with DPM 2007 is the generic “DPM encountered a retryable VSS error. (ID 30112 Details: Unknown error (0x800423f3) (0x800423F3)).” There are a couple of different things that could cause this error. The two most common are:

1. You are running a Windows Server 2008 SP1 Hyper-V host and do not have the appropriate pre-requisites installed. Specifically, the hotfix described in KB959962.

http://technet.microsoft.com/en-us/library/dd347840.aspx

2. There is a VSS error of some kind inside the VM causing the Hyper-V VSS writer to fail.

One of the most common VSS errors inside a Server 2008 VM I have seen, is event id 8193:

Log Name:      Application
Source:        VSS
Date:          <DateTime>
Event ID:      8193
Task Category: None
Level:         Error
Keywords:      Classic
User:          N/A
Computer:      <ComputerName>
Description:
Volume Shadow Copy Service error: Unexpected error calling routine ConvertStringSidToSid. hr = 0x80070539.

Operation:
OnIdentify event
Gathering Writer Data

Context:
Execution Context: Shadow Copy Optimization Writer
Writer Class Id: {4dc3bdd4-ab48-4d07-adb0-3bee2926fd7f}
Writer Name: Shadow Copy Optimization Writer
Writer Instance ID: {3586f039-f2f9-4dcb-a46e-3aaa20f1a2fa}

This error can be solved by following the instructions in this blog post. Specifically, perform these steps outline in KB947242:

Delete unresolvable SIDs in the ‘Administrators’ group on the VM.
Open regedit and locate ‘HKEY_LOCAL_MACHINE\SOFTWARE\Microsoft\Windows NT\CurrentVersion\ProfileList’
Under the ProfileList subkey, delete any subkey that is named SID.bak

This has resolved the issue in most cases where I have seen that DPM error occur. Some other suggested troubleshooting tips that have solved this problem for me in the past:

Re-install the Integration Components and reboot the VM
Resolve issues for any VSS writers not listed as stable from the “vssadmin list writers” command on the host or inside the VM. You can restart the following services to resolve some problems
- System Writer – Cryptographic Services service (doesn’t affect the system)
- IIS Metabse Writer – IIS Administrative service (will reset all of IIS)
- SqlServerWriter – SQL VSS service (doesn’t affect SQL)
- WMI Writer – Windows Management Instrumentation service (WMI will be unavailable during the
  service restart)
- BITS Writer – BITS service (BITS will be unavailable during the service restart)
Re-register VSS components as described in KB940032
Ensure there is sufficient space inside the VM for shadow copies

Using RPC custom port range with Windows Firewall

Posted on January 6, 2010 by jeff

I ran into an interesting issue today. We use a dedicated port range for RPC connections through firewall per this Microsoft article. Doing so allows RPC to work through dedicated hardware firewalls. We also enable the local Windows firewall on several boxes as this provides a firewall for any systems not using a dedicated piece of hardware or from other systems behind dedicated firewalls.

While using Shavlik NetChk Configure to scan systems for compliance, I noticed some inconsistencies which I traced back to a firewall issue on the server being scanned. The scans perform some of the checks over RPC. I confirmed that Remote Administration had been enabled using this command:

netsh firewall set service REMOTEADMIN enable

However, netstat would show the connection in a SYN_SENT state on a port in the dedicated RPC range. Buried in this technet article, I found the reason:

Remote Administration

Adds TCP ports 135 and 445 to the exceptions list. Also adds Svchost.exe and Lsass.exe to the exceptions list to allow hosted services to open additional, dynamically-assigned ports, typically in the range of 1024 to 1034. This setting allows a computer to be remotely managed with administrative tools, such as the Microsoft Management Console (MMC) and Windows Management Instrumentation (WMI). It also allows a computer to receive unsolicited incoming Distributed Component Object Model (DCOM) and remote procedure call (RPC) traffic.

It seems that when setting a custom range of ports for RPC via the HKLM\Software\Microsoft\RPC\Internet key, it “breaks” the Remote Administration firewall rule in the Windows Firewall. This was tested on a Server 2003 R2 SP2 system, but I suspect similar issues would apply to Server 2008.

Jeff's Blog

Author Archives: jeff