Archives For Troubleshooting

I blogged a while ago about downloading the Virtualisation Eco Shell (VESI), since then I haven’t stopped using it. It now forms one of my tools that is opened as soon as I am dealing with VMware. VESI puts a graphical user interface over the top of the VI Toolkit / PowerCLI and powershell. Allowing you to run queries and get information out of a environment in a matter of seconds. Below is a short list of some of the ways I have used it recently. I plan on doing a more detailed blog post shortly depicting how usefull this free tool is. Big thanks must go to Scott Herold who runs the project, he has been extremly helpful  and accommodating with assistance and feature requests.

  • Quickly shutdown all VM’s when doing SAN maintainance
  • Documentation / Infrastructure Diagrams
  • Checking host log files
  • Checking for snapshots
  • Using the script editior to write, test and debug powershell scripts
  • Checking Windows service status / restarting / starting services
  • Checking Windows event logs for VMs

Plus lots more.

For more information please visit the website www.thevesi.org

As I have mainly installed ESX servers, today I was posed with a situation where I needed to restart the management agents on an ESXi server. Immediatly I was thinking I would need to access the emergency service console etc but I found the convienent restart management agents option in the F2 menu.

To access do the following

1. From the console screen (use ilo / drac connection to connect if needed) press F2

2. Login

3. Now press F11 to restart the agents

You can now logout again, this is a usual one to resolve a host that is appearing as disconnected and you are unable to manage it.

VMware Snapshots

May 22, 2009 — 2 Comments

After another customer recently had an issue with old VMware snapshots, I thought I would put together some pointers regarding VMware snapshots on production servers and some items I discussed with the VMware engineer at the time.

  • Snapshots are not a backup and should not be treated as such, snapshots should be used as a moment in time recovery point whilst undertaking installations of updates and new software etc on a server. Once you are happy remove the snapshot immediately.
  • Only keep snapshots live for the shortest amount of time possible.
  • As a rule of thumb ensure your snapshots are deleted at the latest within 24 hours, if you have a requirement to keep them longer consider taking a backup instead.
  • Don’t remove snapshots during your servers busiest times
  • If you have a very old snapshot consider either cloning the VM to a new VM (This will consolidate the snapshots and keep the original for fail back) or turning off the VM and removing the snapshot, this will mean there is no change happening to the delta whilst it is being consolidated.
  • When removing large snapshots from the VI client it will timeout after around 15 minutes.  This doesn’t mean it has failed! Be paitent check via the service console the progress by looking at the datestamp on the VMDK.
  • Check regularly to ensure there are no outstanding snapshots, a tool such as the Virtualisation Eco Shell will help you with this.
  • If you wish to check for snapshots from the service console use the following command, this will show you the location and size of the delta files.  

 find /vmfs/volumes/ -name “*delta*” -type f  -print0 | xargs -0 du –human-readable –total

Please feel free to comment other suggestions, I will add to this other time.

Cheers

Barry

Whilst recently experiencing performance issues running a virtualised backup on a customers site, I thought I would blog some useful steps to help diagnose issues with virtualised backup performance with products such as vRanger and Veeam a lot of these also apply to physical backups.

This is a list of things to consider / try when trying to diagnose the issues you are having, I will add to it over time but please feel free to comment further suggestions or changes.

  • Network
  • ESX
  • Storage
  • Anti-Virus / Third Party Tools
  • Veeam  Specific Settings

Network

  1. Are you using a Gb connection or 100Mb lan connection?
  2. Is the ESX host, VM and Backup server all seeing the expected network connection speed?
  3. Do a ping test between all the effected elements, are you loosing any packets? are you seeing the expected response times?
  4. Is the backup server on the same subnet as the ESX Host or VM being backed up? If not could the routing device be causing a bottle neck? Is it a 100Mb routing device or Gb? Do a tracert to ensure you are taking the expected route. Are you able to temporarly put this backup server on the same network and test backup performance?
  5. Are you using the latest drivers for your NIC?
  6. What is your backup server / VM / ESX host currently doing? Could this be causing a networking bottleneck?
  7. Try a standard file transfer between your VM and the backup server, how does this perform? Use Veeam FastSCP to do a file transfer between the ESX host and your backup server, how does this perform?

ESX

  1. Are you using VCB or Network backup? If you are using VCB try network backup, how does this effect the performance? Have you followed the relevent VCB articles on VMware’s website? Are you using a FC or iSCSI SAN? If you are using iSCSI are you expecting a performance increase by using VCB? or are you just looking to remove backup from LAN or take load away from the ESX boxes?
  2. How much memory have you reserved for the service console? Try and set it to 800Mb, does this improve the performance?
  3. Have you reserved CPU for the service console? Try increasing this from the default settings.

Storage

  1. Where are you storing your backups? What performance do you get when copying a standard file to this store? How are you connecting to this storage and what are the bandwidth limitation of this connection?
  2. Are you connecting to a NAS or SAN via iSCSI? If you are using a software iSCSI initiator how is this effecting the CPU usage of your backup server?
  3. Are the drivers and firmware for you FC card up to date?
  4. Is anything else being activly written to this device? How many backups are you trying to run at once to this store?

Anti Virus / Third Party Tools

  1. Is your anti virus causing a bottle neck for reads / writes? Is it possible to setup exclusions or temporaliy disable the AV to test? (Please ensure you don’t leave your AV permently disabled!?)
  2. Is there any other software on the server that could be causing a bottleneck for your backup?

Veeam Specific Settings

  1. If using Network mode have you enabled the SSH client on the ESX host as instructed when you added your hosts?
  2. If using full fat ESX rather than ESXi try changing the “Data Transfer Engine” under properties of your host in Veeam to “Force Service Console Agent Mode” note, the default “Automatic” mode will actually always attempt to use SSH on “fat” ESX host if it configured (previous bullet).
  3. How many backups are you running at once? Note when using network backup mode, multiple backups may slow down indvidual job rates, but the overall speed maybe quicker, test different mixes until you find a sweet spot. For VCB SAN mode though, backup storage disk speed is often a bottleneck, so unless parallel jobs write to different disks, there’s little sense to run many jobs in parallel.
  4. What commpression level is your backup set to? If currently set to best try Optimal.

VMware have recently published an article on the Knowledge Base blog regarding common fault resolution paths.

The blog post here reads

Many common tech support issues in VMware products can be solved using what we call Resolution Paths. Resolution Paths are collections of modular steps that can be used to solve tech support issues.

Being modular, they can be re-used in other resolution paths. A good example is using the ping command to test network connectivity. This step is used in all kinds of troubleshooting procedures. Put a number of these steps together, and you have a method.

Down the very first left hand column of each resolution path are common issues that customers encounter. To the right of each these are VMware’s recommended steps to resolving these issues. Each and every one is hyper-linked directly to our Knowledge base article on that topic.

These can be very handy and can save you having to make that call into Tech Support. Click the links below. There’s one for each potential problem area

This link is definitely worth bookmarking and keeping for when you are having issues.

http://blogs.VMware.com/kb/2009/05/resolution-paths-published.html

VM from template customization fails at 99% when trying to deploy a VM from template using VC 2.5 update 4. Ensure that you are entering your Windows licence information in the customization wizard when it asks for it, it seems to be a bug in update that if you leave it out it fails. It is also worth checking that your VM tools are up to date in your template as this has also been reported as being an issue in the past.

Stumbled across this usefull article on problems with VMware snapshots

http://hyperinfo.wordpress.com/2008/06/12/troubleshooting-vmwareesx-snapshots/

Specifically the items below, please note I have changed the wording a content on some items from the original post now.

Locating VMs that have snapshots


Trying to find out which VMs have snapshots can be challenging. There is no centralized way to do this built into the VMware Infrastructure Client or VirtualCenter, so you should periodically check your ESX servers for old snapshots that need to be deleted. There are a few methods you can use to accomplish this.

Method 1 – use the Find command on the Service Console

  1. Login to service console.
  2. Change to your /vmfs/volumes/ directory.
  3. Type find -iname “*-delta.vmdk” -mtime +7 -ls to find snapshot files that have not been modified in 7 days or simply find -iname “*-delta.vmdk” to find all snapshot files.

and

Dealing with snapshots that do not delete properly


Occasionally, a snapshot will not delete properly leaving an active snapshot for a VM. This can happen when using VMware Consolidated Backup or when deleting snapshots through Snapshot Manager. In most cases, the snapshot will not appear in the Snapshot Manager for you to delete. The only indication that a snapshot may still exist is the presence of delta files in the VM’s directory.

If you do have a snapshot running that is not in Snapshot Manager, you can attempt to delete it one of two ways. First, create a new snapshot using the VI Client and delete all snapshots from the snapshot manager after the new one has been created. Alternatively, login to the ESX Service Console, switch to the VM’s home directory and create a new snapshot by typing vmware-cmd createsnapshot (The syntax is as follows “vmware-cmd createsnapshot name description quiesce memory” view an example here . Wait for the snapshot to be created and type vmware-cmd removesnapshots. When it completes, check to see if the delta files have been deleted. If they have, then it was successfully completed.

If the delta files weren’t deleted, check the vmx file for the VM and locate the lines starting with scsi. If the VM is configured with only one virtual disk, it is usually scsi0:0 (if .present is false, it is a non-existent drive that you can ignore). The .fileName should be using the original disk file that was created with the VM and is usually the same name as your VM. If this is the case, then your VM is not using the snapshot files. If it has a -00000# in the filename, it is currently using a snapshot file. The following makes this a little clearer: VM with no snapshots: scsi0:0.present = “true” scsi0:0.fileName = “myvmname.vmdk” VM with snapshots: scsi0:0.present = “true” scsi0:0.fileName = “myvmname-000001.vmdk”

If this is the case and the above operation failed, another option you have is to either clone the VM or clone the VM’s disk file. To clone the VM I would recommend the VM is powered down and cloned through the VI client to another VM, although this could technically be done whilst the VM is powered up if you are using ESX 3.5 Update 2 or newer.

Another method is to shutdown the VM, login to the Service Console, switch to the VM’s directory and clone the VM’s disk file by using vmkfstools and specifying the snapshot file as the source disk, i.e. “vmkfstools –i myvmname-000001.vmdk myvmnamenew.vmdk” Once it completes go into the settings for the VM, remove (don’t delete) the hard disk, add a new hard disk and browse to the newly created disk file. Power on the VM and verify everything is working before you delete the old disk and delta files.

Please note if you are struggling with a particular issue with snapshot and you have a current VMware support contract, VMware support are very good at assisting with these problems and I would suggest you log a call with them rather than trying anything you are unsure of.

A good article taken from http://www.ozvms.com/content/view/160/9/ on how to change the service console IP address

Written by Damian Murdoch

Friday, 15 December 2006

If you want to change the IP address of the service console in ESX 3.x you can using a command in the service console. Read on for more.

To change the IP address of the ESX 3.x host, you need to change the configuration of the vswif. By default this is vswif0 and this is assumed in this document. Login to the service console with root permissions, either by using root or doing a su – to get the permissions.
Once in the service console run the command

“esxcfg-vswif -d vswif0”.

This command deletes the existing vswif0. Don’t worry if you get a message about nothing to flush. Then you need to run the command to change the ip address, subnet mask and broadcast address. They are also specified in that order when the command is given. An example command is below.

“esxcfg-vswif -a vswif0 -p Service\ Console -i 10.1.1.1 -n 255.255.255.0 -b 10.1.1.255”

In this command the -a switch is to add a vswif, the \ in the Service\ Console is deliberate, the -i is the ip address, the -n is the netmask and the -b is the broadcast address.

Remeber to change the hosts file in /etc/hosts


You now need to change your default gateway, you can do this by editing the network file located at /etc/sysconfig/network. To do this at the command prompt, follow the steps below.
“cd /etc/sysconfig”
“vi network”
Then while in vi, go to the location of the default gateway using the arrow keys.
Hit “i” which will perform an insert and change the default gateway to your liking.
Hit the escape key twice to exit insert mode.
type “:wq!” to write (i.e save) and quit.
At this point you can run some commands to restart the vmware management, but I prefer to restart the server and will recommend you do that. Once the server comes up there are a few things that still need to be done for management in virtualcenter.
Open a remote console to your virtualcenter server, do a ping <yourESXhostname> to make sure the ESX host is pingable after the IP change. Make sure you are seeing the new IP address and it is assumed you have already changed that in DNS. If you are seeing the host correctly, open virtualcenter and disconnect then reconnect the host.
Once the host is connected in virtualcenter we need to change a few bits of configuration information, namely the vmkernel ip address, subnet and gateway. This is so we can vmotion correctly. Click on your host and bring up the configuration tab. Select networking and then properties on the vitual switch.
Select your vmkernel and hit the edit button. Change your ip address here for vmotion and subnet mask. You will not be able to change the default gateway until you hit ok and go back in. Once you have selected ok, then hit edit again on the vmkernel. Select the edit button on the default gateway and change the default gateway on the menu that appears. Select ok, ok again and then close.

I found this fantastic article on SearchVMWare.com >> Go

Eric Siebert, contributor

Panicking at the onset of a high impact technical problem can cause impulsive decision making that enhances the problem. Before trying to troubleshoot any problem, pause and relax to approach the task with a clear mind, then address each symptom, possible cause and resolution appropriately.

In this series, I offer solutions for many common problems that arise with VMware ESX host servers, VirtualCenter, and virtual machines in general. Let’s begin by addressing common issues with VMware ESX host servers.

Windows server administrators have long been familiar with the dreaded Blue Screen of Death (BSOD), which signifies a complete halt by the server. VMware ESX has a similar state called the purple screen of death (PSOD) which is typically caused by hardware problems or a bug in the VMware code.

Troubleshooting a purple screen of death
When a PSOD occurs, the first thing you want to do is note the information displayed on the screen. I suggest using a digital camera or cell phone to take a quick photo. The PSOD message consists of the ESX version and build, the exception type, register dump, what was running on each CPU at the time of the crash, back-trace, server up-time, error messages and memory core dump info. The information won’t be useful to you, but VMware support can decipher it and help determine the cause of the crash.

Unfortunately, other than recording the information on the screen, your only option when experiencing a PSOD is to power the server off and back on. Once the server reboots you should find a vmkernel-zdump-* file in your server /root directory. This file will be valuable for determining the cause. You can use the vmkdump utility to extract the vmkernel log file from the file (vmkdump –l ) and examine it for clues as to what caused the PSOD. VMware support will usually want this file also. One common cause of PSOD’s is defective server memory; the dump file will help identify which memory module caused the problem so it can be replaced.

Checking your RAM for errors
If you suspect your system’s RAM may be at fault you can use a built-in utility to check your RAM in the background without effecting your running virtual machines. The RAM check utility runs in the VMkernel space and can be started by logging into the Service Console and typing Service Ramcheck Start.

While RAM check is running it will log all activity and any errors to the /var/log/vmware directory in files called ramcheck.log and ramcheck-err.log. One drawback, however, is that it’s hard to test all of your RAM with this utility if you have virtual machines (VMs) running, as it will only test unused RAM in the ESX system. A more thorough method of testing your server’s RAM is to shutdown ESX, boot from a CD, and run Memtest86+.

Using the vm-support utility
If you contact VMware support, they will usually ask you to run the vm-support utility that packages all of the ESX server log and configuration files into a single file. To run this utility, simply log in to the service console with root access, and type “vm-support” without any options. The utility will run and create a single Tar file that will be named “esx—..tgz”. You can send it via FTP to VMware support. Make sure you delete the Tar file from the ESX Server once you are done to save disk space.

Alternatively, you can generate the same file by using the VMware Infrastructure Client (VI Client). Select Administration, then Export Diagnostic Data, and select your host (VirtualCenter data optional) and a directory on your local PC to store the file that will be created.

Using log files for troubleshooting
Log files are generally your best tool for troubleshooting any type of problem. ESX has many log files. Which ones you should check depends on the problem you are experiencing. Below is the list of ESX log files that you will commonly use to troubleshoot ESX server problems. The VMkernel and hosted log files are usually the logs you will want to check first.

  • VMkernel – /var/log/vmkernel – Records activities related to the virtual machines and ESX server. Rotated with a numeric extension, current log has no extension, most recent has a “.1” extension.
  • VMkernel Warnings – /var/log/vmkwarning – Records activities with the virtual machines, a subset of the VMkernel log and uses the same rotation scheme.
  • VMkernel Summary – /var/log/vmksummary – Used to determine uptime and availability statistics for ESX Server; readable summary found in /var/log/vmksummary.txt.
  • ESX Server host agent log – /var/log/vmware/hostd.log – Contains information on the agent that manages and configures the ESX Server host and its virtual machines. (Search the file date/time stamps to find the log file it is currently outputting to, or open hostd.log, which is linked to the current log file.)
  • ESX Firewall log – /var/log/vmware/esxcfg-firewall.log – Logs all firewall rule events.
  • ESX Update log – /var/log/vmware/esxupdate.log – Logs all updates done through the esxupdate tool.
  • Service Console – /var/log/messages – Contains all general log messages used to troubleshoot virtual machines or ESX Server.
  • Web Access – /var/log/vmware/webAccess – Records information on web-based access to ESX Server.
  • Authentication log – /var/log/secure – Contains records of connections that require authentication, such as VMware daemons and actions initiated by the xinetd daemon.
  • Vpxa log – /var/log/vmware/vpx – Contains information on the agent that communicates with VirtualCenter. Search the file date/time stamps to find the log file it is currently outputting to or open hostd.log which is linked to the current log file.

As part of the troubleshooting process, often times you’ll need to find out the version of various ESX components and which patches are applied. Below are some commands you can run from the service console to do this:

  • Type vmware –v to check ESX Server version, i.e., VMware ESX Server 3.0.1 build-32039
  • Type esxupdate –l query to see which patches are installed.
  • Type vpxa –v to check the ESX Server management version, i.e. VMware VirtualCenter Agent Daemon 2.0.1 build-40644.
  • Type rpm –qa | grep VMware-esx-tools to check the ESX Server VMware Tools installed version – i.e., VMware-esx-tools-3.0.1-32039.

If all else fails, restart the VMware host agent service
Many ESX problems can be resolved by simply restarting the VMware host agent service (vmware-hostd), which is responsible for managing most of the operations on the ESX host. To do this, log into the service console and type service mgmt-vmware restart.

NOTE: ESX 3.0.1 contained a bug that would restart all your VMs if your ESX server was configured to use auto-startups for your VMs. This bug was fixed in a patch for 3.0.1 and also in 3.0.2, but appeared again in ESX 3.5 with another patch released to fix it. It’s best to temporarily disable auto-startups before you run this command.

In some cases restarting the vmware-vpxa service when you restart the host agent will fix problems that occur between ESX and both the VI Client and VirtualCenter. This service is the management agent that handles all communication between ESX and its clients. To restart it, log into the ESX host and type service vmware-vpxa restart. It is important to note that restarting either of these services will not impact the operation of your virtual machines (with the exception of the bug noted above).

Fixing a frozen service console
Another problem that can occur is your Service Console can hang and not allow you to log in locally. This can be caused by hardware lock-ups or a deadlocked condition. Your VMs may continue to operate normally when this occurs, but rebooting ESX is usually the only way to recover. Before you do that, however, try shutting down your guest VMs and/or using VMotion to migrate them to another ESX host. To do this, use the VI Client by connecting remotely via SSH or by using one of alternate/emergency consoles, which you can access by pressing Alt-F2 through Alt-F6. You can also press Alt-F12 to display VMkernel messages on the console screen.

If you are able to shutdown or move your VMs, then you can try rebooting the server by issuing the reboot command through the VI Client or alternate consoles. If not, cold-booting the server is your only option.

Lost network configurations
The problem that can occur is that you may lose part or all of your networking configurations. If this happens, you must rebuild your network by using the ESX local service console, since you will be unable to connect using the VI Client. VMware has published knowledgebase articles that detail how to rebuild your networking using the esxcfg-* service console commands and also how to verify your network settings.

Conclusion
In this tip, I have addressed a few of the most common problems that can occur with VMware ESX. In the next installment of this series, I will cover troubleshooting VirtualCenter issues.

Check the following llinks for solutions to other possible ESX problems:

This article relates to Platespin, but may assist with VMWare Convertor as well. I have came across this problems with servers that have been built from images in the past.
Server Details Discovery Problems

Article © Platespin KB

Checklist for discovering Windows based source and target servers:

  • Ensure that Windows Management Instrumentation (WMI) is installed and that the service is running
  • Ensure that DCOM is enabled (on the source machine and PowerConvert Server – see instructions below on how to check if DCOM is enabled)
  • Ensure that the RPC service is running
  • Ensure that the Remote registry service is running (Please see Q20371 for more details)
  • Ensure that File and Printer Sharing for Microsoft Networks is installed and enabled
  • Ensure that the administrative shares Admin$ and C$ on the source server are accessable from the PowerConvert server
  • Administrative credentials are required when discovering Windows servers in order to remotely gather the necessary server details. 
    Try the discovery using a local admin account with the user name syntax: hostname\LocalAdmin or try the discovery with a domain admin account with the user name syntax: domain\DomainAdmin
  • If you are using the UPN (User Principal Name) format for the credentials, ensure that the PowerConvert Server is installed on a Windows 2003 Server ( see Q20293 for more information)
  • If you are experiencing problems discovering a Windows NT 4.0 Server, ensure that the latest WMI core version is installed. (See the link below to download WMI Core V1.5 from Microsoft’s web site).  For more details on troubleshooting Windows NT 4.0 Server discoveries, please see Q20035
  • If you are experiencing problems discovering Windows XP Professional machines, please see Q20350 for more details
  • If problems still exist after checking the above, please see the related articles below

If you are still experiencing problems discovering servers after verifying the above, Microsoft’s Windows Management Instrumentation Tester (WBEMTest) can be used to troubleshoot WMI connections as follows:

1. On the machine where the PowerConvert Server is installed, click on the Windows Start button and then select Run
Type “Wbemtest” (without the quotes) and press Enter.

2. In the Namespace enter in the name of the machine you are trying to discover with \root\cimv2 appended.  For example,  if your machine name is win2k you would enter: \\win2k\root\cimv2 as the namespace

3. Enter in the appropriate credentials using either the hostname\LocalAdmin or domain\DomainAdmin format

4. Click on the Connect button to test the WMI connection.  If an error message is returned after clicking on  “Connect”, this indicates that a WMI connection cannot be established between the machine where PowerConvert is installed and the machine being discovered.  Please contact your Support Administrator for further assistance.


How do I check if DCOM is enabled?

1. Click on the Windows Start button and select Run

2. Type “dcomcnfg” (without quotes) and press Enter

3. On a Windows NT/2000 server machine, the DCOM Configuration dialog window should appear.  Click on the Default Properties tab and ensure that “Enable Distributed COM on this computer” is checked.

4. For Windows 2003, the Component Services window will appear instead.  In the Computers folder of the console tree of the Component Services administrative tool, right-click on the computer for which you want to check if DCOM is enabled and then click on Properties.  Click on the Default Properties tab and ensure that “Enable Distributed COM on this computer” is checked.”

5. If DCOM was not enabled, please enable it and restart the Windows Management Instrumentation Service (if rebooting the server is not possible) and try the discovery again

Administrative shares do not appear on server

http://support.microsoft.com/kb/245117