Friday, 16 September 2011

RHEL 5 : Crash Dump capturing for Red Hat Linux


RHEL 5 : Crash Dump capturing for Red Hat Linux


There are numerous occasions when a crash dump can be a valuable source of information when troubleshooting a system. The most common times are a system hang or a system panic.

Under Solaris[TM] on both SPARC(R) and x86 platforms, the mechanisms for getting a crash dump in these situations are well understood. Under Linux (specifically Red Hat) this situation is less clear.

This post explains about hot to get a crash dump from Red Hat linux to aid in troubleshooting system hangs after the operating system has been loaded. It covers which versions of RHEL are required, and the differences between 32bit and 64bit support.

RHEL crash dump Utilities:

The two main options for getting a crash dump (all pages in memory dumped to a file) under RHEL are netdump and diskdump.


Netdump – Supplied in RHEL 3 U1 and later – If you are on update 1 please see RHSA-2004:017-06 from Red Hat, this will allow 64 bit os dump as well – This will dump a vmcore file containing the entire contents of memory, over the network to a dedicated netdump-server. It will also dump a thread list and register info over the network to a log file. Kernel oops information will be dumped as well.

This allows for a central netdump-server, that can receive dumps and logs from multiple systems on a network and multiple architectures. This machine can be provisioned with large amounts of disk space and allows for central maintinance.

Secuity between the client and server is catered for.

There is a bug regarding netdump working across subnets (https://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=90803). Currently the server and client need to be on the same subnet. Scheduled for fix in RHEL3 U5 and RHEL4 U1.

Netdump-server is on cd1 of RHEL3 and will need to be installed manually (rpm install)

Netdump client and the kernel modules are installed by default.







Diskdump – Supplied in RHEL3 U3 and later – This is more familiar to Solaris users, and is closer to the savecore facility in solaris. A dedicated partition is formatted to receive disk dumps. When the system panics, it will write a memory image to this partition. When the system comes back up, this partition will be checked and if it contains a valid dump image, this will be written back out to /var/crash (or another location) on the system. After this has completed, the dump partiton will be reformatted (which can take a while) ready to take another crash dump.

Diskdump is supplied with RHEL3 U3 or later on both 32bit and 64bit

Both the above methods supply a vmcore image, and a textual stack dump. They do not provide a namelist or symbol table. To analyze the resultant dump image, a kernel needs to be built with debug flags set, that is matched to the kernel the customer will be running. As most RHEL installs will use the default kernel, this isn’t as tricky as might be expected.

The contents of /boot on the customer system should be tar’d up, as it can contain useful system maps for assistance in performing a Red Hat Linux crash dump.

The crash analysis tool provided with Red Hat Linux ‘crash’ contains info in the manual page about what it requires. It can be run against a live kernel image as well.

Forcing Crash Dumps from Hung Linux systems

Setting up RHEL for crash dumps

The main method for forcing a crash dump from a hung Linux system is using the alt-sysrq-<key> combination. This is analogus to STOP-A (or L1-A) on a Sun SPARC system.

echo ?h? > /proc/sysrqtrigger will have the same effect as pressing alt-sysrq-h.

Enabling alt-sysrq key sequence

The alt-sysrq key sequence is disabled by default under RHEL. To enable it, edit /etc/sysctl.conf and set kernel.sysrq = 1

Netdump configuration

http://www.redhat.com/support/wpapers/redhat/netdump/

Netdump only works on i386, not x86_64 the netconsole.o kernel module is not supplied for x86_64. Even if you roll your own kernel, it will load, but not dump the memory image over the network in 64bit mode.

The server and client need to be on the same subnet.

On the server
chkconfig netdump-server on
service netdump-server start
create user netdump with password

On the client
Edit the file /etc/sysconfig/netdump and add a line like NETDUMPADDR=10.0.0.1
make sure the DEV= line reflects the ethernet adaptor that the server is accessible on (e.g. DEV=eth1)
chkconfig netdump on
service netdump propagate (will require netdump user/password on server)
service netdump start (make sure the module loads ok)

IF the server changes IP address or mac address, then all netdump client modules will need to be unloaded and reloaded.




netdump example outputCPU#0 is frozen. CPU#1 is executing netdump. CPU#2 is frozen. CPU#3 is frozen. < netdump activated - performing handshake with the client. > NETDUMP START! < handshake completed - listening for dump requests. > 0(79500)/


Diskdump configuration

Disk dump requires RHEL3 Update 3 or later. It works under 32bit and 64bit modes.

Create a new partition using fdisk.
NOTE: It MUST be bigger than the amount of physical memory in the system.

Swap partitions cannot be used for dump devices.

Format the newly created dump partition (a reboot may be required to reread the partition tables on the disk) with diskdumpfmt -f -p <device> (e.g. diskdumpfmt -f /dev/sdb2)

chkconfig diskdump on
service diskdump start

disk dump example outputCPU frozen: #0#1 CPU#1 is executing diskdump. start dumping
dumping memory...


and on the way back upINIT: Entering runlevel: 3 Entering non-interactive startup Saving panic dump: [ OK ] Formatting dump device: [ OK ] Starting diskdump: [ OK ]


No comments:

Post a Comment

Twitter Bird Gadget