Wednesday, 31 August 2011

Linux disk cache


Experiments and fun with the Linux disk cache

Hopefully you are now convinced that Linux didn't just eat your ram. Here are some interesting things you can do to learn how the disk cache works.

Effects of disk cache on application memory allocation

Since I've already promised that disk cache doesn't prevent applications from getting the memory they want, let's start with that. Here is a C app (munch.c) that gobbles up as much memory as it can, or to a specified limit:
#include <stdlib.h>
#include <stdio.h>
#include <string.h>

int main(int argc, char** argv) {
    int max = -1;
    int mb = 0;
    char* buffer;

    if(argc > 1)
        max = atoi(argv[1]);

    while((buffer=malloc(1024*1024)) != NULL && mb != max) {
        memset(buffer, 0, 1024*1024);
        mb++;
        printf("Allocated %d MB\n", mb);
    }
    
    return 0;
}
Running out of memory isn't fun, but the OOM killer should end just this process and hopefully the rest will be unperturbed. We'll definitely want to disable swap for this, or the app will gobble up that as well.
$ sudo swapoff -a

$ free -m
             total       used       free     shared    buffers     cached
Mem:          1504       1490         14          0         24        809
-/+ buffers/cache:        656        848
Swap:            0          0          0

$ gcc munch.c -o munch

$ ./munch
Allocated 1 MB
Allocated 2 MB
(...)
Allocated 877 MB
Allocated 878 MB
Allocated 879 MB
Killed

$ free -m
             total       used       free     shared    buffers     cached
Mem:          1504        650        854          0          1         67
-/+ buffers/cache:        581        923
Swap:            0          0          0

$
Even though it said 14MB "free", that didn't stop the application from grabbing 879MB. Afterwards, the cache is pretty empty, but it will gradually fill up again as files are read and written. Give it a try.

Effects of disk cache on swapping

I also said that disk cache won't cause applications to use swap. Let's try that as well, with the same 'munch' app as in the last experiment. This time we'll run it with swap on, and limit it to a few hundred megabytes:
$ free -m
             total       used       free     shared    buffers     cached
Mem:          1504       1490         14          0         10        874
-/+ buffers/cache:        605        899
Swap:         2047          6       2041

$ ./munch 400
Allocated 1 MB
Allocated 2 MB
(...)
Allocated 399 MB
Allocated 400 MB

$ free -m
             total       used       free     shared    buffers     cached
Mem:          1504       1090        414          0          5        485
-/+ buffers/cache:        598        906
Swap:         2047          6       2041

munch ate 400MB of ram, which was taken from the disk cache without resorting to swap. Likewise, we can fill the disk cache again and it will not start eating swap either. If you run watch free -m in one terminal, and find . -type f -exec cat {} + > /dev/null in another, you can see that "cached" will rise while "free" falls. After a while, it tapers off but swap is never touched1

Clearing the disk cache

For experimentation, it's very convenient to be able to drop the disk cache. For this, we can use the special file /proc/sys/vm/drop_caches. By writing 3 to it, we can clear most of the disk cache:
$ free -m
             total       used       free     shared    buffers     cached
Mem:          1504       1471         33          0         36        801
-/+ buffers/cache:        633        871
Swap:         2047          6       2041

$ echo 3 | sudo tee /proc/sys/vm/drop_caches 
3

$ free -m
             total       used       free     shared    buffers     cached
Mem:          1504        763        741          0          0        134
-/+ buffers/cache:        629        875
Swap:         2047          6       2041

Notice how "buffers" and "cached" went down, free mem went up, and free+buffers/cache stayed the same.

Effects of disk cache on load times

Let's make two test programs, one in Python and one in Java. Python and Java both come with pretty big runtimes, which have to be loaded in order to run the application. This is a perfect scenario for disk cache to work its magic.
$ cat hello.py
print "Hello World! Love, Python"

$ cat Hello.java
class Hello { 
    public static void main(String[] args) throws Exception {
        System.out.println("Hello World! Regards, Java");
    }
}

$ javac Hello.java

$ python hello.py
Hello World! Love, Python

$ java Hello
Hello World! Regards, Java

$ 
Our hello world apps work. Now let's drop the disk cache, and see how long it takes to run them.
$ echo 3 | sudo tee /proc/sys/vm/drop_caches
3

$ time python hello.py
Hello World! Love, Python

real	0m1.026s
user	0m0.020s
sys	    0m0.020s

$ time java Hello
Hello World! Regards, Java

real	0m2.174s
user	0m0.100s
sys	    0m0.056s

$ 
Wow. 1 second for Python, and 2 seconds for Java? That's a lot just to say hello. However, now all the file required to run them will be in the disk cache so they can be fetched straight from memory. Let's try again:
$ time python hello.py
Hello World! Love, Python

real    0m0.022s
user    0m0.016s
sys     0m0.008s

$ time java Hello
Hello World! Regards, Java

real    0m0.139s
user    0m0.060s
sys     0m0.028s

$ 
Yay! Python now runs in just 22 milliseconds, while java uses 139ms. That's a 95% improvement! This works the same for every application!

Effects of disk cache on file reading

Let's make a big file and see how disk cache affects how fast we can read it. I'm making a 200mb file, but if you have less free ram, you can adjust it.
$ echo 3 | sudo tee /proc/sys/vm/drop_caches
3

$ free -m
             total       used       free     shared    buffers     cached
Mem:          1504        546        958          0          0         85
-/+ buffers/cache:        461       1043
Swap:         2047          6       2041

$ dd if=/dev/zero of=bigfile bs=1M count=200
200+0 records in
200+0 records out
209715200 bytes (210 MB) copied, 6.66191 s, 31.5 MB/s

$ ls -lh bigfile
-rw-r--r-- 1 vidar vidar 200M 2009-04-25 12:30 bigfile

$ free -m
             total       used       free     shared    buffers     cached
Mem:          1504        753        750          0          0        285
-/+ buffers/cache:        468       1036
Swap:         2047          6       2041

$ 

Since the file was just written, it will go in the disk cache. The 200MB file caused a 200MB bump in "cached". Let's read it, clear the cache, and read it again to see how fast it is:
$ time cat bigfile > /dev/null

real    0m0.139s
user    0m0.008s
sys     0m0.128s

$ echo 3 | sudo tee /proc/sys/vm/drop_caches
3

$ time cat bigfile > /dev/null

real    0m8.688s
user    0m0.020s
sys     0m0.336s

$ 
That's more than fifty times faster!

Conclusions

The Linux disk cache is very unobtrusive. It uses spare memory to greatly increase disk access speeds, and without taking any memory away from applications. A fully used store of ram on Linux is efficient hardware use, not a warning sign.


1. This is somewhat oversimplified. While newly allocated memory will always be taken from the disk cache instead of swap, Linux can be configured to preemptively swap out other unused applications in the background to free up memory for cache. The is tunable through the 'swappiness' setting, accessible through /proc/sys/vm/swappiness.
A server might want to swap out unused apps to speed up disk access of running ones (making the system faster), while a desktop system might want to keep apps in memory to prevent lag when the user finally uses them (making the system more responsive). This is the subject of much debate.

Clear filesystem memory cache


How to clear or drop the cache buffer pages from Linux memory

Introduction

Cache in Linux memory is where the Kernel stores the information it may need later, as memory is incredible faster than disk, it is great that the Linux Kernel takes care about that.
Anyway you can also manipulate how the cache behaves, there usually no need to do that, as Linux Operating system is very efficient in managing your computer memory, and will automatically free the RAM and drop the cache if some application needs memory. Let’s see how to force Linux to drop the cache from memory.


Writing to this will cause the kernel to drop clean caches, dentries and inodes from memory, causing that memory to become free.



Since Kernel 2.6.16, you can control how cache behaves, there are four possible “positions” for the switch.
0 -> Will give the Kernel full control to the cache memory 
1 -> Will free the page cache 
2 -> Will free dentries and inodes 
3 -> Will free dentries and inodes as well as page cache

So, just enter those values to the file /proc/sys/vm/drop_caches, with echo, and as root:






* sync; echo 0 > /proc/sys/vm/drop_caches


* sync; echo 1 > /proc/sys/vm/drop_caches

*sync; echo 2 > /proc/sys/vm/drop_caches

*sync; echo 3 > /proc/sys/vm/drop_caches

Better use sysctl instead of echoing:

/sbin/sysctl vm.drop_caches=3
sync
sh -c "echo 1 > /proc/sys/vm/drop_caches"

This file contains the documentation for the sysctl files in /proc/sys/vm and is valid for Linux kernel version 2.6.29. The files in this directory can be used to tune the operation of the virtual memory (VM) subsystem of the Linux kernel and the writeout of dirty data to disk. Default values and initialization routines for most of these files can be found in mm/swap.c. Currently, these files are in /proc/sys/vm: -
block_dump -
compact_memory -
dirty_background_bytes - 
dirty_background_ratio - 
dirty_bytes - 
dirty_expire_centisecs - 
dirty_ratio - 
dirty_writeback_centisecs - 
drop_caches - 
extfrag_threshold - 
hugepages_treat_as_movable - 
hugetlb_shm_group - 
laptop_mode - 
legacy_va_layout - 
lowmem_reserve_ratio - 
max_map_count - 
memory_failure_early_kill - 
memory_failure_recovery - 
min_free_kbytes - 
min_slab_ratio - 
min_unmapped_ratio - 
mmap_min_addr - 
nr_hugepages - 
nr_overcommit_hugepages - 
nr_pdflush_threads -
nr_trim_pages (only if CONFIG_MMU=n) - 
numa_zonelist_order - 
oom_dump_tasks - 
oom_kill_allocating_task - 
overcommit_memory - 
overcommit_ratio - 
page-cluster - 
panic_on_oom - 
percpu_pagelist_fraction - 
stat_interval - 
swappiness - 
vfs_cache_pressure - 
zone_reclaim_mode ============================================================== 
block_dump 
 block_dump enables block I/O debugging when set to a nonzero value. ============================================================== compact_memory 
 Available only when CONFIG_COMPACTION is set. When 1 is written to the file, all zones are compacted such that free memory is available in contiguous blocks where possible. This can be important for example in the allocation of huge pages although processes will also directly compact memory as required.
 ============================================================== dirty_background_bytes
Contains the amount of dirty memory at which the pdflush background writeback daemon will start writeback. Note: dirty_background_bytes is the counterpart of dirty_background_ratio. Only one of them may be specified at a time. When one sysctl is written it is immediately taken into account to evaluate the dirty memory limits and the other appears as 0 when read. ============================================================== dirty_background_ratio 
Contains, as a percentage of total system memory, the number of pages at which the pdflush background writeback daemon will start writing out dirty data. ============================================================== 
dirty_bytes 
Contains the amount of dirty memory at which a process generating disk writes will itself start writeback. Note: dirty_bytes is the counterpart of dirty_ratio. Only one of them may be specified at a time. When one sysctl is written it is immediately taken into account to evaluate the dirty memory limits and the other appears as 0 when read. Note: the minimum value allowed for dirty_bytes is two pages (in bytes); any value lower than this limit will be ignored and the old configuration will be retained. ============================================================== dirty_expire_centisecs 
This tunable is used to define when dirty data is old enough to be eligible for writeout by the pdflush daemons. It is expressed in 100'ths of a second. Data which has been dirty in-memory for longer than this interval will be written out next time a pdflush daemon wakes up. ============================================================== 
dirty_ratio 
Contains, as a percentage of total system memory, the number of pages at which a process which is generating disk writes will itself start writing out dirty data. ============================================================== dirty_writeback_centisecs 
The pdflush writeback daemons will periodically wake up and write `old' data out to disk. This tunable expresses the interval between those wakeups, in 100'ths of a second. Setting this to zero disables periodic writeback altogether. ============================================================== 
drop_caches 
Writing to this will cause the kernel to drop clean caches, dentries and inodes from memory, causing that memory to become free. To free pagecache: echo 1 > /proc/sys/vm/drop_caches To free dentries and inodes: echo 2 > /proc/sys/vm/drop_caches To free pagecache, dentries and inodes: echo 3 > /proc/sys/vm/drop_caches As this is a non-destructive operation and dirty objects are not freeable, the user should run `sync' first. ============================================================== extfrag_threshold 
This parameter affects whether the kernel will compact memory or direct reclaim to satisfy a high-order allocation. /proc/extfrag_index shows what the fragmentation index for each order is in each zone in the system. Values tending towards 0 imply allocations would fail due to lack of memory, values towards 1000 imply failures are due to fragmentation and -1 implies that the allocation will succeed as long as watermarks are met. The kernel will not compact memory in a zone if the fragmentation index is <= extfrag_threshold. The default value is 500. ============================================================== hugepages_treat_as_movable 
This parameter is only useful when kernelcore= is specified at boot time to create ZONE_MOVABLE for pages that may be reclaimed or migrated. Huge pages are not movable so are not normally allocated from ZONE_MOVABLE. A non-zero value written to hugepages_treat_as_movable allows huge pages to be allocated from ZONE_MOVABLE. Once enabled, the ZONE_MOVABLE is treated as an area of memory the huge pages pool can easily grow or shrink within. Assuming that applications are not running that mlock() a lot of memory, it is likely the huge pages pool can grow to the size of ZONE_MOVABLE by repeatedly entering the desired value into nr_hugepages and triggering page reclaim. ============================================================== hugetlb_shm_group hugetlb_shm_group 
contains group id that is allowed to create SysV shared memory segment using hugetlb page. ============================================================== 
 laptop_mode laptop_mode is a knob that controls "laptop mode". ============================================================== legacy_va_layout If non-zero, this sysctl disables the new 32-bit mmap layout - the kernel will use the legacy (2.4) layout for all processes. ============================================================== lowmem_reserve_ratio 
For some specialised workloads on highmem machines it is dangerous for the kernel to allow process memory to be allocated from the "lowmem" zone. This is because that memory could then be pinned via the mlock() system call, or by unavailability of swapspace. And on large highmem machines this lack of reclaimable lowmem memory can be fatal. So the Linux page allocator has a mechanism which prevents allocations which _could_ use highmem from using too much lowmem. This means that a certain amount of lowmem is defended from the possibility of being captured into pinned user memory. (The same argument applies to the old 16 megabyte ISA DMA region. This mechanism will also defend that region from allocations which could use highmem or lowmem). The `lowmem_reserve_ratio' tunable determines how aggressive the kernel is in defending these lower zones. If you have a machine which uses highmem or ISA DMA and your applications are using mlock(), or if you are running with no swap then you probably should change the lowmem_reserve_ratio setting. The lowmem_reserve_ratio is an array. You can see them by reading this file. - % cat /proc/sys/vm/lowmem_reserve_ratio 256 256 32 - Note: # of this elements is one fewer than number of zones. Because the highest zone's value is not necessary for following calculation. But, these values are not used directly. The kernel calculates # of protection pages for each zones from them. These are shown as array of protection pages in /proc/zoneinfo like followings. (This is an example of x86-64 box). Each zone has an array of protection pages like this. - Node 0, zone DMA pages free 1355 min 3 low 3 high 4 : : numa_other 0 protection: (0, 2004, 2004, 2004) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ pagesets cpu: 0 pcp: 0 : - These protections are added to score to judge whether this zone should be used for page allocation or should be reclaimed. In this example, if normal pages (index=2) are required to this DMA zone and watermark[WMARK_HIGH] is used for watermark, the kernel judges this zone should not be used because pages_free(1355) is smaller than watermark + protection[2] (4 + 2004 = 2008). If this protection value is 0, this zone would be used for normal page requirement. If requirement is DMA zone(index=0), protection[0] (=0) is used. zone[i]'s protection[j] is calculated by following expression. (i < j): zone[i]->protection[j] = (total sums of present_pages from zone[i+1] to zone[j] on the node) / lowmem_reserve_ratio[i]; (i = j): (should not be protected. = 0; (i > j): (not necessary, but looks 0) The default values of lowmem_reserve_ratio[i] are 256 (if zone[i] means DMA or DMA32 zone) 32 (others). As above expression, they are reciprocal number of ratio. 256 means 1/256. # of protection pages becomes about "0.39%" of total present pages of higher zones on the node. If you would like to protect more pages, smaller values are effective. The minimum value is 1 (1/1 -> 100%). ============================================================== max_map_count: This file contains the maximum number of memory map areas a process may have. Memory map areas are used as a side-effect of calling malloc, directly by mmap and mprotect, and also when loading shared libraries. While most applications need less than a thousand maps, certain programs, particularly malloc debuggers, may consume lots of them, e.g., up to one or two maps per allocation. The default value is 65536. ============================================================= memory_failure_early_kill: Control how to kill processes when uncorrected memory error (typically a 2bit error in a memory module) is detected in the background by hardware that cannot be handled by the kernel. In some cases (like the page still having a valid copy on disk) the kernel will handle the failure transparently without affecting any applications. But if there is no other uptodate copy of the data it will kill to prevent any data corruptions from propagating. 1: Kill all processes that have the corrupted and not reloadable page mapped as soon as the corruption is detected. Note this is not supported for a few types of pages, like kernel internally allocated data or the swap cache, but works for the majority of user pages. 0: Only unmap the corrupted page from all processes and only kill a process who tries to access it. The kill is done using a catchable SIGBUS with BUS_MCEERR_AO, so processes can handle this if they want to. This is only active on architectures/platforms with advanced machine check handling and depends on the hardware capabilities. Applications can override this setting individually with the PR_MCE_KILL prctl ============================================================== memory_failure_recovery Enable memory failure recovery (when supported by the platform) 1: Attempt recovery. 0: Always panic on a memory failure. ============================================================== min_free_kbytes: This is used to force the Linux VM to keep a minimum number of kilobytes free. The VM uses this number to compute a watermark[WMARK_MIN] value for each lowmem zone in the system. Each lowmem zone gets a number of reserved free pages based proportionally on its size. Some minimal amount of memory is needed to satisfy PF_MEMALLOC allocations; if you set this to lower than 1024KB, your system will become subtly broken, and prone to deadlock under high loads. Setting this too high will OOM your machine instantly. ============================================================= 
min_slab_ratio: This is available only on NUMA kernels. A percentage of the total pages in each zone. On Zone reclaim (fallback from the local zone occurs) slabs will be reclaimed if more than this percentage of pages in a zone are reclaimable slab pages. This insures that the slab growth stays under control even in NUMA systems that rarely perform global reclaim. The default is 5 percent. Note that slab reclaim is triggered in a per zone / node fashion. The process of reclaiming slab memory is currently not node specific and may not be fast. ============================================================= min_unmapped_ratio: This is available only on NUMA kernels. This is a percentage of the total pages in each zone. Zone reclaim will only occur if more than this percentage of pages are in a state that zone_reclaim_mode allows to be reclaimed. If zone_reclaim_mode has the value 4 OR'd, then the percentage is compared against all file-backed unmapped pages including swapcache pages and tmpfs files. Otherwise, only unmapped pages backed by normal files but not tmpfs files and similar are considered. The default is 1 percent. 
============================================================== mmap_min_addr This file indicates the amount of address space which a user process will be restricted from mmapping. Since kernel null dereference bugs could accidentally operate based on the information in the first couple of pages of memory userspace processes should not be allowed to write to them. By default this value is set to 0 and no protections will be enforced by the security module. Setting this value to something like 64k will allow the vast majority of applications to work correctly and provide defense in depth against future potential kernel bugs. ============================================================== 
nr_hugepages Change the minimum size of the hugepage pool. See Documentation/vm/hugetlbpage.txt ============================================================== nr_overcommit_hugepages Change the maximum size of the hugepage pool. The maximum is nr_hugepages + nr_overcommit_hugepages.  ============================================================== nr_pdflush_threads The current number of pdflush threads. This value is read-only. The value changes according to the number of dirty pages in the system. When necessary, additional pdflush threads are created, one per second, up to nr_pdflush_threads_max. ============================================================== 
nr_trim_pages This is available only on NOMMU kernels. This value adjusts the excess page trimming behaviour of power-of-2 aligned NOMMU mmap allocations. A value of 0 disables trimming of allocations entirely, while a value of 1 trims excess pages aggressively. Any value >= 1 acts as the watermark where trimming of allocations is initiated. The default value is 1. See Documentation/nommu-mmap.txt for more information. 
============================================================== numa_zonelist_order This sysctl is only for NUMA. 'where the memory is allocated from' is controlled by zonelists. (This documentation ignores ZONE_HIGHMEM/ZONE_DMA32 for simple explanation. you may be able to read ZONE_DMA as ZONE_DMA32...) In non-NUMA case, a zonelist for GFP_KERNEL is ordered as following. ZONE_NORMAL -> ZONE_DMA This means that a memory allocation request for GFP_KERNEL will get memory from ZONE_DMA only when ZONE_NORMAL is not available. In NUMA case, you can think of following 2 types of order. Assume 2 node NUMA and below is zonelist of Node(0)'s GFP_KERNEL (A) Node(0) ZONE_NORMAL -> Node(0) ZONE_DMA -> Node(1) ZONE_NORMAL (B) Node(0) ZONE_NORMAL -> Node(1) ZONE_NORMAL -> Node(0) ZONE_DMA. Type(A) offers the best locality for processes on Node(0), but ZONE_DMA will be used before ZONE_NORMAL exhaustion. This increases possibility of out-of-memory(OOM) of ZONE_DMA because ZONE_DMA is tend to be small. Type(B) cannot offer the best locality but is more robust against OOM of the DMA zone. Type(A) is called as "Node" order. Type (B) is "Zone" order. "Node order" orders the zonelists by node, then by zone within each node. Specify "[Nn]ode" for node order "Zone Order" orders the zonelists by zone type, then by node within each zone. Specify "[Zz]one" for zone order. Specify "[Dd]efault" to request automatic configuration. Autoconfiguration will select "node" order in following case. (1) if the DMA zone does not exist or (2) if the DMA zone comprises greater than 50% of the available memory or (3) if any node's DMA zone comprises greater than 60% of its local memory and the amount of local memory is big enough. Otherwise, "zone" order will be selected. Default order is recommended unless this is causing problems for your system/application. ============================================================== oom_dump_tasks Enables a system-wide task dump (excluding kernel threads) to be produced when the kernel performs an OOM-killing and includes such information as pid, uid, tgid, vm size, rss, cpu, oom_adj score, and name. This is helpful to determine why the OOM killer was invoked and to identify the rogue task that caused it. If this is set to zero, this information is suppressed. On very large systems with thousands of tasks it may not be feasible to dump the memory state information for each one. Such systems should not be forced to incur a performance penalty in OOM conditions when the information may not be desired. If this is set to non-zero, this information is shown whenever the OOM killer actually kills a memory-hogging task. The default value is 1 (enabled). ============================================================== oom_kill_allocating_task This enables or disables killing the OOM-triggering task in out-of-memory situations. If this is set to zero, the OOM killer will scan through the entire tasklist and select a task based on heuristics to kill. This normally selects a rogue memory-hogging task that frees up a large amount of memory when killed. If this is set to non-zero, the OOM killer simply kills the task that triggered the out-of-memory condition. This avoids the expensive tasklist scan. If panic_on_oom is selected, it takes precedence over whatever value is used in oom_kill_allocating_task. The default value is 0. ============================================================== overcommit_memory: This value contains a flag that enables memory overcommitment. When this flag is 0, the kernel attempts to estimate the amount of free memory left when userspace requests more memory. When this flag is 1, the kernel pretends there is always enough memory until it actually runs out. When this flag is 2, the kernel uses a "never overcommit" policy that attempts to prevent any overcommit of memory. This feature can be very useful because there are a lot of programs that malloc() huge amounts of memory "just-in-case" and don't use much of it. The default value is 0. See Documentation/vm/overcommit-accounting and security/commoncap.c::cap_vm_enough_memory() for more information. ============================================================== overcommit_ratio: When overcommit_memory is set to 2, the committed address space is not permitted to exceed swap plus this percentage of physical RAM. See above. ============================================================== page-cluster page-cluster controls the number of pages which are written to swap in a single attempt. The swap I/O size. It is a logarithmic value - setting it to zero means "1 page", setting it to 1 means "2 pages", setting it to 2 means "4 pages", etc. The default value is three (eight pages at a time). There may be some small benefits in tuning this to a different value if your workload is swap-intensive. ============================================================= 
panic_on_oom This enables or disables panic on out-of-memory feature. If this is set to 0, the kernel will kill some rogue process, called oom_killer. Usually, oom_killer can kill rogue processes and system will survive. If this is set to 1, the kernel panics when out-of-memory happens. However, if a process limits using nodes by mempolicy/cpusets, and those nodes become memory exhaustion status, one process may be killed by oom-killer. No panic occurs in this case. Because other nodes' memory may be free. This means system total status may be not fatal yet. If this is set to 2, the kernel panics compulsorily even on the above-mentioned. Even oom happens under memory cgroup, the whole system panics. The default value is 0. 1 and 2 are for failover of clustering. Please select either according to your policy of failover. panic_on_oom=2+kdump gives you very strong tool to investigate why oom happens. You can get snapshot. ============================================================= percpu_pagelist_fraction This is the fraction of pages at most (high mark pcp->high) in each zone that are allocated for each per cpu page list. The min value for this is 8. It means that we don't allow more than 1/8th of pages in each zone to be allocated in any single per_cpu_pagelist. This entry only changes the value of hot per cpu pagelists. User can specify a number like 100 to allocate 1/100th of each zone to each per cpu page list. The batch value of each per cpu pagelist is also updated as a result. It is set to pcp->high/4. The upper limit of batch is (PAGE_SHIFT * 8) The initial value is zero. Kernel does not use this value at boot time to set the high water marks for each per cpu page list. ============================================================== 
stat_interval The time interval between which vm statistics are updated. The default is 1 second. ============================================================== 
swappiness This control is used to define how aggressive the kernel will swap memory pages. Higher values will increase agressiveness, lower values decrease the amount of swap. The default value is 60. ============================================================== vfs_cache_pressure ------------------ Controls the tendency of the kernel to reclaim the memory which is used for caching of directory and inode objects. At the default value of vfs_cache_pressure=100 the kernel will attempt to reclaim dentries and inodes at a "fair" rate with respect to pagecache and swapcache reclaim. Decreasing vfs_cache_pressure causes the kernel to prefer to retain dentry and inode caches. When vfs_cache_pressure=0, the kernel will never reclaim dentries and inodes due to memory pressure and this can easily lead to out-of-memory conditions. Increasing vfs_cache_pressure beyond 100 causes the kernel to prefer to reclaim dentries and inodes. ============================================================== zone_reclaim_mode: Zone_reclaim_mode allows someone to set more or less aggressive approaches to reclaim memory when a zone runs out of memory. If it is set to zero then no zone reclaim occurs. Allocations will be satisfied from other zones / nodes in the system. This is value ORed together of 1 = Zone reclaim on 2 = Zone reclaim writes dirty pages out 4 = Zone reclaim swaps pages zone_reclaim_mode is set during bootup to 1 if it is determined that pages from remote zones will cause a measurable performance reduction. The page allocator will then reclaim easily reusable pages (those page cache pages that are currently not used) before allocating off node pages. It may be beneficial to switch off zone reclaim if the system is used for a file server and all of memory should be used for caching files from disk. In that case the caching effect is more important than data locality. Allowing zone reclaim to write out pages stops processes that are writing large amounts of data from dirtying pages on other nodes. Zone reclaim will write out dirty pages if a zone fills up and so effectively throttle the process. This may decrease the performance of a single process since it cannot use all of system memory to buffer the outgoing writes anymore but it preserve the memory on other nodes so that the performance of other processes running on other nodes will not be affected. Allowing regular swap effectively restricts allocations to the local node unless explicitly overridden by memory policies or cpuset configurations.





Network Troubleshooting


Network Troubleshooting Using Packet Capture Utilities .


1.0 Introduction


Purpose:

The purpose of this paper is demonstrate how to monitor and troubleshoot common network based applications using standard UNIX utilities packet capture utilities. A basic understanding of the TCP/IP stack and Ethernet hardware is assumed.

Packet Capture/hSniffing the Wireh

Promiscuous mode is generally used to refer to the practice of putting a network card into a setting so that it
passes all traffic it receives to the CPU rather than just packets addressed to it. Many operating systems require superuser privileges to operate in promiscuous mode. A nonrouting node in promiscuous mode can generally only monitor traffic to and from other nodes within the same collision domain (for Ethernet and Wireless LAN) or ring (for Token Ring or FDDI), which is why network switches are used to combat malicious use of promiscuous mode. A router may monitor all traffic that it routes. Promiscuous mode is commonly used to diagnose network connectivity issues. Some programs like Ethereal, tcpdump, and AirSnort (for wireless LANs) make use of this feature to show the user all the data being transferred over the network. Some programs like FTP and Telnet transfer data and passwords in clear text, without encryption, and network scanners can see this data. Therefore, computer users are encouraged to stay away from programs like telnet and use more secure ones such as SSH.

2.0 libpcap Based Tools


The libpcap library is a systemindependent interface for userlevel packet capture. Many UNIX utilities use
the libpcap interface as their underlying packet capture engine. Due to the portability of this code, all utilities
that use the libpcap library share the same syntax. The most common utilities that use the libpcap library are
tcpdump and ethereal.

2.1 The tcpdump Utility


The tcpdump utility is the most common packet capture utility for UNIX based systems. It is highly versatile
and filterable. Here is a standard run of the command:
box:~# tcpdump

tcpdump: WARNING: eth0: no IPv4 address assigned tcpdump: verbose output suppressed, use v or vv

for full protocol decode listening on eth0, linktype
EN10MB (Ethernet), capture size 96 bytes

There are a few very helpful options out there. In the following example, the gih specifies another Ethernet
interface besides the default (eth0). Also included is the gnh option which turns of host and port resolution.
The tcpdump utility will attempt to resolve IP address, which can lead to significant delays in output due to the
latency of network lookups. This can be a very handy option when attempting to monitor high volumes of
c 2005 Darren Hoch - webmaster [at] litemail.org 1

Network Troubleshooting Using Packet Capture Utilities .


box:~# tcpdump -ni eth1 
tcpdump: verbose output suppressed, use v or vv for full protocol decode listening on eth1, linktype

EN10MB (Ethernet), capture size 96 bytes

19:47:22.935554 IP 192.168.1.105.32783 > 67.110.253.165.993: P 3432387228:3432387265(37)
ack 2742259796 win 63712 <nop,nop,timestamp 239675 1064682926>

19:47:22.967508 IP 67.110.253.165.993 > 192.168.1.105.32783: P 1:54(53) ack 37 win 1984
<nop,nop,timestamp 1064879093 239675>

Here is a breakdown of a single packet:
Real Time: 19:47:22.967508
Source IP Address: IP 67.110.253.165.993
Direction of Packet Flow: >
Destination Address: 192.168.1.105.32783:
TCP Flags: P
TCP Source SYN Number: 1:
Next TCP SYN Number 54(53) # original SYN (1) + payload (53) = next SYN (54)
TCP ACK NUMBER: ack 37
TCP Window Size: win 1984
TCP Options: <nop,nop,timestamp 1064879093 239675>

If the volume of traffic to be monitored is too great for a standard terminal window buffer, the captured packets can be written to a file instead of STDOUT with a gwh (write) option and then read back in with a grh
(read) option.

box:~# tcpdump -w /tmp/tcp.out -ni eth1
tcpdump: listening on eth1, linktype
EN10MB (Ethernet), capture size 96 bytes
46 packets captured
46 packets received by filter
0 packets dropped by kernel

box:~# tcpdump -r /tmp/tcp.out -ni eth1
reading from file /tmp/tcp.out, linktype
EN10MB (Ethernet)
19:56:07.190888 IP 192.168.1.105.32783 > 67.110.253.165.993: P 3432387731:3432387768(37)
ack 2742260475 win 63712 <nop,nop,timestamp 292100 1065283060>
19:56:07.227315 IP 67.110.253.165.993 > 192.168.1.105.32783: P 1:54(53) ack 37 win 1984
<nop,nop,timestamp 1065403449 292100>
<<snip>>

2.2 The ethereal Utility


Just like tcpdump, ethereal is based on the libpcap interface. There are two main versions of ethereal.
There is the text version called gtetherealh and the GUI based version called getherealh. The text based
version is very similar in syntax to the tcpdump command syntax. Once again, this is because they use the same c 2005 Darren Hoch - webmaster [at] litemail.org 2

Network Troubleshooting Using Packet Capture Utilities .

box:~# tethereal -w /tmp/ethereal.out -ni eth1
Capturing on eth1
0.327450 192.168.1.105 >
67.110.253.165 TLS Application Data
0.361175 67.110.253.165 >
192.168.1.105 TLS Application Data
0.361220 192.168.1.105 >
67.110.253.165 TCP 32783 > 993 [ACK] Seq=37 Ack=53 Win=63712
Len=0 TSV=389797 TSER=1066380554
0.363460 192.168.1.105 >
67.110.253.165 TLS Application Data
0.410951 67.110.253.165 >
192.168.1.105 TLS Application Data

box:~# tethereal -r /tmp/ethereal.out
6 2.543822 192.168.1.105 >
67.110.253.165 TLS Application Data
7 2.593330 67.110.253.165 >
192.168.1.105 TLS Application Data
8 2.593375 192.168.1.105 >
67.110.253.165 TCP 32783 > imaps [ACK] Seq=37 Ack=53
Win=63712 Len=0 TSV=412045 TSER=1066603077
9 2.595989 192.168.1.105 >
67.110.253.165 TLS Application Data

2.3 The dsniff Utility


Unlike the previously mentioned utilities, dsniff takes packet capture one level further. Using the underlying
libpcap engine, dsniff takes the packets captured and attempts to report something a little more useful. The
dsniff program is one of many utilities in the dsniff package. The standard dsniff command will attempt to
capture and replay all unencrypted sessions including: FTP, telnet, SMTP, IMAP, and POP. The following
example demonstrates how to use dsniff to audit telnet and ftp sessions:

box:~# dsniff -ni eth1

dsniff: listening on eth1

06/01/05 20:35:46 tcp 192.168.1.105.32883 > 192.168.1.220.21 (ftp)
USER darren
PASS darren$$$$

06/01/05 20:37:53 tcp 192.168.1.105.32889 > 192.168.1.220.23 (telnet)
darren
darren$$$$
ls
ls l
ps ef
exit

2.4 The snort Utility


The snort utility is the most common open source intrusion detection system. Like dsniff, it attempts to make
sense out of traffic. Whereas simply reports clear text payloads, snort attempts to identify malicious traffic
patterns using signatures of known malicious packets. Like all the other utilities, snort uses libpcap as the
underlying engine. The following is a very basic run of snort.

box:~# snort -i eth1-D -d -u snort -g snort -c /etc/snort/snort.conf

The snort utility will run in daemon mode in the background. All alerts are written to a text file
/var/log/snort/alert. Here is a sample scan from a possible intruder:

[root@targus ~]# nmap -p 22 -sX 192.168.1.105

This questionable port scan of the ssh port (22) is obfuscated by the -sX (Christmas scan) switch.
Here is the output from the alert file on the host running snort:

box:~# tail -f /var/log/snort/alert
[**] [1:1228:7] SCAN nmap XMAS [**]
[Classification: Attempted Information Leak] [Priority: 2]
06/0120: 42:02.813099 192.168.1.220:37325 > 192.168.1.105:22
TCP TTL:47 TOS:0x0 ID:15194 IpLen:20 DgmLen:40
**U*P**F Seq: 0x82A00B2 Ack: 0x0 Win: 0x1000 TcpLen: 20 UrgPtr: 0x0
[Xref => http://www.whitehats.com/info/IDS30]

3.0 Sun Solaris snoop Utility


Snoop is specific to Sun Microsystem's Solaris UNIX. Although there is a port of tcpdump for Solaris, there is no port of snoop for LINUX. Much like tcpdump, it is a utility that puts your systemfs interface(s) in
promiscuous mode. Although similar in design goals, since snoop uses it's own packet capture library, the
options are a little different. The following example demonstrates how to run snoop on an alternate interface
(d) with name resolution disabled (r).

pilate > snoop -r -d hme0

Using device /dev/hme (promiscuous mode)
66.27.208.74 > 67.110.253.164 TCP D=22 S=32897 Ack=1172114638 Seq=1267082364 Len=0
Win=11200 Options=<nop,nop,tstamp 663391 727479722>
67.110.253.164 > 66.27.208.74 TCP D=32897 S=22 Push Ack=1267082364 Seq=1172114638
Len=208 Win=47824 Options=<nop,nop,tstamp 727479725 663391>

Like libpcap utilities, snoop also enables the redirection of packets to a file instead of STDOUT.

pilate > snoop -o /tmp/snoop.out -r -d hme0

Using device /dev/hme (promiscuous mode)
14
pilate > snoop -i /tmp/snoop.out
1 0.00000 cpe662720874.
socal.res.rr.com >pilate TCP D=22 S=32897 Ack=1172118318 Seq=1267084668 Len=0 Win=15600 Options=<nop,nop,tstamp 670581 727486912>

2 0.00005 pilate > cpe662720874. socal.res.rr.com TCP D=32897 S=22 Push
Ack=1267084668 Seq=1172118318 Len=112 Win=47824 Options=<nop,nop,tstamp 727486915 670581>

4.0 Using Filter Expressions


It may be easy to identify specific traffic streams on small or idle networks. It will be much harder to
accomplish this on large WAN or saturated networks. The ability to use filter expressions is extremely
important in these cases to cut out unwanted gnoiseh packets from the traffic in question. Fortunately, both the
libpcap based utilities and the snoop utility all use the same filter syntax. There are many ways to filter traffic
in all utilities, the most common filters are by port, protocol, and host. The following example tracks only
telnet traffic and host 192.168.1.105:
[root@targus ~]# tcpdump -ni eth0 port 23 and host 192.168.1.105

tcpdump: verbose output suppressed, use v or vv for full protocol decode listening on eth0, linktype
EN10MB (Ethernet), capture size 96 bytes
21:13:26.905262 IP 192.168.1.105.32899 > 192.168.1.220.telnet: S 1903904803:1903904803(0)
win 5840 <mss 1460,sackOK,timestamp 722613 0,nop,wscale 0>

[root@targus ~]# tethereal -ni eth0 port 23 and host 192.168.1.105
Capturing on eth0
0.000000 192.168.1.105 >
192.168.1.220 TCP 32900 > 23 [SYN] Seq=0 Ack=0 Win=5840 Len=0
MSS=1460 TSV=729689 TSER=0 WS=0

box:~# dsniff -ni eth1 port 23 and host 192.168.1.105
dsniff: listening on eth1 [port 23 and host 192.168.1.105]

06/01/05 21:11:16 tcp 192.168.1.105.32901 > 192.168.1.220.23 (telnet)
root
cangetin
pilate > snoop -r -d hme0 port 53 and host 192.168.1.105
There are other cases where an administrator may want to capture all but certain types of traffic. Specifically, a lot of gnoiseh can be made if one is trying to run a packet capture while logged into the remote host. Much of the traffic generated will be the control traffic back to that host. The following example shows how to filter the ssh control traffic to and from the control connection (192.168.1.105 connected to 192.168.1.220 as root) and all DNS traffic.

[root@targus ~]# who
root pts/2 Jun 1 21:30 (192.168.1.105)

[root@targus ~]# tcpdump -ni eth0 not host 192.168.1.105 and not port 53
tcpdump: verbose output suppressed, use v or vv for full protocol decode listening on eth0, linktype
EN10MB (Ethernet), capture size 96 bytes
21:32:07.692524 IP 216.93.214.50.51606 > 192.168.1.220.ssh: S 2550704294:2550704294(0)
win 5840 <mss 1460,sackOK,timestamp 431608120 0,nop,wscale 2>
21:32:07.692596 IP 192.168.1.220.ssh > 216.93.214.50.51606: S 729994889:729994889(0) ack
2550704295 win 5792 <mss 1460,sackOK,timestamp 111008380 431608120,nop,wscale 2>
21:32:07.796911 IP 216.93.214.50.51606 > 192.168.1.220.ssh: . ack 1 win 1460
<nop,nop,timestamp 431608221 111008380>

5.0 Protocol Layer Problems

5.1.0 ARP Layer Problems . No ARP Reply
The ping command is often used to test whether or not a remote host has a configured network stack. When a ping hangs or no reply is received, there is an assumption that the host is not up or working. This may be true, however, there could be multiple issues at the ARP layer preventing a working host from communicating on the network. The following example demonstrates what an expected ARP exchange looks like BEFORE a ping command can commence. It demonstrates the ARP REQUEST by the source host followed by the ARP REPLY from the destination host.
box:~# ping 192.168.1.102
PING 192.168.1.102 (192.168.1.102) 56(84) bytes of data.
64 bytes from 192.168.1.102: icmp_seq=1 ttl=128 time=5.83 ms


box:~# tcpdump -ni eth1 arp
tcpdump: verbose output suppressed, use v or vv for full protocol decode listening on eth1, linktype
EN10MB (Ethernet), capture size 96 bytes
21:43:55.459856 arp whohas
192.168.1.102 tell 192.168.1.105
21:43:55.462638 arp reply 192.168.1.102 isat
00:0f:1f:17:ab:a7

Here is an example of a failed ping attempt. Notice that the ARP REQUEST was never answered by the
destination host (192.168.1.107). This could mean that the destination host is not online.


box:~# ping 192.168.1.107
PING 192.168.1.107 (192.168.1.107) 56(84) bytes of data.
From 192.168.1.105 icmp_seq=1 Destination Host Unreachable
From 192.168.1.105 icmp_seq=2 Destination Host Unreachable


box:~# tcpdump -ni eth1 arp
tcpdump: verbose output suppressed, use v or vv for full protocol decode listening on eth1, linktype
EN10MB (Ethernet), capture size 96 bytes
21:36:48.995766 arp whohas
192.168.1.107 tell 192.168.1.105
21:36:49.992668 arp whohas
192.168.1.107 tell 192.168.1.105
21:36:50.992667 arp whohas
192.168.1.107 tell 192.168.1.105

5.1.1 ARP Layer Problems . Duplicate IP Addresses


When two hosts have the same IP Address assigned, there will be two ARP REPLY to the ARP REQUEST.
The first reply enters the source host's ARP table. The problem is that the first reply may be the wrong host. The following example demonstrates how a Windows XP and LINUX host compete for the same IP address. On the surface, the ping makes it appear that the destination host (expected to be LINUX) is up an responding.
box:~# ping 192.168.1.102

PING 192.168.1.102 (192.168.1.102) 56(84) bytes of data.
64 bytes from 192.168.1.102: icmp_seq=1 ttl=128 time=5.83 ms
However, a capture of ARP traffic shows that two replies were sent to the original request. The first reply was from the Windows host. The LINUX host's entry came after.

box:~# tcpdump -ni eth1 arp

tcpdump: verbose output suppressed, use v or vv for full protocol decode listening on eth1, linktype

EN10MB (Ethernet), capture size 96 bytes
22:04:35.074216 arp whohas
192.168.1.102 tell 192.168.1.105
22:04:35.078448 arp reply 192.168.1.102 isat
00:40:f4:83:48:24 # Windows Host
22:04:35.079562 arp reply 192.168.1.102 isat
00:0f:1f:17:ab:a7 # LINUX Host

A check of the ARP cache shows that the Windows XP Ethernet address is the one populated in the cache.
box:~# arp -a


targus (192.168.1.220) at 00:02:55:74:41:1B [ether] on eth1
? (192.168.1.1) at 00:06:25:77:63:8B [ether] on eth1
? (192.168.1.102) at 00:40:f4:83:48:24 [ether] on eth1
An attempt to use ssh to connect to the remote host fails because the source host is attempting to connect to the Windows XP host instead of the LINUX host using the Windows XP Ethernet address.

box:~# ssh v
192.168.1.102
OpenSSH_3.8.1p1 Debian8.
sarge.4, OpenSSL 0.9.7e 25 Oct 2004
debug1: Reading configuration data /etc/ssh/ssh_config
debug1: Connecting to 192.168.1.102 [192.168.1.102] port 22.
^C

5.2.0 IP Problems . Misconfigured Broadcast Address

The broadcast address is often overlooked when determining network problems. Some network communications rely on a properly configured broadcast address including: NTP, RIPv1, and ICMP Broadcast pings. The following example demonstrates a standard ICMP broadcast ping an associated reply.


box:~# ifconfig eth1

eth1 Link encap:Ethernet HWaddr 00:06:53:E4:8D:B8
inet addr:192.168.1.105 Bcast:192.168.1.255 Mask:255.255.255.0


box:~# ping b
192.168.1.255
WARNING: pinging broadcast address
PING 192.168.1.255 (192.168.1.255) 56(84) bytes of data.
64 bytes from 192.168.1.105: icmp_seq=1 ttl=64 time=0.052 ms
64 bytes from 192.168.1.1: icmp_seq=1 ttl=150 time=3.54 ms (DUP!)
64 bytes from 192.168.1.220: icmp_seq=1 ttl=64 time=4.43 ms (DUP!)
If the broadcast address is not consistent with the rest of the hosts on the network, none of those hosts will reply to the broadcast ping. In the following example, the correct subnet mask 255.255.255.0. However, the source host has a misconfigured netmask of 255.0.0.0.

box:~# ifconfig eth1
eth1 Link encap:Ethernet HWaddr 00:06:53:E4:8D:B8
inet addr:192.168.1.105 Bcast:192.255.255.255 Mask:255.0.0.0
A broadcast ping returns no replies from the network.

box:~# ping b
192.168.1.255
PING 192.168.1.255 (192.168.1.255) 56(84) bytes of data.
From 192.168.1.105 icmp_seq=1 Destination Host Unreachable
A packet capture further confirms the problem. The correct broadcast is 192.168.1.255, however since the
source host's broadcast is 192.255.255.255, the source host is mistakingly trying to ARP for 192.168.1.255,
thinking it is a real host. The source host will never receive a valid ARP reply as no host on the network can
have a .255 in it's last octet.

box:~# tcpdump -ni eth1
tcpdump: verbose output suppressed, use v or vv for full protocol decode listening on eth1, linktype
EN10MB (Ethernet), capture size 96 bytes
22:31:01.512750 arp whohas
192.168.1.255 tell 192.168.1.105
22:31:02.512665 arp whohas
192.168.1.255 tell 192.168.1.105

5.2.1 . IP Problems Misconfigured

Default Gateway
When a client host on a LAN can't communicate with the outside world, it can be one of 4 issues:
œ no network connectivity œ misconfigured /etc/nsswitch.conf œ misconfigured or nonexistent
/etc/resolv.conf for DNS œ misconfigured or wrong gateway information The first 3 issues can be solved by viewing files and checking physical links. There is no real way to tell if the gateway entry is truly routing packets. The following example demonstrates how to monitor whether the gateway is routing packets.
The client host is unable to reach a host on the Internet.

box:~# ping yahoo.com
ping: unknown host yahoo.com
The client has a gateway configured in the routing table. However, there is no way to tell whether the gateway is actually routing.
box:~# netstat rn
Kernel IP routing table
Destination Gateway Genmask Flags MSS Window irtt Iface
192.168.1.0 0.0.0.0 255.255.255.0 U 0 0 0 eth1
0.0.0.0 192.168.1.220 0.0.0.0 UG 0 0 0 eth1

The following packet capture is taken from the router. The packet are coming from the source host of
192.168.1.105, but the interface is NOT showing the return packet.

[root@targus ~]# tcpdump -ni eth0 not port 22

tcpdump: verbose output suppressed, use v or vv for full protocol decode listening on eth0, linktype
EN10MB (Ethernet), capture size 96 bytes
09:01:39.063347 IP 192.168.1.105.32770 > 4.2.2.2.domain: 49762+ A? yahoo.com. (27)
09:01:44.075062 IP 192.168.1.105.32771 > 4.2.2.1.domain: 49762+ A? yahoo.com. (27)
From the client prospective, the router is on the network as it replies to a ping request.

box:~# ping 192.168.1.220
PING 192.168.1.220 (192.168.1.220) 56(84) bytes of data.
64 bytes from 192.168.1.220: icmp_seq=1 ttl=64 time=2.69 ms
64 bytes from 192.168.1.220: icmp_seq=2 ttl=64 time=2.95 ms

However, the client is not receiving any replies from it's DNS request. Packets are going to the router, however they are getting dropped.

box:~# tcpdump -ni eth1 not port 22

tcpdump: verbose output suppressed, use v or vv for full protocol decode listening on eth1, linktype
EN10MB (Ethernet), capture size 96 bytes
09:10:06.367238 IP 192.168.1.105.32772 > 4.2.2.2.53: 49099+ A? yahoo.com. (27)
09:10:11.381598 IP 192.168.1.105.32773 > 4.2.2.1.53: 49099+ A? yahoo.com. (27)

5.2.0 TCP Problems . Closed Ports


There are multiple reasons why a remote server may not respond to a client request. Examples have already
been given to troubleshoot at the Ethernet, ARP, and IP levels. A common mistake is to assume that since a host is available at the IP level, it does not mean that it is available at the TCP level.

The host is available at the IP level as per the ping replies.
box:~# ping 192.168.1.220
PING 192.168.1.220 (192.168.1.220) 56(84) bytes of data.
64 bytes from 192.168.1.220: icmp_seq=1 ttl=64 time=2.71 ms
64 bytes from 192.168.1.220: icmp_seq=2 ttl=64 time=2.64 ms

The telnet service is not available to the client.
box:~# telnet 192.168.1.220
Trying 192.168.1.220...
telnet: Unable to connect to remote host: Connection refused

Taking a look at the packet capture, it is clear that the telnet service is not running. Standard TCP replies to
closed ports is to send a greseth flag to the source host.
box:~# tcpdump -ni eth1 port 23

tcpdump: verbose output suppressed, use v or vv for full protocol decode listening on eth1, linktype

EN10MB (Ethernet), capture size 96 bytes
09:18:12.495679 IP 192.168.1.105.32780 > 192.168.1.220.23: S 2557844136:2557844136(0) win
5840 <mss 1460,sackOK,timestamp 1643931 0,nop,wscale 0>
09:18:12.498346 IP 192.168.1.220.23 > 192.168.1.105.32780: R 0:0(0) ack 2557844137 win 0

5.2.1 TCP Problems . TCP Wrapped Services

TCP Wrappers have been around for quite some time. Their purpose is to do host based access control. Unlike closed ports, the port to the server is open. Upon connection to the server, a check is made to the
/etc/hosts.allow and /etc/hosts.deny files. If the client is allowed to connect, then a standard TCP
connection is made. If not, the server sends a TCP reset to the client. The following is an example of a TCP
wrapped ssh service.

box:~# ssh 192.168.1.220
ssh_exchange_identification: Connection closed by remote host

A look at the packet capture shows that upon determining that the client is denied, the ssh server initiates a
standard port closing through a series of FIN packet exchanges.

box:~# tcpdump -ni  eth1 port 22
tcpdump: verbose output suppressed, use v or vv for full protocol decode listening on eth1, linktype

EN10MB (Ethernet), capture size 96 bytes
10:24:44.508208 IP 192.168.1.105.32786 > 192.168.1.220.22: S 2304868053:2304868053(0) win
5840 <mss 1460,sackOK,timestamp 2043132 0,nop,wscale 0>
10:24:44.510770 IP 192.168.1.220.22 > 192.168.1.105.32786: S 762146323:762146323(0) ack
2304868054 win 5792 <mss 1460,sackOK,timestamp 44455276 2043132,nop,wscale 2>
10:24:44.510807 IP 192.168.1.105.32786 > 192.168.1.220.22: . ack 1 win 5840
<nop,nop,timestamp 2043132 44455276>
10:24:49.526296 IP 192.168.1.220.22 > 192.168.1.105.32786: F 1:1(0) ack 1 win 1448
<nop,nop,timestamp 44460292 2043132>
10:24:49.526647 IP 192.168.1.105.32786 > 192.168.1.220.22: F 1:1(0) ack 2 win 5840
<nop,nop,timestamp 2043634 44460292>
10:24:49.529124 IP 192.168.1.220.22 > 192.168.1.105.32786: . ack 2 win 1448
<nop,nop,timestamp 44460295 2043634>

5.2.2 TCP Problems . Packet Filtered TCP Ports


When a port is blocked by a packet filter (IPTables or IPFilter for example), it may be open but filtered at the IP level. In this case, the client will send multiple SYN packets to the server and the server will not respond simply because the packet has been dropped by the filter.

box:~# telnet 192.168.1.220
Trying 192.168.1.220...
telnet: Unable to connect to remote host: No route to host

A look at the packet capture shows that the client sent 3 TCP SYN packets to the telnet port (23) that were
simply dropped by the server and not replied to in the client.
box:~# tcpdump -ni eth1 port 23


tcpdump: verbose output suppressed, use v or vv for full protocol decode listening on eth1, linktype
EN10MB (Ethernet), capture size 96 bytes
10:35:02.302120 IP 192.168.1.105.32787 > 192.168.1.220.23: S 2934062463:2934062463(0) win
5840 <mss 1460,sackOK,timestamp 2104912 0,nop,wscale 0>
10:35:13.847729 IP 192.168.1.105.32788 > 192.168.1.220.23: S 2951328215:2951328215(0) win
5840 <mss 1460,sackOK,timestamp 2106066 0,nop,wscale 0>
10:35:19.518837 IP 192.168.1.105.32789 > 192.168.1.220.23: S 2947681121:2947681121(0) win
5840 <mss 1460,sackOK,timestamp 2106633 0,nop,wscale 0>

6.0 Application Layer Problems


6.1.0 DHCP Problems . IP Address already assigned


The DHCP protocol is a largely transparent protocol on the network. The following packet capture shows the
pieces that comprise a DHCP exchange between a client and a DHCP server.

[root@targus dhcp]# tethereal -ni eth1 not port 22
4.471802 0.0.0.0 > 255.255.255.255 DHCP DHCP Discover Transaction ID 0x803fcf50

After the initial DHCP DISCOVER made by the client, the server sends out an ARP request to the network to see if any other host has taken the IP Address from the DHCP pool.


4.475738 00:06:25:77:63:8b > ff:ff:ff:ff:ff:ff ARP Who has 192.168.1.104? Tell 192.168.1.1

If there are no ARP replies, then the DHCP server offers the IP address to the client. The client accepts the lease and the DHCP server acknowledges that the lease is now assigned.


5.135171 192.168.1.1 > 255.255.255.255 DHCP DHCP Offer Transaction ID 0x803fcf50
5.135471 0.0.0.0 > 255.255.255.255 DHCP DHCP Request Transaction ID 0x803fcf50
5.139041 192.168.1.1 > 255.255.255.255 DHCP DHCP ACK Transaction ID 0x803fcf50

From this point on, a client will continue to ask for whatever address it previously had. If the client disconnects from the network and then tries to reconnect, the old IP address could easily be taken by another client. In the following packet capture, the client requests it's old IP address of 192.168.1.104.

[root@targus dhcp]# tethereal -ni eth1 not port 22

Capturing on eth1

4.462923 0.0.0.0 > 255.255.255.255 DHCP DHCP Request Transaction ID 0x6ccda4f

The DHCP server sent out an ARP request and received a valid ARP reply from another host.

4.465737 00:06:25:77:63:8b > ff:ff:ff:ff:ff:ff ARP Who has 192.168.1.104? Tell 192.168.1.1
4.465762 00:c3:61:f9:42:a7 > 00:06:25:77:63:8b ARP 192.168.1.104 is at 00:c3:61:f9:42:a7
4.467191 192.168.1.1 > 192.168.1.104 ICMP [Malformed Packet]

Therefore, the old requested address of 192.168.1.104 is no longer available and the DHCP server sends a
gNAKh (no acknowledgment) packet back to the client, causing the client to start at the beginning with a

DHCP
DISCOVER.
4.471480 192.168.1.1 > 255.255.255.255 DHCP DHCP NAK Transaction ID 0x6ccda4f

6.1.1 . DHCP Problems . Using DHCP for Manual Network Configuration
If there are hardware or software version issues that prevent a client from negotiating a lease from a DHCP
server, then a NIC card can be configured manually with all the networking information from a DHCP ACK
message from a DHCP server. The following packet capture displays verbose output of a DHCP ACK. This packet includes all information needed to successfully configure a NIC manually.

[root@targus ~]# ifconfig eth1 up
[root@targus ~]# ifup eth1
[root@targus ~]# tethereal nVi eth1 port bootpc

Capturing on eth1
<<snip>>
Your (client) IP address: 192.168.1.100 (192.168.1.100)
Next server IP address: 192.168.1.1 (192.168.1.1)
Relay agent IP address: 0.0.0.0 (0.0.0.0)
Client MAC address: 00:40:f4:83:48:24 (CameoCom_83:48:24)
Server host name not given
Boot file name not given
Magic cookie: (OK)
Option 53: DHCP Message Type = DHCP ACK
Option 1: Subnet Mask = 255.255.255.0
Option 3: Router = 192.168.1.1
Option 6: Domain Name Server
IP Address: 66.75.160.62
IP Address: 66.75.160.41
IP Address: 66.75.160.37
Option 15: Domain Name = "socal.rr.com"
Option 51: IP Address Lease Time = 1 day
Option 54: Server Identifier = 192.168.1.1
End Option
Padding

6.1.2 Samba Problems . Unable to Connect to Samba Server
The Samba protocol, often used to network Windows clients to UNIX servers, is largely a broadcast protocol.When failing to connect a Windows based client to a UNIX server, there is no useful debugging information available on the Windows host. There is an error message that simply states the gNetwork Path Not Foundh or something similar. The following packet capture shows that the client and server can't communicate over Samba due to the differences in broadcast addresses. The client (192.168.1.102) is attempting to discover the server (192.168.1.220) , however, both have different broadcast domains and therefore can't here each other's requests.


[root@targus tmp]# tcpdump -ni eth0

not port 22 and not host 192.168.1.1
tcpdump: verbose output suppressed, use v or vv for full protocol decode listening on eth0, linktype

EN10MB (Ethernet), capture size 96 bytes
13:51:58.319893 IP 192.168.1.102.netbiosns
> 192.168.1.255.netbiosns:
NBT UDP PACKET
(137): QUERY; REQUEST; BROADCAST
13:51:59.069690 IP 192.168.1.102.netbiosns
> 192.168.1.255.netbiosns:
NBT UDP PACKET
(137): QUERY; REQUEST; BROADCAST
13:51:59.820093 IP 192.168.1.102.netbiosns
> 192.168.1.255.netbiosns:
NBT UDP PACKET
(137): QUERY; REQUEST; BROADCAST
13:52:02.114031 IP 192.168.1.220.netbiosns
> 192.168.255.255.netbiosns:
NBT UDP PACKET
(137): QUERY; REQUEST; BROADCAST
13:52:04.113961 IP 192.168.1.220.netbiosns
> 192.168.255.255.netbiosns:
NBT UDP PACKET
(137): QUERY; REQUEST; BROADCAST

Monday, 29 August 2011

Configuring Multi-Path I/O for AIX client logical partitions


Scenario: Configuring Multi-Path I/O for AIX client logical partitions

Multi-Path I/O (MPIO) helps provide increased availability of virtual SCSI resources by providing redundant paths to the resource. This topic describes how to set up Multi-Path I/O for AIX® client logical partitions.
In order to provide MPIO to AIX client logical partitions, you must have two Virtual I/O Server logical partitions configured on your system. This procedure assumes that the disks are already allocated to both the Virtual I/O Server logical partitions involved in this configuration.
To configure MPIO, follow these steps. In this scenario, hdisk5 in the first Virtual I/O Server logical partition, and hdisk7 in the second Virtual I/O Server logical partition, are used in the configuration.
The following figure shows the configuration that will be completed during this scenario.
An illustration of an MPIO configuration with two Virtual I/O Server logical partitions.
Using the preceding figure as a guide, follow these steps:
  1. Using the HMC, create SCSI server adapters on the two Virtual I/O Server logical partitions.
  2. Using the HMC, create two virtual client SCSI adapters on the client logical partitions, each mapping to one of the Virtual I/O Server logical partitions.
  3. On either of the Virtual I/O Server logical partitions, determine which disks are available by typinglsdev -type disk. Your results look similar to the following:
    name            status     description
    
    hdisk3          Available  MPIO Other FC SCSI Disk Drive
    hdisk4          Available  MPIO Other FC SCSI Disk Drive
    hdisk5          Available  MPIO Other FC SCSI Disk Drive
    Select which disk that you want to use in the MPIO configuration. In this scenario, we selected hdisk5.
  4. Determine the ID of the disk that you have selected. For instructions, see Identifying exportable disks. In this scenario, the disk does not have an IEEE volume attribute identifier or a unique identifier (UDID), so we determine the physical identifier (PVID) by running the lspv hdisk5command. Your results look similar to the following:
    hdisk5          00c3e35ca560f919                    None
    The second value is the PVID. In this scenario, the PVID is 00c3e35ca560f919. Note this value.
  5. List the attributes of the disk using the lsdev command. In this scenario, we typed lsdev -dev hdisk5 -attr. Your results look similar to the following
    ..
    lun_id          0x5463000000000000               Logical Unit Number ID           False
    ..
    ..
    pvid            00c3e35ca560f9190000000000000000 Physical volume identifier       False
    ..
    reserve_policy  single_path                      Reserve Policy                   True
    Note the values for lun_id and reserve_policy. If the reserve_policy attribute is set to anything other than no_reserve, then you must change it. Set the reserve_policy to no_reserve by typingchdev -dev hdiskx -attr reserve_policy=no_reserve.
  6. On the second Virtual I/O Server logical partition, list the physical volumes by typing lspv. In the output, locate the disk that has the same PVID as the disk identified previously. In this scenario, the PVID for hdisk7 matched:
    hdisk7          00c3e35ca560f919                    None
    Tip: Although the PVID values should be identical, the disk numbers on the two Virtual I/O Serverlogical partitions might vary.
  7. Determine if the reserve_policy attribute is set to no_reserve using the lsdev command. In this scenario, we typed lsdev -dev hdisk7 -attr. You see results similar to the following:
    ..
    lun_id          0x5463000000000000               Logical Unit Number ID           False
    ..
    pvid            00c3e35ca560f9190000000000000000 Physical volume identifier       False
    ..
    reserve_policy  single_path                      Reserve Policy                   
    If the reserve_policy attribute is set to anything other than no_reserve, you must change it. Set the reserve_policy to no_reserve by typing chdev -dev hdiskx -attr reserve_policy=no_reserve.
  8. On both Virtual I/O Server logical partitions, use the mkvdev to create the virtual devices. In each case, use the appropriate hdisk value. In this scenario, we type the following commands:
    • On the first Virtual I/O Server logical partition, we typed mkvdev -vdev hdisk5 -vadapter vhost5 -dev vhdisk5
    • On the second Virtual I/O Server logical partition, we typed mkvdev -vdev hdisk7 -vadapter vhost7 -dev vhdisk7
    The same LUN is now exported to the client logical partition from both Virtual I/O Server logical partitions.
  9. AIX can now be installed on the client logical partition. 
  10. After you have installed AIX on the client logical partition, check for MPIO by running the following command:
    lspath
    You see results similar to the following:
    Enabled hdisk0 vscsi0
    Enabled hdisk0 vscsi1
    If one of the Virtual I/O Server logical partitions fails, the results of the lspath command look similar to the following:
    Failed  hdisk0 vscsi0
    Enabled hdisk0 vscsi1
    Unless a health check is enabled, the state continues to show Failed even after the disk has recovered. To have the state updated automatically, type chdev -l hdiskx -a hcheck_interval=60 -P. The client logical partition must be rebooted for this change to take effect.

rootvg: Creating a mksysb backup to tape

rootvg: Creating a mksysb backup to tape: Question Mksysb related questions / how to create and restore. Answer This document discusses the ‘mksysb’ command when ran to a tape dr...

Creating a mksysb backup to tape


Question
Mksysb related questions / how to create and restore.
Answer
This document discusses the ‘mksysb’ command when ran to a tape drive (rmt device).

What is a mksysb and why create one ?
Mksysb tape structure
Files important to the mksysb
Important information concerning mksysb flags
Creating a mksysb to a tape drive in AIX V5
Creating a mksysb to a tape drive in AIX V6
Verification of a mksysb
Restoring a mksysb
Restore menus
Restoring individual files or directories from a mksysb tape
FAQ


*Note : For all examples the tape drive will be refered to as /dev/rmt0. This may not be the case in your environment. Simply substitute the correct tape drive # as needed. Furthermore, this document does not cover restoring mksysb images to systems other than the one it was taken from (cloning). 

What is a mksysb and why create one ?
A mksysb is a bootable backup of your root volume group. The mksysb process will backup all mounted JFS and JFS2 filesystem data. The file-system image is in backup-file format. The tape format includes a boot image, system/rootvg informational files, an empty table of contents, followed by the system backup (root
volume group) image. The root volume group image is in backup-file format, starting with the data files and then any optional map files.

When a bootable backup of a root volume group is created, the boot image reflects the currently running kernel. If the current kernel is the 64-bit kernel, the backup's boot image is also 64-bit, and it only boots 64-bit systems. If the current kernel is a 32-bit kernel, the backup's boot image is 32-bit, and it can boot both 32-bit and 64-bit systems.

In general the mksysb backup is the standard backup utility used to recover a system from an unusable state - whether that be a result of data corruption, a disk failure, or any other situation that leaves you in an unbootable state. You should create a mksysb backup on a schedule in line with how often your rootvg data changes, and always before any sort of system software upgrade.

A mksysb tape can also be used to boot a system into maintenance mode for work on the rootvg in cases where the system can not boot into normal mode.

Mksysb tape structure

When creating a mksysb to tape, 4 images are created in total.+---------------------------------------------------------+ | Bosboot | Mkinsttape | Dummy TOC | rootvg | | Image | Image | Image | data | |-----------+--------------+-------------+----------------| |<----------- Block size 512 ----------->| Blksz defined | | | by the device | +---------------------------------------------------------+

Image #1: The bosboot image contains a copy of the system's kernel and specific device drivers, allowing the user to boot from this tape.blocksize: 512 format: raw image files: kernel device drivers 

Image #2:
The mkinsttape image contains files to be loaded into the RAM file system when you are booting in maintenance. blocksize: 512 format: backbyname files: ./image.data, ./tapeblksz, ./bosinst.data and other commands required to initiate the restore. 

Image #3:
The dummy image contains a single file containing the words "dummy toc". This image is used to make the mksysb tape contain the same number of images as a BOS Install tape. This is merely reference to pre-AIX V4 days when AIX was installed from tape.

Image #4: The rootvg image contains data from the rootvg volume group (mounted JFS/JFS2 file systems only). blocksize: determined by tape drive configuration on creation format: backbyname (backup/restore) files: rootvg, mounted JFS/JFS2 filesystems WARNING: If the device blocksize is set to 0, mksysb will use a hardcoded value of 512 for the fourth image. This can cause the create and restore to take 5-10 times longer than expected. You should set your tape drive’s block size to the recommended value for optimal performance.

Files important to the mksysbThere are a few files that the mksysb uses in order to successfully
rebuild your rootvg environment. These files are located on the 2nd image of your mksysb tape. Three of the files you may find yourself working with are described below.

bosinst.data : This file can be used to pre-set the BOS menu options. Selections such as which disk to install to, kernel settings, and whether or not to recover TCP related information can all be set here.
This file is mainly used for non-prompted installations. Any option selected during a prompted install will override the corresponding setting in this file. 

 image.data : This file is responsible for holding information used to rebuild the rootvg structure before the data is restored. This information includes the sizes, names, maps, and mount points of logical volumes and file systems in the root volume group. It is extremely important that this file is up to date and correct, otherwise
the restore can fail. It is common to edit this file when it is necessary to break mirroring during a restore.

 tapeblksz : This is a small text file that indicates the block size the tape drive was changed to in order to write the 4th image of the mksysb. This information would be useful if you wanted to restore
individual files/directories from your mksysb image.

Important information concerning mksysb flags

It is very important that you understand the use and intent of a few of the flags used by the mksysb command. Improper use, lack of use, or use of certain flags in certain situations could cause your mksysb to be difficult to restore. In some cases it may cause your mksysb to be unrestorable.

-i : Calls the ‘mkszfile’ command, which updates the image.data file with current filesystem sizes and characteristics. This flag should always be used unless there is a very specific reason you do not wish to have this information updated. Failure to have an accurate image.data file can cause your mksysb restore to fail with “out of space” errors. 

-e : Allows you to exclude data by editing the /etc/exclude.rootvg file.

A few tips on excluding data from your mksysb are listed below :

There should be one entry per line of the file. It can be either a single file or directory name.
The correct format of each entry should be ^./<path>
Never use wildcards.
Do not leave extra spaces or blank lines in the file. 

While the /etc/exclude.rootvg file excludes data, bear in mind that it does not exclude the fact that a filesystem exists. For example if you have a 50Gig filesystem “/data” and add an entry in your /etc/exclude.rootvg file :
^./data

This will exclude all files in /data but it will still recreate the /data filesystem as a 50Gig filesystem (except it will now be empty).
The only way to truly exclude a filesystem from your mksysb would be to unmount the filesystem before initiating your mksysb.
-p : Using this flag disables the software compression algorithms.

When creating a mksysb during any level of system activity it is recommended to use the “-p” flag. Failure to do so can cause “unpacking / file out of phase” errors during your mksysb restore.

These errors are fatal (unrecoverable) errors. No warning is given during the creation of the mksysb that notifies you of the possibility of having these errors during the restore.

You may want to make the “-p” flag compulsory when running your mksysb command so you do not run into this situation.

-X : This flag will cause the system to automatically expand the /tmp filesystem if necessary. The /tmp filesystem will require approximately 32Mb of free space.

For more information about these and other mksysb command flags, please refer the the mksysb man page.

Creating a mksysb to a tape drive in AIX V51. Using SMITTY : # smitty mksysb Backup DEVICE or FILE.........................[/dev/rmt0] Create MAP files?.............................no EXCLUDE files?................................no (-e) List files as they are backed up?.............no Verify readability if tape device?............no Generate new /image.data file?................yes (-i) EXPAND /tmp if needed?........................no (-X) Disable software packing of backup?...........no (-p) Backup extended attributes?...................yes Number of BLOCKS to write in a single output..[]
The only required selection here would be the tape drive to use for
the backup. Default flags are listed above. Change flags as necessary
for your environment / situation.
*Please refer to the section above entitled “Important Information Regarding Mksysb Flags"

2. From command line :# mksysb -i /dev/rmt0 

This command reflects the options listed in the above “smitty mksysb” output. This does not take into account any customization flags. Please review the section above entitled “Important Information
Regarding Mksysb Flags” to be best informed concerning the flags that you should use.

Creating a mksysb to a tape drive in AIX V61. Using SMITTY : # smitty mksysb Backup DEVICE or FILE.........................[/dev/rmt0] Create MAP files?.............................no EXCLUDE files?................................no (-e) List files as they are backed up?.............no Verify readability if tape device?............no Generate new /image.data file?................yes (-i) EXPAND /tmp if needed?........................no (-X) Disable software packing of backup?...........no (-p) Backup extended attributes?...................yes Number of BLOCKS to write in a single output..[] Location of existing mksysb image.............[] File system to use for temporary work space...[] Backup encrypted files?.......................yes Back up DMAPI filesystem files?...............yes
The only required selection here would be the tape drive to use for
the backup. Default flags are listed above. Change flags as necessary
for your environment / situation.
*Please refer to the section above entitled “Important Information Regarding Mksysb Flags"

There are a few extra options with V6 mksysb using SMIT. The most notable being the option “Location of existing mksysb image”. You can now use an existing mksysb taken to file and copy that to tape. An
attempt will be made to make the tape a bootable tape. You should use a system at the same or higher technology level as the mksysb image if you choose to do this. The command line flag would be “-F”.

This does require a minimum of 100Mb free in /tmp. See the manpage

for further information. This flag was introduced as a command line option in AIX V5 (5300-05).

2. From command line :# mksysb -i /dev/rmt0 

This command reflects the options listed in the above “smitty mksysb” output. This does not take into account any customization flags. Please review the section above entitled “Important Information Regarding Mksysb Flags” to be best informed concerning the flags that you should use.
Verification of a mksysbThere is no true verification of the “restorability” of a mksysb other than actually restoring it. Taking cautions such as understanding the flags used for the creation of the mksysb, checking your error report for any tape drive related errors before running the mksysb, regular cleaning of the tape drive, and verifying the readability of the mksysb after creation are all good checks. If your system is in good health your mksysb should be in good health. Similarly, if you attempt to create a mksysb of a system
logging hundreds of disk errors, or a system with known filesystem corruption, your mksysb will likely retain that corruption.

To verify the readability of your backup run the following command :
# listvgbackup -Vf /dev/rmt0

Any errors that occur while reading the headers of any of the files will be displayed, otherwise only the initial backup header information will be displayed. Keep in mind that this check tests the readability of the
file only, not the writeability.
Restoring a mksysbTo restore a mksysb image you simply need to boot from the tape and verify your selections in the BOS menus. Next, we’ll cover two booting scenarios. One in which your system is currently up and operational, the next in which your system is down.

1. If your system is currently running and you need to restore your mksysb, simply change the bootlist to reflect the tape drive and reboot the system.
# bootlist -m normal rmt0 
# shutdown -Fr 
 
2. If your system is in a down state you should boot to the SMS menus and set your bootlist to reflect the tape drive. The SMS menu options are listed below. Your menu options may be different (depending on
your level of firmware), however it should be clear enough by following this document to figure out what options should be chosen if yours differ.

SMS - SYSTEM MANAGEMENT SERVICES - 1. Select Language 2. Change Password Options 3. View Error Log 4. Setup Remote IPL (RIPL (Remote Initial Program Load)) 5. Change SCSI Settings 6. Select Console --> 7. Select Boot Options
The next menu should come up : --> 1. Select Install or Boot Device 2. Configure Boot Device Order 3. Multiboot Startup
The next menu will have the following : Select Device Type : 1. Diskette 2. Tape 3. CD/DVD 4. IDE 5. Hard Drive 6. Network --> 7. List all Devices 

The system will scan itself to determine which devices are available to boot from. All of your available boot devices will be displayed here. This menu can be a little tricky. If you have a device pre-selected it
will have a 1 next to it under the “Current Position” column. Use the “Select Device Number” listing to choose the device you want to boot from to change that. 

The next screen will offer you three choices :
1. Information --> 2. Normal Mode Boot 3. Service Mode Boot

Restore menusI. From the Installation and Maintenance Menu, select (2):
1) Start Installation Now with Default Settings 
2) Change/Show Installation Settings and Install 
3) Start Maintenance Mode for System Recovery 
 
II. From the System Backup Installation and Settings, you’ll see the default options that are taken from your “bosinst.data” file. If these are correct select (0) and skip down to step 6 below.
If you need to change any options such as the disks you would like to install to select (1):

Setting: Current Choice(s): 1. Disk(s) where you want to install... hdisk0 Use Maps............................ No 2. Shrink File Systems................. No 0. Install with the settings listed above. 
 
To shrink the file systems to reclaim allocated free space, select option 2 so the setting is set to Yes. For the file systems to be restored with the same allocated space as the original system, make
sure option 2 is set to No.

III. Change Disk(s) Where You Want to Install.Type one or more numbers for the disks to be used for installation and press Enter. 
The current choice is indicated by >>>. To deselect a choice, type the corresponding number and press Enter. At least one bootable disk must be selected. 
Choose the location by its SCSI ID. Name Location Code Size (MB) VG Status Bootable >>> 
1. hdisk0 00-01-00-0,0 70008 rootvg yes >>> 
2. hdisk1 00-01-00-1,0 70008 rootvg yes 0. Continue with the choices indicated above
After the desired disks have been chosen, select (0) to continue.

IV. System Backup Installation and Settings, select (0 to continue)Setting: Current Choice(s): 1. Disk(s) where you want to install......... hdisk0... 2. Use Maps.................................. No 3. Shrink File Systems....................... No 0. Install with the settings listed above.

Restoring individual files or directories from a mksysb tapeYou may at some point need to restore a file, several files, or directories from your mksysb. You’ll need to first find the block size the rootvg data was written at (4th image). Files will be restored relative to your current location on the system when the restore command is executed. If you would like the files to return to their original location run the restore command (step 3) from /, otherwise cd down to the path you wish the file(s) to be restored.


1. Display the contents of the ./tapeblksz file on the mksysb to determine the correct block size the tape drive should be set to for the restore command.

# cd /tmp 
 # tctl -f /dev/rmt0 rewind 
# chdev -l rmt0 -a block_size=512 
 # restore -s2 -xqvdf /dev/rmt0.1 ./tapeblksz 
 # cat ./tapeblksz 
 
The output that is given is the blocksize to which the tape drive was set when the mksysb was made.

2. Next, set the blocksize of the tape drive accordingly by running the following command :
# chdev -l rmt0 -a block_size=<number in the ./tapeblksz file> 

3. Restore the files or directories by running the following commands :

# cd / (if the file is to be restored to its original place) 
# tctl -f /dev/rmt0 rewind # restore -s4 -xqdvf /dev/rmt0.1 ./<pathname> 

You can specify multiple <pathname> entries for multiple file(s)/directory structures to restore. Simply separate each entry with a space. Remember to always use a “./” before each pathname.

**As an alternative you can also use the 'restorevgfiles' command. In the interest of keeping this document "relatively" short - no further examples will be given. Please see the manpage for use of this
command.

FAQThis section is included to provide answers to common questions asked concerning mksysb. This section is not intended to diagnose any problem or perform any problem determination. These questions/answers are intended to hopefully prevent the need to call up and open a problem ticket for a short duration / short answer
question. If you have any questions that you feel might be helpful, please submit feedback on this document and it may be added. 

1. The rootvg on my mksysb tape has all JFS filesystems, and I’d like to change them to JFS2 filesystems. 

How can I do this ?
The only supported method of changing rootvg system filesystems from JFS to JFS2 would be to run a “New and Complete Overwrite” installation.

2. Does the mksysb command backup nfs mountpoints ? 
No, nfs mountpoints are not followed.

3. Will my non-root volume groups automatically mount after the restore completes ?
That volume group setting is held on the VGDA of the disk the volume group is held. There is a new option that will allow this to be set in the BOS menus, so this should no longer be an issue. 

4. The document mentions I can restore files from my mksysb.
Are there any restrictions to what I should/should not restore ?
Absolutely. You do not want to restore any files that are critical to the system running.
Examples of files you do not want to restore: most library files, ODM files, applications, the kernel...
Examples of files safe to restore : /etc/group, /etc/passwd, cron related files, /home, any data filesystems you created....

5. How long will my mksysb take to restore ?
That is dependent on many factors - the amount of data that needs to be restored being the major player in the restore time. A ballpark rule of thumb would be 1.5 - 2x the time it took to create the mksysb. You
also have to consider reboot time.

6. The restore appears to be hung at 83%, what do I do ?
First you want to make sure this is a “true” hang. This point in the restore can take anywhere from 10 minutes to even upwards of 60 minutes depending on the size of the rootvg. Make sure you’ve given
it ample time to bypass this portion of the restore before becoming concerned.

7. I have a mksysb tape but I don’t know anything about it. Are there any commands that I can run to get information about the rootvg it contains ?
There are some very helpful ‘lsmksysb’ commands that can provide all sorts of information. Some of the things you can find out : - the ‘lslpp -L’ output to see what filesets are installed on that rootvg
- ‘lsvg -l rootvg’ output will show: 
volume group information and oslevel total backup size and size of volume group if shrunk to minimum
Twitter Bird Gadget