Document OALWP01950320



Document Id: OALWP01950320
Date Loaded: 03-21-95

Description: HP-UX 10.0 HFS File System White Paper

HP-UX 10.0 HFS File System White Paper HP 9000 Series 700/800 Computers March 1995, First Edition LEGAL NOTICES The information in this document is subject to change without notice. Hewlett-Packard makes no warranty of any kind with regard to this manual, including, but not limited to, the implied warranties of merchantability and fitness for a particular purpose. Hewlett-Packard shall not be held liable for errors contained herein or direct, indirect, special, incidental or consequential damages in connection with the furnishing, performance, or use of this material. Warranty. A copy of the specific warranty terms applicable to your Hewlett-Packard product and replacement parts can be obtained from your local Sales and Service Office. Restricted Rights Legend. Use, duplication, or disclosure by the U.S. Government Department is subject to restrictions as set forth in subparagraph (c) (1) (ii) of the Rights in Technical Data and Computer Software clause at DFARS 252.227-7013 for DOD agencies, and subparagraphs (c) (1) and (c) (2) of the Commercial Computer Software Restricted Rights clause at FAR 52.227-19 for other agencies. HEWLETT-PACKARD COMPANY 3000 Hanover Street Palo Alto, California 94304 U.S.A. Use of this manual and flexible disk(s) or tape cartridge(s) supplied for this pack is restricted to this product only. Additional copies of the programs may be made for security and back-up purposes only. Resale of the programs in their present form or with alterations, is expressly prohibited. Copyright Notices. (C)copyright 1983-95 Hewlett-Packard Company, all rights reserved. Reproduction, adaptation, or translation of this document without prior written permission is prohibited, except as allowed under the copyright laws. (C)copyright 1979, 1980, 1983, 1985-93 Regents of the University of California This software is based in part on the Fourth Berkeley Software Distribution under license from the Regents of the University of California. (C)copyright 1980, 1984, 1986 Novell, Inc. (C)copyright 1986-1992 Sun Microsystems, Inc. (C)copyright 1985-86, 1988 Massachusetts Institute of Technology. (C)copyright 1989-93 The Open Software Foundation, Inc. (C)copyright 1986 Digital Equipment Corporation. (C)copyright 1990 Motorola, Inc. (C)copyright 1990, 1991, 1992 Cornell University (C)copyright 1989-1991 The University of Maryland. (C)copyright 1988 Carnegie Mellon University. Trademark Notices. UNIX is a registered trademark in the United States and other countries, licensed exclusively through X/Open Company Limited. X Window System is a trademark of the Massachusetts Institute of Technology. MS-DOS and Microsoft are U.S. registered trademarks of Microsoft Corporation. OSF/Motif is a trademark of the Open Software Foundation, Inc. in the U.S. and other countries. First Edition: March 1995 (HP-UX Release 10.0) ============================================================================== HP-UX 10.0 HFS File System ========================== The predominant file system used by HP-UX is called the High Performance File System (HFS), which is also known as the McKusick (or BSD) file system. This white paper describes the structure of the file system and its relationship to the disks on which file systems reside. The following additional resources are useful in gaining further understanding of the HP-UX file systems and how to administer them: * HP-UX System Administration Tasks manual, for creating and managing file systems and disk space. * Section (4) of the HP-UX Reference, for specifications of file-system formats. * HP-UX 10.0 Documentation Map, identifying additional sources of information. * Other white papers, available from SupportLine, file-system subjects. To work effectively with file systems, you must understand their interrelationship with physical disks. Every file of the HFS file system is stored on a formatted mass storage medium, a disk. The disk is known to HP-UX by specifying the path name to the disk's device file. Device drivers in the operating system enable communication to the disk. Each architecture supports a different set of disks, based on the device drivers written for that architecture and disk. To access files in a file system, you mount the file system on a disk, by associating the path name of its mount point to the disk's device file. Once mounted, the file system is accessible to the operating system and users. This paper discusses file-system creation, storage, modification, and protection. Understanding File-System Creation ================================== As a system administrator, much of what you do concerns file systems. System files, application files, and user files are typically organized as file systems. Also, although disks are the storage devices that hold data, the data must reside in a file system to be available to the operating system. Thus, if you run short of space, you can install a new disk and create a file system on it to hold additional data. Conceptually, the creating a file system involves: * making the physical environment (the disk device) available to the file system. * creating the software entity (the file system) itself. * establishing (by mounting) the "connective threads" between the physical and software elements. HP-UX uses the term "file system" to mean several things: A file system is the HP-UX file-system (often several file systems mounted together) directory tree, starting from root. File system is also a body of structures that exist on each file-system device that enables you to keep data contiguous with the existing data hierarchy. This second meaning of file system is the subject of this white paper. This section summarizes the numerous aspects of file system creation, to explain how a file system is connected to HP-UX as a whole. Note: All procedures for creating and maintaining file systems are found in HP-UX System Administration Tasks manual. There are many reasons why you might add a new file system, including: * You anticipate that your file system will soon exceed current maximum capacity. * Your current file system has already reached maximum capacity. * You wish to separate portions of a file system physically, to restrict growth of files on a portion of the file system or to increase concurrent access for better performance. To create a file system, you can use a sequence of HP-UX commands, or you can invoke the SAM utility and perform the task interactively. In either case, adding a file system involves: * Installing the necessary device files for the new device (done if disk is newly connected) * Preparing the storage medium (the disk device) for the file system (if disk is newly connected) * Creating the file system itself. * Mounting the file system to make it available for system use. * Adding the file system to /etc/fstab for automatic mounting. If you are creating your new file system on a new disk drive, you first connect the physical device to the system, referring to the device's installation manual. Use a hard disk to hold an HP-UX file system. The capacity of flexible disks, cartridge and reel tape drives is too limited, slow, and subject to deterioration from such constant use. Rewritable magneto-optical disks are slower than hard disks, but substantially faster than flexible disks or tape, and are typically used to back up a file system. If necessary, magneto-optical disks can be used to hold an auxiliary file system. Each disk is accessed physically via a compatible interface card that connects the disk to the computer's bus architecture. Hard disk drives might use any of the following interfaces -- standard or high-speed HP-IB, fiber link (HP-FL), or small computer systems interface (SCSI). The protocol for each interface is encoded in a specific device driver, which must be present for HP-UX to communicate with the disk. The operating system accesses physical devices logically through both the device driver and device special files. * You can see the device drivers used by your system by reading the /stand/system file or by running the lsdev(1M) command. * You can see device special files for disks by listing the /dev/dsk (for block special files) and /dev/rdsk (for character special files) directories. Create device files using mknod(1M), mksf(1M), or insf(1M). Character and block device special files are required for devices that hold file systems. If you are apportioning disk space using the Logical Volume Manager (LVM), you need a character and block device special file for each logical volume. Without LVM on Series 700 systems, you need a character and block device special file for the entire disk drive. Using disk sections on a Series 800, you need a character and block device special file for each section used. The device special files are used when performing system administration tasks involving the file system. For example, * The mediainit(1M) command requires a transparent special file to reformat a disk or tape for a file system. Use mediainit if you suspect the media is corrupted or worn. To use mediainit, you must create the device files using the -t option of mksf(1M). * The mount(1M) requires block device files to mount and unmount (umount) the file system. * The newfs(1M) command requires a character special file to create a file system. HP-UX cannot use media to store files until you place a file system on it. You can create a new file system using SAM, mkfs(1M), or newfs(1M). Of the two manual commands, newfs is easier to use. When you create a file system, you create an environment to contain files, much like building a "file cabinet" for paper files. When first built, the file cabinet is empty. Then you add files. To create a file system, you specify the disk special file to newfs; newfs queries the device driver, which returns information that newfs can then use to set disk characteristics and key values, including block and fragment size, number of bytes per inode, percentage of reserved free space, and rotational delay. Procedures for building a file system are documented in HP-UX System Administration Tasks manual, Chapter 4. After creating a file system, the file system has to be mounted (attached) to the HP-UX file hierarchy, using the mount(1M) command. This incorporates the file system into the existing file system's overall hierarchy. You do this by logically associating the root directory of the new file system with a mount point, a directory on the existing file system. Once a file system is mounted, the mount points are seamless. You can access the new mounted disk space as a contiguous part of the entire HP-UX file-system hierarchy, as shown in the following figure. File System /users Mounted to Root File System /dev/dsk/c1t4d0 at /home +--------------------------------------+ | / | | | | | +--------------+---------------+ | root file system | | | | | /dev/dsk/c1t4d0 | bin usr home | | | | +---------------------------------|----+ | +---------------------|--------------------+ | | | | +---------+---------+---------+ | file system | | | | | | /users | beth jo amy meg | | | +------------------------------------------+ Once mounted, user jo's pathname is /home/jo, but when you run bdf, you will see the file system /users mounted to /home. To mount a file system: * Make a mount point directory (using the mkdir command) for the file system. * Mount the newly created file system to the mount point (using the mount command). An existing file system can be moved to a different location on the HP-UX file hierarchy by unmounting (detaching) it from its current location using the -u option (or umount command) of mount(1M) and remounting the file system. A file system cannot be unmounted if any files are open or if any user's current working directory is in that file system. You can use the fuser(1M) command to identify which processes are using a file system or file structure, and if necessary, terminate them. The shutdown command unmounts all mounted file systems before bringing a system down, so that the file systems are not corrupted. You cannot unmount the root file system or any file system that has dynamic swap enabled. Likewise, be sure that the /stand and /sbin directories are part of the root file system, so that they cannot be inadvertently unmounted. (Directories such as /var, /opt, and /usr are made to be mountable.) For mounting, you refer to the file system by its logical volume and its mount point directory. For unmounting, you refer to the file system by either the device file name or mount point, because unmounting breaks the link between the two. As a system administrator, you maintain the /etc/fstab file as a record of mountable file systems and swap space. The /etc/fstab file is read: * by /sbin/init.d/hfsmount, to mount all listed file systems when the system boots up. * by fsck(1M), to determine the order for conducting file-system checks. * by shutdown(1M), to unmount all file systems before halting the system. * by library calls such as getfsent(3X) and getmntent(3X), which enable programs to make use of file system information. Disk Layout =========== The disk layout is the geometry applied to a physical disk. Typically, a disk is divided into areas that accommodate file systems or raw I/O, dump, and swap. A disk from which the system can be booted is called a root disk, is organized somewhat differently from other disks, and discussed later in this paper. Non-root disks typically contain a single swap area, file systems, or a combination of both. The following sections discuss layout principles of HP-UX disks for each architecture. Logical Volume Manager (LVM) is the recommended method of apportioning disk space on both Series 700 and Series 800. Logical Volumes _______________ * LVM enables you to partition disks flexibly. You combine one or more disks (called physical volumes) into a volume group, which can then be subdivided into logical volumes. * The size of logical volumes can be defined according to need. You can extend or reduce the size of logical volume as needs change. * Logical volumes can span disks. This enables you to create very large logical volumes, or use small portions of disk space more efficiently. * You can mirror logical volumes, using an optional product, MirrorDisk/UX. Procedures for using Logical Volume Manager (LVM) are documented in "HP-UX System Administration Tasks." Note: Software Disk Striping(SDS), which had been a Series 700-only feature, is no longer supported on 10.0. Instead, you need to convert the disk to 10.0 LVM. LVM provides comparable striping capability for both Series 700 and 800, using the lvcreate command with -i and -I options. See lvcreate(1M) and lvextend(1M) in the HP-UX Reference. Series 700 Disk Layout ______________________ The first 8 KB of the Series 700 disk is used for the LIF directory, which contains pointers to the file system and each boot program in the boot area. In the absence of a boot area, the swap area occupies the remaining space. The Series 700 boot program occupies the last 2 MB of the root disk layout. Series 700 Disk Layout ---------------------- Area Data Structure Size ---- -------------- ---- Boot pointers LIF directory 8 KB File system Superblocks 8 KB and Dynamic Swap (primary and redundant) Cylinder group 1 varies Cylinder group 2 ... Cylinder group n Swap Swap tables 0 or more blocks (defined in /usr/include/sys/swap.h) Boot area LIF file system 2 MB (optional) Series 800 Disk Layout ______________________ For backward compatibility, Series 800 disks can be apportioned in sections (also called partitions). Using LVM is the recommended method, however, and you are encouraged to convert your disks to LVM. Disk space can be partitioned on the Series 800 in a variety of ways. Each section can be addressed like separate disk drives. A section can used for: * Boot area * File system * Swap area * Raw I/O The layout of each section is nearly identical to the same areas on the Series 700. However, the boot and swap areas reside in their own sections instead of residing in the same section as the file system. Series 800 disks can be partitioned into sixteen possible section choices. The size and location of each hard-coded section, as shown below, is dependent on the disk model. Disk Sections and Relative Sizes -------------------------------- # ----------------------------------------- # 6 ^ ^ # ---------------------- | --- | # 2 15 | 7 ^ | # ------------- | | | # --------- ^ | | | # 1 14 | v | | # ------------------ | -------- | | 0 # 10 | ^ | | # ------------------ | | | | # 3 ^ | 13 | | | # ----------- | | | 11 | 12 | # 4 ^ | 8 | | | | # ------ | 9 | | | | | # 5 v v v v v v # ----------------------------------------- Limited information on section sizes and locations are defined in the /etc/disktab file (maintained only for backward compatibility). If you are managing a disk using hard-coded sections, when you create a new file system (with mkfs, newfs or SAM), you declare on what section the file system is to be mounted. You must be careful not to use overlapping disk sections. File System Size ================ HP-UX supports file systems up to 4 GB; however, the size limit for individual files is 2 GB. Applications may also not use raw access to disk sections larger than 2 GB. For very large disks (such as HP C2254B), the boot partition must lie within 2 GB of the beginning of the disk. Protocols do not permit NFS-mounting file systems larger than 2 GB. Disk and File System Tools ========================== When working with file systems, you often have to understand how much disk space you have and how large your file systems are. /usr/sbin/diskinfo can help you determine available disk space. To view how large a file system is that you want to mount, you can use bdf, df or du. For backward compatibility, /etc/disktab provides some information about disk geometry. Each tool is discussed in the next sections. Disk Characteristics Command -- /usr/sbin/diskinfo __________________________________________________ The diskinfo(1M) command displays characteristics of a disk device, when given the device's character special file. /usr/sbin/diskinfo is particularly useful when setting up or managing logical volumes. When used without options, /usr/sbin/diskinfo produces terse output: % /usr/sbin/diskinfo /dev/rdsk/c2t5d0 SCSI describe of /dev/rdsk/c2t5d0: vendor: HP product id: C3010 type: direct access size: 1956086 Kbytes bytes per sector: 512 With the -b option, /usr/sbin/diskinfo returns the size of the disk in 1024-byte sectors. % /usr/sbin/diskinfo -b /dev/rdsk/c2t5d0 1956086 The verbose (-v) option of /usr/sbin/diskinfo displays different information, depending on type of disk: * vendor and product ID (SCSI devices) * device name (CS/80 and SCSI) * number of bytes/sector (CS/80 and SCSI) * geometry, interleave, and timing information (CS/80) * size in bytes and logical blocks, revision level, SCSI conformance level (SCSI) For example, % /usr/sbin/diskinfo -v /dev/rdsk/c2t5d0 SCSI describe of /dev/rdsk/c2t5d0: vendor: HP product id: C3010 type: direct access size: 1956086 Kbytes bytes per sector: 512 rev level: 0BQ3 blocks per disk: 3912172 ISO version: 0 ECMA version: 0 ANSI version: 2 removable media: no response format: 2 Free Disk Blocks Command -- bdf _______________________________ The bdf command (Berkeley's variation of df) reports the number of free disk blocks available on a file system. If no file system is given as an argument, bdf reports on all file systems. Several options are available: -b Displays information about file system swapping. -i Displays used and free inodes. -l Local. Displays HFS file systems mounted on a client. Does not display NFS-mounted file systems. -t type Displays only information on mounted file systems of a given type. Here is sample output of bdf: % bdf Filesystem kbytes used avail %used Mounted on /dev/vg00/lvol1 47829 19886 23160 46% / /dev/vg00/lvol8 34541 8260 22826 27% /var /dev/vg00/lvol7 299157 157561 111680 59% /usr /dev/vg00/lvol6 23013 3576 17135 17% /tmp /dev/vg00/lvol5 99669 11100 78602 12% /opt /dev/vg00/lvol4 19861 9 17865 0% /home bdf reports its output in 1024-byte blocks. df reports its output in 512-byte blocks. Disk Usage Command -- du ________________________ The du command reports disk usage in 512-byte blocks for all files or directories specified; if none is specified, du reports on the current directory. Its report traverses the file tree recursively. Here is sample output using du on a subdirectory of one of the file systems listed in the previous example: % du /var/sam 4 /var/sam/preferences 10 /var/sam/log 2 /var/sam/lock 2 /var/sam/rt 142 /var/sam The final number reported is the total of 512-byte blocks for the /usr/contrib file system, and therefore the number is twice as large as that reported by bdf in 1024-byte blocks. Note: If it encounters a protected directory (that is, one whose file permissions are set to prevent access), du cannot report the number of blocks contained in that directory or its subdirectories. Disk Geometry Database -- /etc/disktab ______________________________________ Note: /etc/disktab is provided for backward compatibility only. Do not rely on it for current information; newfs now determines the geometry requirements of disks when it creates a file system. The /etc/disktab file is a database and informational file about disks, that provides reference about the many HP disks supported on a given computer system and tutorial information about disk geometry. Because /etc/disktab is a database, its information appears in terse form, as follows: ty Type of disk. ns Number of 1K sectors per track. nt Number of tracks per cylinder. nc Total number of cylinders per disk. s0 Size of file system in 1K blocks. b0 Block size in bytes. (Default block size for all systems is 8K.) f0 Fragment sizes in bytes. (Default fragment size for all systems is 1K.) se Number of bytes per physical sector. rm Rotational speed of disk platters by revolutions per minute. Not all abbreviations are used on all systems. The contents of /etc/disktab are used if you construct a file system with newfs -O. /etc/disktab provides entries that enable you to specify whether you want portions of a disk used for swap and boot. newfs no longer shows Series 800 disk sections. If you are using the LVM, you have even less cause to refer to /etc/disktab, although you might refer to it when you want to use non-default settings for file-system specification (for example, to change the fragment size, customize the various file-system sizes). Before adding a physical volume (disk) to a volume group, you might consult /etc/disktab to get an idea of the disk size. For full specifications, see disktab(4) in the HP-UX Reference. Boot Area _________ The boot area is the portion of the disk that holds the code used to bring the system into an operational state. The boot code initializes and tests the hardware, then loads into memory a secondary loader. The secondary loader is the program that loads /stand/vmunix (the operating system) into memory to enable you to use your system. (For detailed information about boot code, see the white paper entitled, "System Startup.") The boot area is reserved on the mass storage medium (usually a disk) during the installation process. Information in the boot area is used only if the disk is used for booting (boot disk), but the space can be reserved on all disks. Although the disk layouts for HP-UX platforms differ, all systems use a small file system for the initial system booting, written in Logical Interchange Format (LIF). (LIF is described in lif(4) in the HP-UX Reference. The manual page also contains pointers to the LIF utilities.) Using LVM, the boot data is contained in a Boot Data Reserved Area, which is created using the pvcreate -B command. If the system is administered without LVM, the boot area on a Series 700 precedes the file system on the disk. On Series 800 systems using traditional disk sections, the boot area must reside in its own disk section distinct from the file system and swap area sections. Series 700 Boot Area Implementation ----------------------------------- The Series 700 loader understands the layout of the file system. The Series 700 boot area has pointers to the actual bootstrap programs. The lifls command on a Series 700 reports presence of FS, SWAP, ISL, AUTO, HPUX, IOMAP, EST, and PAD files. Its reportage of FS and SWAP indicates that Series 700 LIF has knowledge of the entire disk, including the file system and swap. When the system is booted, the loader can find /stand/vmunix at a default or designated location, using the boot console user interface. For more information, see the owner's guide for the Series 700 systems or hpux(1M) in the HP-UX Reference. Series 800 Boot Area Implementation ----------------------------------- On Series 800, the LIF header contains ISL, HPUX, AUTO, RDB, and IOMAP files. ISL uses the AUTO file to locate the HP-UX kernel. Primary Swap Area _________________ The primary swap area is a contiguous area of the root disk used by the virtual memory system (see white paper on Memory Management) to temporarily store a process image. The primary swap area is specified in /etc/fstab. Until /sbin/rc1.d/S500swap_start executes swapon, primary swap is your only swap device. Device swap space is used for primary swap, because the system can access it directly, without having to go through a file system. On systems using LVM, the primary swap area resides in the root volume group in a designated logical volume. You can set up multiple swap areas in logical volumes that are on separate disks (physical volumes). On Series 700 systems, the primary swap area occupies blocks after a file system area or an entire disk dedicated as a swap disk. If you have multiple disks, each one can contain its own swap area, but there is still only one primary swap area on the entire system. On Series 800 systems using disk sections, primary swap space occupies its own section, separate from the file system and boot area sections. A single disk should not have multiple swap sections, because performance will degrade as the system attempts to do interleaved writes to swap areas on separate areas of the disk. Instead, configure multiple swap areas on separate disks. (For discussion of interleaving, see the Memory Management white paper.) You can list all swap areas on your system using the swapinfo(1M) command; see the HP-UX Reference. Procedures for managing swap space are found in the "HP-UX System Administration Tasks" manual. File System Layout ================== With the exception of disk drives used for raw data, every disk drive contains some file systems. All HFS file systems are laid out in a common format, with the following structures: * Primary superblock * Multiple cylinder groups The many data structures governing the superblock and cylinder group are defined in several header files, particularly /usr/include/sys/fs.h. Superblock headers, defined in fs.h, also include absolute disk addresses for the first boot block and definitions of numerous file system attributes, including cylinder-group characteristics (such as rotational positions, number of inodes per group, number fragments per block), file length, and mirror states of root and primary swap. For description of file-system format, see fs(4) of the HP-UX Reference. The Superblock ______________ The superblock is a contiguous, 8-KB block of disk space near the beginning of the file system's disk section. The superblock contains a record of the static information about the state of the file system at the time of its creation (or extension, if using LVM): * file system size * number of inodes it can store * locations of free space on the file system * number of cylinder groups * location of superblocks, cylinder groups, inode blocks, and data blocks * size and number of blocks and fragments. The primary superblock also keeps track of file system update information in its summary information area. HP-UX uses information in the superblock for various file system maintenance procedures -- for example, when you mount a file system or perform a file system check by executing fsck. Because the superblock is so important, HP-UX always keeps redundant copies on disk in each cylinder group. One copy is brought into main memory when you boot up. A primary superblock is at the beginning of the file system, and each cylinder group has a copy of the superblock. This redundancy further ensures the integrity of file system data. The non-redundant superblocks on the disk are updated whenever the sync command is executed and when a file system is unmounted (see sync(1M) in the HP-UX Reference). Record of all superblock locations can be found in /var/adm/sbtab. The Cylinder Group __________________ The cylinder group is the term used to describe a further internal organization of disk layout. Picture a set of disks stacked on top of one another, rotating around the same single point. One movable arm for each disk in the set extends from outside the edge of the disk toward the center of rotation. All the arms are tethered together so that they move in unison. At the end of each arm (toward the central point) is a read/write head that can access any point on the disk surface. A cylinder is a collection of tracks located the same distance from the edge of a disk platter, accessable by the read/write head. Since all the tracks in a cylinder are accessed by the read/write heads of the disk drive simultaneously, the blocks of space on each track can be accessible with minimum rotational latency; that is, requiring no seek time. For performance reasons, small groups of adjacent cylinders (sixteen by default; see newfs_hfs(1M)) are grouped together as cylinder groups. Each cylinder group has its own set of inodes and local mappings of free space in the group. This internal organization results in both bringing to closer proximity file-system inodes and their associated data without long seeks and dispersing data and inodes across cylinders. Minimum time is lost seeking file data within a cylinder group. The cylinder group controls all access to a file and its associated data. Each cylinder group contains a copy of the superblock, a cylinder group information structure, an inode table, and data blocks. Cylinder Group Layout --------------------- Data Structure Size -------------- ---- Boot block 8 KB Primary superblock 8 KB Redundant superblock 8 KB Cylinder group information 1 block (4KB or 8KB) Inode table varies (see Inodes section) Data blocks 0 or more blocks (due to offset; see Data Blocks section) Only the first cylinder group is likely to have a boot block. The beginning of all subsequent cylinder groups might be filled by data blocks, depending on offset. A redundant copy of the superblock is located in each cylinder group. This ensures that if any single track, cylinder, or platter is damaged, the file system itself can be repaired by executing fsck and specifying an alternate superblock. Further, each successive cylinder group is laid out offset by one track in relation to the previous cylinder group, so that the redundant copies of the superblock spiral down the platters. The cylinder group information contains the dynamic parameters of the cylinder group: * Number of inodes and data blocks in the cylinder group * Pointers to the last used block, fragment, and inode * Number of available fragments * Used inode map * Free block map. The cylinder group information data structure's size is one block (a block can be defined when running newfs as either 4 KB or 8 KB). The layout of the cylinder group information is defined in /usr/include/sys/fs.h. Inodes ------ Besides maintaining information about the file-system state, the cylinder group holds key information about the file-system inodes -- the system's index to the actual files of data. Inodes contain the locations of the actual file data. The cylinder group maintains an inode table, which provides summary information about each file in the cylinder group (see the figure, "Mapping from Inode to File Data Blocks," later in this paper). In addition, the "disk inodes" appear in an expanded version ("in-core inodes") in memory for inodes currently (or recently) used. A disk inode includes the following information: * mode and file type * number of links to the file * owner and group information * file size in bytes * time stamps * pointers to the file's actual blocks of data on disk When a file is read into memory, its in-core inode also shows the following: * status of the in-core inode, including if the inode is locked, if a process is waiting for the inode, if the disk inode now differs from the in-core copy due to file modification, if the file is a mount point. * numeric address of the file system containing the file. * inode array number by which the kernel identifies the disk inode. * pointers to other in-core inodes linked on buffer hash and free lists. The /usr/include/sys/inode.h header file defines the in-core inode; the /usr/include/sys/ino.h header file defines the disk inode. When the operating system accesses a file, it finds the file using the inode pointers to the file blocks of data. This is discussed later in this paper. A static number of inodes is allocated for each cylinder group when the file system is created. HFS uses a default that provides sufficient inodes per cylinder group for average usage. If the file described by the inode is not a regular file, some of the inode fields differ as follows: * FIFO and pipes: The space reserved for indirect block pointers contains information about the current state of a FIFO or pipe. * Character or block device files: The first direct block address is actually the major and minor number of the device. The rest of the direct block addresses are 0. * Directories: The pointers point to regular file system data blocks that contain specially formatted data described in /usr/include/sys/dir.h. When you create a file system (using newfs or mkfs), the system creates inodes. The number of created inodes limits the number of files that you can have in a file system. Each time you create a file, an inode is allocated for that file. Both commands default to 6144 bytes per inode, meaning the system assumes that the average size of your files will be 6144 bytes. Although uncommon, an inode error message, inode: table is full, might require changing the size of the inode table. This message refers to the kernel's in-core inode parameter. A configurable parameter, ninode, defines the maximum number of open, in-core inodes. You can use SAM to change these configurable parameters. Data Blocks ----------- Disk space before or after the superblock, cylinder group information, and inode table is filled with data blocks. (The specific locations of data on each platter is different, due to the cylinder-block offset.) The blocks are used to store data for regular files, directories, and symbolic links. HP-UX provides support for file systems in several block sizes: 8 KB, 16 KB, 32 KB, or 64 KB. Block size is set using the mkfs or newfs command, when you construct a file system. See mkfs(1M) and newfs(1M) in the HP-UX Reference. Larger block sizes are faster for sequential access to the file system, while smaller block sizes use space more efficiently and are better for random I/O. Having a large block size has both benefits and costs. For big files, a large block size significantly reduces the number of disk accesses, thereby increasing file system throughput. The problem is that most HP-UX files are small; thus, using a large block size for small files might waste space. In the fs.h header file, the size of blocks is referred to as fs_bsize, depending upon what block size your file system uses. Fragment size is specified at file system creation. To minimize wasted space, fragments can be one-eighth, one-fourth, one-half or the same size as a block. A block can be divided into 1 KB, 2 KB, 4 KB, or 8 KB fragments. How a File is Accessed from Inode to Data Blocks ================================================ Inode in the cylinder group contains pointers to the locations of a file's actual data. Depending on the size of a file, its data might be reached through pointers to direct blocks or indirect blocks, which are pointers to a block containing more pointers to the data. HP-UX allows for up to triple indirect pointing for enormous files. The next figure shows the mapping from an inode to a file's data blocks. The first 12 pointers in an inode point directly to the first 12 blocks or fragments containing the file's data. If the file is larger than 12 blocks (greater than 12 times fs_bsize, indirect reference is made to the file's data. A group of 4-byte long indirect pointers is contained in one data block; there can be either 1024 pointers (4096/4) or 2048 pointers (8192/4) in each block of indirect pointers. The thirteenth block address in the inode points to a block containing 1024 or 2048 additional pointers to data blocks. The number of indirect pointers in a block is called num_ip. Thus, the thirteenth (single indirect) block address handles files up to 4,243,456 bytes in a 4-KB block file system or 16,875,520 bytes in an 8-KB block file system (fs_bsize times (12+num_ip)). If the file is larger, the fourteenth inode block address points to num_ip indirect blocks, each of which contains pointers to an additional num_ip actual data blocks. If the file cannot be contained in this space, the fifteenth inode block address points to num_ip double-indirect blocks. With the fifteenth (triple-indirect) block address, the size of a file is limited to fs_bsize times (12+num_ip+(num_ip squared) + num_ip cubed). Mapping from Inode to File Data Blocks -------------------------------------- inode 1st level 2nd level file contents here +-------------------+ indirection indirection | mode & file type | +---+ +---+ +---+ +-------------------+ | | | | ... | | |# links to file | +---+ +---+ +---+ +-------------------+ ^ ^ ^ |owner, group info | | | | +-------------------+ | | | | file size in bytes| | | | +-------------------+ | | | | time stamps | | | | +-------------------+ | | | | direct 1 |------------------------------------+ | | | blocks 2 |------------------------------------------+ | | ... | | | 12 |----------------------------------------------------+ +-------------------+ +---+ +---+ +---+ |single indirect |-------+ | | | | ... | | +-------------------+ | +---+ +---+ +---+ |double indirect |-+ v ^ ^ ^ +-------------------+ | +-----+ | | | |triple indirect | | | 1 |-------------------------+ | | | | | | 2 |-------------------------------+ | +-------------------+ | | ... | | | |1K or| | | | 2K* |-----------------------------------------+ | +-----+ +---+ +---+ +---+ | | | | | ... | | | +---+ +---+ +---+ | +-----+ +-----+ ^ ^ ^ +->| 1 |----->| 1 |------------+ | | | 2 | | 2 |------------------+ | | ... | | ... | | |1K or| |1K or| | | 2K* |-+ | 2K* |----------------------------+ +-----+ | +-----+ +---+ +---+ +---+ | | | | | ... | | | +---+ +---+ +---+ | +-----+ ^ ^ ^ +-> | 1 |------------+ | | | 2 |------------------+ | | ... | | |1K or| | | 2K* |----------------------------+ +-----+ * 1K pointers if file-system block size = 4KB 2K pointers if file-system block size = 8KB Inode pointers hold the address of a fragment. The address references an entire block or one or more fragments, depending on the number of bytes stored at the address. All blocks but the last have a full block of data allocated to them. If the amount of data in the last block is less than the file system block size, only the number of consecutive fragments needed to actually store the actual data are allocated. For example, in an 8-KB/1-KB file system, a 15-KB file is stored as 2 8-KB blocks and 3 consecutive 1-KB fragments. (The latter might also be referred to as a 3-KB fragment.) This allocation scheme provides the performance advantage of large blocks with the space savings of small fragments. The next figure shows an example of a 20-KB file stored in 8-KB blocks with 1-KB fragments. The number of blocks needed is 20/8 (file size/block size): 2 full blocks with a remainder of 4 fragments. Therefore, the first and second pointers point to full blocks, but the third pointer points to the remaining 4 fragments. Sample Inode Addressing ----------------------- Inode +-------+ | ... | +-------+ file| 20K | +------+ +------+ +-------+ size| | +------+ +------+ +--|----+ +-------+ 8 15 24 31 40 43 48 | ... | ^ ^ ^ +-------+ | | | 1| 8 |-----+ | | direct +-------+ | | blocks 2| 24 |-------------------+ | +-------+ | 3| 43 |-------------------------------------+ +-------+ 4| 0 | +-------+ ...| | +-------+ 12| 0 | +-------+ ...| | +-------+ All indirect blocks are referenced only as full blocks; no pieces of the file are addressed at the fragment level beyond the 12 direct pointers. When a block or fragment is needed, the disk is searched for free blocks. Ideally, free blocks should be found throughout the disk, for searches to locate a free block close to related blocks. When the file system is full, there are long linear searches to find the block, and when a block is allocated, it is likely to be placed far from the previous block of the file, resulting in long seeks and slow performance. Minimum Free Space __________________ To ensure the availability of free blocks near one another, a certain percentage of free space must always be available in the file system. This minimum free space percentage is specified at file system creation using the -m option of the newfs command or the minfree argument of the mkfs command. The default is 10 percent. Values lower than 10 percent may severely degrade system performance, by causing the file system to search harder for free space. The percent of free space can be changed at any time using tunefs -m. The reserved free space is inaccessible to the normal user; once this threshold is met, only the superuser can continue to allocate blocks. When the percentage of free space drops below the threshold, system throughput (to and from newly created files) drops because the file system can no longer localize the blocks for a file. Accessing a file is quicker if the entire file is grouped together. How Disk Space is Allocated =========================== Free space availability is determined from a bit map associated with each cylinder group. The bit map contains one bit for each fragment. To determine if a block is available, the system examines consecutive fragments. A piece of the bit map from a file system using 1024-byte fragments and 8192-byte blocks is shown next. Sample Free Block Bitmap in an 8KB/1KB File System -------------------------------------------------- bit map 00000000 00000011 11111100 11111111 Fragment numbers 0-7 8-15 16-23 24-31 Block numbers 0 1 2 3 Fragment numbers 14-21 and 24-31 in this example are free, indicated by ones in the bit map. Fragment numbers 0-13 and 22-23 are allocated, as indicated by zeroes in the bit map. Fragments in adjacent blocks cannot be used to create a full block; only eight contiguous fragments starting on a block boundary can be used to allocate a full block. Fragments 24-31 can be coalesced to form a full block, but not fragments 14-21. Also, if a partial block is allocated, the fragments must be consecutive and not cross a block boundary. For example, if three fragments are needed, fragments 16-18 can be allocated, but not fragments 14-16. Every time data is written to an existing file, the system checks to see if file size must increase. If so, one of three conditions exists: * Sufficient space exists in the existing block or fragment; the new data is written into the already allocated space. * The file contains only whole blocks; the last block contains insufficient space to hold additional data. If more than a full block of data must be written, a new block is allocated and written. This process is repeated until less than a full block of new data is needed. At that point, a block containing enough contiguous fragments is located and the new data is written there. * The file contains fragments, but not enough to hold the new data. If the size of the existing data in fragments plus the new data exceeds the size of a full block, a new block is allocated. Both the old and new data are written to the new block. If the size of the old and new data is less than a full block, a block with enough contiguous fragments (or a full block) is located and allocated. When a block or fragment has been located, the address is recorded in the inode table and the free block bit map is updated. Allocation Policies ___________________ Allocation is performed globally to place new directories and files and locally to place data in blocks. A global decision determines which cylinder group contains a given file or directory. HP-UX attempts to put all files from a single directory in the same cylinder group. Newly created directories are put in the cylinder group with the greatest number of free inodes and smallest number of directories. Once the file size reaches maxbpg (maxbpg is defined via the tunefs command), HP-UX allocates blocks from another cylinder group. This helps to enforce grouping of all files within one directory into a single cylinder group by spreading the less common larger files over several cylinder groups. Global allocation routines call local allocation routines with requests for specific data blocks. Blocks are allocated by the following priorities: * Allocate block requested. * Allocate a block on the same cylinder that is rotationally closest to the requested block. * Allocate any block within the same cylinder group. * Use a quadratic hash to find a new cylinder group; allocate a block somewhere in the new cylinder group. * Use sequential search to find an available block. Speed in allocating blocks is the most important characteristic of this strategy. For this reason, the percentage of free space must be maintained. The File-System Buffer Cache ============================ The file-system buffer cache manages data flow between main memory and secondary memory (principally disks), by temporarily holding (buffering) information about data being transferred to and from disk. The buffer cache speeds data transfer from the file system to main memory; once buffered, data is accessed by a process's executing space in main memory much faster than from the file system on disk. The buffer cache is used for all file system I/O operations, plus all other block I/O operations in the system (for example, mount, inode reading, LVM management, and some device drivers). The role of the buffer cache is illustrated below. When you execute a program, the shell passes the file path name to exec, finds the file on disk, and reads the a.out header into the buffer cache. The a.out header contains preliminary information about the executable, including the size of the text and the uninitialized data (bss areas). Buffer Cache Holds the a.out Header of Executing Programs --------------------------------------------------------- Secondary Storage Main Memory +--------------------+ +-------------------------+ | | | buffer cache | | program file ++ |------------------>| containing a.out header | | ++ | | | | |<----------------->| program executable | +--------------------+ +-------------------------+ As the code executes, the virtual-memory system reads the pages of data directly from the disk into memory. (Some additional pages might also be read in, based on the probability they will be needed). The file's a.out, which is only needed to begin the "demand-paging", might (or might not) remain in the buffer cache throughout the process execution, depending on whether its buffer is needed. If you have just created and compiled a program, all transactions occur from the buffer cache. For an existing program, however, data might exist on both disk and buffer cache. When a page is faulted in from disk to memory, HP-UX also ensures that the process executes using the most current copy of the data. During a file-system write, HP-UX ensures that only the most current copy of the data, whether in the virtual-memory system's page cache or in the buffer cache, is written to disk. Structure of the Buffer Cache ----------------------------- The buffer cache consists of two parts: * buffer headers, which have pointers to the buffer and describe its contents. * buffer data area, which reside in data blocks ranging in size from DEV_BSIZE to MAXBSIZE. Like a file system, a buffer must a always be some multiple of DEV_BSIZE. MAX_BSIZE is the largest buffer size, in bytes. The smallest unit of memory assigned to a buffer is one page. The data structures used for allocating and managing buffers are defined in the /usr/include/sys/buf.h header file. Requests for buffers come from many sources, including file-system reads and writes, and device driver allocations. If a buffer is requested and not already in the cache, the operating system obtains the buffer header, allocates memory for the pages of the buffer, and then gives it to the part of the operating system making the request. Implementation of the Buffer Cache ---------------------------------- The HP-UX file-system buffer cache can be implemented in two ways: * Dynamically. The dynamic buffer-cache implementation allows the buffer cache to change in size depending on system demand for virtual memory vs. buffer cache. As of HP-UX release 10.0, the buffer cache is implemented dynamically, by default. Instead of setting fixed values using the familiar nbuf and bufpages parameters (both nbuf and bufpages are now set to zero), the operating system uses two new parameters, set as a percentage of physical memory. By default, dbc_min_pct is set to 5% of physical memory; dbc_max_pct is set to 50% of physical memory. These percentages can be changed to as low as 2% or as high as 90%, respectively. * Fixed. The number of buffers in the cache is set by two operating-system parameters in the /stand/system file -- nbuf and bufpages. When you power up your system, these parameters reserve memory for buffer headers (nbuf) and for pages of memory for buffer-cache use (bufpages) based upon the amount of available RAM. Of the two parameters, bufpages is more critical, defining the amount of memory in buffer cache, which can vary depending on block size. If either nbuf or bufpages is set to a value other than zero, a fixed buffer cache is implemented. You can use SAM to change the buffer-cache operating system parameters (dbc_min_pct, dbc_max_pct, nbuf, bufpages,) and then reboot to implement the changes. Since the values are stored in /stand/system, you can edit the file to assign the values, but the SAM method is recommended. For further information, refer to the SAM online help and "HP-UX System Administration Tasks" manual. Implementation of a Dynamic Buffer Cache ---------------------------------------- From a system-administration perspective, using the dynamic buffer cache is simple: the operating system is shipped with it set up by default. The size of the buffer cache is determined by two parameters (dbc_min_pct (5%) and dbc_max_pct (50%)), which are set in SAM. The dynamic buffer cache begins at the dbc_min_pct value and can grow to dbc_max_pct value, as the I/O requests occur. When memory pressure occurs, the cache can shrink to a minimum of dbc_min_pct. Although the nbuf and bufpages operating-system parameters are not specified in /stand/system, the operating system determines how many buffer-cache pages are needed for optimal system performance. You can choose to configure these parameters, but if you do, the buffer cache will not function dynamically. With dynamic buffer cache, nbuf is set to one-half bufpages (that is, half the minimum percent; by default, 2 1/2%.) The number of buffer headers (bufpages) does not change. The dynamic buffer cache is implemented to grow and shrink in size, depending on operating-system and virtual-memory need. Demand for memory is generated not only by the file system, but also by other objects, including processes, data regions, memory-mapped files. Both buffer cache and virtual-memory subsystem access the same body of RAM in main memory. The dynamic buffer cache is allowed to grow considerably larger than a fixed buffer cache, permitting more data to be held in memory. When the virtual-memory system requires more memory, the dynamic buffer cache is reduced to yield memory for processes. The dynamic buffer cache functions like a large memory-mapped file shared among all the processes running on the system. (Note, there are a number of subtle interactions between the buffer cache and memory-mapped files that can streamline bringing data into the virtual-memory subsystem. The dynamic buffer cache uses an algorithm based on two free lists LRU (least recently used) and EMPTY (unallocated buffer headers) for reusing existing buffer pages and allocating more pages from memory. The LRU lists buffers in most-recently to least-recently used order. This list may grow as long as the buffer cache is growing. When a buffer is read for the first time, its buffer is inserted mid-list, in fairly high priority. If accessed again, its priority is increased. Other buffers might decrease in priority (such as file-system writes to an entire block, which typically do not get referenced again). The dynamic buffer cache shrinks by use of vhand, the virtual-memory subsystem's pageout daemon. vhand reclaims pages of memory from the buffer cache as well as virtual memory, by using reference bits, much as it does through the virtual memory subsystem's regions. Its first (age) hand clears the status bits of any buffer pages not recently accessed. If the status bit remains clear by the time the second (steal) hand traverses it, vhand reclaims (pages out) the associated page. The dynamic buffer cache gives the operating system flexibility to accommodate both small application programs that do a lot of I/O and large programs that do little I/O but require many pages of memory for data. For information on memory-mapped files, and the vhand and swapper daemons, see the Memory Management white paper. Implementation of a Fixed Buffer Cache -------------------------------------- Buffer headers are allocated in a single contiguous block and treated as an array. Inactive buffer headers are placed on one of three doubly-linked lists -- LRU (least recently used), AGE, and EMPTY: * Although its name suggests otherwise, the LRU list actually points to blocks of most frequently accessed data, representing no more than 40% of total buffers. If data in a buffer is dirty (that is, its contents changed since accessed from the file system), its pages must be written to disk before the pages can be reallocated. * The AGE list contains buffers accessed less frequently and the overflow of the LRU list. * The EMPTY list contains unallocated buffer headers. If a buffer requires more than one page, HP-UX ensures that the pages are assigned consecutive addresses. As code and data move from the file system into the buffer cache, the system copies the information from the buffer cache into user's main memory. If a user requests information already in the buffer cache, the information is copied from the cache to user's main memory, eliminating the I/O operation to bring it in from disk. When data is written through the buffer cache, any data in the virtual-memory system's page cache (in main memory) with the same vnode and block address is purged. Virtual addresses used by the buffer cache are in kernel space. When a pagein occurs, both the buffer header (on one of the buffer lists) and associated data in the buffer cache are flushed. How the HFS File System Modifies Files ====================================== Every time a file is modified, the HP-UX operating system updates the file system to ensure its consistency. When a process updates (writes to) the file system, the data being written is copied into an in-memory buffer cache. The physical disk is updated asynchronously from the buffer write. The data and inode information reflecting the change is written to the disk later, unless the file was opened in the synchronous mode (see the section on Synchronized I/O Flags in the open(2) manpage of the HP-UX Reference). The process continues, though the data has not yet been written to the disk. If the system is halted without writing the buffer to disk, the file system on the disk is left in an inconsistent state. Such inconsistencies are flagged and corrected, if possible, by the fsck command at system startup. The sync command can be used to force synchronization. The syncer command routinely updates the file system's superblock, inodes, data blocks, and cylinder group information, as described below. (See fsck(1M), sync(1M), and syncer(1M) in the HP-UX Reference.) Primary Superblock: The superblock of a mounted file system is written to the disk whenever a umount command is issued, or when a sync command is issued and the file system has been modified. Inodes: An inode contains information describing the file. The inode is written to disk after every modification, unless the fs_async parameter is set in the /stand/system file. (See "fs_async on an HFS File System," later in this paper.) Data blocks: In-core blocks (including directories, indirect blocks, files, pipes, symbolic links, and FIFOs) are written to the file system after being modified and released by the operating system. Upon release, data blocks are buffered or queued for eventual writing. Physical I/O takes place when the buffer is needed by HP-UX, when a sync or fsync command is issued, or when O_SYNC is set for the file. If a file is opened with the O_SYNC or O_SYNCIO flag set, the write system call does not return until completed. Cylinder group information: The cylinder group information is updated whenever a sync is executed, or when the system needs a buffer and the cylinder group is written. CAUTION: * Always unmount a file system BEFORE executing fsck. * Always reboot the system WITHOUT sync'ing (that is, use reboot -n) after altering the root device with fsck. A file system can become inconsistent if you execute fsck on a mounted file system other than the root file system; you risk missing buffered information not yet written to the file system. If this information is then flushed from the buffer cache, it might overwrite corrections that fsck had made. Immediate Reporting ___________________ Numerous SCSI disk devices are shipped with a feature called immediate reporting. Workstation disk devices are set with default ON; multi-user disks are set with default OFF. Immediate reporting speeds status notification; its implementation is handled by the disk controller and disk device. However, immediate reporting also has some associated risks. With immediate reporting, when a device driver sends a write request to a device, the device accepts the data, places it in its buffer or its cache, and reports to the SPU that the write completed successfully. Without immediate reporting, status is not returned until the data goes to the media itself. In a power (or other) failure, data might not have been written successfully to disk, but in fact, still reside in a buffer. An application, writing to the raw device or to the files system using O_SYNC, continues processing as though the data has been written. If data remains in the buffer at the time of a system failure, the database is left in an inconsistent state. Note, however, O_SYNC might cause the driver to attempt to have I/Os sourced through that open (marked B_SYNC) to be written through the cache to media by use of a scsi command, Write FUA. Not all devices support this command, however. Under rare circumstances, immediate reporting might also cause delayed errors or system panics. This can occur in the following scenario: A user has a write request and the system returns good status immediately. If the next request is a kernel request and an error occurs (such as a write failure) caused by the user's write request, the error might get associated with the kernel request. If the kernel request cannot tolerate the error, the kernel might panic. In other words, the I/O which has already been reported successful actually fails. This failure is reported on a subsequent I/O by a "deferred" error. Such erroneous I/Os cannot be retried, nor reported to the application nor the kernel, since the only information available to the driver is the report itself. The original I/O (prematurely reported successful) is long gone, as might the application. Thus, the system's sole recourse may be to panic. Immediate reporting can be set or disabled using scsictl(1M). If it is critical that your system not go down (or cause silent data corruption), you might want to disable immediate reporting. Although SCSI disks available for Series 800 systems can be set for immediate reporting, the feature poses greater risk of inconsistent data; the disks are shipped with the feature disabled. fs_async on an HFS File System ______________________________ When HP-UX writes data to disk synchronously, any file-system activity must complete to the disk before the program is allowed to continue; the process does not regain control until completion of the physical I/O (regardless of whether the I/O is user data or operating-system data). Synchronous writes include some file-system structures and whatever an application writes with O_SYNC set. When HP-UX writes to disk asychronously, I/O is scheduled at some later time and the process regains control immediately, without waiting for the write to complete. (In the case of a SCSI disk, the data is actually written to a write cache in the card controller, which as far to the disk as the operating system can tell.) By default, some critical changes to the structure of the file system are posted to disk synchronously. Synchronous writes ensure file system integrity in case of system crash, but this kind of disk writing also impedes system performance. Run-time performance increases significantly on I/O-intensive applications when all disk writes occur asynchronously; little effect is seen for compute-bound processes. However, if a system using asynchronous disk writes crashes, recovery might require system-administrator intervention using fsck and might also cause user data or directories to disappear. As a system administrator, you can specify whether some disk writes are performed synchronously or asynchronously. The fs_async parameter specified in the /stand/system file enables and disables the feature. (You cannot modify whether or not other types of disk writes occur synchronously. They are asynchronous by default and synchronous if synchronous I/O flags are set by the application.) On both Series 700 and 800, the fs_async value is set to 0 by default. This specifies that the writes should be performed synchronously. Setting fs_async to 1 causes fewer writes to be performed asynchronously. Typically, this causes file-system performance to improve. Note too, fs_async, deals with inodes and directories, while O_SYNC deals with files and data. If a file is opened via O_SYNC, the file continues to be written synchronously, regardless of what method is specified. O_SYNC also causes inodes to be updated synchronously. For further information on synchronous I/O, refer to open(2) in the HP-UX Reference. Although asynchronous disk writes increases system performance for most applications, if a system crashes, file-system data structures are likely to be left in an inconsistent state. For this reason, we do NOT recommend that you turn on fs_async on a production system. Normally, file-system recovery is performed automatically by fsck in the reboot process and does not require any intervention by the system administrator. However, using asynchronous disk writes might require system administrator intervention in the event of a crash. For further information, refer to fsck(1M) in the "HP-UX Reference." Minimizing File-System Corruption ================================= Although the HFS file system is very reliable, hardware failures, accidental power loss, or improper shutdown procedures can cause its corruption. Problems, such as a bad block on a disk, power loss, or a non-functional disk controller, can occur and cause the hardware to fail. By following recommended hardware preventive maintenance procedures and by keeping regular backups (as defined in the "HP-UX System Administration Tasks" manual), you can avoid most serious problems and be prepared for any that might occur. As a system administrator, you are responsible for preserving users' data. Since the file system is the HP-UX data structure that stores the data, it is essential that you safeguard the file system by performing maintenance tasks (such as regular backups), following proper startup and shutdown procedures, and by checking the file system when necessary using the fsck command. System Shutdown and Startup Guidelines ______________________________________ To ensure file system integrity, always follow proper shutdown and startup procedures (described in the "HP-UX System Administration Tasks" manual): * Always shut down the system using reboot or shutdown. * Never physically write-protect a mounted file system, unless it is mounted read-only. * Never take a mounted file system off-line (for example, by shutting its power off or by disconnecting it) while it is in use. Follow proper startup procedures: * Always check the file system for inconsistencies. (The fsck command runs automatically when the system reboots.) * Always repair inconsistencies, using fsck. Allowing a corrupted file system to be further modified in such circumstances can be disastrous. The /lost+found Directory _________________________ Every file system should have a lost+found directory at its root. fsck, the file system check command (discussed in the next section), places any problem files or directories in lost+found. After fsck completes, you should examine each file in lost+found to determine its name and location and attempt to return it to its rightful place. lost+found is created by both mkfs and newfs when they create file systems. However, if your system lacks lost+found, you can rebuild it using mklost+found(1M). mklost+found creates several empty file slots for fsck. Understanding Use of fsck to Detect and Correct File-System Corruption ====================================================================== The fsck command is the principal file-system maintenance tool for checking system consistency and making repairs. NEVER run fsck on a mounted file system. However, fsck should be run regularly to ensure the file system's structural integrity: * fsck is invoked during system boot-up by the /etc/bcheckrc script run by init. * For preventative maintenance, fsck should be run weekly (before each full backup) on all file systems, but particularly on file systems that have been unmounted. * You should run fsck any time you suspect problems with the HP-UX file system. Be sure to unmount the file system first! In performing its checks, fsck examines the file system several times, each time examining different characteristics, including: * Block and file size * Path names * Connectivity (parent-child relationships) * Reference count links * Cylinder groups fsck checks intrinsically redundant file-system data. The redundant data is either read from the file system or computed from known values. The file system should be in a unmounted state when you check it. The root file system should only be run from init run-level s, the system administrator run-level. (Thus, you can check the root file system after performing a system shutdown.) Do not run fsck for the root file system when the system is busy. You can check non-root file systems any time, but be sure they are unmounted. You can run fsck interactively or non-interactively. When invoked without options, fsck runs interactively on file systems marked hfs in /etc/fstab and queries you for a response when it finds an inconsistency. In non-interactive mode (typically, in the -p or preen mode), fsck reports inconsistencies, corrects many problems, but does not remove data. If it cannot solve a problem, fsck terminates. If this happens, you should run fsck interactively to fix the problems. Note: When running fsck -p before a backup, if the command completes successfully, perform your backup. If it aborts with errors, back up the bad file system, repair it, then back up the file system again. Do not issue the reboot command in its default form after fsck has repaired a mounted file system. By default, reboot executes sync on the disks, thus writing out inconsistent data. If you must reboot, use reboot -n, which does not issue a sync. For further discussion of fsck, see the fsck white paper and fsck(1M) in the HP-UX Reference. The following subsections describe the interaction of fsck on various elements of the file-system. Superblock Consistency ______________________ The superblock's summary information can become inconsistent because every change to the file system's blocks or inodes modifies it. Most often, the superblock and its associated parts become corrupted when the computer is halted and the last command involving output to the file system is not a reboot, shutdown, sync, or umount command. fsck checks the superblock for inconsistencies involving: * Free block count -- this is fairly common * Free inode count -- this is fairly common * File system size -- this rarely happens. If it detects corruption in the static parameters of the primary (default) superblock, fsck requests the system administrator to specify the location of an alternate superblock. The alternate superblock addresses are listed during file-system creation. An alternate superblock is always found at block number 16. If this superblock is also corrupted, you must supply the address of another superblock. If the last time you created a file system was during the installation, a list of superblock addresses can be found in the /var/adm/sbtab file. File System Size ---------------- fsck examines the superblock for inconsistencies involving file system size, number of inodes, free block count, and the free inode count. The file system size must be larger than the number of blocks used by the superblock and the number of blocks used by the list of inodes. The file system size and layout information are critical pieces of information to the fsck program. While there is no way to actually check these sizes, fsck can verify that they are within reasonable bounds. All other checks of the file system depend on the correctness of these sizes. Free-Block Checking ------------------- fsck checks that all data associated with files and directories can be found. The superblock summary information contains a count of the total number of free blocks within the file system. fsck checks that all the blocks marked as free are not claimed by any files. When all the blocks have been accounted for, fsck compares this count to the number of free blocks it finds within the file system. If the figures do not agree, fsck replaces the count in the summary information by the actual free-block count. If any of the free-block maps is erroneous, fsck rebuilds them, excluding all blocks in the list of allocated blocks. Inode Checking -------------- The superblock summary contains a count of the total number of free inodes within the file system. fsck compares this count to the number of free inodes it finds within the file system. If the figures do not agree, fsck replaces the count in the summary information by the actual free inode count. Inode Consistency _________________ Individual inodes are less likely than superblock summary information to be corrupted. However, because of the great number of active inodes, it is possible that a few inodes might become corrupted. The inodes list is checked sequentially, from inode 2 (inode 0 marks unused directory slots and inode 1 is reserved for future use) to the last inode in the file system. The inode structure is defined in the /usr/include/sys/inode.h header file. There are two major types of inodes: primary and continuation. Continuation inodes contain only a mode (which is of type continuation), a link count, and access control list (ACL) entries. Continuation inodes exist only if a file has optional ACL entries associated with it. fsck checks the continuation inode's mode, link count, and the reference from the primary inode. It does not examine the ACL information itself. fsck checks each primary inode for inconsistencies in the following areas: * Format and type * Link count * Duplicate blocks * Bad blocks * Inode size * Block count Format and Type --------------- fsck verifies inodes classifications -- regular file, directory, block special file, character special file, network device, FIFO, symbolic link, or continuation inode. It also examines the inode state, as: * Unallocated * Allocated * Neither unallocated nor allocated This last state indicates an incorrectly formatted inode. An inode can get into this state, for example, if bad data is written into the inode list through a hardware failure. To correct such an ambiguous state, fsck clears the defective inode. Link Count ---------- Contained in each inode is a count of the total number of directory entries linked to the inode. fsck verifies the link count stored in each inode by traversing the total directory structure (starting from the root directory) and calculating an actual link count for each inode. If the stored link count is non-zero and the actual link count is zero, no directory entry appears for the inode; fsck links the disconnected file to the /lost+found directory. If the stored and actual link counts are non-zero and unequal, a directory entry may have been added or removed without the inode being updated. fsck replaces the stored link count in the inode by the actual link count. Duplicate Blocks ---------------- Duplicate blocks can occur from using a file system with blocks claimed by both the free-block list and other parts of the file system. Each inode contains a list (or for large files, pointers to lists in indirect blocks) of all blocks containing its file's data. fsck compares each block number claimed by an inode to a list of allocated blocks. fsck updates the list of allocated blocks to include the block number. If a block number is already claimed by another inode, fsck adds the block number to a list of duplicate blocks. To resolve duplicate blocks, fsck makes a partial second pass of the inode list to find the duplicated blocks' inodes. fsck prompts the operator to clear both inodes. Often clearing only one inode solves the problem, but the data in the other inode is suspect. Bad Blocks ---------- Contained in each inode is a list or pointer to lists of all the blocks claimed by the inode. fsck checks each block number claimed by an inode for a value outside the range of the file system (lower than that of the first data block or greater than the last block in the file system). If the block number is outside this range, the block number is a bad block number. fsck prompts the operator to clear the inode. LVM provides another mechanism for relocating bad blocks. (See the Logical Volume Manager documentation.) Inode Size ---------- Each inode contains a 64-bit (eight-byte) size field indicating the number of characters in the file associated with the inode. fsck uses the inode size field to check for size inconsistencies. fsck calculates the number of blocks that should be claimed by an inode by dividing the number of characters in the file by the number of characters per block and rounding up to get the number of direct blocks. fsck then counts actual direct and indirect blocks associated with the inode. If the actual number of blocks does not match the computed number of blocks, fsck warns of a possible file-size error. This is only a warning because HP-UX does not fill in blocks in sparse data files. A directory inode within the HP-UX file system has the mode word set to "directory". The directory size of a file system using 14-character filename limits must be a multiple of 32 characters, because a directory entry contains 32 bytes of information. The number of blocks actually used for the directory should match that indicated by the inode size. fsck reports any directory misalignment, but cannot correct it. Block Count ----------- fsck checks the block count of two types of data blocks: * Ordinary data blocks containing information stored in a file. fsck does not attempt to check the validity of the contents of an ordinary data block. * Directory data blocks containing directory entries. Indirect blocks are owned by an inode; thus, inconsistencies in indirect blocks affect the inode that points to the block. fsck checks indirect blocks for the following block-count inconsistencies: * Blocks already claimed by another inode. * Block numbers outside the range of the file system. fsck detects and corrects the indirect-block inconsistencies iteratively, by the same scheme used for direct blocks. fsck checks each directory data block for inconsistencies involving: * Directory inode numbers pointing to unallocated inodes. If a directory entry inode number points to an unallocated inode, fsck removes the directory entry. * Directory inode numbers larger than the number of inodes in the file system. If a directory entry inode number points beyond the end of the inode list, fsck removes the directory entry. This occurs if bad data is written into a directory data block. * Incorrect directory inode numbers for "." and ".." (current and parent directories, respectively). The directory inode number entry for "." should be the first entry in the directory data block. Its value should be equal to the inode number for the directory data block. The directory inode number entry for ".." should be the second entry in the directory data block. Its value should be equal to the inode number for the parent of the directory entry (or the inode number of the directory data block if the directory is the root directory). If the directory inode numbers for "." and ".." are incorrect, fsck replaces them with correct values. File-System Connectivity ________________________ fsck checks the general connectivity of the file system. If it finds directories not linked into the file system, fsck links the directory back into the file system by placing them in the /lost+found directory. Uncorrectable File System Corruption ____________________________________ In certain instances, fsck may be unable to check and repair the file system (for example, if all copies of the superblock are lost). The fsdb (file system debugger) command is provided for such situations. CAUTION: fsdb should be used ONLY by an HP-UX file system expert, since it can easily destroy the entire file system. Refer to the fsdb(1M) entry in the HP-UX Reference for details. Transferring Files between HP-UX and Other Systems ================================================== Not all computers use the HFS File System. To accommodate variances, HP-UX supports several utilities and services for transferring files, including to other vendors' operating systems. The following listing shows what to use when transferring information between HP-UX and various systems. In some cases (such as with networking products), optional products must be present. Utilities and Services for File Transfer ---------------------------------------- kermit: Use when both systems, connected by serial lines, run kermit. kermit transfers data between HP-UX and many incompatible operating systems. For more info: * Kermit Mailer * Using C-Kermit, Columbia University, Digital Press LIF Utilities: Use when transferring files between HP-UX and systems that support the LIF file format, including HP-UX Basic, Pascal, and other HP-UX systems. For more info: lif(4) in the HP-UX Reference uucp: Use when the other system is a UNIX system (including HP-UX), connected by modem lines, direct connection, or X.25 network, and with UUCP utilities installed. uucp automatically reconciles differences in file format between systems. For more info: UUCP chapter of Remote Access: User's Guide Internet Services: Use when the both systems are connected via LAN to the same services. The other system can be an HP-UX or UNIX system, or an MS-DOS personal computer. Internet Services reconcile automatically any differences in file format between systems. For more info: * Installing & Administering Internet Services * Using Internet Services * ftp(1M) in the HP-UX Reference HP FTAM/9000: Use when both systems are networking via OSI and using FTAM (OSI File Transfer, Access, and Management). OSI is a multi-vendor standard compatible with UNIX and non-UNIX operating systems. FTAM handles binary and ASCII file transfers, but does no data conversion. For more info: * HP FTAM/9000 Reference Manual * HP FTAM/9000 Programmer's Guide * HP FTAM/9000 User's Guide * Installing and Administering HP FTAM/9000 * FTAM/9000 Technical Addendum * Release Notes: FTAM 9000 Network Services/9000: Use when transferring files over LAN to any HP-UX platform. For more info: Using Network Services NFS: Use when users on different HP-UX (and other UNIX) systems want to share files. Explicit file transfers are unnecessary because file-system access is transparent. For more info: * Installing and Administering NFS Services cpio: Use when transferring files by magnetic tape (cartridge or reel-to-reel) to another UNIX system that supports the cpio format. NOTE: cpio can be used with the tcio command to ensure smoother tape access. For more info: cpio(1) in the HP-UX Reference ftio: Use when copying files to magnetic tape (cartridge, reel-to-reel, or DDS format). Faster throughput than either tar or cpio. For more info: ftio(1) in the HP-UX Reference tar: Use when transferring files by magnetic tape (cartridge, reel-to-reel, DDS format) to another UNIX system supporting the tar format. NOTE: tar can be used with the tcio command to ensure smoother tape access. However, tcio only works on cartridge tapes. For more info: tar(1) in the HP-UX Reference tcio: Use when transferring files between cartridge tape units (including autochanger) and a controlling HP-UX computer. tcio is typically used with cpio or tar. For more info: tcio(1) in the HP-UX Reference fbackbup, frecover: Use when transferring (typically backing up and restoring) files to magnetic tape, standard out, DAT tape, rewritable magneto-optical disk, or to a file. Combines features of dump and ftio. For more info: fbackup(1M) and frecover(1M) in the HP-UX Reference File Protection =============== When created, each file in the file system is assigned a set of file protections stored in the file permissions bits (often called the file's mode). The file permission bits determine which classes of users may read from the file, write to the file, or execute the program stored in the file. Read, write, and execute permissions for a file can be set for the file's owner, all members of the file's group (other than the file's owner), and all other system users. These three classes of users (user, group, and other) are mutually independent; that is, no member of one class of users is included in any other class of users. When a file is created, it is associated with an owner and a group ID. For example, a file created by pjw in group dbase is listed as being owned by user pjw of group dbase. These values specify which user owns the file and which group has special access capability. The default permissions of a file are initially determined by umask (set systemwide, in the users' environment file, or on the command line), or by parameters passed to creat, mknod, or mkdir system calls when the file is created. The permissions can be changed with the chmod command. File permissions are represented as the binary form of four octal digits. The initial discussion deals with only the three least significant digits. When the most significant digit is not specified, its value is assumed to be zero (0). Organization of File Permission Bits ------------------------------------ | file | file | others | owner | group | +----+----+----+----+----+----+----+----+----+----+----+----+ binary | | | | | | | | | | | | | +----+----+----+----+----+----+----+----+----+----+----+----+ | | | | | | | | | | | exec | | exec | | exec | write | write | write read read read Each three binary bits -- one bit to specify read permission, one bit to specify write permission, and one bit to specify execute permission for file owner, group, and others -- are interpreted as a single octal digit. If the binary bit value is one, permission is granted for the associated operation. If the bit value is zero, permission is denied. Consider a file whose permission bits are set to 754 (octal). Octal 754 is equivalent to 111 101 100 binary. The ll command represents this as rwxr-xr--. The file's owner may read, write, and execute the file, while read and execute permission is granted to members of the file-owner's group. This includes any user (except the file's owner) whose effective group ID equals the ID of the file's group, or whose group access list includes the file's group ID. All other system users may only read the file. Note, if a file has associated Access Control List (ACL) entries, "a" is displayed following the permissions. By default, the chmod command deletes any ACL entries, but you can use the -A option to preserve them. For more information on ACLs, refer to acl(5) in the HP-UX Reference. File Permission Bits of rwxr-xr-- --------------------------------- | file | file | others | owner | group | +----+----+----+----+----+----+----+----+----+----+----+----+ binary | | | | 1 | 1 | 1 | 1 | 0 | 1 | 1 | 0 | 0 | +----+----+----+----+----+----+----+----+----+----+----+----+ | | | | | | | | | | | exec | | exec | | exec | write | write | write read read read as seen using ll: r w x r - x r - - octal ______7_______ ______5______ ______4______ Protecting Directories ______________________ Directories, like all files in the HP-UX file system, have permissions. The format of a directory's permission bits is identical to that of an ordinary file; however, the read, write, and execute permissions have a slightly different meaning when applied to a directory. * Read permission grants access to display the contents of a directory. * Write permission grants access to add a file to the directory, rename a file within the directory, and remove a file from the directory. Users (even superusers) cannot write directly to the directory itself. Only the kernel can write directly to directories. * Execute permission grants access to search a directory for a file. If execute permission is not set, the files below that directory in the file-system hierarchy cannot be accessed, even when you supply the file's correct path name. Setting the sticky bit on a directory provides additional protection to files within the directory: files cannot be removed from the directory except by the owner of the file, the owner of the directory, or a user having appropriate privileges. (See rm(1) in the HP-UX Reference.) Setting Effective User and Group ID Bits (suid, sgid) _____________________________________________________ A process has effective user and group IDs that can be used to ensure file security. Using user and group IDs, a file can be protected so that when executed, the process's effective IDs are identical to the file owner's IDs. This capability is specified through the most significant digit of the four octal file protection digits. The most significant digit is represented by three bits: set user ID, set group ID, and stick bit. These bit values affect the capabilities of file owner, group, and other. When its most significant bit is 1, the effective user ID of the process executing the file is set equal to the user ID of the file's owner. This bit is called the set user ID bit (suid or setuid). Similarly, if the middle bit of the most significant octal digit is 1, then the effective group ID of the process executing the file is set equal to the group ID of the file's group. This bit is called the set group ID bit (sgid or setgid). If the sgid bit is set for an ordinary file, and the file does not have group execute permission, the file is in enforcement locking mode. Refer to the section "File Sharing and Locking" later in this paper, or to the lockf(2) entry in the HP-UX Reference. For example, consider a file whose permission bits are octal 6754. The binary equivalent is 110 111 101 100, as shown below and explained following the figure. Note that because the set user ID and set group ID bits are set, the ll listing shows the letter s in the execute bit of file owner and file group. If the sticky bit had been set, the execute bit of others would be designated with the letter t. Permission Bits of an suid/sgid file set to rwsr-sr-- ----------------------------------------------------- most significant | file | file | others bits | owner | group | +----+----+----+----+----+----+----+----+----+----+----+----+ binary | 1 | 1 | 0 | 1 | 1 | 1 | 1 | 0 | 1 | 1 | 0 | 0 | +----+----+----+----+----+----+----+----+----+----+----+----+ | | | | | | | | | | | | | | sticky| | exec | | exec | | exec | | bit | write | write | write | | read read read | set group ID set user ID as seen using ll: r w s r - s r - - octal _____6________ ______7_______ ______5______ ______4______ Explanation of File Permission Bits rwsr-sr-- --------------------------------------------- Most Significant Bits: Octal digit: 6 Binary form: 110 Permissions: set user ID: Effective user ID of the process executing this file is set equal to the real user ID of the file's owner. set group: Effective group ID of the process executing this file is set equal to the group ID of the file's group. sticky bit: The sticky bit is not set; see "Protecting Directories," earlier in this paper. File Owner Permissions: Octal digit: 7 Binary form: 111 Permissions: read: File owner may read the file. write: File owner may write to the file. execute: File owner may execute the file. File Group Permissions: Octal digit: 5 Binary form: 101 Permissions: read: Members of the file's group may read the file. write: Members of the file's group may not write to the file. execute: Members of the file's group may execute the file. All Others Permissions Octal digit: 4 Binary form: 100 Permissions: read: Any other user may read the contents of the file. write: No other users can write to the file. execute: No other users can execute the file. Access Control Lists ____________________ Access control lists (ACLs) offer a finer degree of file protection than traditional file-mode protection bits. With ACLs, you can allow or restrict file access to individual users, regardless of what group the users belong. For additional information see acl(5) in the HP-UX Reference. File Sharing and Locking ======================== In a multi-user, multi-tasking environment such as HP-UX, it is often desirable to control interaction with files. Many applications share disk files, and the status of information contained in them could have serious implications to the user (such as lost or inaccurate information). Imagine we are responsible for maintaining on-line technical reports for a myriad of projects, and we have many different people who must have simultaneous access to these reports. The content of a given report at a given time could significantly affect a company decision, and so we want a way to control how records are accessed. One potential problem could arise if one person (let's call him George) adds to or modifies information in a report while someone else (Sarah) is working on it. Sarah is unaware of changes that George has just made in the report. And once she is done, Sarah overwrites the information George added. The result is that we have lost ALL of George's information, and when Sarah added data she was unaware of information that might have been pertinent. Advisory Locks ______________ A solution to this problem common to file sharing is called file locking. In HP-UX, file locking is done with the lockf or fcntl system calls, which handle two modes of functionality. Advisory locks are placed on disk resources to inform (warn) other processes desiring access that a file is currently being accessed or modified. Advisory locks are only valuable for cooperating processes that are both aware of and use file locking. In our example, the programs used to access the on-line reports can use advisory locks. When George begins to work on the Marketing project his program can call lockf and set an advisory lock. A few minutes later when Sarah tries to access records in the Marketing report, she would get an error message indicating that the report is busy. Her program could wait until George is done and then access the report, by using the system call, lockf. Enforcement Mode ________________ Even if we use advisory locks in our example, Sarah would still be able to overwrite the Marketing report if she uses commands or utilities that do not check for advisory locks. She needs some way to ensure that no records are written until George finishes accessing the report. HP-UX does this with enforcement mode. When a process attempts to read or write to a locked record in a file opened in enforcement mode, the process sleeps until the record is unlocked. Enforcement mode can be used only on regular files. Enforcement mode is enabled by setting the set-group-id bit (sgid) but not the group execute bit. For example, if we opened a file whose permission bits are set to 644, a long listing of the file would resemble: -rw-r--r-- 1 george fiscal 512 May 7 16:11 Marketing To enable enforcement mode, type: chmod g+s Marketing This command turns on the sgid bit, resulting in file protection of 2644. Enforcement mode can also be enabled by using the chmod system call. After enforcement mode is enabled, a long listing shows: -rw-r-Sr-- 1 george fiscal 512 May 7 16:11 Marketing Using enforcement mode, George can prevent Sarah from overwriting his changes, and Sarah would have the data that George has added. When attempting to access a file locked under enforcement mode, the process sleeps until the file is released. This provides a means for one process to control execution of another. Be careful when doing this, because a system deadlock is possible. Locking Activities __________________ All file locking is controlled with the lockf or fcntl system calls. lockf controls four file actions: * Testing file accessibility by checking to see if another process is present on a specific file record. * Attempting to lock a file. If the record is already locked by another process, lockf puts the requesting process to sleep until the record is free again. * Testing file accessibility, locking the record if it is free, and returning immediately if it is not. * Unlocking a record previously locked by the requesting process. When the locking process either closes the locked file or terminates, all locks placed by that process are removed. For more details on how specific locking activities work on HP-UX, refer to lockf(2) and fcntl(2) in the HP-UX Reference.