Chapter 2. Files

Input File Formats

Haploview currently accepts input data in three formats, standard linkage format, completely or partially phased haplotypes and HapMap Project data dumps. It also takes in a separate file with marker position information. The four formats are explained in depth below.

Linkage Format

Linkage data should be in the Linkage Pedigree (pre MAKEPED) format, with columns of family, individual, father, mother, gender, affected status and genotypes. The file should not have a header line (i.e. the first line should be for the first individual, not the names of the columns). Please note that Haploview can only interpret biallelic markers — markers with greater than two alleles (e.g. microsatellites) will not work correctly. A sample line from such a file might look something like:

3    12    8    9    1    2    1 2    3 3    0 0    4 2
a     b    c    d    e    f    -----------g------------ 
(a) pedigree name

A unique alphanumeric identifier for this individual's family. Unrelated individuals should not share a pedigree name.

(b) individual ID

An alphanumeric identifier for this individual. Should be unique within his family (see above).

(c) father's ID

Identifier corresponding to father's individual ID or "0" if unknown father. Note that if a father ID is specified, the father must also appear in the file.

(d) mother's ID

Identifier corresponding to mother's individual ID or "0" if unknown mother Note that if a mother ID is specified, the mother must also appear in the file.

(e) sex

Individual's gender (1=MALE, 2=FEMALE).

(f) affectation status

Affectation status to be used for association tests (0=UNKNOWN, 1=UNAFFECTED, 2=AFFECTED).

(g) marker genotypes

Each marker is represented by two columns (one for each allele, separated by a space) and coded 1-4 where: 1=A, 2=C, 3=G, T=4. A 0 in any of the marker genotype position (as in the the genotypes for the third marker above) indicates missing data.

It is also worth noting that this format can be used with non-family based data. Simply use a dummy value for the pedigree name (1, 2, 3...) and fill in zeroes for father and mother ID. It is important that the "dummy" value for the ped name be unique for each individual. Affectation status can be used to designate cases vs. controls (2 and 1, respectively).

Files should also follow the following guidelines:

  • Families should be listed consecutively within the file (i.e. all the lines with the same pedigree ID should be adjacent)
  • If an individual has a nonzero parent, the parent should be included in the file on his own line.
Phased Haplotypes

Haplotype data for Haploview's input must be formatted in columns of Family, Individual and Genotypes. There should be two lines (chromosomes) for each individual. This is the standard format of Genehunter's TDT output. See the sample below:

FAM1    FAM1M01    0    4    2    2
FAM1    FAM1M01    0    4    2    2
FAM1    FAM1F02    3    h    1    2
FAM1    FAM1F02    3    h    1    2

The data format uses the numerals 1-4 to represent genotypes, the number zero to represent missing data, and the letter "h" to represent a heterozygous allele. That is, if an individual is heterozygous at a locus, both alleles should be "h" if the phasing (which allele falls on which chromosome) is uncertain.

HapMap Project Data Dumps

Data from the HapMap Project can be dumped by region using the GBrowse interface. Downloading data requires user registration and agreement to the terms of use. The saved data file is in a marker-per-line format which can be loaded in Haploview.

GBrowse dumps only one file, which has one marker per line and which includes familial relationships among the HapMap samples as well as marker position information. The file format has several header lines (beginning with "#") which Haploview parses. Open the file by selecting "Browse HapMap Data" option and selecting the downloaded file.

Marker Information File

The marker info file is two columns, marker name and position. The positions can be either absolute chromosomal coordinates or relative positions. It might look something like this:

marker01 190299
marker02 190950
marker03 191287

An optional third column can be included in the info file to make additional notes for specific SNPs. SNPs with additional information are highlighted in green on the LD display. For instance, you could make note that the first SNP is a coding variant as follows:

marker01 190299 CODING_SNP
marker02 190950
marker03 191287
Batch Load File

The "-batch" flag on the command line allows you to run Haploview automatically (in nogui mode) on several files. Batch input files should have one genotype file per line, along with an info file (if desired) separated by a space. Filenames must conform to the following rules:

  • Pedfile names must end in ".ped"
  • Phased haplotype file names must end in ".haps"
  • HapMap file names must end in ".hmp"
  • Info file names must end in ".info"

The following example shows 2 pedfiles (with info files) and a hapmap file:

sample1.ped   sample1.info
sample2.ped   sample2.info
sample3.hmp

Output Files

For any given tab the information in the display can be saved. For the data check and association test tabs, a simple tab-delimited text file is generated from the tables. For the LD and Haplotype tabs, data can either be dumped to text files or the image can be saved to a PNG.

LD Text Output File

LD text output is a tab delimited set of columns containing the various measures of LD used by the program. Details for each column are shown below:

  • L1 and L2 are the two loci in question, referenced by their number or name (if marker info file is provided)
  • D' is the value of D prime between the two loci.
  • LOD is the log of the likelihood odds ratio, a measure of confidence in the value of D'
  • r2 is the correlation coefficient between the two loci
  • CIlow is 95% confidence lower bound on D'
  • CIhi is the 95% confidence upper bound on D'
  • Dist is the distance (in bases) between the loci, and is only displayed if a marker info file has been loaded
  • T-int is a statistic used by the HapMap Project to measure the completeness of information represented by a set of markers in a region

Details about additional options for this output type can be found below in the Export Options section.

LD PNG Output

When saving the LD table to a PNG, Haploview saves an image using the current display settings. This includes color scheme, zoom and proportional spacing. Thus, in order to save a less detailed image to a PNG, first zoom out, then export the tab. Note that Haploview cannot save large datasets at the higher zoom levels. For more information see the Export Options section below.

Haplotype Text Output File

Haplotype output shows a block, its markers, the haplotypes and their population frequencies, the crossover percentages to the next block and the multiallelic D prime. Tag SNPs are denoted with a "!". Crossover percentages are shown as a matrix with this block's haplotypes as the rows and the next block's haplotypes as the columns. An example might look like:

BLOCK 1.  MARKERS: 1 2 3! 4!
3312 (0.825)    |0.800  0.025   0.000|
1144 (0.163)    |0.031  0.125   0.007|
3342 (0.013)    |0.006  0.000   0.006|
Multiallelic Dprime: 0.802
BLOCK 2.  MARKERS: 10! 11! 12
441 (0.837)
222 (0.150)
242 (0.013)

In this example, the first block has 4 markers with 3 haplotypes displayed and the second block has 3 markers and 3 haplotypes. The tag SNPs for each block are (3,4) and (10,11) respectively. The crossover percentage matrix can be read as follows: 80% of all samples have the pattern 3312-441, 3.1% have the pattern 1144-441 and so forth.

Haplotype PNG Output

Saving the haplotype tab to a PNG produces an image using the current display settings (such as haplotype frequency cutoff).

Single Marker Association Text Output File

Single marker association results are saved in a tab-delimited text file with the following columns:

  • # is the marker number.
  • Name is the marker ID specified if an info file is loaded.
  • Chi Square is the chi square value for the marker.
  • p value is the significance level for the above chi square.

Trio (TDT) data only:

  • Overtransmitted is the allele overtransmitted to affected offspring.
  • T:U is the ratio of transmissions to non transmissions of the overtransmitted allele (see above).

Case-Control data only:

  • Major Alleles are the major alleles in the case and control populations respectively.
  • Case Control Ratios are the ratios (shown as either counts or quotients, depending on selected options) for the case and control populations, respectively.
Haplotype Association Text Output

Haplotype association text output is a tab-delimited file, broken into sections by block. The columns are:

  • Haplotype is the sequence of alleles for this haplotype in this block.
  • Frequency is the population frequency for this haplotype.
  • Chi Square is the chi square value for the haplotype.
  • p value is the significance level for the above chi square.

Trio (TDT) data only:

  • T:U is the ratio of transmissions to non transmissions of the haplotype to affected offspring.

Case-Control data only:

  • Case Control Ratios are the ratios (shown as either counts or quotients, depending on selected options) for the case and control populations, respectively.
Marker Check Text Output File

The marker check data is a tab-delimited file with the following columns:

  • # is the marker number.
  • Name is the marker ID specified (only if an info file is loaded).
  • Position is the marker position specified (only if an info file is loaded).
  • ObsHET is the marker's observed heterozygosity.
  • PredHET is the marker's predicted heterozygosity (i.e. 2*MAF*(1-MAF)).
  • HWpval is the Hardy-Weinberg equilibrium p value, which is the probability that its deviation from H-W equilibrium could be explained by chance.
  • %Geno is the percentage of non-missing genotypes for this marker.
  • FamTrio is the number of fully genotyped family trios for this marker (0 for datasets with unrelated individuals).
  • MendErr is the number of observed Mendelian inheritance errors (0 for datasets with unrelated individuals).
  • MAF is the minor allele frequency (using founders only) for this marker.
  • Rating is "BAD" if the marker failed any of the above tests and blank otherwise.

Export Options

The "Export Options" item in the File Menu allows adjustment of several parameters and allows the user to save any tab without having to switch to it. Specifically, the LD tab allow the markers to be filtered to output only some of the markers:

All

The default setting (and only one available for most tabs) is to use all the markers.

Marker Range

Generates the LD text or PNG file for only a specific range of markers.

Adjacent Markers

Generates the LD text file for only adjacent markers. This can be useful to view the T-int stat, which measures LD information content in the gaps between markers.

There is also an option to generate a "compressed" LD PNG, which is useful for very large datasets. The image is shrunk to an arbitrary zoom level which allows Haploview to save the PNG with minimal memory usage.

Auxiliary Input Files

Blocks File

You can specify a set of blocks by loading a blocks file. Each line is a space separated list of markers with one block per line. For example:

1 2 3 4
9 10 11 12 13 14 15

Would create one block from markers 1-4 and another from 9-15. The first marker in the file is number 1 (not 0).

Analysis Track File

You can add an analysis track along the top of the LD display by loading a file with two columns, <position> <value>. Haploview will plot the values continuously with respect to the positions of the markers, so the positions should use the same coordinates as the marker info file. For example:

1000 0.3
2000 1.7
3000 11.0
4000 2.3
5000 4.6

Would plot a line from position 1000 to 5000. The values can be of any units or magnitude, as the Haploview scales the analysis track to the bounds of the values.