Chapter 1. Using Haploview

Loading a Dataset

Data can be loaded in three formats along with an optional marker info file (except for HapMap files which do not require an info file). Further options are presented on the load screen:

  • Haploview saves time by only computing pairwise LD statistics for markers within a certain distance of each other. The default is 500KB. Enter a value of zero to force all pairwise computations.
  • Haploview excludes individuals with less than 50% complete genotypes. This threshold can be adjusted in the load dialog. Additional details about excluded individuals are available from the marker check tab.
  • When loading a file dumped from the HapMap project website, it is possible to automatically display SNP and gene tracks from the HapMap above the data by checking the "Download and show HapMap info track" box. More information is available with the LD Display help. [hapmap file only]
  • If you wish to perform association tests, you must inform the program now and select either family trios or case/controls. More details are available under association. [pedfile only]

Please note that bugs are sure to crop up which occasionally cause the program to hang. In general any dataset which causes the program to hang for > 15 minutes is unlikely ever to return. Therefore, prolonged wait times are probably a symptom of a failure as opposed to lengthy compute time. If you encounter such a problem, please contact the authors.

Haploview allocates 1G of memory by default. This is usually sufficient to handle datasets with several thousand markers. If you are running the program on very large datasets (>20,000 markers) you may need to force more memory (presuming your computer has sufficient resources available). This can be accomplished using the following command:

java -Xmx2000M -cp Haploview.jar edu/mit/wi/haploview/Haploview

Where "2000M" in this case specifies 2000 megabytes (2G) of memory and can be adjusted as necessary.

Data Quality Checks

Marker Checks

After loading a linkage or HapMap format file, Haploview shows some basic data quality checks for the markers. Markers are filtered out based on some default criteria which can be adjusted as necessary. Markers can be added or removed from analyses by hand via the checkboxes.

  • # is the marker number.
  • Name is the marker ID specified (only if an info file is loaded).
  • Position is the marker position specified (only if an info file is loaded).
  • ObsHET is the marker's observed heterozygosity.
  • PredHET is the marker's predicted heterozygosity (i.e. 2*MAF*(1-MAF)).
  • HWpval is the Hardy-Weinberg equilibrium p value, which is the probability that its deviation from H-W equilibrium could be explained by chance.
  • %Geno is the percentage of non-missing genotypes for this marker.
  • FamTrio is the number of fully genotyped family trios for this marker (0 for datasets with unrelated individuals).
  • MendErr is the number of observed Mendelian inheritance errors (0 for datasets with unrelated individuals).
  • MAF is the minor allele frequency (using founders only) for this marker.
  • Rating is checked if the marker passes all the tests and unchecked if it fails one or more tests (highlighted in red).

You can adjust the filtering thresholds and click "Rescore" to refilter the markers using the new values. Markers can also be selected/unselected by hand by clicking the "Rating" checkbox. Any marker which fails one of the quality tests will have the relevant field(s) highlighted in red.

Duplicate Markers

If two markers in an input file have the same chromosomal position, Haploview will by ignore the less completely genotyped marker by default and highlight both in yellow on the check markers panel. When running in nogui mode Haploview always ignores the less completely genotyped version of two markers with the same position. If you want to use both from the command line, you'll need to adjust one of the positions.

Filtered Individuals

The top of the tab contains information about individuals filtered during the loading of the file. It will show overview information about the number of singletons and trios used and the number of independent families loaded. Further information about filtered individuals can be shown by clicking the "Show Excluded Individuals" button. This will present a list of excluded individuals as well as the reason for exclusion. Details about individual filtering can be found in the FAQ.

LD Display

Perusing the LD Display

  • The color scheme option (Display menu) allows you to choose among several LD color schemes. The following tables provide details on the color schemes, and a key to the meaning of the currently selected scheme can be dropped down from the "Key" menu in the upper right corner of the screen.

    Table 1.1. Standard Color Scheme

     D' < 1D' = 1
    LOD < 2whiteblue
    LOD ≥ 2shades of pink/redbright red

    Table 1.2. Confidence Bounds Color Scheme

    Strong Evidence of LDdark grey
    Uninformativelight grey
    Strong Evidence of Recombinationwhite

    Table 1.3. r2 Color Scheme

    r2 = 0white
    0 < r2 < 1 shades of grey
    r2 = 1 black

    Table 1.4. Alternate D'/LOD Color Scheme

     Low D'High D'
    Low LODwhiteshades of pink
    High LODwhiteblack

    (r2 and Alt D'/LOD courtesy of Will Fitzhugh)

    Table 1.5. 4 Gamete Color Scheme

    4 distinct 2-marker haplotypeswhite
    < 4 distinct 2-marker haplotypesblack
  • In order to help keep the display uncluttered, D prime values of 1.0 are never shown (the box is empty). The r2 and Alternate D'/LOD schemes do not show any values in the boxes.
  • The zoom option (Display menu) allows you to select one of three zoom modes. The two zoomed out versions can be useful for browsing large datasets.
  • Large datasets also show a "map" in the lower left corner which gives an overview of the D prime display and allows you to navigate quickly. Clicking on an area of the map will cause the main display to jump to that area. This map also shows the currently defined blocks as small black lines across the top.
  • Markers with additional notes (as loaded from the info file) are highlighted (the names are green in the zoomed-in view and the lines from the SNP position to the LD chart are green in the zoomed-out view. Details can be viewed by right clicking on the marker number (as mentioned below).
  • Right clicking on the marker number (or the equivalent space in the zoomed out views) shows the marker name, minor allele frequency and any additional notes specified in the info file. This can be especially helpful in the zoomed out views which do not display marker names.
  • Right clicking on any pairwise LD comparison will show a more detailed summary of the LD between the two markers in question.

Additional Data Tracks

Analysis Track

A graph of any variable versus chromosomal location can be added above the LD plot with the "Load Analysis Track" option. Simply create a file with two columns: <position> <value> . Haploview will plot the values in a continuous line along the top of the screen.

HapMap Gene/SNP Track

The "Download HapMap info track" option (with an internet connection) allows you to connect to the HapMap Project server and download and display a track with HapMap genotyped SNPs and gene names. If an info file is specified, the default boundaries are the positions of the first and last markers (which is only valid if the info file is in genomic coordinates). You must specify the proper chromosome in the dialog box. If you are using a file downloaded from the HapMap website the program will specify the correct default chromosome and start/end positions. This track display can be configured with the "HapMap Info Track Options" item in the "Display" menu. Available tracks include HapMap SNPs, gene names, mRNas, contigs, gaps and GC content.

Blocks and Haplotypes

Blocks

Haploview generates blocks whenever a file is opened, but these blocks can be edited and redefined in a number of ways. In the Analysis menu, you can clear all the blocks in order to start over, define blocks based on one of several automated methods or customize the parameters of those algorithms. Additionally, the blocks can be edited by hand.

Confidence Intervals [DEFAULT]

The default algorithm is taken from Gabriel et al, Science, 2002. 95% confidence bounds on D prime are generated and each comparison is called "strong LD", "inconclusive" or "strong recombination". A block is created if 95% of informative (i.e. non-inconclusive) comparisons are "strong LD". This method by default ignores markers with MAF < 0.05. The MAF cutoff and the confidence bound cutoffs can be edited by choosing "Customize Block Definitions" (Analysis menu). This definition allows for many overlapping blocks to be valid. The default behavior is to sort the list of all possible blocks and start with the largest and keep adding blocks as long as they don't overlap with an already declared block.

Four Gamete Rule

This is a variant on the algorithm described in Wang et al, Am. J. Hum. Genet., 2002. For each marker pair, the population frequencies of the 4 possible two-marker haplotypes are computed. If all 4 are observed with at least frequency 0.01, a recombination is deemed to have taken place. Blocks are formed by consecutive markers where only 3 gametes are observed. The 1% cutoff can be edited to make the definition more or less stringent.

Solid Spine of LD

This internally developed method searches for a "spine" of strong LD running from one marker to another along the legs of the triangle in the LD chart (this would mean that the first and last markers in a block are in strong LD with all intermediate markers but that the intermediate markers are not necessarily in LD with each other).

Markers can be removed from blocks by clicking on the marker number (along the top of the D prime graph). Blocks can be defined by hand by clicking and dragging along the marker number row. Any block which overlaps with an existing block will take precedence and delete the existing block.

Haplotypes

Display

View haplotypes for selected blocks by clicking on the "Haplotypes" tab or selecting "Haplotypes" from the Display menu. Haplotypes are estimated using an accelerated EM algorithm similar to the partition/ligation method described in Qin et al, 2002, Am J Hum Genet. This creates highly accurate population frequency estimates of the phased haplotypes based on the maximum likelihood as determined from the unphased input.

The haplotype display shows each haplotype in a block with its population frequency and connections from one block to the next. In the crossing areas, a value of multiallelic D' is shown. This represents the level of recombination between the two blocks. Note that the value of multiallelic D' is computed for only the haplotypes ("alleles") currently displayed. This usually does not have a strong effect, as the rare haplotypes contribute only slightly to the overall value. Above the haplotypes are marker numbers along with a tick beneath haplotype tag SNPs (htSNPs).

Display Controls

The display can be edited using the controls at the bottom of the screen to display only more common haplotypes or to adjust the connecting lines. By default, alleles are displayed using A,C,G,T along with the special symbol 'X' which represents a fairly rare situation in which only one allele is unambiguously observed in phased data. The 'X' represents the allele of unknown identity. The display can also be changed to show the alleles numerically from 1-4 with 8 being the equivalent of 'X', or as blue and red boxes, with blue being the major allele and red the minor.

Tag SNPs

The htSNPs are selected on a block-by-block basis. This means that the end set is not necessarily the most parsimonious one for the entire dataset, but it provides for a much more thorough testing set if you plan to move from an initial sample used to pick htSNPs to a much larger sample, that is, it is much more likely to catch variation in the new, large dataset that was not observed in the initial dataset. Preference for this method or another depends on experimental design.

Specifically, the SNPs in each block are ranked in order of completeness of genotyping and then a set is selected which defines all haplotypes above a certain (adjustable) frequency threshold. The method ensures the most efficient possible set for strict 4 gamete blocks, but occasionally features redundancies of larger and more loosely defined blocks (where some recombination has occurred).

Association Tests

If selected when loading the data, Haploview computes single locus and multi-marker haplotype association tests. For case/control data, the chi square and p-value for the allele frequencies in cases vs. control are shown. For family trios, all probands (affected individual with genotyped parents) are used to compute TDT values.

The haplotype association test is performed on the set of blocks selected on the LD and haplotype tabs. Results are shown only for those haplotypes above the display threshold on the haplotype tab. Counts for both TDT and case control association tests are obtained by summing the fractional likelihoods of each individual for each haplotype. In other words, if a particular individual has been determined by the EM to have a 40% likelihood of haplotype A and 60% likelihood of haplotype B, 0.4 and 0.6 would be added to the counts for A and B respectively.

Additional information about the way in which pedigrees are filtered for TDT purposes can be found in the FAQ.

Haploview is not intended to be the only way of testing association results, but to provide a straightforward way to do simple association tests. It's always a good idea to try out multiple approaches to analyzing your data.