Data can be loaded in three formats along with an optional marker info file (except for HapMap files which do not require an info file). Further options are presented on the load screen:
Please note that bugs are sure to crop up which occasionally cause the program to hang. In general any dataset which causes the program to hang for > 15 minutes is unlikely ever to return. Therefore, prolonged wait times are probably a symptom of a failure as opposed to lengthy compute time. If you encounter such a problem, please contact the authors.
Haploview allocates 1G of memory by default. This is usually sufficient to handle datasets with several thousand markers. If you are running the program on very large datasets (>20,000 markers) you may need to force more memory (presuming your computer has sufficient resources available). This can be accomplished using the following command:
java -Xmx2000M -cp Haploview.jar edu/mit/wi/haploview/Haploview |
Where "2000M" in this case specifies 2000 megabytes (2G) of memory and can be adjusted as necessary.
After loading a linkage or HapMap format file, Haploview shows some basic data quality checks for the markers. Markers are filtered out based on some default criteria which can be adjusted as necessary. Markers can be added or removed from analyses by hand via the checkboxes.
You can adjust the filtering thresholds and click "Rescore" to refilter the markers using the new values. Markers can also be selected/unselected by hand by clicking the "Rating" checkbox. Any marker which fails one of the quality tests will have the relevant field(s) highlighted in red.
If two markers in an input file have the same chromosomal position, Haploview will by ignore the less completely genotyped marker by default and highlight both in yellow on the check markers panel. When running in nogui mode Haploview always ignores the less completely genotyped version of two markers with the same position. If you want to use both from the command line, you'll need to adjust one of the positions.
The top of the tab contains information about individuals filtered during the loading of the file. It will show overview information about the number of singletons and trios used and the number of independent families loaded. Further information about filtered individuals can be shown by clicking the "Show Excluded Individuals" button. This will present a list of excluded individuals as well as the reason for exclusion. Details about individual filtering can be found in the FAQ.
The color scheme option (Display menu) allows you to choose among several LD color schemes. The following tables provide details on the color schemes, and a key to the meaning of the currently selected scheme can be dropped down from the "Key" menu in the upper right corner of the screen.
Table 1.2. Confidence Bounds Color Scheme
Strong Evidence of LD | dark grey |
Uninformative | light grey |
Strong Evidence of Recombination | white |
(r2 and Alt D'/LOD courtesy of Will Fitzhugh)
A graph of any variable versus chromosomal location can be added above the LD plot with the "Load Analysis Track" option. Simply create a file with two columns: <position> <value> . Haploview will plot the values in a continuous line along the top of the screen.
The "Download HapMap info track" option (with an internet connection) allows you to connect to the HapMap Project server and download and display a track with HapMap genotyped SNPs and gene names. If an info file is specified, the default boundaries are the positions of the first and last markers (which is only valid if the info file is in genomic coordinates). You must specify the proper chromosome in the dialog box. If you are using a file downloaded from the HapMap website the program will specify the correct default chromosome and start/end positions. This track display can be configured with the "HapMap Info Track Options" item in the "Display" menu. Available tracks include HapMap SNPs, gene names, mRNas, contigs, gaps and GC content.
Haploview generates blocks whenever a file is opened, but these blocks can be edited and redefined in a number of ways. In the Analysis menu, you can clear all the blocks in order to start over, define blocks based on one of several automated methods or customize the parameters of those algorithms. Additionally, the blocks can be edited by hand.
The default algorithm is taken from Gabriel et al, Science, 2002. 95% confidence bounds on D prime are generated and each comparison is called "strong LD", "inconclusive" or "strong recombination". A block is created if 95% of informative (i.e. non-inconclusive) comparisons are "strong LD". This method by default ignores markers with MAF < 0.05. The MAF cutoff and the confidence bound cutoffs can be edited by choosing "Customize Block Definitions" (Analysis menu). This definition allows for many overlapping blocks to be valid. The default behavior is to sort the list of all possible blocks and start with the largest and keep adding blocks as long as they don't overlap with an already declared block.
This is a variant on the algorithm described in Wang et al, Am. J. Hum. Genet., 2002. For each marker pair, the population frequencies of the 4 possible two-marker haplotypes are computed. If all 4 are observed with at least frequency 0.01, a recombination is deemed to have taken place. Blocks are formed by consecutive markers where only 3 gametes are observed. The 1% cutoff can be edited to make the definition more or less stringent.
This internally developed method searches for a "spine" of strong LD running from one marker to another along the legs of the triangle in the LD chart (this would mean that the first and last markers in a block are in strong LD with all intermediate markers but that the intermediate markers are not necessarily in LD with each other).
Markers can be removed from blocks by clicking on the marker number (along the top of the D prime graph). Blocks can be defined by hand by clicking and dragging along the marker number row. Any block which overlaps with an existing block will take precedence and delete the existing block.
View haplotypes for selected blocks by clicking on the "Haplotypes" tab or selecting "Haplotypes" from the Display menu. Haplotypes are estimated using an accelerated EM algorithm similar to the partition/ligation method described in Qin et al, 2002, Am J Hum Genet. This creates highly accurate population frequency estimates of the phased haplotypes based on the maximum likelihood as determined from the unphased input.
The haplotype display shows each haplotype in a block with its population frequency and connections from one block to the next. In the crossing areas, a value of multiallelic D' is shown. This represents the level of recombination between the two blocks. Note that the value of multiallelic D' is computed for only the haplotypes ("alleles") currently displayed. This usually does not have a strong effect, as the rare haplotypes contribute only slightly to the overall value. Above the haplotypes are marker numbers along with a tick beneath haplotype tag SNPs (htSNPs).
The display can be edited using the controls at the bottom of the screen to display only more common haplotypes or to adjust the connecting lines. By default, alleles are displayed using A,C,G,T along with the special symbol 'X' which represents a fairly rare situation in which only one allele is unambiguously observed in phased data. The 'X' represents the allele of unknown identity. The display can also be changed to show the alleles numerically from 1-4 with 8 being the equivalent of 'X', or as blue and red boxes, with blue being the major allele and red the minor.
The htSNPs are selected on a block-by-block basis. This means that the end set is not necessarily the most parsimonious one for the entire dataset, but it provides for a much more thorough testing set if you plan to move from an initial sample used to pick htSNPs to a much larger sample, that is, it is much more likely to catch variation in the new, large dataset that was not observed in the initial dataset. Preference for this method or another depends on experimental design.
Specifically, the SNPs in each block are ranked in order of completeness of genotyping and then a set is selected which defines all haplotypes above a certain (adjustable) frequency threshold. The method ensures the most efficient possible set for strict 4 gamete blocks, but occasionally features redundancies of larger and more loosely defined blocks (where some recombination has occurred).
If selected when loading the data, Haploview computes single locus and multi-marker haplotype association tests. For case/control data, the chi square and p-value for the allele frequencies in cases vs. control are shown. For family trios, all probands (affected individual with genotyped parents) are used to compute TDT values.
The haplotype association test is performed on the set of blocks selected on the LD and haplotype tabs. Results are shown only for those haplotypes above the display threshold on the haplotype tab. Counts for both TDT and case control association tests are obtained by summing the fractional likelihoods of each individual for each haplotype. In other words, if a particular individual has been determined by the EM to have a 40% likelihood of haplotype A and 60% likelihood of haplotype B, 0.4 and 0.6 would be added to the counts for A and B respectively.
Additional information about the way in which pedigrees are filtered for TDT purposes can be found in the FAQ.
Haploview is not intended to be the only way of testing association results, but to provide a straightforward way to do simple association tests. It's always a good idea to try out multiple approaches to analyzing your data.