FastGap homepage

Finn Borchsenius, Department of Biological Sciences, University of Aarhus, Denmark

Last updated 07 February 2012

DOWNLOAD FastGap 1.2 (zip file)

If you find the program useful please send me an email

Cite as: Borchsenius, F. 2009. FastGap 1.2.

Department of Biosciences, Aarhus University, Denmark.

Published online at http://www.aubot.dk/FastGap_home.htm

Introduction

FastGap is a Windows executable program for fast and efficient assembly of DNA sequence alignment files in BioEdit fasta format into #NEXUS format ready for analysis in programs such as PAUP* or MrBayes. In the process, gap or indel characters can be coded using the simple method of Simmons & Ochoterena 2000 and added to the data file as separate partitions.

The program is made using the Borland C++ builder and needs the following resource file to be present in order to run:

VCL35.BPL

That file is supplied together with the Windows executable. Place it in the same folder as the FastGap program.

The Windows interface of FastGap is simple and largely self explanatory. One or several sequence alignment files can be added to the assembly list using a standard Windows file open dialog. The list of files can also be edited manually if necessary. The maximum number of files that can be included is 9 in the present version but this can easily be changed if there is a need to do so. Upon execution of the Make command each sequence alignment is written to the #NEXUS file as a separate partition defined in a SETS block. Nucleotide partitions are named region1_nuc, region2_nuc, etc. If gaps are coded then they are added to the #NEXUS file as separate partitions following each sequence alignment. Gap partitions are also defined in the SETS block and named region1_gaps, region2_gaps, etc. A set including all gap partitions is also coded (charset ‘gaps’) to facilitate fast inclusion and exclusion of gap characters in PAUP. A list of all gaps that have been coded and their first occurrence is written to the #NEXUS file as a list of comments. I owe inspiration for the format of that list to Young and Healy 2003.

Other program settings include:

Coding of gap characters on/off
Specification of the first n taxa in the data matrix as outgroup. This option results in an outgroup setting being written to the data file and is particularly convenient if you are using PAUP for PC where the outgroup otherwise has to be specified from the command line.

From version 1.2 the program interprets both ‘?’ ‘N’ and ‘n’ in the input file as missing data. i.e., ‘uncertain whether gap or nucleotide’. A slash ‘-‘ represents a gap. All other characters mean nucleotide. Prior versions used only a single missing data character.

An error message will appear if the specified sequence alignment files cannot be found or opened, or if the number of taxa varies among different files. The program does not check for consistency of number of nucleotide characters among lines of the input sequence alignment files. Such errors will be detected only when you try to execute the generated #NEXUS file in PAUP*. Upon successful assembly a preview of the #NEXUS file is displayed for inspection. Note that the file cannot be edited from the FastGap window. If you detect an error correct the input-files and assemble them again. If you wish to modify the #NEXUS file after assembly open do it with your favourite text processor.

One common source of error is incorrect format of the input file. It must be the BioEdit standard fasta format. If you experience problems try opening your aligment in BioEdit, save it in different format (e.g., genbank file ‘.gb’), then re-open it in BioEdit and save it once more in fasta format. That should secure that the format is correct relative to FastGap.

Another common source of error is to have space characters in the taxon names. This will cause FastGap to interpret text following the first space as nucleotid characters. The output format is intended for direct use in PAUP. If your aim is to analyse the file in MrBayes then you need to manually delete line:

OPTIONS GAPMODE=MISSING;

in the DATA block You may also wish to add a MrBayes block with the necessary specifications for your analysis. The data partitions specified in the SETS block can be copied to the MrBayes block if you intend to analyse a partitioned model.

Gap coding algorithm

FastGap scores gap or indel characters according to the simple method described by Simmons and Ochoterena (2000). The results from FastGap are no different from those obtained with GapCoder (Young and Healy 2003) or the online Gap Recoder program (except the latter will place gap characters in a different order). The main point of FastGap is that it supplies a Windows interface for concatenation of several independent sequence alignment files while simultaneously performing gap coding. Furthermore FastGap reads unmodified BioEdit fasta files. These two features make FastGap very efficient for #NEXUS file assembly by users of BioEdit and PAUP/MrBayes on a PC platform, irrespective of whether gaps are coded or not!

Under the simple method gaps are considered homologous if and only if they start and end in the same position in the sequence alignment. The computational approach to gap coding applied in FastGap is intitiated by a search for the first gap character in the data matrix. The search starts in position 1 of sequence 1 and proceeds down across sequences before moving to the next position. When a gap is located its starting and ending positions are recorded and written to a list of unique gaps maintained in the program memory. Then a decision on how to code the gap is made for each sequence in the matrix. The rules governing this procedure are (see figure):

1) If a sequence has a gap starting and ending at exactly the same positions as the gap being coded then the gap is scored as present (default value A)

2) If a sequence has a nucleotide character at either the starting OR the ending position of the gap being coded then the gap is coded as absent (default value C)

3) If a sequence has a gap that starts at the same or an earlier position than the gap being coded AND ends at the same or a later position then the gap is scored as unknown (default value -)

4) If a sequence has a gap that starts and ends at the same positions as the gap being coded but is bordered by missing data (‘?’, ‘N’, ‘n’) then the gap is scored as unknown. This is also the case if the sequence has missing data in the gap positions.

Having coded the first gap the search proceeds for other unique gaps until the last position of the matrix is reached. Ambiguity codes (incl. N) are handled identically to nucleotide characters. Leading and trailing gap characters are not coded as gaps. Likewise gaps with missing data on either side are not coded since their length and position cannot be defined exactly. The source code that handles file concatenation and gap coding in FastGap can be downloaded here. Note that this cannot be compiled directly – you will need to supply code for your own console (or GUI) interface to interact with the program.

Fig. 1. Example of gap coding. Three unique gaps (pink, blue, green) are identified and coded:

GAAC------ATGC 01-

GAAC------TTGC 01-

GAAC---CCTTTGC 001

GAA---------GC 1--