RemoveGaps

A Java program to remove gaps in multiple DNA sequence alignments

© Copyright February 2002, Dirk Husmeier, Biomathematics and Statistics Scotland (BioSS)


Get the file

Just click here.

Compile the program

First, you need to make sure that you have the Java compiler properly installed. Then, give the command

javac RemoveGaps.java

To run the program, just type:

RemoveGaps


What does the program do?

Program RemoveGaps reads in an alignment in BAMBE format, which must be in a file called dna.dat. It discards all the sites with symbols different from a nucleotide. The results are written out to file dna_no_gaps.dat, which is also in BAMBE format. The program also writes out a file called gapSites.out, which contains a list of all the sites in the original alignment that have been discarded.

Example

Assume you are given the following file (in PHYLIP format):

4 50
strain1 TGGGGCAAAA -TTTAGTCAA TTTTCGTAAC TTTTTTATTT TGAAAAATTC
strain2 TGGGGCAAAA TTTTAGTCAA TTTTCGTAAC TTTTTTATTT TGAAAAATTT
strain3 TGGGGCAAAA TTTTAGTCAA TTCTCATAAC TTTTTTATTT TGAAAAATTT
strain4 TGGGGCAAAA -TTTAGTCAA TTTTCATAAC TTTTTTATTT TGAAAAAT-C

First, transform this file into BAMBE format, and call this new file dna.dat:

4 50
strain1
TGGGGCAAAA-TTTAGTCAATTTTCGTAACTTTTTTATTTTGAAAAATTC
strain2
TGGGGCAAAATTTTAGTCAATTTTCGTAACTTTTTTATTTTGAAAAATTT
strain3
TGGGGCAAAATTTTAGTCAATTCTCATAACTTTTTTATTTTGAAAAATTT
strain4
TGGGGCAAAA-TTTAGTCAATTTTCATAACTTTTTTATTTTGAAAAAT-C

Now give the command:

java RemoveGaps

This gives you two output files. The file dna_no_gaps.dat contains the ungapped DNA sequence alignment (in BAMBE format):

4 48
strain1
TGGGGCAAAATTTAGTCAATTTTCGTAACTTTTTTATTTTGAAAAATC
strain2
TGGGGCAAAATTTAGTCAATTTTCGTAACTTTTTTATTTTGAAAAATT
strain3
TGGGGCAAAATTTAGTCAATTCTCATAACTTTTTTATTTTGAAAAATT
strain4
TGGGGCAAAATTTAGTCAATTTTCATAACTTTTTTATTTTGAAAAATC

On transforming this into PHYLIP format, you get:

4 48
strain1 TGGGGCAAAA TTTAGTCAAT TTTCGTAACT TTTTTATTTT GAAAAATC
strain2 TGGGGCAAAA TTTAGTCAAT TTTCGTAACT TTTTTATTTT GAAAAATT
strain3 TGGGGCAAAA TTTAGTCAAT TCTCATAACT TTTTTATTTT GAAAAATT
strain4 TGGGGCAAAA TTTAGTCAAT TTTCATAACT TTTTTATTTT GAAAAATC

Note that the second number in the first row, which gives the number of nucleotides, has been corrected since two columns with gaps have been removed. The numbers of the columns that have been discarded are written out to file gapSites.out:

11
49


Last modified: February 2002
Back to my homepage