Clean-up FASTA Header with Vim

The Raw FASTA file

>gi|937372680|gb|KT805061.1| Steiropteris leprieurii voucher A. Salino 14500 (BHCB) tRNA-Leu (trnL) gene and trnL-trnF intergenic spacer, partial sequence; chloroplast

Cleaned up FASTA file


Step by Step Directions

1. Open terminal and navigate to your home directory (cd ~/)
2. Check to see if .vimrc exists (ls -l .vimrc). If it doesn’t move on to step 3, otherwise skip to step 4
3. Create .vimrc (touch .vimrc)
4. Open .vimrc (vim ~/.vimrc)
5. Insert following code in the file (Get into insert mode by pressing ‘i’). Copy pasting directly ‘as-is’ from here is recommended!

function! CleanFasta()
nmap :call CleanFasta()

6. Save and close .vimrc (:wq)
7. Open Fasta file in vim (vim my_fasta_file.fas)
8. Press escape to ensure being in command mode, as opposed to insert mode
9. Press key and watch magic happen

Estimating Pi (nucleotide diversity) for a large number of populations

If you want to estimate diversity statistics such as ‘pi’ for a large number of populations in a contiguous data set, this tutorial may help you.

What we have:
– .vcf format data file
– .pop file (unique names of pops, one per line)
– .map file (individual to population mapping file — 2 columns)

If you want to use vcftools, your first thought might be to subset the data into multiple vcfs, one per population. But this is entirely unnecessary. Let’s see how we can combine different flags to achieve the same result. This makes use of simple bash scripting and vcftools.

Subset popfile for indv pops

cat plants.pop | while read line;
 grep "$line" > $line.pop

If this worked, you now should have one .pop file per population containing mappings for individuals in that population.

Estimate ‘pi’ diversity stat

for p in *.pop
 vcftools --vcf input.vcf --keep $p --site-pi --out $p

This will subset the the input vcf for a given population on the fly and estimate pi statistic.

Fast Estimation of Per Locus Pairwise FST (W&C)

Many tools are available for obtaining W&C FST estimates for pairs of populations on a locus by locus basis. My preferred method involved creating a genind object with Adegenet package, using seploc to separate loci and then running pairwise.fst() function on each locus to obtain FST matrices.

The adegenet pairwise.fst remains a great function for doing a few estimations. But if you are dealing with simulation data with hundreds to thousands of input files, the computational burden becomes prohibitive.

In these cases, the diffCalc() function from Kevin Keenan’s diveRsity package is very useful. It is fast, and is scalable due to it’s ability to parallelize jobs. And it does so automatically (assumes you have relevant packages installed though e.g. doParallel and parallel). The only caveat with this package is that it requires genepop formatted files. This was the first bottleneck for me.

Our simulation data is in Structure formatted files. We will use some bash scripting to loop PGDSpider over structure files one by one and output corresponding genepop files. First fire up the GUI version of PGDSpider and create a SPID settings file for conversion from STRUCTURE to GENEPOP. Let’s name it str2gpop.spid.

## Bash Script to loop PGDSpider
## Input files have .str extension, output files .gen. -Xmx5g indicates that java is allowed 5Gb memory.

for f in ./*.str;
java -Xmx5g -jar /pgd/PGDSpider2-cli.jar -inputfile "$f" -outputfile "${f%.str}.gen" -inputformat STRUCTURE -outputformat GENEPOP -spid str2gpop.spid

If you have a large number of files to process, it is recommended to run this inside a screen session. Look up ‘unix screen’ if you don’t know what it is.

Next, install the diveRsity R package including it’s dependencies. Let’s assume we have now have 5 genepop files in the current directory.

ls -lh ~/fst/genepop/

Fire up R

gen <- list.files(pattern='.gen')
fst <- lapply(gen, function(x) diffCalc(infile=x, outfile=x, fst=TRUE, pairwise=TRUE))

At this point, we should have five separate directories (one per input file) where results are stored:

ls -l iter0*/

Now, let’s rename the directories to a convenient format

find . -name '*diffCalc*' -type d -exec bash -c 'mv "${0}" "${0//-\[diffCalc\]/_fst"' {} \;

Fst matrices are all stored in a file named ‘pw_locus.txt’ Let’s rename these files after the directories they reside in.

for subdir in */*; do mv $subdir/pw_locus.txt $subdir/$subdir.fst; done;

You will find several measures of population differentiation (including Gst, D and W&C Fst) in the “pw_locus.txt” files. We only need the W&C FST. But instead of extracting those matrices, we would delete everything else. For this we can make use of the fact that diffCalc() stored these matrices of interest always starting with line#4027. Before we go ahead delete everything else, let’s insert a column header line containing names of pairwise comparisons.

In R do this:

a <- t(combn(1:30, 2))
write.table(a, 'pw.txt', row.names=F, col.names=F, quote=F, sep='\t')

Open up the file in vi, prefix every number with ‘pop’ and join every row by column by putting an underscore in between. This is how the names will look in pw.txt

locus p1_p2 p1_p3 .....p29_p30

Now insert contents of ‘pw.txt’ below 4027th row in each of the fst files.
find */* -name '*.fst' -exec sed -i '4027r./pw.txt' {} \;

Finally, delete lines 1 through 4027 in each of the files
sed -i.bak -e '1,4027d' */*.fst

At this point, your FST matrices should be ready.

Preparing BGC data files for Parental Pops: Allele Counts

For BGC website, see: Bayesian Genomic Clines

The genotype data for parental and admixed populations for BGC is in the form of counts of alleles per locus. A separate file for each population is needed. First convert your data to structure format and read into R as an adegenet object. Then convert genind to genpop to obtain allele counts.


mygenind <- import2genind('input.str', onerowperind=F, n.ind=10, n.loc=20, row.marknames=1, col.lab=1, col.pop=2, ask=F)

mygenpop <- genind2genpop(mygenind)

mygenpop@tab contains a table of allele counts per locus and per population. Now transpose this table, save it, then read it back (probably not necessary, but I hate dealing with atomic vector errors).

write.table(mygenpop@tab, 'AllCountTable.txt', quote=F, sep='\t')
ct <- read.table('AllCountTable.txt', header=T)

   Allele pop1 pop2 
1 L0001.1    7    3 
2 L0001.2    3    1 
3 L0002.1   10    4 
4 L0002.2    0    0 

We would like to retain the locus names from the first column (ct$Allele), which R probably does not recognize as a character vector yet. Let’s get that out of the way first.
ctall <- as.character(ct$Allele)

Split the locus names from the composite loc+allele names.
ctall_split <- lapply(strsplit(ctall, '.', fixed=TRUE), '[[', 1)

[1] "L0001"

[1] "L0001"

[1] "L0002"

[1] "L0002"

We will need to get rid of duplicate entries. We are interested in only every other entry.

locnames <- ctall_split[c(TRUE, FALSE),]
[1] "L0001"

[1] "L0002"

Now select only odd rows (containing allele 1 count at each locus in each pop), then even rows. This assumes that you do not have any monomorphic sites in your data. If you do, this and the previous steps will generate errors. Make sure you have exactly twice as many rows as you have loci. Should you find odd number of rows, that’s a clear indication that you have at least one locus with only one allele in your data. You will need to recreate structure file to get rid of such loci before proceeding again.

ct_odd <- ct[seq(1, length(ct$Allele), 2), ]
ct_evn <- ct[seq(2, length(ct$Allele), 2), ]

At this point, we have all the data we need i.e. locus names and allele counts in each population. I will demonstrate putting all this information together for one population.

1. Create an vector of length equal to number of loci and fill it with any symbol.
filler <- rep('#', length(locnames))

2. Create a new data frame by cbinding all components together.
df <- data.frame(locnames, filler, ct_odd$pop1, ct_evn$pop1)

3. head(df)
locnames filler pop1 pop1
1 loc01 # 7 3
2 loc02 # 8 7
3 loc03 # 3 5
4 loc04 # 4 2
5 loc05 # 2 6

4. Save this table
write.table(df, 'pop1_allct_bgc.txt', row.names=F, col.names=F, quote=F, sep='\t')

5. Finally, open the saved table in vi and perform this final operation:
vim pop1_allct_bgc.txt
:%s/\t/ /g

You can do ^M by first pressing Ctrl-V, then quickly hitting ENTER.

That’s it. Check the file to make sure everything looks ok.
7 3
8 7
3 5
4 2
2 6

At this point, you should be done coding the parental population files. Remember, only two parental populations are allowed. Thus, if you have data from two species with multiple pops each, you will need to collapse individual pops into a composite population for each species.

In the next blog post, I will summarize creating allele count data files for admixed populations, the format for which is somewhat different. Stay tuned.

Nextgen data: How to keep monomorphic loci out

Working on population genomic questions with nextgen (GBS to be specific) data, I find my unix server full of different iterations and sizes of the data set (in vcf format) based on various filtering strategies etc. I have always religiously applied an arbitrary MAF filter (0.001) as the last step before converting a vcf file to the format needed for popgen analysis (structure, bayescan etc.).

When my data sets were large, this approach worked fine. So I didn’t think much of it. But some of the hypotheses I am recently testing required exclusion of a large number of individuals. Suddenly, even after applying the MAF filter, monomorphic loci started showing up in my downstream analysis files.

If only I had given the MAF filter a second thought. Even if you do not want to get rid of any rare alleles, the MAF threshold needs to be calculated fresh every time remove/add individuals.

MAF = 1/n*2

Say, if you have 156 individuals: 1/(156*2) = 0.003205128

This number is very different from an arbitrary 0.001. As soon as I applied MAF=0.0032, vcftools dropped 2678 loci (more than 50% of the original data set).

Lesson learned.

vcftools --vcf --input.vcf --min-alleles 2 --max-alleles 2 --maf 0.0032 --recode

Hierarchical and Per Locus AMOVA using R (adegenet and poppr)

Several R packages currently offer AMOVA (analysis of molecular variance) functionality. This post is a work in progress and I will update it as I make progress. My goal here was to obtain locus wise estimates of phi-statistic at two levels of hierarchy:
– Among populations
– Among Regions comprising of populations

Starting data in Structure format:
156 individuals, 14 populations, 2 regions, 5300 SNP sites

Hierarchy levels in text format:
ind pop reg
1 1 1 1
2 2 1 1
3 3 2 1
4 4 2 1
5 5 3 2
6 6 3 2

Let’s load the packages we need:

Import data into genind object:
mygenind <- import2genind('mydata.str', onerowperind=F, n.ind=156, n.loc=5300, col.lab=1, col.pop=2, ask=FALSE)

Convert to genclone object:
mygenclone <- as.genclone(mygenind)

Set hierarchy levels for the genclone object:
myhier <- read.table('myhier.txt', header=T)
sethierarchy(mygenclone) <- myhier

Separate loci using ‘seploc’ function:
mygenclone_seploc <- seploc(mygenclone)

Finally, let’s do AMOVA at two levels: regions and populations
amova_res <- lapply(mygenclone_seploc, poppr.amova, hier=~reg/pop, within=TRUE)

The last command currently does not work. To be updated soon with more info.

If you do not need to do hierarchical amova and simply need phi stats among all populations in your data sets, it is very easy to do. Package mmod will estimate Meirman’s global phi-stat.

First, load package mmod:

Then do amova directly on the genind object:
amova_phistat_Meirmans <- Phi_st_Meirmans(mygenind)

This will generate global and individual locus phi estimates.

Acknowledgements: Thanks to Zhian Kamvar, an author of poppr package for help with the above code.

Installing BGC (Bayesian Genomic Clines) on Ubuntu 14.04 Server

BGC source code is available from

It depends upon GSL (GNU Scientific Library) and HDF5 (  GSL is usually installed on unix like systems at “/usr/local/”.  If you have installed GSL at a nonstandard location, it would be necessary to tell BGC compiler of that location.  For now let’s assume that location as “/usr/include/gsl”

HDF packages you will need. Use ‘sudo apt-get install PACKAGE’

Now unpack the bgc tar archive you downloaded earlier.  Navigate inside ‘./bgcdist’ in your favorite terminal shell.  Then compile bgc as follows:

h5c++ -Wall -O2 -L/usr/include/gsl -I/usr/include/gsl -o bgc bgc_main.C bgc_func_readdata.C bgc_func_initialize.C bgc_func_mcmc.C bgc_func_write.C bgc_func_linkage.C bgc_func_ngs.C bgc_func_hdf5.C mvrandist.c -lgsl -lgslcblas

If you do not get any error messages, your current directory should now contain the ‘bgc’ binary. Execute it without any options to make sure it works.

crypticlineage@Wright:/popgen/bgcdist$ ./bgc
./bgc version 1.03 -- 03 April 2014
Usage: bgc -a p0infile -b p1infile -h admixinfile [options]
-a Infile with genetic data for parental population 0
-b Infile with genetic data for parental population 1
-h Infile with genetic data for admixed population(s)
-M Infile with genetic map, only used for ICARr

If you get an error message that the compiler can’t find mpi.h file, open /usr/include/H5public.h in your favorite text editor as a sudoer, and edit L63 to reflect the correct path to mpi.h. That should get rid of the error messages.