Fast Estimation of Per Locus Pairwise FST (W&C)

Many tools are available for obtaining W&C FST estimates for pairs of populations on a locus by locus basis. My preferred method involved creating a genind object with Adegenet package, using seploc to separate loci and then running pairwise.fst() function on each locus to obtain FST matrices.

The adegenet pairwise.fst remains a great function for doing a few estimations. But if you are dealing with simulation data with hundreds to thousands of input files, the computational burden becomes prohibitive.

In these cases, the diffCalc() function from Kevin Keenan’s diveRsity package is very useful. It is fast, and is scalable due to it’s ability to parallelize jobs. And it does so automatically (assumes you have relevant packages installed though e.g. doParallel and parallel). The only caveat with this package is that it requires genepop formatted files. This was the first bottleneck for me.

Our simulation data is in Structure formatted files. We will use some bash scripting to loop PGDSpider over structure files one by one and output corresponding genepop files. First fire up the GUI version of PGDSpider and create a SPID settings file for conversion from STRUCTURE to GENEPOP. Let’s name it str2gpop.spid.

## Bash Script to loop PGDSpider
## Input files have .str extension, output files .gen. -Xmx5g indicates that java is allowed 5Gb memory.

for f in ./*.str;
java -Xmx5g -jar /pgd/PGDSpider2-cli.jar -inputfile "$f" -outputfile "${f%.str}.gen" -inputformat STRUCTURE -outputformat GENEPOP -spid str2gpop.spid

If you have a large number of files to process, it is recommended to run this inside a screen session. Look up ‘unix screen’ if you don’t know what it is.

Next, install the diveRsity R package including it’s dependencies. Let’s assume we have now have 5 genepop files in the current directory.

ls -lh ~/fst/genepop/

Fire up R

gen <- list.files(pattern='.gen')
fst <- lapply(gen, function(x) diffCalc(infile=x, outfile=x, fst=TRUE, pairwise=TRUE))

At this point, we should have five separate directories (one per input file) where results are stored:

ls -l iter0*/

Now, let’s rename the directories to a convenient format

find . -name '*diffCalc*' -type d -exec bash -c 'mv "${0}" "${0//-\[diffCalc\]/_fst"' {} \;

Fst matrices are all stored in a file named ‘pw_locus.txt’ Let’s rename these files after the directories they reside in.

for subdir in */*; do mv $subdir/pw_locus.txt $subdir/$subdir.fst; done;

You will find several measures of population differentiation (including Gst, D and W&C Fst) in the “pw_locus.txt” files. We only need the W&C FST. But instead of extracting those matrices, we would delete everything else. For this we can make use of the fact that diffCalc() stored these matrices of interest always starting with line#4027. Before we go ahead delete everything else, let’s insert a column header line containing names of pairwise comparisons.

In R do this:

a <- t(combn(1:30, 2))
write.table(a, 'pw.txt', row.names=F, col.names=F, quote=F, sep='\t')

Open up the file in vi, prefix every number with ‘pop’ and join every row by column by putting an underscore in between. This is how the names will look in pw.txt

locus p1_p2 p1_p3 .....p29_p30

Now insert contents of ‘pw.txt’ below 4027th row in each of the fst files.
find */* -name '*.fst' -exec sed -i '4027r./pw.txt' {} \;

Finally, delete lines 1 through 4027 in each of the files
sed -i.bak -e '1,4027d' */*.fst

At this point, your FST matrices should be ready.


Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s