Association between SNPs and climate variables (LFMM)

Formatting SNPs
Imputing missing SNP data
Test run with SNP BOPA2_12_30894 (in VrnH3 gene)
Analyzing a set of SNPs

This document describes how SNP genotypes from the Spanish Barley Core Collection (SBCC) were checked and processed for further use with LFMM in combination with climate data.

In the next steps we convert SNPs of SBCC barleys to fit the appropriate formats for downstream analyses. SNPs conserve sample order of climate data.

We shall write a SNPSFILE containing allele counts across populations/barleys, where each SNP is represented by two lines in the file, with the counts of allele 1 on the first line and the counts for allele 2 on the second, and son on. The counts of allele 1 and allele 2 are assumed to sum to the sample size typed at this SNP in this population (i.e. the total sample size excluding missing data). This means that PAV markers must be converted to SNP-like markers if they are to be used.

Formatting SNPs

We wrote a Perl script (SNP2LFMM.pl) to carry out these formatting tasks. For instance, a set of 9,920 Infinium and GBS markers in file 9920_SNPs_SBCC_50K.tsv, was converted as follows, accepting 10% missing data per position and accepting only biallelic loci with \(MAF \geq 0.05\) (n=6822):

./SNP2LFMM.pl raw/9920_SNPs_SBCC_50K.tsv SBCC_order.txt \
  SBCC_9K_LFMM.tsv 2> SBCC_9K_LFMM.log 
  
./SNP2LFMM.pl raw/9920_SNPs_SBCC_50K.tsv SBCC_order_complete_env.txt \
  SBCC_9K_LFMM.complete.tsv 2> SBCC_9K_LFMM.complete.log   

perl -F'\t' -ane '$r++;for(1 .. $#F){$m[$r][$_]=$F[$_-1]};$mx=$#F;END{for(1 .. $mx){for $t(1 .. $r){print"$m[$t][$_]\t"}print"\n"}}' SBCC_9K_LFMM.tsv > SBCC_9K_LFMM.tr.tsv

perl -F'\t' -ane '$r++;for(1 .. $#F){$m[$r][$_]=$F[$_-1]};$mx=$#F;END{for(1 .. $mx){for $t(1 .. $r){print"$m[$t][$_]\t"}print"\n"}}' SBCC_9K_LFMM.complete.tsv > SBCC_9K_LFMM.complete.tr.tsv

head SBCC_9K_LFMM.log 
echo ...
tail SBCC_9K_LFMM.log

# MAF=0.05 MAXMISSINGRATIO=0.1 MISSINGVALUE=9
# total samples in SBCC_order.txt: 135
# creating SBCC_9K_LFMM.annot.tsv
# skip sample SBCC038
# skip sample SBCC138
# total valid samples: 135
#   snpname  allele1 allele2 missing MAF
1   BOPA1_2511-533  A   G   0   0.052
2   BOPA1_3107-422  A   G   0   0.096
3   BOPA1_8670-388  C   G   0   0.104
...
6814    SCRI_RS_7407    C   T   6   0.287
6815    SCRI_RS_81903   A   G   0   0.119
6816    SCRI_RS_8250    A   C   0   0.052
6817    SCRI_RS_8401    C   T   0   0.415
6818    SCRI_RS_88466   G   T   1   0.157
6819    SCRI_RS_88710   A   G   0   0.096
6820    SCRI_RS_9327    C   T   0   0.281
6821    SCRI_RS_9584    G   T   0   0.385
6822    SCRI_RS_9618    C   T   3   0.091
# total valid markers=6822

The resulting SBCC_9K_LFMM.tr.tsv file is the SNPSFILE required by LFMM. Note that file SBCC_9K_LFMM.annot.tsv with matching fullnames of SNPs (columns) is produced alongside.

Imputing missing SNP data

We realised that LFMM does not cope well with missing data, as in our tests markers with missing data were two orders or magnitude more likely to be associated to climate vars. Therefore we used instead a dataset with imputed missing calls, which is contained in file SBCC_9K_SNPs.imputed.tsv. The following code filters out SNPs with with \(MAF \geq 0.05\):

./SNPimp2LFMM.pl SBCC_9K_SNPs.imputed.tsv SBCC_order.txt \
  SBCC_9K_LFMM.imputed.tsv 2> SBCC_9K_LFMM.imputed.log

./SNPimp2LFMM.pl SBCC_9K_SNPs.imputed.tsv SBCC_order_complete_env.txt \
  SBCC_9K_LFMM.complete.imputed.tsv 2> SBCC_9K_LFMM.complete.imputed.log

perl -F'\t' -ane '$r++;for(1 .. $#F){$m[$r][$_]=$F[$_-1]};$mx=$#F;END{for(1 .. $mx){for $t(1 .. $r){print"$m[$t][$_]\t"}print"\n"}}' SBCC_9K_LFMM.imputed.tsv > SBCC_9K_LFMM.imputed.tr.tsv

perl -F'\t' -ane '$r++;for(1 .. $#F){$m[$r][$_]=$F[$_-1]};$mx=$#F;END{for(1 .. $mx){for $t(1 .. $r){print"$m[$t][$_]\t"}print"\n"}}' SBCC_9K_LFMM.complete.imputed.tsv > SBCC_9K_LFMM.complete.imputed.tr.tsv

head SBCC_9K_LFMM.imputed.log 
echo ...
tail SBCC_9K_LFMM.imputed.log

# MAF=0.05
# total samples in SBCC_order.txt: 135
# creating SBCC_9K_LFMM.imputed.annot.tsv
# total valid samples: 135
#   snpname  allele1 allele2 other MAF
1   B3258773|F|0    0   1   0   0.059
2   3256790|F|0 0   1   0   0.311
3   BOPA2_12_10420  0   1   0   0.467
4   BOPA1_3107-422  0   1   0   0.096
5   SCRI_RS_204276  0   1   0   0.289
...
6120    BOPA2_12_30827  0   1   0   0.370
6121    BOPA2_12_30826  0   1   0   0.370
6122    3256939|F|0 0   1   0   0.119
6123    BOPA1_1437-687  0   1   0   0.363
6124    BOPA2_12_10378  0   1   0   0.089
6125    BOPA2_12_31019  0   1   0   0.089
6126    3260398|F|0 0   1   0   0.111
6127    SCRI_RS_167617  0   1   0   0.200
6128    SCRI_RS_123211  0   1   0   0.407
# total valid markers=6128

The resulting SBCC_9K_LFMM.imputed.tr.tsv file is the SNPSFILE which will be used by LFMM (n=6128). Note that file SBCC_9K_LFMM.imputed.annot.tsv with matching fullnames of SNPs (columns) is produced alongside.

Test run with SNP BOPA2_12_30894 (in VrnH3 gene)

We now run a test with SNP BOPA2_12_30894, found within an intron of gene VrnH3, which we expected to be correlated with latitude among SBCC landraces.

First, we will extract the relevant SNP with the previous script, parsing file Vrn3.txt:

./SNP2LFMM.pl raw/9920_SNPs_SBCC_50K.tsv SBCC_order.txt Vrn3/Vrn3b.tsv \
  raw/Vrn3.txt 2> Vrn3/Vrn3b.log 

perl -F'\t' -ane '$r++;for(1 .. $#F){$m[$r][$_]=$F[$_-1]};$mx=$#F;END{for(1 .. $mx){for $t(1 .. $r){print"$m[$t][$_]\t"}print"\n"}}' Vrn3/Vrn3b.tsv > Vrn3/Vrn3b.tr.tsv

Now we’ll invoke LFMM with this SNP and the climate data, which was produced earlier:

time ./soft/LFMM_CL_v1.4/bin/LFMM -x Vrn3/Vrn3b.tr.tsv -v SBCC_environfile.tr.tsv \
  -K 4 -p 5 -i 50000 -b 25000 -s 12345 -m -o Vrn3/Vrn3b_LFMM &> Vrn3/Vrn3b_LFMM.SBCC_environfile.log
rm -f Vrn3/Vrn3b_LFMM_s*.4.dic
cat Vrn3/Vrn3b_LFMM_s*.4.zscore

Analyzing a set of SNPs

The LFMM documentation recommends using a large number of cycles (ie 10K) with a burn-in period set to at least to one-half of the total number of cycles. We increased these numbers to 50K as the authors noticed that the results were sensitive to the run-length with datasets of a few hundreds individuals and a few thousands of loci, as it is our case. Note that the number of latent factors (-K) was initially set to 4 as this is the optimal number of subpopulations detected by STRUCTURE within the SBCC:


# first with K=4
./soft/LFMM_CL_v1.5/bin/LFMM -x SBCC_9K_LFMM.imputed.tr.tsv -v SBCC_environfile.tr.tsv \
  -K 4 -p 5 -i 50000 -b 25000 -s 12345 -o SBCC_9K_LFMM.rep1 \
  &> SBCC_9K_LFMMimp.SBCC_environfile.k4.rep1.log

./soft/LFMM_CL_v1.5/bin/LFMM -x SBCC_9K_LFMM.imputed.tr.tsv -v SBCC_environfile.tr.tsv \
  -K 4 -p 5 -i 50000 -b 25000 -s 23456 -o SBCC_9K_LFMM.rep2 \
  &> SBCC_9K_LFMMimp.SBCC_environfile.k4.rep2.log

./soft/LFMM_CL_v1.5/bin/LFMM -x SBCC_9K_LFMM.imputed.tr.tsv -v SBCC_environfile.tr.tsv \
  -K 4 -p 5 -i 50000 -b 25000 -s 34567 -o SBCC_9K_LFMM.rep3 \
  &> SBCC_9K_LFMMimp.SBCC_environfile.k4.rep3.log

./soft/LFMM_CL_v1.5/bin/LFMM -x SBCC_9K_LFMM.imputed.tr.tsv -v SBCC_environfile.tr.tsv \
  -K 4 -p 5 -i 50000 -b 25000 -s 45678 -o SBCC_9K_LFMM.rep4 \
  &> SBCC_9K_LFMMimp.SBCC_environfile.k4.rep4.log

./soft/LFMM_CL_v1.5/bin/LFMM -x SBCC_9K_LFMM.imputed.tr.tsv -v SBCC_environfile.tr.tsv \
  -K 4 -p 5 -i 50000 -b 25000 -s 56789 -o SBCC_9K_LFMM.rep5 \
  &> SBCC_9K_LFMMimp.SBCC_environfile.k4.rep5.log

# move results to LFMM folder  
mv SBCC_9K_LFMM.rep* SBCC_9K_LFMMimp.*rep?.log LFMM/K4
gzip LFMM/K4/*

# now with K=6
./soft/LFMM_CL_v1.5/bin/LFMM -x SBCC_9K_LFMM.imputed.tr.tsv -v SBCC_environfile.tr.tsv \
  -K 6 -p 5 -i 50000 -b 25000 -s 12345 -o SBCC_9K_LFMM.rep1 \
  > SBCC_9K_LFMMimp.SBCC_environfile.k6.rep1.log

./soft/LFMM_CL_v1.5/bin/LFMM -x SBCC_9K_LFMM.imputed.tr.tsv -v SBCC_environfile.tr.tsv \
  -K 6 -p 5 -i 50000 -b 25000 -s 23456 -o SBCC_9K_LFMM.rep2 \
  > SBCC_9K_LFMMimp.SBCC_environfile.k6.rep2.log

./soft/LFMM_CL_v1.5/bin/LFMM -x SBCC_9K_LFMM.imputed.tr.tsv -v SBCC_environfile.tr.tsv \
  -K 6 -p 5 -i 50000 -b 25000 -s 34567 -o SBCC_9K_LFMM.rep3 \
  > SBCC_9K_LFMMimp.SBCC_environfile.k6.rep3.log

./soft/LFMM_CL_v1.5/bin/LFMM -x SBCC_9K_LFMM.imputed.tr.tsv -v SBCC_environfile.tr.tsv \
  -K 6 -p 5 -i 50000 -b 25000 -s 45678 -o SBCC_9K_LFMM.rep4 \
  > SBCC_9K_LFMMimp.SBCC_environfile.k6.rep4.log

./soft/LFMM_CL_v1.5/bin/LFMM -x SBCC_9K_LFMM.imputed.tr.tsv -v SBCC_environfile.tr.tsv \
  -K 6 -p 5 -i 50000 -b 25000 -s 56789 -o SBCC_9K_LFMM.rep5 \
  > SBCC_9K_LFMMimp.SBCC_environfile.k6.rep5.log
  
mv SBCC_9K_LFMM.rep* SBCC_9K_LFMMimp.*rep?.log LFMM/K6
gzip LFMM/K6/*

# now with K=8
./soft/LFMM_CL_v1.5/bin/LFMM -x SBCC_9K_LFMM.imputed.tr.tsv -v SBCC_environfile.tr.tsv \
  -K 8 -p 5 -i 50000 -b 25000 -s 12345 -o SBCC_9K_LFMM.rep1 \
  > SBCC_9K_LFMMimp.SBCC_environfile.k8.rep1.log

./soft/LFMM_CL_v1.5/bin/LFMM -x SBCC_9K_LFMM.imputed.tr.tsv -v SBCC_environfile.tr.tsv \
  -K 8 -p 5 -i 50000 -b 25000 -s 23456 -o SBCC_9K_LFMM.rep2 \
  > SBCC_9K_LFMMimp.SBCC_environfile.k8.rep2.log

./soft/LFMM_CL_v1.5/bin/LFMM -x SBCC_9K_LFMM.imputed.tr.tsv -v SBCC_environfile.tr.tsv \
  -K 8 -p 5 -i 50000 -b 25000 -s 34567 -o SBCC_9K_LFMM.rep3 \
  > SBCC_9K_LFMMimp.SBCC_environfile.k8.rep3.log

./soft/LFMM_CL_v1.5/bin/LFMM -x SBCC_9K_LFMM.imputed.tr.tsv -v SBCC_environfile.tr.tsv \
  -K 8 -p 5 -i 50000 -b 25000 -s 45678 -o SBCC_9K_LFMM.rep4 \
  > SBCC_9K_LFMMimp.SBCC_environfile.k8.rep4.log

./soft/LFMM_CL_v1.5/bin/LFMM -x SBCC_9K_LFMM.imputed.tr.tsv -v SBCC_environfile.tr.tsv \
  -K 8 -p 5 -i 50000 -b 25000 -s 56789 -o SBCC_9K_LFMM.rep5 \
  > SBCC_9K_LFMMimp.SBCC_environfile.k8.rep5.log
 
mv SBCC_9K_LFMM.rep* SBCC_9K_LFMMimp.*rep?.log LFMM/K8
gzip LFMM/K8/*

We should now combine the correlation z-scores obtained from those 5 independent runs, using the Fisher-Stouffer approach recommended by the authors:

library(qqman)
library(pracma)

# set type of map
maptype = "cM"
#maptype = "bp"

# set barley chr names
chrnames = c("1H","2H","3H","4H","5H","6H","7H")

# get names of env vars
env_var_names = read.table(file="SBCC_environfile_order.txt")

# get SNP names
snp_names = read.table(file="SBCC_9K_LFMM.imputed.annot.tsv",sep="\t",
                       col.names = c("index","SNPidentifier"))

# read genetic/bp positions of markers
if(maptype == "bp"){
  genmap = read.table(file="raw/9920_SNPs_SBCC_bp_map2017.curated.tsv",header=T)
} else{
  genmap = read.table(file="raw/9920_SNPs_SBCC_cM_map2017.curated.tsv",header=T)
}

# read neighbor candidate genes
heading_genes = read.table(file="SBCC_9K_SNPs.annotated_genes.tsv",header=F)

# set rootname of LFMM outfiles
rootname = "SBCC_9K_LFMM"

# produce multi-panel histograms to control P-values with K=4
png("plots_LFMM/Pvalue_histogram_K4.png",width=600,height=800)
par(mfrow=c(7,3),mar=c(2,1.75,1.5,1))

for (v in 1:nrow(env_var_names)){
  # combined P-values from replicates
  z.table = NULL
  combined.p.values = NULL
  for (i in 1:5){
    file.name = paste("LFMM/K4/",rootname,".rep",i,"_s",v,".4.zscore.gz",sep="")
    z.table = cbind(z.table, read.table(file.name)[,1])
  }
  z.score = apply(z.table, MARGIN = 1, median) #combines z-scores
  
  # estimate genomic inflaction 
  # http://membres-timc.imag.fr/Olivier.Francois/lfmm/files/note.pdf
  # https://www.ncbi.nlm.nih.gov/pubmed/11315092
  lambda = median(z.score^2)/0.456 
  
  # get P-values for those Z-scores
  combined.p.values = pchisq(z.score^2/lambda, df = 1, lower = F) 

  # plot the distribution of P-values
  hist(combined.p.values, col = 3, main=paste(env_var_names[v,1],
        " (lambda=",sprintf("%1.2f)",lambda)))
}  
dev.off()

png 
  2

# now with K=6
png("plots_LFMM/Pvalue_histogram_K6.png",width=600,height=800)
par(mfrow=c(7,3),mar=c(2,1.75,1.5,1))

for (v in 1:nrow(env_var_names)){
  # combined P-values from replicates
  z.table = NULL
  combined.p.values = NULL
  for (i in 1:5){
    file.name = paste("LFMM/K6/",rootname,".rep",i,"_s",v,".6.zscore.gz",sep="")
    z.table = cbind(z.table, read.table(file.name)[,1])
  }
  z.score = apply(z.table, MARGIN = 1, median) #combines z-scores
  lambda = median(z.score^2)/0.456 
  
  # get P-values for those Z-scores
  combined.p.values = pchisq(z.score^2/lambda, df = 1, lower = F) 

  # plot the distribution of P-values
  hist(combined.p.values, col = 3, main=paste(env_var_names[v,1],
        " (lambda=",sprintf("%1.2f)",lambda)))
}  
dev.off()

png 
  2

# now with K=8
png("plots_LFMM/Pvalue_histogram_K8.png",width=600,height=800)
par(mfrow=c(7,3),mar=c(2,1.75,1.5,1))

for (v in 1:nrow(env_var_names)){
  # combined P-values from replicates
  z.table = NULL
  combined.p.values = NULL
  for (i in 1:5){
    file.name = paste("LFMM/K8/",rootname,".rep",i,"_s",v,".8.zscore.gz",sep="")
    z.table = cbind(z.table, read.table(file.name)[,1])
  }
  z.score = apply(z.table, MARGIN = 1, median) #combines z-scores
  lambda = median(z.score^2)/0.456 
  
  # get P-values for those Z-scores
  combined.p.values = pchisq(z.score^2/lambda, df = 1, lower = F) 

  # plot the distribution of P-values
  hist(combined.p.values, col = 3, main=paste(env_var_names[v,1],
        " (lambda=",sprintf("%1.2f)",lambda)))
}  
dev.off()

png 
  2

# produce Manhattan plots and export -log10(P-values) for K=6
# as in lambda values seem better than K=4 and K=8
FDR = 0.01
outTSV1 = paste("LFMM/SBCC_9K_LFMM.FDR",FDR,".tsv", sep="")
unlink(outTSV1)

for (v in 1:nrow(env_var_names)){
  z.table = NULL
  combined.p.values = NULL
  outTSV2 = paste("LFMM/SBCC_9K_LFMM.",env_var_names[v,1],".",maptype,".tsv", sep="")
  
  for (i in 1:5){
    file.name = paste("LFMM/K6/",rootname,".rep",i,"_s",v,".6.zscore.gz",sep="")
    z.table = cbind(z.table, read.table(file.name)[,1])
  }

  z.score = apply(z.table, MARGIN = 1, median) 
  lambda = median(z.score^2)/0.456
  combined.p.values = pchisq(z.score^2/lambda, df = 1, lower = F) 
  
  # export data for this env variable
  z.score = cbind(z.score, combined.p.values)
  data.table = cbind(snp_names, z.score)
  
  data.table$index <- NULL
  data.table = merge( data.table, genmap, by="SNPidentifier")
  data.table = data.table[order(data.table$chr,data.table[maptype],decreasing=F),]
  write.table(data.table,file=outTSV2,sep="\t",row.names=F,col.names=T,quote=F)

  # calculate Benjamini-Hochberg (FDR) combined adjusted P-values
  q = FDR
  L = length(combined.p.values)
  w = which(sort(combined.p.values) < q * (1:L) / L)
  if(length(w) == 0) w = c(1)
  candidates = order(combined.p.values)[w]
  fprintf("%s\t%2.5g\n", env_var_names[v,1],
          combined.p.values[candidates[length(candidates)]], 
          file = outTSV1, append = T)

  # plot manhattan plots
  SNPs = as.data.frame(combined.p.values)
  SNPs = cbind(SNPs,snp_names$SNPidentifier)
  colnames(SNPs) = c("P-value","SNPidentifier")
  SNPs = merge( SNPs, genmap, by="SNPidentifier")
  
  mht.data = SNPs[ , c("chr","cM","P-value","SNPidentifier")]
  
  png(file=paste("plots_LFMM/",rootname,".",env_var_names[v,1],
                 ".png",sep=""),width=1200)

  manhattan(mht.data,chr="chr",bp="cM",p="P-value",snp="SNPidentifier",logp=T,
          col = c("gray10", "gray60"), chrlabs=chrnames,suggestiveline=F,
          genomewideline = -log10(combined.p.values[candidates[length(candidates)]]),
          main=env_var_names[v,1], heading_genes$V1) #c("BOPA2_12_30894"))
  dev.off()
}

Histograms of combined adjusted P-values with K=6

In addition to the raw agroclimatic variables we can also test association to PCAs of all the climatic variables tested:


# with K=6
./soft/LFMM_CL_v1.5/bin/LFMM -x SBCC_9K_LFMM.complete.imputed.tr.tsv -v SBCC_PC_environfile.tr.tsv \
  -K 6 -p 5 -i 50000 -b 25000 -s 12345 -o SBCC_9K_LFMM.PC.rep1 \
  > SBCC_9K_LFMMimp.SBCC_PC_environfile.k6.rep1.log

./soft/LFMM_CL_v1.5/bin/LFMM -x SBCC_9K_LFMM.complete.imputed.tr.tsv -v SBCC_PC_environfile.tr.tsv \
  -K 6 -p 5 -i 50000 -b 25000 -s 23456 -o SBCC_9K_LFMM.PC.rep2 \
  > SBCC_9K_LFMMimp.SBCC_PC_environfile.k6.rep2.log

./soft/LFMM_CL_v1.5/bin/LFMM -x SBCC_9K_LFMM.complete.imputed.tr.tsv -v SBCC_PC_environfile.tr.tsv \
  -K 6 -p 5 -i 50000 -b 25000 -s 34567 -o SBCC_9K_LFMM.PC.rep3 \
  > SBCC_9K_LFMMimp.SBCC_PC_environfile.k6.rep3.log

./soft/LFMM_CL_v1.5/bin/LFMM -x SBCC_9K_LFMM.complete.imputed.tr.tsv -v SBCC_PC_environfile.tr.tsv \
  -K 6 -p 5 -i 50000 -b 25000 -s 45678 -o SBCC_9K_LFMM.PC.rep4 \
  > SBCC_9K_LFMMimp.SBCC_PC_environfile.k6.rep4.log

./soft/LFMM_CL_v1.5/bin/LFMM -x SBCC_9K_LFMM.complete.imputed.tr.tsv -v SBCC_PC_environfile.tr.tsv \
  -K 6 -p 5 -i 50000 -b 25000 -s 56789 -o SBCC_9K_LFMM.PC.rep5 \
  > SBCC_9K_LFMMimp.SBCC_PC_environfile.k6.rep5.log
  
mv SBCC_9K_LFMM.PC.rep* SBCC_9K_LFMMimp.*rep?.log LFMM/K6
gzip LFMM/K6/*

These data can be used to produce Manhattan plots as explained below:

library(qqman)
library(pracma)

# set type of map
maptype = "cM"
#maptype = "bp"

# set barley chr names
chrnames = c("1H","2H","3H","4H","5H","6H","7H")

# get names of env vars
env_var_names = read.table(file="SBCC_PC_environfile_order.txt")

# get SNP names
snp_names = read.table(file="SBCC_9K_LFMM.complete.imputed.annot.tsv",sep="\t",
                       col.names = c("index","SNPidentifier"))

# read genetic/bp positions of markers
if(maptype == "bp"){
  genmap = read.table(file="raw/9920_SNPs_SBCC_bp_map2017.curated.tsv",header=T)
} else{
  genmap = read.table(file="raw/9920_SNPs_SBCC_cM_map2017.curated.tsv",header=T)
}

# read neighbor candidate genes
heading_genes = read.table(file="SBCC_9K_SNPs.annotated_genes.tsv",header=F)

# set rootname of LFMM outfiles
rootname = "SBCC_9K_LFMM.PC"

# produce Manhattan plots of PCAs
FDR = 0.01
outTSV1 = paste("LFMM/SBCC_9K_LFMM.PC.FDR",FDR,".tsv", sep="")
unlink(outTSV1)

for (v in 1:nrow(env_var_names)){
  z.table = NULL
  combined.p.values = NULL
  v
  
  outTSV2 = paste("LFMM/SBCC_9K_LFMM.PC.",env_var_names[v,1],".",maptype,".tsv", sep="")
  
  for (i in 1:5){
    file.name = paste("LFMM/K6/",rootname,".rep",i,"_s",v,".6.zscore.gz",sep="")
    z.table = cbind(z.table, read.table(file.name)[,1])
  }
  z.score = apply(z.table, MARGIN = 1, median) 
  lambda = median(z.score^2)/0.456
  combined.p.values = pchisq(z.score^2/lambda, df = 1, lower = F) 
  
  # export data for this env variable
  z.score = cbind(z.score, combined.p.values)
  data.table = cbind(snp_names, z.score)
  data.table$index <- NULL
  data.table = merge( data.table, genmap, by="SNPidentifier")
  data.table = data.table[order(data.table$chr,data.table[maptype],decreasing=F),]
  write.table(data.table,file=outTSV2,sep="\t",row.names=F,col.names=T,quote=F)

  # calculate Benjamini-Hochberg (FDR) combined adjusted P-values
  q = FDR
  L = length(combined.p.values)
  w = which(sort(combined.p.values) < q * (1:L) / L)
  if(length(w) == 0) w = c(1)
  candidates = order(combined.p.values)[w]
  fprintf("%s\t%2.5g\n", env_var_names[v,1],
          combined.p.values[candidates[length(candidates)]], 
          file = outTSV1, append = T)

  # plot manhattan plots
  SNPs = as.data.frame(combined.p.values)
  SNPs = cbind(SNPs,snp_names$SNPidentifier)
  colnames(SNPs) = c("P-value","SNPidentifier")
  SNPs = merge( SNPs, genmap, by="SNPidentifier")
  
  mht.data = SNPs[ , c("chr","cM","P-value","SNPidentifier")]
  
  png(file=paste("plots_LFMM_pc/",rootname,".",env_var_names[v,1],
                 ".png",sep=""),width=1200)

  manhattan(mht.data,chr="chr",bp="cM",p="P-value",snp="SNPidentifier",logp=T,
          col = c("gray10", "gray60"), chrlabs=chrnames,suggestiveline=F,
          genomewideline = -log10(combined.p.values[candidates[length(candidates)]]),
          main=env_var_names[v,1], heading_genes$V1) #c("BOPA2_12_30894"))
  dev.off()
}

Association between SNPs and climate variables (LFMM)

26/jun/2018

Formatting SNPs

Imputing missing SNP data

Test run with SNP BOPA2_12_30894 (in VrnH3 gene)

Analyzing a set of SNPs