Sorry, my mistake! I didn't read the hap file properly.
hap <- matrix(scan(hap.con, n=hap.ncol * nlines), ncol=hap.ncol)
should become
hap <- matrix(scan(hap.con, n=hap.ncol * nlines), ncol=hap.ncol, byrow=T)
and the MAFs would be fine.
Tim
-----Original Message-----
From: Oxford Statistical Genetics Software [mailto:[log in to unmask]] On Behalf Of Timothy Mak
Sent: 26 January 2016 10:20
To: [log in to unmask]
Subject: Re: [OXSTATGEN] 1000G Phase 3 hap files MAF different from expected
Here, I'm using all 2504 participants. And the .legend file provides MAFs for different populations, including ALL. I've verified that the MAFs provided in the .legend files with respect to ALL are correct with respect to the original files downloaded from 1000G. So it seems that the .hap files are not correct.
Tim
-----Original Message-----
From: Oxford Statistical Genetics Software [mailto:[log in to unmask]] On Behalf Of Speed, Doug
Sent: 26 January 2016 10:16
To: [log in to unmask]
Subject: Re: [OXSTATGEN] 1000G Phase 3 hap files MAF different from expected
Is it that the MAF's are population specific - so if you were to compute using only, say, the CEU individuals, they would be correct?
On 26/01/16 02:10, Timothy Mak wrote:
> I have been using the 1000G Phase 3 data as provided at https://mathgen.stats.ox.ac.uk/impute/impute_v2.html#reference. However, I noticed there seems to be some discrepancy between the .hap file provided and the .legend file, in that the MAFs don't match. Here's an R script that demonstrates the difference. Can someone tell me why?
>
> Many many thanks!
>
> Tim
>
> ### R script to show difference ###
> leg.con <- gzfile("1000GP_Phase3_chr22.legend.gz")
> nlines <- 10
> leg <- strsplit(readLines(leg.con, n=nlines + 1), split=" ")
> leg2 <- as.data.frame(do.call(rbind, leg[-1]))
> colnames(leg2) <- leg[[1]]
>
> hap.con <- gzfile("1000GP_Phase3_chr22.hap.gz")
> hap.ncol <- 5008
> hap <- matrix(scan(hap.con, n=hap.ncol * nlines), ncol=hap.ncol)
> hapmeans <- rowMeans(hap) cbind(hapmeans,
> as.numeric(as.character(leg2$ALL)))
> ### End of R script ###
>
> Output:
> hapmeans
> [1,] 0.0013977636 0.0001996805
> [2,] 0.0013977636 0.0063897764
> [3,] 0.0009984026 0.0075878594
> [4,] 0.0019968051 0.0001996805
> [5,] 0.0031948882 0.0001996805
> [6,] 0.0021964856 0.0003993610
> [7,] 0.0009984026 0.0009984026
> [8,] 0.0025958466 0.0003993610
> [9,] 0.0023961661 0.0001996805
> [10,] 0.0011980831 0.0017971246
>
> One would expect the left column to either equal the right column, or 1 - the right column, but it is neither.
>
> To unsubscribe from the list visit this webpage
> https://www.jiscmail.ac.uk/cgi-bin/webadmin?SUBED1=OXSTATGEN&A=1
>
To unsubscribe from the list visit this webpage https://www.jiscmail.ac.uk/cgi-bin/webadmin?SUBED1=OXSTATGEN&A=1
To unsubscribe from the list visit this webpage https://www.jiscmail.ac.uk/cgi-bin/webadmin?SUBED1=OXSTATGEN&A=1
To unsubscribe from the list visit this webpage https://www.jiscmail.ac.uk/cgi-bin/webadmin?SUBED1=OXSTATGEN&A=1
|