Hello all,

I have a follow-up question on the strategy mentioned below: to reduce computational burden of imputing a large study sample by pre-phasing all samples together and then breaking the samples up into smaller batches for the imputation step. This will give you one set of metrics for each set of imputed samples, yes? How would you deal with this when trying to choose variants for downstream analysis, i.e. filtering on a minimum “info” value? I suppose you could take one set of metrics and just assume that the other sets of metrics should be similar (if your sample batches were randomly created). Or is there some way to create a unified set of metrics, like a script that you could run on a .gprobs file created after patching all the individual, per-batch .gprobs files back together….?

Curious how others would approach this.

Thanks!

~Sarah

From: Oxford Statistical Genetics Software [mailto:[log in to unmask]] On Behalf Of Bryan Howie
Sent: Sunday, April 14, 2013 6:06 AM
To: [log in to unmask]
Subject: Re: [OXSTATGEN] Imputation interval N SNPs versus megabases (Impute2)

Hi George,

Thanks Bryan, this is very helpful. The main impetus for using SNP count rather than region size, for me, was that the sample I will be trying to impute is ~20,000 subjects and using a default region size of 5mb leads to a reasonable number of regions to impute that won't be handled by my computers capabilities.

Yes, 20,000 is a big data set. You can reduce the computational load by using smaller regions or subsetting the individuals for imputation; further comments on this below.

If you would like to explain further why setting chunk boundaries based on reference variants is preferable I would be very interested (for example, what would be the extremes of this? how few study variants would indicate possible problems?)...Therefore controlling region interval by reference variant count might lead to a finer control on running time and memory usage.

You've answered your own question: there are many more reference than study variants, so they drive the computational burden. A practical approach is to split the genome into chunks with a fixed number of reference variants, then discard chunks where there are too few study SNPs for reliable imputation. (There is no fixed way to define "too few", but usually the distinction isn't hard to make.)

A related question to my specific imputation is whether imputing all subjects together produces just a helpful results as splitting the sample by subject, imputing and then combining the subjects back into one dataset for association testing. Really I would just like the option of using the default IMPUTE2 settings of 5Mb to impute and I believe splitting the sample by subjects would allow this as running times and memory tends to increase with sample size. Can you provide any further guidance on that approach. My main two worries with it are a) the accuracy of the imputation will be decreased due to a smaller sample sample size and b) when combining the subjects from two imputations runs things might get sticky with SNPs that are difficult to impute having genotypes that correlate with which imputation.

Your concerns can be addressed by pre-phasing the study individuals all together, then using the pre-phased haplotypes for imputation in batches. This approach avoids losing accuracy (since the pre-phasing uses all available information), reduces computation at the imputation step (since it lets you impute in batches), and avoids problems of imputation batch effects (since pre-phased haplotypes are imputed independently from one another).

I also notice that sometimes the buffer has less input SNPs, so perhaps I could control this as well.

I wouldn't worry too much about this.

Just to finally check, when you suggest controlling reference variant count, the benefits of this are mainly for memory usage and running times compared to controlling input SNPs count, i.e. there is no benefit to imputation accuracy,

Yes.

Best,

--Bryan

To unsubscribe from the list visit this webpage https://www.jiscmail.ac.uk/cgi-bin/webadmin?SUBED1=OXSTATGEN&A=1