Challenges of Adjusting Single-Nucleotide Polymorphism Effect Sizes for Linkage Disequilibrium

Background: Genome-wide association studies (GWAS) were successful in identifying SNPs showing association with disease, but their individual effect sizes are small and require large sample sizes to achieve statistical significance. Methods of post-GWAS analysis, including gene-based, gene-set and polygenic risk scores, combine the SNP effect sizes in an attempt to boost the power of the analyses. To avoid giving undue weight to SNPs in linkage disequilibrium (LD), the LD needs to be taken into account in these analyses. Objectives: We review methods that attempt to adjust the effect sizes (β-coefficients) of summary statistics, instead of simple LD pruning. Methods: We subject LD adjustment approaches to a mathematical analysis, recognising Tikhonov regularisation as a framework for comparison. Results: Observing the similarity of the processes involved with the more straightforward Tikhonov-regularised ordinary least squares estimate for multivariate regression coefficients, we note that current methods based on a Bayesian model for the effect sizes effectively provide an implicit choice of the regularisation parameter, which is convenient, but at the price of reduced transparency and, especially in smaller LD blocks, a risk of incomplete LD correction. Conclusions: There is no simple answer to the question which method is best, but where interpretability of the LD adjustment is essential, as in research aiming at identifying the genomic aetiology of disorders, our study suggests that a more direct choice of mild regularisation in the correction of effect sizes may be preferable.


Introduction
Genome-wide association studies (GWAS) have quickly advanced the field of genetics of complex genetic disorders, led to generation of large data sets and gave a push to developing novel data analysis methods to summarise the vast amount of information and make use of it in a comprehensible manner.GWAS were successful in identifying genetic variants (single-nucleotide polymorphisms [SNPs]) that show association with a disease, but their individual effect sizes are small and require large sample sizes to achieve statistical significance.Methods of post-GWAS analysis such as gene-based and gene-set analysis and the use of polygenic risk scores (PRSs) combine the SNP effect sizes in an attempt to boost the power of the analyses.The most commonly used approaches of assessing the significance of a gene or gene set are implemented in the MAGMA data analysis tool [1].In addition, the polygenic risk score approach provides an individual score per person and is often applied to the whole genome.The PRS for an individual is calculated as a linear combination of the number of risk alleles carried by the individual, weighted by the effect size of the SNPs estimated from an independent sample [2].This mimics the combination of the SNP genotypes in multivariate (logistic) regression, but as the coefficients are taken from single-SNP associations, the linkage disequilibrium (LD) between SNPs needs to be taken into account in order to avoid giving undue weight to correlated SNP genotypes.One way to achieve this is to avoid LD by selecting independent (LD-pruned) SNPs for inclusion in the risk score and, to make the score disease relevant, prioritising the SNPs by their strength of association to the disease (p value), a simple approach now often referred to as pruning and thresholding ("P + T").Several real data and simulation studies have shown that this strategy is limited in power and falls short of the heritability explained by the SNPs [3,4].Strategies were developed to adjust the effect sizes for LD, instead of LD pruning, thus preserving information carried by SNP markers in LD [1,3,4].Several methods that have been proposed and are gaining wide-spread use are based on a Bayesian framework, assuming a prior distribution of putative effect sizes and LD information from a reference panel to obtain posterior joint effect sizes from those observed in individual SNP association [4][5][6][7].In earlier work, starting from ideas similar to MAGMA-PCA [1], we have suggested a more direct adjustment of the SNP effect sizes using the singular value decomposition of the SNP-SNP LD matrix [3].
In this paper we analyse the adjustment of SNP effect sizes performed by the methods mentioned above, considering in particular predicted LD (LDpred) as a model indicative of the more complex variants.Observing the similarity of the process with the simple Tikhonov-regularised ordinary least square estimate for the putative multivariate regression coefficients, we consider the questions to what extent LD correction is actually performed, and how transparent the adjustment and its interpretation are to the user.The latter point may be an important element in the approach to understanding the biological mechanisms of complex genetic disorders, as, for example, PRSs are now used not only to test the statistical significance and prediction accuracy in a sample, but also for designing clinical trials and biological experiments, including selection of samples for stem cell research.

LD Adjustment by Multivariate Regression
The effect size of the association of a single SNP with a trait can easily be calculated as the regression coefficient from case-control data.However, when the effect sizes of a number of SNPs are to be combined, for example, in calculating a polygenic risk score or set-based significance, the effect sizes of SNP markers in LD may be given undue weight, as the single-SNP regression coefficients partly replicate the association of the correlated markers.In the extreme case of 2 identical markers, the effect size could be counted double.Therefore some adjustment of single-SNP effect sizes before combining them into summary indicators is desirable.
If, instead of single-SNP regression, multivariate regression over all SNPs is performed, then the correlation between the SNPs is automatically taken into account and eliminated.However, multivariate regression over a large number of variables is numerically unstable and prone to convergence issues and therefore not practically feasible.Moreover, it is impossible when only single-SNP effect sizes are available, for example, from a GWAS, but the original case and control genotypes are not accessible.Nevertheless, the joint regression coefficients can be calculated from the single-SNP regression coefficients if the SNP-SNP correlation is known.Indeed, denoting the M vector of single-SNP regression coefficients for M SNPs, calculated from standardised genotypic data, by β ∼ , the corresponding vector of joint regression coefficients by β ∧ and the SNP-SNP (LD) correlation matrix of the genotypic data by D, it is well known that the ordinary least squares (OLS) estimate for the joint regression is given by the formula (see, e.g., Yang et al. [8]; note that, as did Vilhjálmson et al. [4], we here consider linear regression for simplicity).
To illustrate the effect of the adjustment (equation 1) on the effect sizes of correlated markers, consider the simple case of only 2 markers in correlation r, but independent of other markers.In this case The action of this matrix on the 2-vector β . Its effect can be more clearly seen in the sum and difference of the coefficients,

 
The adjustment of the sum (or average) is an intuitive correction for LD, as the weight of 2 highly correlated markers (r ≈ 1) is reduced to approximately its half.(In the more general case of a block of M l markers in constant LD with correlation r, the correction factor of the average of the effect sizes will be 1/(1 + [M l -1]r), so if the correlation is close to 1, then the average effect size of SNPs in the block is divided by approximately the number of SNPs in the block).The adjustment of the difference of the effect sizes is perhaps less intuitive and more problematic, in particular if r is close to -1.If 2 SNPs are in high LD, we expect the single-SNP effect sizes to be very similar, too, so any difference between them will be very important and is hence amplified in the adjustment.For r close to 1, the difference will be amplified very strongly.This can be very problematic when we take proxies for the correlation coefficients from a different sample from the one used for regression, because random fluctuations in the effect sizes between the samples will be amplified if the markers are in high LD.For example, consider 2 SNP markers in LD with correlation r = 0.98 and single-SNP effect sizes β ∼ Clearly the sum of the effect sizes has been approximately halved, but their small difference is strongly amplified, leading in particular to a severe adjustment of both effect sizes.If furthermore the LD is taken from a proxy with, for example, r = 0.99, the resulting adjusted effect sizes have sum -0.452.While the adjusted sum is almost the same as before, we here see an even more severe adjustment of the difference and therefore of both effect sizes, showing the instability of the straightforward OLS adjustment.Note that when the markers are highly negatively correlated, so r is close to -1, then the sum and difference of the effect sizes swap roles in the above discussion.

Tikhonov-Regularised LD Adjustment
Generally, in the presence of high LD, the correlation matrix D will be numerically ill-conditioned or even singular, and therefore the calculation of its inverse in equation 1 will be numerically difficult or unstable, and will in any case lead to an adjustment matrix which has the tendency to amplify differences between the effect sizes of correlated markers, even when these differences are due to random fluctuations between different samples.
A standard method of avoiding such difficulties arising from an ill-conditioned matrix is Tikhonov regularisation [9, sect. 5.3].(In a regression context, this method is associated with ridge regression, but we prefer to consider the more general concept of regularising the inverse problem (equation 1).The idea here is to make the nonnegative definite correlation matrix strictly positive definite, with a positive lower bound, by adding a multiple of the unit matrix.This gives a regularised adjustment of the effect sizes of the general form where T > 0 is the regularisation constant.The operator norm of the inverse matrix in equation 2 is bounded by 1/T, ensuring that there is no uncontrolled amplification of effect sizes β.Note that the operator norm of a matrix is the maximal factor of amplification of a vector through multiplication with this matrix.In the present situation, where D is a correlation matrix, Tikhonov regularisation can be interpreted in the following way.Separating the diagonal entries 1 of the matrix D from the correlation coefficients of SNP pairs by writing D = D ∧ + I, where D ∧ is the reduced matrix of correlation coefficients where the diagonal entries are set to 0, we can rewrite the adjustment formula 2 equivalently in the form ( ) ( ) This shows that, apart from an overall division by the constant 1 + T, the regularisation consists of dividing each pairwise correlation coefficient by the number 1 + T. Thus, the regularisation avoids the singularities at correlation r = ±1 by just scaling down all the correlation coefficients.This means that instead of adjusting for the LD present in the markers, the regularised adjustment 3 only adjusts for a fixed fraction of this LD in each pair of markers.The effect sizes for a pair of markers in LD with some value of r 2 are only corrected as if the LD were r 2 /(1 + T) 2 .In consequence, T needs to be fairly small for effective LD adjustment to occur; for large values of T, the values β adj will essentially be uniformly scaled down values of β ∼ , rendering the adjustment ineffective when the β adj are used in a subsequent analysis that introduces its own scale, such as linear or logistic regression.Although this downscaling is applied uniformly to all correlation coefficients, the effect on the adjustment of effect sizes will depend on the size of the LD blocks.Due to the LD structure of the genome, the SNP-SNP correlation matrix D is approximately block-diagonal, that is, it is composed of square matrices D l of noticeable correlation strung along the diagonal of D, with small entries outside these blocks [10].If we neglect these small entries, the regularisation and inverse in formula 3 apply to each LD block matrix D l separately.While the operator norm of the regulariser (1 + T)I is equal to 1 + T independently of the block size, the norm of D ∧ l , the part of D ∧ restricted to the block, will depend on the correlation coefficients in the block, but even in case of extreme LD is bounded by M l -1, where M l is the number of SNPs in the block (see Appendix B).If 1 + T is larger than the norm of D ∧ l , then, apart from the overall scaling factor, the adjustment of each effect size will only be a minor correction.

Challenges of Adjusting SNP Effect
As an illustrative, albeit somewhat cartoonish example, suppose the LD matrix is exactly of block-diagonal form, and the component block D l has constant correlation r l , so the entries of D l are 1 on the diagonal and r l everywhere else.Then the adjustment (equation 3) can be calculated in explicit form (see Appendix B), giving for each β ∼ j in the corresponding LD block where < β ∼ > l is the mean of all observed effect sizes in the LD block.We can see that the adjustment consists of a scaling with factor 1/1 -r l + T and a constant shift, the latter dependent on the average of SNP effect sizes in the LD block.For large T, the dependence of the scaling factor on the correlation r l , which lies between -1 and 1, is of minor significance, and the scaling factor is approximately equal to the common overall factor (1 + T) -1 of equation 3. The shift term depends on the average effect size, but will be small unless the product M l r l is of noticeable size compared to T. For larger values of T, this will only be the case for large blocks of high LD.
As a further illustration of this effect, we show the result of a simulated example in Figure 1.Genotypes for 10,000 cases and 10,000 controls in 100 SNPs were randomly gen-erated.The first block, comprising 6 SNPs, has an odds ratio of 1.1 and LD r 2 ≈ 0.9 between SNPs, the second block, comprising 50 SNPs, has an odds ratio of 1.1 and LD r 2 ≈ 0.4 between SNPs.The remaining 54 SNPs were generated unassociated and without LD, as a baseline for comparison.Figure 1a shows the SNP-SNP correlation matrix; there is some small random correlation between the LD blocks and the baseline SNPs.The vector of single-SNP effect sizes, calculated as logistic regression coefficients, was then subjected to regularised LD adjustment according to equation 2, with regularisation parameter T = 0.01, T = 1 and T = 10.The comparison of the resulting adjusted effect sizes with the raw (single-SNP) values is shown in Figure 1b-d.For mild regularisation (T = 0.01), the LD blocks show adjustment according to the size of correlation, with some strong corrections in the small high-LD block.For moderate regularisation (T = 1), the outcome is similar, but extreme adjustment is avoided, and the correction factors (line slopes) are more aligned.With a stronger regularisation (T = 10), the large block of moderate LD still receives adjustment, but the small block of high LD tends to fall in line with the uncorrelated baseline SNPs that do not need adjustment; as the regularisation constant exceeds the block size, the regularisation makes the adjustment less effective.
The size of the LD blocks in the genome varies greatly.In Reich et al. [11], blocks of significant LD ranged in size from 6 to 155 kb, with an average of about 60 kb in North American individuals of European descent.Some LD blocks can be very large, for example, a block in the MHC region on chromosome 6 spans up to 5 Mb [12].The relevant figure for the present considerations is the number of SNPs in an LD block.As an example, a typical set of summary statistics, including imputed SNPs after quality controls, contains about 10 million SNPs.We estimated about 4 SNP/kb on average in Alzheimer's disease summary statistics [13].The Hapmap3 list of SNPs used in LDscore regression [14] contains about 1.2 million SNPs, giving an average of 0.5 SNP/kb.The publicly available 1,000 genomes data, often used as a reference panel for LD calculation, contains about 2.3 millions SNPs for a European population, giving an estimated 1 SNP/kb.These data show that the size of LD blocks can be expected to range from about 3 to 600 SNPs or, in some cases, even more.
When setting the regularisation constant T explicitly, one would naturally choose a small value (in particular, smaller than the smallest number of SNPs in an LD block) in order to keep effective LD adjustment; however, the constant T may turn out to be large when it arises implicitly and out of the researcher's direct control, as shown in the following section.DOI: 10.1159/000513303

LDpred's Bayesian Approach in the Context of Regularised LD Adjustment
The fundamental idea of LDpred, as explained in Vilhjálmsson et al. [4], is to correct the vector of single-SNP effect sizes or marginal least-squares effects β ∼ by replacing them with the expectation values of a Bayesian update of an informationless prior, using these effect sizes as empirical input.In the infinitesimal model (LDpred-inf), it is assumed that each SNP is a priori equally likely to have an effect.This is modelled by setting the prior distribution to i.i.d.normal with mean 0 and variance h 2 /M, where h 2 is the total expected variance (disease heritability) and M is the total number of SNPs.Given any sup- ( ) ( ) Here N is the number of individuals in the data set from which the observed effect sizes were calculated, D is the SNP-SNP correlation matrix estimated from the data set, 1 -h 2 l is the variance of the residuals per individual in the regression model, and I M is the M × M unit matrix.We show the derivation of these formulae in Appendix A, both for ease of reference and because the corresponding passage in Vilhjálmsson et al. [4] contains a number of misprints, including a spurious factor of 1 -h 2 l in the formula for the expectation E(β|β ∼ ) on p. 587.The expected vector of effect sizes from the posterior distribution (equation 5) is then taken to be a Bayesian LD-adjusted vector of effect sizes.Assuming that the adjustment is applied to LD regions of small size compared to the total number M of SNPs, the approximation 1 -h 2 l ≈ 1 is applied [4, pp. 578-579].Clearly, the adjustment is then just a Tikhonov-regularised LD correction as in equation 2 with the region-independent constant (7) 2 =.

M T hN
In particular, the considerations at the end of section 2 above about the dependence of the effect of regularisation on the size of the LD region apply.
Thus, the main contribution of LDpred is to provide a rationale, based on the Bayesian model, for a particular choice of the regularisation constant T. Taken at face value, LDpred's regularisation constant (equation 7) is determined by the total number of SNP M, the sample size N and the estimated overall heritability h 2 -in fact, the derivation shows that the estimated heritability per SNP marker h 2 /M is the basic quantity here.For a large full-genome study, M will be on the order of 10 6 -10 7 and the number of individuals genotyped is typically in the range N ≈ 10 4 -10 5 .If the heritability h 2 lies below 1, this will give a regu-larisation constant T of an order of 10 2 or higher, which is fairly large and exceeds the size of the smaller LD blocks.
We note that the ostensible dependence of T on M and N in equations 7 and 5 is somewhat misleading.Indeed, in LDpred the heritability h 2 is estimated from the data using the formula ( )  As a result, the effect of the adjustment (equation 3) on each LD block will be determined by the trade-off between the operator norm of the reduced correlation matrix for this block, an average LD score and the average χ 2 statistic across the markers.The LD score of the j-th marker is approximately 2 = j jk k lr å , summing over the neighbouring markers in a window of size K. Estimating the average value of r 2 in this window at r 2 = 0.2 for K = 10 and r 2 = 0.1 for K = 100 and taking a typical value of χ 2 = 1.1 (e.g., χ 2 = 1.0783 in Kunkle et al. [13]), we find T = 20 for K = 10 and T = 100 for K = 100.Again, these values are so large that the regularisation will render the LD adjustment ineffective in smaller LD blocks.
There is a non-infinitesimal variant of LDpred based on a Bayesian prior where the normal distribution for each SNP is mixed with a delta distribution at 0, intended to model a situation where a certain fraction of genetic variants is not at all causal.This provides a parameter for optimisation but makes the analysis much less transparent and in particular does not allow an explicit formula for the posterior distribution beyond observation of per-SNP scaling with the constant 1 + T with T as in equation 7, see Vilhjálmsson et al. [4,p. 588].The LD adjustment is not made explicit, and Márquez-Luna et al. [7] recommend exclusion of SNPs from long-range LD regions.For these reasons, we do not discuss non-infinitesimal LDpred further here.DOI: 10.1159/000513303 The PRS-CS method [5] and the SBayesR method [6] use a more sophisticated prior composed of a mixture of normal distributions with optimised parameters.These approaches are shown to achieve higher prediction accuracy than LDpred, but also preclude explicit analysis of the resulting posterior distribution.

Variants of Regularisation
More recently, a functionally informed variant of the infinitesimal model for LDpred has been proposed, see "LDpred-funct-inf" in Márquez-Luna et al. [7].Here the multiple of the unit matrix Using the matrix D ∧ defined above and abbreviating γ i = 1 + 1/(Ncσ 2 i ) for the i-th marker, we observe that the matrix in equation 9 can be rewritten as where D ∼ is a correlation matrix with non-diagonal entries Thus, in comparison with the plain adjustment (equation 1), the correlation coefficients are scaled down and the effect sizes are also directly reduced.In order to assess the extent of this scaling, we note that the mean of Ncσ 2 i is equal to Nh 2 /M + = 1/T with T as in equation 7, but where M is replaced by the generally smaller number M + and h 2 l is as before neglected.By Jensen's inequality, ( ) so if T is big, then γ i will be large for a large proportion of SNPs.For these SNPs, there will be a strong direct downscaling of the effect sizes with a factor varying between SNPs in dependence on their estimated σ 2 i , but the SNP-SNP correlation coefficient measuring the actual LD will also be scaled down severely.The effect on these SNPs, then, will be a per-SNP scaling rather than an LD adjustment.For SNPs with a large per-SNP heritability σ 2 i , the corresponding γ i may be smaller and close to the minimal possible value 1, in which case the adjustment (equation 9) will be close to the straightforward OLS estimate (equation 1) with little direct downscaling.
In the overall result, a PRS formed with the adjusted effect sizes will give great prominence to SNPs with a high previously estimated per-SNP heritability, with LD adjustment essentially restricted to these markers.
A rather different approach to the question of LD adjustment for the purpose of calculation of PRS was proposed in Baker et al. [3].The POLARIS method uses, instead of equation 2, the adjustment ( ) either without regularisation (T = 0) or very mildly regularised (T⪡1).The rationale behind this is that the OLS formula 1 eliminates all correlations on the right-hand side β T X of the linear regression formula, so it adjusts for correlation both in the coefficients β and the data vectors X.However, although the effect sizes β ∼ are obtained as single-SNP regression coefficients, they enter the PRS as per-SNP effect sizes which may arise from different samples and can be connected functionally as well as by LD.
Hence, forming a PRS from a genotype vector x in the form (D x can be considered an overcorrection, and the form β ∼ T D -1/2 x (or a mildly regularised form of this) is preferred, which can be taken as a correction of x for LD with the actual data correlation matrix D while taking the single-SNP effect sizes at face value.This form of LD adjustment for PRS was shown to be effective [3].
From the practical point of view, the matrix D 1/2 , if not directly singular, is considerably better conditioned than the matrix D, as the square root function moves the very small eigenvalues disproportionately farther away from 0. In this sense, taking the square root is a form of regularisation which runs less risk of obscuring the correlation structure encoded in the LD matrix than Tikhonov regularisation with a large constant T.

Discussion
In order to avoid spurious contribution from associated SNPs in medium to high LD, it is essential to take the LD between SNPs into account when preparing SNP effect sizes for inclusion in aggregates for further analysis such as in PRS or (converted to p values) in gene sets as in Brown's method (MAGMA [1]).When assessing different methods of LD adjustment, a main concern may be the prediction accuracy achieved in the end; however, since the ultimate aim of polygenic statistical analysis is not just to find a suitable measure of disease risk, but to interpret and understand its structure and thus to identify the cocausal genes, we contend that transparency of the choices and calculations involved is a further crucial characteristic.A comparison of prediction accuracy for different approaches can be found in Ge et al. [5] and Lloyd-Jones et al. [6]; in the present discussion we focus on the interpretability and possible misconceptions on the side of the user.
The "P + T" approach avoids the issue by using independent (low LD) SNPs only, but this may lead to loss of information and hence of prediction accuracy.Nevertheless, this method is fast, transparent and easy to interpret; for example, the identification of SNPs of which an individual at high risk for the disease has 2 risk alleles can inform which variants or genes are mostly contributing to the PRS.
Direct multivariate regression of case-control data would automatically account for correlation between SNPs, but is not practically feasible due to computational problems when the number of SNPs included is large, and also due to the fact that usually only single-SNP regression coefficients are available from previous studies.It can be replaced computationally by using a proxy SNP-SNP correlation matrix in the OLS formula (equation 1); the computation of the inverse matrix, however, usually requires some regularisation such as the standard Tikhonov regularisation (equation 2).Normally, one would expect to use a fairly small regularisation parameter T, as strong regularisation is essentially equivalent to an artificial downscaling of the LD correlation coefficients.Our analysis indicates that T should not exceed 1 in order to ensure effective LD adjustment even of small blocks of highly correlated SNPs.
LDpred appears to offer a different, more data-driven way, as it is based on a Bayesian model involving quantities estimated from the available data themselves.However, on inspection, LDpred with the infinitesimal prior turns out to give a regularised OLS estimate for multivariate regression coefficients.The contribution of the Bayesian framework is that it provides a specific choice of the regularisation parameter.This saves the user the trouble of choosing a suitable regularisation parameter or of even realising that regularisation is taking place, but there are 2 caveats.
First, the choice of the regularisation parameter is not very transparent; although it ostensibly seems to depend on the study parameters of sample size and number of SNPs, the implicit heritability estimate used in fact removes the direct dependence on these manifest numbers and instead determines the regularisation parameter in terms of average LD scores and the average χ 2 statistic.Therefore the resulting LD adjustment will be non-local; to what extent the effect sizes of SNPs in some region will be corrected depends not only on the LD between the SNPs, but also on the effect sizes and LD found on average throughout the genome.
Second, the regularisation parameter resulting from the Bayesian estimate may be rather large.In this case, adjustment will only be made for a possibly small fraction of the actual LD.This effect is particularly pronounced in small LD blocks, which even in case of extreme LD receive no more LD adjustment than if they were uncorrelated.Note that the equal downscaling of all effect sizes has no effect when the effect sizes are subsequently linearly combined into a single regression variable.This gives rise to a situation where, out of the control or knowledge of the user and contrary to the expectation raised by its name, LDpred does not perform full LD adjustment of SNPs, even in high LD and possibly with large effect sizes, resulting in the spurious contributions, for example, to PRS that accompany the use of uncorrected single-SNP effect sizes.
The functionally informed variant of LDpred varies the strength of regularisation for each SNP based on its estimated per-SNP heritability σ 2 i .As a result, SNPs with low or medium σ 2 i will have their effect sizes directly scaled down, but not adjusted for LD.SNPs with high σ 2 i will be promoted due to less penalisation of their effect sizes and adjusted for LD among each other.Thus, LDpred-functinf can be considered as a fuzzy version of P + T or, if the SNPs with high σ 2 i are in LD, of fuzzy pruning combined with a mildly regularised OLS estimate.Therefore LDpred-funct-inf points in the direction of what the field wants to achieve by intelligent pruning [16].In comparison to LDpred-inf, which imposes a uniformly regularised, possibly much diluted LD adjustment on all SNPs on the basis of a Bayesian model, LDpred-funct-inf aims more directly at a focus on SNPs that appear likely to have explanatory power and subjects their effect sizes to LD correction with little adjustment while eliminating or diminishing the effect sizes of the other SNPs.This method achieves higher prediction accuracy than LDpred-inf by using the estimated per-SNP heritabilities, but, as the se-DOI: 10.1159/000513303 lection of SNPs for emphasis remains internal, is not as transparent as a straightforward method of pruning and mildly regularised OLS.This also applies to methods with variant Bayesian priors, such as non-infinitesimal LDpred, PRS-CS and SBayesR.These methods may reach higher prediction accuracy, likely due to promoting the most associated SNPs by reduced and more relaxed LD adjustment.This may make practical sense, but ultimately transcends the paradigm of approximating the "true" effect sizes one would obtain from multivariate regression.
The POLARIS method uses the matrix square root for a regularisation that runs less risk of obscuring the LD structure encoded in the correlation matrix than Tikhonov regularisation with a large parameter, but at additional computational cost.
In conclusion, faced with an array of automated methods of processing single-SNP effect sizes for LD, some of considerable complexity, but not always guaranteed to perform the expected task of LD adjustment, the user may consider whether, beyond mere prediction accuracy, the aim of identifying the polygenic elements of a disease is not better served by a simple and transparent method of cautious pruning and gently regularised OLS adjustment for LD.
( ) for the expectation and (12) and further use the assumed density for a multivariate Gaussian prior with covariance matrix C p () We now rewrite the term in the exponential as a quadratic expression in β, that is, to be of the form (β -m) T Σ -1 (β -m) + const, where the constant does not depend on β and m is subsequently interpreted as posterior mean, Σ as posterior covariance matrix for β.Since () () ( When using the infinitesimal non-informative prior where C p = h 2 /MI M , this gives the formulae 5 and 6.For the functionally informed prior with previously estimated per-SNP heritability σ 2 i for the i-th SNP and c = h 2 /∑ i σ 2 i , we have the diagonal covariance matrix C p = diag(σ 2 1 , ..., σ 2 M+ ) with reduced dimension M + due to omission of SNPs which do not have σ 2 i > 0, and the adjustment formula 9 follows when h 2 l is neglected.

Appendix B
The Importance of LD Block Size Consider an LD block of size M l and denote by D l the corresponding part of the SNP-SNP correlation matrix.We write D l = I + D ∧ l , where I is the M l × M l unit matrix and D ∧ l is the reduced correlation matrix with entries r jk off-diagonal and zeros on the diagonal.With respect to the vector norm  Now suppose the correlation in the block is constant, so r jk = r l with fixed r l , and neglect the correlations with markers outside the block.Then, considering SNPs in the corresponding LD block only, the relationship between the observed effect size β ∼ j and the adjusted effect size β adj,j by equation 3 is Sizes for Linkage Disequilibrium 27 Hum Hered 2020;85:24-34 DOI: 10.1159/000513303
equal to the result of separate linear regression on each SNP.Now assume the "true" effect sizes are β, so equation 10 holds, where only Y and ε are random variables.Under this assumption of true effects β, the distribution parameters of β ∼ can be calculated as follows: the covariance matrix.From equations 11 and 12 we then obtain the conditional probability density Hum Hered 2020;85:24-34 DOI: 10.1159/000513303 1, the operator norm of D ∧ l is the maximum value of the ratio ║D ∧ x║ p /║x║ p as x ranges over all M l vectors; since β ∼ j = (1 -r l + T)β adj,j + r l M l <β adj > l , where <...> l denotes averaging over the LD block of length M l .Hence <β ∼ > l = (1 -r l + M l r l + T) <β adj > l , and equation 4 follows by substitution and rearrangement.Escott-Price/Schmidt Hum Hered 2020;85:24-34 34 DOI: 10.1159/000513303