DAWN: a framework to identify autism genes and subnetworks using gene expression and genetics

Background De novo loss-of-function (dnLoF) mutations are found twofold more often in autism spectrum disorder (ASD) probands than their unaffected siblings. Multiple independent dnLoF mutations in the same gene implicate the gene in risk and hence provide a systematic, albeit arduous, path forward for ASD genetics. It is likely that using additional non-genetic data will enhance the ability to identify ASD genes. Methods To accelerate the search for ASD genes, we developed a novel algorithm, DAWN, to model two kinds of data: rare variations from exome sequencing and gene co-expression in the mid-fetal prefrontal and motor-somatosensory neocortex, a critical nexus for risk. The algorithm casts the ensemble data as a hidden Markov random field in which the graph structure is determined by gene co-expression and it combines these interrelationships with node-specific observations, namely gene identity, expression, genetic data and the estimated effect on risk. Results Using currently available genetic data and a specific developmental time period for gene co-expression, DAWN identified 127 genes that plausibly affect risk, and a set of likely ASD subnetworks. Validation experiments making use of published targeted resequencing results demonstrate its efficacy in reliably predicting ASD genes. DAWN also successfully predicts known ASD genes, not included in the genetic data used to create the model. Conclusions Validation studies demonstrate that DAWN is effective in predicting ASD genes and subnetworks by leveraging genetic and gene expression data. The findings reported here implicate neurite extension and neuronal arborization as risks for ASD. Using DAWN on emerging ASD sequence data and gene expression data from other brain regions and tissues would likely identify novel ASD genes. DAWN can also be used for other complex disorders to identify genes and subnetworks in those disorders.

among hidden variables I in our model to gain power.
The TADA p-values can be converted to Z-scores, Z = (Z 1 , ..., Z n ), to obtain a measure of the evidence of ASD association for each gene. It follows immediately that each of the Zscores under the null hypothesis (I = 0) has a standard normal distribution. We assume that under the alternative (I = 1) the Z-scores approximately follow a shifted normal distribution, with a mean µ and variance σ 2 . Further we assume that the Z-scores are conditionally independent given the hidden indicators I.
We model the clustering of association signals with respect to gene expression networks using a hidden Markov random field (HMRF) model. Based on the co-expression network we have an adjacency matrix Σ. We assume that the distribution of I follows an Ising model. The Ising model implies the following conditional probability model for the nodes in the network: the conditional probability the j'th node is a risk node, given the status of the adjacent nodes, is where N j is the neighborhood of node j based on the adjacency matrix Σ, b models the overall fraction of ASD risk nodes, c > 0 reflects the enhanced connectivity of risk status with respect to co-expression of genes. If c is near 0, it suggests that ASD risk genes are not clustered with respect to co-expression of genes.
1 Each step of DAWN algorithm is summarized by the following:

DAWN Algorithm
Input: an n × n matrix of pairwise correlation of gene expression for n genes, and the TADA P value of each gene.
Output: a risk ASD (rASD) gene list, posterior probabilities and FDR values of each gene.
1. Cluster genes into modules by applying WGCNA (Langfelder and Horvath, 2008) to the input pairwise correlation matrix.
2. Within each module, group tightly correlated genes to supernodes (absolute correlation larger than .75).
3. Within each module, an adjacency matrix Σ connects nodes (and supernodes) with high absolute correlation (average linkage smaller than .3).
4. Assign Z-score derived from the TADA P value to each node. Supernodes are represented by the score associated with the minimum P value of all genes contained in the node.

Parameters estimation for HMRF model
We use the following iterative algorithm to estimate the parameters and the posterior probability of P (I j |Z): 1. Use model based clustering (Fraley & Raftery, 2002) to initialize the hidden states I (0) and parameters µ (0) , σ 2 (0) .
(b) Apply a single cycle of iterative conditional mode (ICM) algorithm to update I.
After we obtain the parameters of the Ising model ( b, c, µ, σ 2 ) for each module, we then calculate the weighted average of µ, σ 2 using the parameter estimates from all modules, with weight proportional to the number of nodes per module. Two parameters ( b and c) remain module specific, and two others ( µ, σ 2 ) are common across modules. We then use the revised set of parameters to update the posterior probability for each node. Nodes which have posterior probability P (I j = 0|Z) less than 0.5 are selected as nASD nodes.

Parameters estimation for the hierarchical model
After we identify risk nodes using the HMRF algorithm, genes are further analyzed to identify risk genes (rASD genes) within the set of risk nodes identified by network analysis (nASD genes). This step is essential for all nASD genes, but it is especially important for genes within supernodes. The posterior probability a supernode is a risk node can be interpreted as the overall level of confidence that at least one gene in the supernode is a risk gene.
Subsequent inference is carried out for each risk node to determine which, if any, members can be labeled as risk genes at a preset level of FDR control. To do so we compute the posterior probability for each gene within a risk node, working across all risk nodes and modules simultaneously. Consider the following hierarchical model: Z jk ∼ P (I jk = 0)N (0, σ 2 0 ) + P (I jk = 1)N (µ, σ 2 1 ), where j = 1, ...M and M is the number of nodes, and k represents gene k in the j-th node.
The parameters can be estimated by applying the EM algorithm iteratively until convergence is achieved: The posterior probability the j'th gene in the k'th supernode is a risk gene is After we obtain P (I jk = 0|Z, θ (T ) j ), the FDR is then computed to identify the significant genes within the risk nodes.
This procedure is not suitable for small supernodes (< 10 genes) and single gene nodes because there is not sufficient information from which to estimate θ j for each small node.
Thus we adjust the algorithm using a two-stage procedure. First, for those nASD genes that fall into large supernodes (≥ 10 genes), the proportion of risk genes in supernode j, θ j , is estimated for each individual supernode, but the mean and standard deviation µ and σ are estimated jointly over all nASD genes in the big supernodes. Thus using (θ j , µ, σ) in Eq. (1) we compute the posterior probability for each gene in each large supernode. Next, we compute θ as the average of the θ j 's estimated from big supernodes. Using (θ, µ, σ) in Eq. (1) we compute the posterior probability for the remaining nASD genes. Finally, let q (s) be the sorted posterior probability in descending order, the Bayesian FDR correction (Müller et al., 2006) of the l'th sorted gene can be calculated as FDR l = l s=1 q (s) /l Genes with FDR less than 0.05 are selected as the rASD genes.