Rare deleterious mutations of the gene EFR3A in autism spectrum disorders

Background Whole-exome sequencing studies in autism spectrum disorder (ASD) have identified de novo mutations in novel candidate genes, including the synaptic gene Eighty-five Requiring 3A (EFR3A). EFR3A is a critical component of a protein complex required for the synthesis of the phosphoinositide PtdIns4P, which has a variety of functions at the neural synapse. We hypothesized that deleterious mutations in EFR3A would be significantly associated with ASD. Methods We conducted a large case/control association study by deep resequencing and analysis of whole-exome data for coding and splice site variants in EFR3A. We determined the potential impact of these variants on protein structure and function by a variety of conservation measures and analysis of the Saccharomyces cerevisiae Efr3 crystal structure. We also analyzed the expression pattern of EFR3A in human brain tissue. Results Rare nonsynonymous mutations in EFR3A were more common among cases (16 / 2,196 = 0.73%) than matched controls (12 / 3,389 = 0.35%) and were statistically more common at conserved nucleotides based on an experiment-wide significance threshold (P = 0.0077, permutation test). Crystal structure analysis revealed that mutations likely to be deleterious were also statistically more common in cases than controls (P = 0.017, Fisher exact test). Furthermore, EFR3A is expressed in cortical neurons, including pyramidal neurons, during human fetal brain development in a pattern consistent with ASD-related genes, and it is strongly co-expressed (P < 2.2 × 10−16, Wilcoxon test) with a module of genes significantly associated with ASD. Conclusions Rare deleterious mutations in EFR3A were found to be associated with ASD using an experiment-wide significance threshold. Synaptic phosphoinositide metabolism has been strongly implicated in syndromic forms of ASD. These data for EFR3A strengthen the evidence for the involvement of this pathway in idiopathic autism.


Background
Autism spectrum disorders (ASDs) are defined by persistent deficits in social communication and social interaction and restricted repetitive patterns of behavior, interests or activities [1]. These syndromes are common in the population, with a prevalence of approximately 1% [2], and demonstrate both considerable phenotypic and extensive genetic heterogeneity [3]. High-throughput sequencing approaches have provided substantial insight into the genomic architecture of ASDs. For example, multiple analyses of whole-exome sequencing data demonstrate an over-representation of de novo, loss-of-function mutations in brain-expressed genes in affected individuals and point to half a dozen new ASD genes [3][4][5][6]. These have been identified based on the clustering of mutations in the same gene in unrelated individuals, providing strong evidence for association [3]. However, a large number of compelling, rare, de novo missense mutations are also found in probands, though a clear threshold for identifying the association of these mutations with ASD is less obvious. Both the rarity of the individual mutations and the small size of current exome discovery cohorts suggest that clarifying which of these de novo mutations point to bona fide ASD genes will require alternative approaches. Largescale, targeted, case/control sequencing as a complement to de novo mutation discovery in ASD is one such strategy.
In a whole-exome analysis of 238 families [3], we identified a single proband carrying two novel de novo missense mutations in synaptic genes, one each in EFR3A (Eighty-five Requiring 3A [NCBI Reference Sequence: NM_015137]) and CASK (Calcium/Calmodulin-dependent Serine/Threonine Kinase [NCBI Reference Sequence: NM_003688]). Both yeast EFR3 [7] and mammalian EFR3A and EFR3B [8] have been linked to the control of phosphoinositide metabolism, a pathway demonstrated to play a role in ASD [9]. CASK is implicated in X-linked intellectual disability [10]. Neither the occurrence of one or several de novo missense mutations in a single affected individual is a statistically significant finding. However, our overall analysis of 599 simplex ASD quartets suggests that approximately 20% of de novo missense mutations in brainexpressed genes found in cases will prove to be true ASD loci, representing an approximately fourfold increase over a brain-expressed gene chosen at random [11]. Given an increased prior probability based on the exome results and the strong biological plausibility of both genes, we conducted a targeted analysis of EFR3A and CASK in large cohorts using both Sanger sequencing and whole-exome data. We found that rare deleterious mutations in EFR3A are associated with ASD using an experiment-wide significance threshold.

Subjects
Initial cases were drawn from the Simons Simplex Collection (SSC). The SSC is an exhaustively characterized ASD family cohort, with the majority of families consisting of a proband, two unaffected parents and an unaffected sibling. The diagnostic methodology used is well described elsewhere [12]. EFR3A and CASK were sequenced in 705 cases of European ancestry (Family Distribution List v13) based on genome-wide genotyping data (see below). Based on a preliminary analysis of the sequence data, mutation screening then focused on EFR3A, which was evaluated in several cohorts for which we had access to DNA or whole-exome sequencing results. Additional cases were drawn from the SSC (n = 452) and via collaboration with the ARRA Autism Sequencing Collaboration (AASC, n = 1,039). All cases were identified as having European ancestry via genome-wide genotyping data. Sample characteristics and diagnostic methodology for the AASC have been described previously [5,13]. For controls, 912 were drawn from the National Institute of Neurological Disorders and Stroke (NINDS) Neurologically Normal Caucasian Control Panel (NDPT020, 079, 082, 084, 090, 093, 094, 095, 096, and 098). This set of adult subjects has a negative personal and family history (first-degree relatives) of neuropsychiatric illness. Additional controls were drawn from the AASC (n = 863) and from ongoing studies of non-neuropsychiatric conditions at our home institution (northern European (NE) controls, n = 1,614). Again, all controls were of confirmed European ancestry. The NE and AASC controls were considered population controls since subjects with potential neuropsychiatric disorders were not excluded. This study only accessed de-identified biospecimens or sequencing data and no protected health information; it received an exemption from human subject research from the Yale Human Research Protection Program.
For ancestry matching, Golden Helix SNP and Variation Suite v7.5.4 (Bozeman, MT, USA) was used in principal component analysis (PCA) of SSC cases, NINDS controls and NE controls using 8,210 SNPs common to all arrays and not in high linkage disequilibrium. Based on visualization of a scree plot (Additional file 1: Figure S1), eigenvalues of the first three principal components, which contributed the greatest amount of variation relative to the other principal components, were plotted against one another (Additional file 2: Figure S2). The interquartile range (IQR) distance around the median of the study population cluster was calculated. A threshold that included all NINDS and NE controls was determined to lie at 5 IQRs from the third quartile, and 54 SSC cases beyond this threshold were excluded as ancestral outliers (Additional file 3: Figure S3). The final cohort sizes were 1,157 SSC cases, 912 NINDS controls and 1,614 NE controls.
AASC cases and controls were genotyped using Illumina microarrays, including 550 K, 610 K and 1 M Bead-Chips, and also filtered to exclude subjects because of: (1) genotyping call rate <95%, (2) discrepancy of genotyping data with recorded sex, and (3) Mendelian inconsistencies or cryptic relatedness. Ancestry matching between AASC cases and controls was conducted using PCA of genotyping data for a subset of SNPs common to all arrays; each case was matched to the nearest control using a greedy algorithm. The final cohort sizes were 1,039 cases and 863 controls, all of European descent.
Sanger sequencing of Simons Simplex Collection cases and NINDS controls PCR primers were designed to flank all coding exons and splice sites of EFR3A and CASK (Additional file 4: Table S1). Then 10 ng lymphoblastoid cell line-derived genomic DNA served as template in a 25 μl PCR containing 1× PreMix D buffer (Epicentre Biotechnologies, Madison, WI, USA), 0.48 μM each forward and reverse primer, and 0.36 μL Taq polymerase/0.072 μL Pyrococcus furiosus (PFU) polymerase. Both enzymes, which were synthesized in house, were used to permit proofreading during PCR and reduce Taq-induced mutations. A Tetrad2 Peltier Thermal Cycler (Bio-Rad, Hercules, CA, USA) was programmed as follows: 95.0°C/5 min; 40 cycles of 95.0°C/30 sec, 60.0°C/30 sec and 72.0°C/60 sec; 72.0°C/ 10 min. PCR products were visualized by agarose gel electrophoresis and sent to Beckman Coulter Genomics (Danvers, MA, USA) or the Yale Keck Biotechnology Resource Laboratory (New Haven, CT, USA) for Sanger sequencing. Chromatograms were aligned and analyzed using Sequencher v4.9 (Gene Codes, Ann Arbor, MI, USA). We obtained a 96% sequencing success rate for both cases and controls. All potential rare (<1% frequency) nonsynonymous variants were confirmed by a second round of PCR and Sanger sequencing in forward and reverse directions, using blood-derived genomic DNA for SSC cases since it was available. Segregation analysis of confirmed variants was performed using blood-derived genomic DNA from all family members, which were only available for SSC cases.

Whole-exome data from northern European controls and ARRA Autism Sequencing Collaboration cases/controls
For the NE controls, we examined whole-exome sequencing data. Greater than 98% of the EFR3A coding and splice site sequence was covered by at least eight independent reads. For variant calling, a minimum read threshold of only one independent read was used to minimize the liability for false negatives. All coding and splice site variants with a SAMtools SNP quality score ≥50 were subjected to confirmation by PCR and Sanger sequencing of whole-genome amplified DNA.
We also examined whole-exome sequencing data for AASC cases and controls (generated by the Broad Institute and Baylor College of Medicine), which were treated as a matched set and subjected to identical quality control and variant calling criteria within each site. All coding and splice site variants were identified after three rounds of filtering the whole-exome data for quality control. Variants were excluded if: (1) they had ≥10% missing calls, (2) they had average coverage <17 for Broad cases/ controls and <12 for Baylor cases/controls, and (3) >50% of minor allele calls had <17 reads or a balance of depth >0.66 for Broad cases/controls and <12 reads or a balance of depth >0.75 for Baylor cases/controls (balance of depth being defined as the number of reference reads divided by the total number of reads). Filtering criteria differed between the two sites since samples were sequenced on different platforms and the data were processed using different software packages (Illumina/GATK at Broad and Solid/AtlasSNP2 at Baylor). AASC case and control variants were not confirmed by Sanger sequencing. However, given that the cohorts are approximately the same size and the entire AASC set was subjected to identical sequencing methods, we anticipated that calling errors would be randomly distributed across affected and unaffected individuals.
To assess the novel singleton status of variants identified in all case and control groups, we queried dbSNP137 and whole-exome sequencing data from an additional 6,503 individuals from release ESP6500 of the Exome Variant Server, comprising 4,300 European-Americans and 2,203 African-Americans.

Analysis by conservation measures
We evaluated conservation at the positions of novel nonsynonymous singleton mutations in EFR3A with three widely used informatics tools: PhyloP (phylogenetic P values), GERP (genomic evolutionary rate profiling) and ConSurf. PhyloP scores [14] were obtained from the UCSC Genome Browser. A PhyloP score ≥1.3 indicates P = 0.05 for conservation and was used as a threshold to determine whether a mutation occurred at a conserved site. GERP scores [15] were obtained from the SeattleSeq annotation pipeline. A GERP score ≥5 was used as a threshold to determine conservation [16]. Regarding Con-Surf analysis, a multiple EFR3 protein sequence alignment was constructed using PSI-Blast, which was then edited to remove partial or redundant sequences and produce a comprehensive sampling of genetic space. Both EFR3A and EFR3B were included to increase the number of sequences available, which totaled 42 (Additional file 5: Figure S4). The alignment was produced with TCoffee and sent to the ConSurf server to quantify conservation [17]. The server normalizes the conservation score for each amino acid such that average positions cluster around zero; the most conserved residues have negative scores, and the least conserved are positive.

Structure-based analysis
The crystal structure of the N-terminal fragment (amino acids 8 to 562) of the Saccharomyces cerevisiae Efr3 was recently determined [18]. Yeast Efr3 and human EFR3A were aligned through amino acid 451, corresponding to the most conserved portion of Efr3 (Additional file 6: Figure S5). We created a homology model and found that secondary structure predictions of the human EFR3A matched well with the observed secondary structure of the yeast protein. Based on the crystal structure, human EFR3A case and control mutations were blindly assessed for their potential to disrupt protein structure and function using the following structural criteria prioritization: first, it was determined whether the mutated residue was located in the protein core or on the surface as shown by the crystal. If the mutation was located in the core, it was then assessed, taking secondary structure into account, for whether a hydrophilic residue would be placed in a hydrophobic environment or whether the mutation changed the residue size, which could result in a defect in packing the core or misfolding. If the mutation was located on the surface of the protein, it was then determined whether that area was well conserved and hence likely to be functionally important. If so, any change in residue charge and/or size was categorized as potentially disruptive as these could affect protein-protein or protein-membrane interactions. To this end we devised a grading scheme, where deleterious variants received a score of 3 or higher. It should be noted, however, that this grading scheme cannot take into account interactions of EFR3A that have not been described to date.

Gene co-expression analysis
Using data from Kang et al. [19], Spearman correlation was performed between expression levels of EFR3A and M12 genes [20] and between EFR3A and all 15,132 genes expressed in the human brain [19]. Of the 432 unique genes in M12, 356 had expression data in the array platform used by Kang et al. [19], and these were used to perform the analysis (Additional file 7: Table S2). These analyses were also performed with ten additional genes: ACTB (a housekeeping gene), CHD8, DYRK1A, EFR3B (a homologue), GRIN2B, KATNAL2, NRXN1, SCN2A, SHANK2 and SHANK3. The median expression correlation coefficients for the 11 genes when compared to M12 and all brain-expressed genes are shown in Additional file 8: Table S3. To show the distribution of the correlation coefficients, kernel density plots were generated using the sm.density.compare function in the sm package in R with the smoothing parameter h = 0.1. The entire process was repeated for an additional three modules: M2, M3 and M16 genes [21].

Statistical analysis
All P values for mutation burden, conservation measures and crystal structure analysis were calculated by the Fisher exact test. We used the right-tailed test based on the hypothesis that there would be a greater number of mutations in cases versus controls and that case mutations would be more deleterious. Since we initially investigated two genes, CASK as well as EFR3A, we performed a Bonferroni correction and multiplied the P value for overall mutation burden by two. The initial de novo mutation F338S identified by whole-exome sequencing was not included in calculations of overall mutation burden between cases and controls but was included when assessing the potential deleteriousness of case versus control mutations.
To study the relative enrichment of variants at conserved positions in cases and controls, we conducted the following analysis. For each novel nonsynonymous singleton variant, we used cutoff values for three conservation measures to annotate whether each variant maps to a conserved position and is, therefore, potentially deleterious: PhyloP ≥ 1.3 (indicates P = 0.05 for conservation), GERP ≥ 5 [16] and ConSurf < 0 (indicates conservation). We also performed a permutation test by first creating an input file (Additional file 9: Table S4) of binary entries, with '1' indicating that the variant met the cutoff and is functional by that conservation measure and '0' indicating that it did not. For each measure, we calculated the proportion of individuals carrying functional variants in the case and control cohorts. We then calculated the ratio of the two proportions as the relative enrichment in cases. We used the largest ratio among the three measures as the test statistic for the observed data. To estimate the statistical significance, we adopted the following permutation procedure. For each of 10,000 permutations, we permuted the case and control labels of the subjects. Based on the case and control groups defined by the permuted labels, we repeated the same relative enrichment ratio calculation and estimated a P value for enrichment of deleterious mutations in cases. The nonparametric Wilcoxon test was used to calculate the P value for the difference in median expression correlation coefficients between EFR3A/M12 and EFR3A/all brain-expressed genes.

Results
To test our hypothesis that mutations in EFR3A and/or CASK confer risk for ASD, we performed Sanger sequencing of all coding exons and splice sites of both genes in 705 comprehensively phenotyped European cases from the SSC. All rare (<1% frequency) nonsynonymous variants were confirmed by a second round of Sanger sequencing. We focused on novel alleles seen only once (singleton variants) and not present in two large databases, dbSNP137 and Exome Variant Server, the latter of which had 6,503 exomes. We reasoned that this strategy would most likely identify deleterious substitutions subject to purifying selection and provide, along with case/control matching for ancestry, the most robust protection against population stratification [22].
In CASK, only two variants met these criteria among all 705 cases (Additional file 11: Table S6). In light of the low cumulative allele frequency and anticipated low power to detect an effect, we did not pursue this gene further. We identified six novel nonsynonymous singleton mutations in EFR3A (Table 1 and Additional file 12:  Table S7) and, consequently, we proceeded to screen this gene in several cohorts for which we had access to DNA or whole-exome sequencing data. We identified an additional 1,491 European cases: (1) 452 from the SSC were subjected to Sanger sequencing and (2) 1,039 from the AASC had whole-exome sequencing data, for a total of 2,196 cases. We identified a total of 3,389 European controls: (1) 912 NINDS neurologically normal European controls matched to SSC cases via PCA of genotyping data and subjected to Sanger sequencing, (2) 1,614 neuropsychiatrically unscreened controls of NE origin matched to SSC cases and who had whole-exome sequencing data, and (3) 863 from the AASC with whole-exome sequencing data. For the NE control exomes, a minimum read  threshold of only one independent read was used to identify variants in an effort to minimize false negatives. For the AASC dataset, Broad cases/controls and Baylor cases/controls were matched and evaluated using identical variant-calling approaches and filtering criteria within each site. All rare nonsynonymous variants in SSC cases, NINDS controls and NE controls were confirmed by Sanger sequencing; confirmations were not available for case and control AASC variants. The analysis of a total of 2,196 cases and 3,389 controls demonstrated that novel nonsynonymous singleton mutations in EFR3A were twice as frequent in cases compared to controls. However, the P value was not statistically significant when corrected for the investigation of two genes since we initially analyzed CASK as well as EFR3A (16/2,196 cases and 12/3,389 controls; P = 0.084, odds ratio = 2.065, 95% confidence interval = 0.924 to 4.652, Fisher exact test, right-tailed). Since the combination of low allele frequency and high conservation has been shown to provide high sensitivity and specificity for predicting functionality in rare variant studies, in contrast to in silico prediction programs [23], we evaluated conservation with three widely used informatics tools: PhyloP, GERP and ConSurf. All found significantly more case variants mapping to conserved positions (Tables 1 and 2). Furthermore, using 10,000 permutations of the cases and controls to test the significance of the enrichment of deleterious mutations in cases, we calculated a P value of 0.0077. We evaluated family data from SSC subjects and found that all newly identified variants were transmitted. Whole-exome data for only two of these subjects (11379. p1 and 11808.p1) have now been reported; neither has a de novo loss-of-function mutation which might contribute to their phenotype. SSC case 11808.p1 does have a novel de novo missense mutation (N160S) in DGCR14 (DiGeorge Syndrome Critical Region Gene 14), which has not been associated with ASD or intellectual disability [24]. We also determined that all of the SSC subjects except one (13507. p1) have undergone genome-wide copy number variant (CNV) analysis; none have de novo CNVs that might better explain their phenotype [25].
We took advantage of the recently available crystal structure (Protein Data Bank ID 4N5A) of the N-terminal fragment of S. cerevisiae Efr3 [18] to map the human mutations ( Figure 1) and determine their potential to disrupt protein structure and function, blinded to case/control status. Because the C-terminal portion is not well conserved, only mutations up to and including amino acid 451 could be evaluated with high confidence. (The full length EFR3A protein has 821 amino acid residues.) Every mutation that was assessed to be deleterious as informed by the crystal structure was a case variant, except R161*, which is assumed to be damaging (Table 1 and Additional file 13: Table S8). R161* was found in an NE control for whom neuropsychiatric information is unavailable, so we cannot determine if this mutation is associated with any neuropsychiatric condition. Thus, crystal structure analysis also identifies a significantly greater number of deleterious mutations in cases than controls (P = 0.017, odds ratio = 9.282, 95% confidence interval = 1.119 to 204.784, Table 2). As would be expected, all of these deleterious mutations are also at highly conserved positions as per PhyloP, GERP and ConSurf. Interestingly, the reverse is not always true, i.e., there are mutations at highly conserved positions which were assessed to be benign in light of the crystal structure. Therefore, having knowledge of the three-dimensional structure of Efr3 enriches our analysis by providing more biological information to evaluate the deleteriousness of mutations. We also noted for subjects with family data, deleterious mutations as per the crystal structure are not shared by the unaffected siblings (Table 1).
EFR3A is a member of the EFR3 family of genes, conserved throughout eukaryotes and essential for viability [7]. The Drosophila melanogaster homologue, rolling blackout (RBO), is highly expressed in the nervous system [26], is enriched at the neural synapse [27], and was proposed to regulate phospholipase C signaling [26]. RBO has also been proposed to function as a transmembrane lipase [26], but structural analysis of Efr3 does not support this hypothesis (Additional file 14: Table S9) [18]. Instead, it shows that EFR3/RBO has a scaffold function with the majority of the protein comprising alpha-helical HEAT (Huntington, Elongation factor 3, regulatory subunit A of protein phosphatase 2A, and Target of rapamycin) repeats.
The tissue expression of EFR3A has not been described, so we performed Western blot analysis of several mouse tissues and found that EFR3A is broadly expressed, including in the brain (Additional file 15: Figure S6). We also analyzed its expression using exon-array data from a study of the spatio-temporal transcriptome of the human brain [19]. There is a steady increase in EFR3A mRNA levels in multiple brain regions through fetal development and into adolescence (Figure 2A). In situ hybridization of adult human dorsolateral prefrontal cortex revealed the presence of EFR3A in cortical neurons including pyramidal neurons ( Figure 2B). This pattern is consistent with prior data on the expression of ASD genes [9,19], as well as functional annotation of genes that are highly co-expressed with ASD genes, showing enrichment for a category associated with the development of cortical projection (pyramidal) neurons [19].
We next identified the top 100 genes co-expressed with EFR3A (Additional file 16: Table S10) using the same dataset [19]. Gene ontology enrichment analysis using the Database for Annotation, Visualization and Integrated Discovery (DAVID v6.7) [28,29] revealed synaptic genes, including SYNJ1, the major PtdIns(4,5)P 2 phosphatase in the brain [30], as the most significant finding ( Figure 2C). We also compared the expression profile of EFR3A with a discrete module of co-expressed genes (M12) significantly associated with ASD in a prior transcriptome analysis of post-mortem autism and control brains [20]. M12 is enriched for genes involved in synaptic function, vesicular transport and neuronal projection and is downregulated in the autistic brain. We compared the distribution of expression correlation coefficients between EFR3A and M12 genes ( Figure 2D) and between EFR3A and all brainexpressed genes [19] ( Figure 2E). We found that the distribution between EFR3A and M12 genes was significantly skewed toward positive correlation coefficients compared to the distribution between EFR3A and all brain-expressed genes (P < 2.2 × 10 −16 , Wilcoxon test). When a similar analysis was performed on the homologue EFR3B (which is largely brain-expressed) and eight genes strongly associated with ASD from recent CNV and exome studies [3][4][5][6]25], EFR3A was the most strongly correlated with M12 expression ( Figure 2D). We repeated this process with three additional modules of co-expressed genes (M2, M3 and M16) identified by a prior analysis of BrainSpan transcriptome data from normal brains [21]. All are significantly associated with ASD candidate genes, although M2 and M3 are enriched for early fetal transcriptional regulators affected by de novo loss-of-function mutations in ASD, while M16, which has significant overlap with M12, is enriched for synaptic genes upregulated during late fetal/early postnatal stages and genes harboring inherited common variants in ASD. As might be expected given its developmental expression pattern and synaptic function, EFR3A is positively correlated with M16 (Additional file 17: Figure S7A; P < 2.2 × 10 −16 , Wilcoxon test) and negatively correlated with M2 and M3 (Additional file 17: Figure S7B,C; P < 2.2 × 10 −16 , Wilcoxon test).

Discussion
Our conservation, structure-based functional and expression analyses suggest a role for rare deleterious EFR3A Figure 1 Ribbon diagram of S. cerevisiae Efr3 crystal structure. Crystallization of Efr3 revealed a series of HEAT repeats, as we had predicted bioinformatically. Alignment of yeast Efr3 and human EFR3A was reliable to amino acid 451. Blinded to case/control status, the human mutations were mapped and analyzed for their potential to disrupt protein structure and function given the three-dimensional crystal structure. Mutations in red are deleterious and found in cases. (R161* in a control and G216Sfs*12 in a case are not shown but presumed to be deleterious.) Mutations in green are benign and found in cases. Mutations in blue are benign and found in controls. mutations in the risk for ASD, adding to the emerging data on specific synaptic functions, including phosphoinositide metabolism, relevant to these disorders. Multiple resequencing projects for ASD have revealed numerous rare variants in both cases and controls. Given the overrepresentation of de novo loss-of-function mutations in cases, it is implausible that a subset of damaging missense mutations does not carry risk as well. However, differentiating relevant functional mutations from the large collection of neutral background variation remains a challenge. We have approached this issue by following up an observation of a de novo mutation in an ASD proband with a relatively large case/control analysis, relying on diverse approaches to identify putatively deleterious variants. While  [20]. (E) Distribution of expression correlation coefficients of EFR3A and ASD genes with all brain-expressed genes (n = 15,132) [19]. The homologue EFR3B is shown for comparison and ACTB, a housekeeping gene, is included as a negative control.
the overall burden of singleton variants was not impressive, the use of multiple conservation measures and crystal structure analysis to segregate functional variation showed consistent evidence for experiment-wide association with ASD.
Our results would clearly not survive correction for genome-wide comparisons. Of course, given the distribution of singleton mutations across the genome, this statistical threshold, if applied to every targeted analysis, would demand implausibly large case/control samples. In an effort to skirt this problem, we used an initial observation in an unbiased exome-wide study to establish a narrow hypothesis and then relied on an experimentwide P value threshold for our case/control analysis. At present, this seems a reasonable approach to evaluating single gene association. Additional data on the distribution of de novo missense mutations in the genome and the integration of ASD risk associated with varying classes of mutations [31] with co-expression network data [11,21] will shed significant light on the contribution of any one gene to ASD.
Our expression data, combined with evidence for the involvement of EFR3A in synaptic phosphoinositide metabolism [8], suggest that EFR3A may play an important role in synaptic function during human fetal brain development. In addition to the significant conservation and structure-based findings, our analysis comparing the expression profile of EFR3A with M12 and M16 further suggests that this gene is associated with ASD. Not only are EFR3A and M12/M16 expression strongly correlated but EFR3A is also the most strongly correlated in the context of ASD-associated genes and its homologue EFR3B. Although co-expression data do not prove that a gene causes a disorder, they can provide another piece of supportive evidence [11,21]. The determination of when in development and in what cell types EFR3A is expressed provides insight into how EFR3A mutations might contribute to the pathophysiology of ASD.
A potential limitation of this study is that we combined data from Sanger sequencing (SSC cases and NINDS controls) and whole-exome sequencing (AASC cases/controls and NE controls). It is possible that the two techniques can yield different sets of variants. As described under Methods, for the NE controls, we determined that >98% of the coding and splice site sequences were covered by at least eight independent reads. To minimize false negatives in controls that might bias toward an excess of rare mutations in cases, a minimum of only one independent read was used to identify variants for confirmation. Regarding the AASC samples, case and control exome data were subjected to identical variant-calling approaches and filtering criteria within each site and were, therefore, treated equally, suggesting that any error should be randomly distributed between these groups.
Another limitation is that we did not find additional de novo mutations in the subjects for whom family DNA was available (only SSC cases), which would strengthen the association of EFR3A mutations with ASD. However, there is abundant evidence that inherited mutations also contribute to ASD [31]. The presence of SSC case mutations in unaffected parents and/or siblings points to incomplete penetrance, as expected in complex genetic disorders such as ASD. The crystal structure analysis was able to stratify the mutations further by determining that potentially deleterious variants were generally not shared by siblings. Although this is an interesting observation, it is based on a very small number of events (five SSC case mutations with both sibling data and crystal structure information) and cannot be assigned statistical significance. We did observe one premature stop codon mutation in an NE control as well as in an SSC case, indicating that EFR3A mutations are not sufficient to cause ASD. However, neuropsychiatric data was not available for this control cohort. Moreover, the identification of well-established ASD-associated variants in unscreened controls is so commonplace as to be expected in a study such as this one.
EFR3A is a critical component of a complex containing a phosphatidylinositol 4-kinase that synthesizes the plasma membrane pool of the phosphoinositide PtdIns4P, the direct precursor of PtdIns(4,5)P 2 [8]. PtdIns(4,5)P 2 has a wide variety of direct functions in the central nervous system, including regulation of exo/endocytosis, ion channel function, neurotransmitter receptors, and transporters and nucleation of the actin cytoskeleton [32,33]. Additionally, PtdIns(4,5)P 2 is a precursor to numerous signaling metabolites: diacylglycerol and InsP 3 (via phospholipase C activity), which are key regulators of Ca 2+ signaling, and PtdIns(3,4,5)P 3 (via PI 3-kinase activity), which mediates many cellular processes such as activation of the Akt/ mTOR signaling pathway [30]. Mutations in PTEN, which encodes a PtdIns(3,4,5)P 3 phosphatase, and in TSC1 and TSC2, which are key effectors in the PtdIns(3,4,5)P 3 signaling pathway, have demonstrated the importance of synaptic phosphoinositide signaling in syndromic forms of autism ( Figure 3) [9,34,35]. Common polymorphisms and rare CNVs in MET, another gene involved in phosphoinositide metabolism, have implicated this pathway in idiopathic ASD as well [36,37].
The identification of rare deleterious mutations in EFR3A, a gene linked to PtdIns4P synthesis (Figure 3), further strengthens the role of phosphoinositide metabolism in ASD. The precise effects of EFR3A on the levels of various phosphoinositides still have to be determined, and an EFR3A knock-out mouse is not yet available. Delineating the molecular details and functional significance of interactions between EFR3A and its binding partners will allow the development of in vitro assays to assess further the severity of the variants we report here. Importantly, phosphoinositide metabolizing enzymes are pharmacologically targetable [38][39][40]. The applicability of this approach towards ASD has been shown for the closely connected mTOR pathway in mouse models of tuberous sclerosis [41,42]. Therefore, mutations in EFR3A and perturbations in phosphoinositide metabolism may point to a potential avenue for treatment in a subset of ASD patients.

Conclusions
Rare nonsynonymous mutations in EFR3A are significantly more common among ASD cases than controls at positions that are conserved and positions that would be disruptive to protein structure and function based on analysis of the Efr3 crystal structure. These results further implicate phosphoinositide metabolism in the pathophysiology of ASD, a pathway that is pharmacologically targetable. Exactly how EFR3A mutations contribute to that pathophysiology will have to await further delineation of how the protein functions and the development of specific assays to test their severity. detecting and scoring HEAT repeats was also used [6]. Using this technique, three HEAT repeats were detected with an E value less than 50, the benchmark for significance. Additionally, six HEAT repeats were detected by the REP server [7].