Samples
Samples were acquired through the Autism Tissue Program (which has since joined with the Autism Brain Net, https://autismbrainnet.org/). Post-mortem, frozen brain samples from the cerebral cortex Brodmann area (BA) 19 were collected at two different brain banks: the Harvard Brain Tissue Resource Center and the NICHD Brain and Tissue Bank at the University of Maryland with written informed consent having been obtained from next-of-kin or a legal guardian. Work herein was both approved by the IRB of The Johns Hopkins Hospital and University of Alabama at Birmingham and conducted in accordance with institutional guidelines.
RRBS library preparation
Seventy-one samples were prepared for reduced representation bisulfite sequencing (RRBS). RRBS libraries were prepared using 100 ng of genomic DNA (gDNA). gDNA was first digested with MspI making cuts exclusively at methylated cytosines. 3′ A-overhangs were created and filled in with Klenow Fragments. DNA was then purified using the Qiagen MinElute Kit. Methylated ilAdap PE adapters (Illumina) were ligated to purified gDNA. Fragment size selection (105–185 bp) was carried out by gel extraction on a 2.5% NuSieve GTG agarose gel (Lonza). DNA was purified using Qiaquick Gel extraction Kit eluting DNA in elution bugger pre-warmed to 55 °C. Bisulfite treatment was performed using the ZymoResearch EZ DNA Methylation Gold Kit following the manufacturer’s instructions; however, we eluted with 20 μl M-Elution buffer. Bisulfite-treated DNA was cleaned up using EpiTect spin columns. Samples were PCR amplified (using the following primers: AATGATACGGCGACCACCGAGATCTACACTCTTTCCCTACACGACGCTCTTCCGATC*T and CAAGCAGAAGACGGCATACGAGATCGGTCTCGGCATTCCTGCTGAACCGCTCTTCCGATC*T; * = phosphorothioate bond), and size selection was carried out on a 3% Metaphor agarose gel to ensure that fragments of the correct size (175–275 bp) were amplified. PCR product was cleaned up using the Qiagen minElute column, eluting with elution buffer warmed to 55 °C. Each sample (10 nM) was sequenced in a single lane on the Illumina HiSeq2000 to produce 50 bp single end reads.
Alignment
Adaptor sequences were removed, and reads shorter than 20 bp were excluded using Trim Galore (v0.2.8). Remaining reads were aligned to hg19 using Bismark (v0.7.7) [10] allowing for one mismatch and setting the seed substring length to 24.
Methylation estimation
Two separate analyses were carried out based on cytosine context; one for cytosines in the CpG context and a separate analysis for all other cytosines in the genome (CpH, where H = A, C, or T). Samfiles for every sample and each of the two contexts were formatted for input into the R package methylKit [11] (v0.9.5) using in-house scripts. Reads were filtered in methylKit based on read count discarding bases with coverage below 10× as well as those with coverage greater than the 99.9th percentile of coverage in each sample to remove reads suffering from PCR bias. Data were normalized based on median coverage and methylation percentage estimated using “normalizeCoverage” and “percMethylation,” respectively within methylKit.
Illumina 27K methylation array
To independently verify methylation estimates from RRBS, CpG methylation was also analyzed in 71 cortical brain samples using the HumanMethylation27 BeadChip. These samples comprised 41 controls and 30 autism cases. Data were generated as described previously [12]. Normalized β-values were used for analysis. For comparison to RRBS data, mean methylation was quantified for the 1249 CpGs that directly overlap between the two platforms.
Sample outlier removal
Four samples were excluded from analysis upon initial diagnostics as their profiles indicated failed library preparation or failed sequencing. In comparison to expected distributions (see Additional file 2: Figure S1a), two were removed due to technical failures, as nearly all (>99%) of their cytosines were methylated after alignment and methylation estimation (see Additional file 2: Figure S1b, c). A third sample was removed because its CpG methylation percentage distribution was not bimodal (see Additional file 2: Figure S1d). The fourth sample was removed because its read coverage distribution did not match the expected distribution (see Additional file 2: Figure S1e).
After identifying samples that failed library preparation and/or sequencing, remaining sample outliers were identified using surrogate variable analysis (SVA) [13], in a manner similar to how outliers have been identified for removal in previous publications [4, 14]. Ten surrogate variables (SVs) were generated using methylation estimates from CpG sites with data across all samples (254,824 CpGs). Samples demonstrating global altered patterns in methylation were identified as any sample that was greater than 4 standard deviations away from the mean in any of the SVs generated. Global outlier samples were removed from analysis. This process was carried out iteratively, and after each round of sample outlier removal, to independently ensure that samples identified as outliers should in fact be removed from analysis; the percentage of known brain meQTLs [15, 16] detected was quantified using a method previously developed for RNA-Sequencing data [14] to guide data analysis. After each round of sample outlier removal, cis meQTLs (1 Mb) were detected at SNPs and CpGs present in both the previously reported meQTL studies and the brain data using high-quality genotype data described previously for these samples [14]. meQTLs were detected using MatrixEQTL [17] with age, sex, site, and SVs included as covariates, and the percentage of known meQTLs was recorded.
Single-site differential methylation analysis
To ensure that a single sample’s outlier methylation estimation would not inaccurately inflate the number of sites identified as differentially methylated, methylation outliers at each single site were first identified, as previously described [14]. Briefly, at each tested site, sample outliers were defined as any sample greater than 3 standard deviations away from the mean methylation at that site. These samples were identified at each site and removed from analysis. Further, only variant sites were included for analysis to minimize the multiple testing burden. Accordingly, the 25% least variable sites were excluded from analysis. Single-site differential methylation was then carried out on each site using the lmFit function in the limma R package [18]. For all cytosines, case-control status was regressed on methylation percentage with age, sex, brain bank, and ten SVs included as covariates (full model). To account for unwanted sources of variation, such as cell type proportion differences or other technical sequencing artifacts, ten SVs were generated using methylation data from all variant sites with data across all samples utilizing the irw method from the sva package and protecting case-control status. Additionally, as read coverage impacts our confidence in methylation estimates, the log10 of read coverage at each site was included as weights in the model.
Statistical significance was determined by residual bootstrapping, again using limma. For each bootstrap, the full model (described above) was fit and residuals recorded. A null model, in which the variable of interest (here, case-control status) was excluded, was also fit. The residuals from the full model were resampled with replacement, randomizing the sample order. “Pseudonull” data were then generating adjusting the fits from the null model with the resampled residuals from the full model. These pseudonull methylation values were then substituted as the outcome variable into the full model, generating a null set of p values. These p values were collected for each of the 1000 bootstraps to empirically determine study-wide significance.
Differentially methylated region analysis
Differentially methylated region (DMR) analysis combines nearby sites for analysis and thus benefits from denser methylation data. As such, coverage requirements were relaxed to include sites with at least five reads across 20 cases and 20 controls in both the CpG and CpH data. 1,638,398 CpG and 6,382,340 CpH sites were included for analysis. As above, age, sex, brain bank, and ten SVs were included as covariates. DMRs were then detected using the bumphunterEngine within the R package bumphunter (v1.14.0) [19, 20]. Default values were used aside from the following: (1) pickCutoff was set to “TRUE” as to not unnecessarily impose an arbitrary cutoff on the data, (2) the maxgap was set to “300,” in line with previous analyses [19], (3) smoothing was used (TRUE) as it is known that methylation sites tend to be correlated across 300 bp (maxgap), (4) nullmethod was set to “bootstrap” as is required when correcting for covariates, as is the case here, and (5) 1000 bootstraps were carried out (B = 1000). Significance was determined by calculating the family-wise error rate (fwer), the proportion of the 1000 residual boostraps with at least one null candidate region more extreme (defined by having a length longer and higher average value of the regression coefficients) than the region observed. Due to the number of sites included for analysis, chromosomes were analyzed independently, necessitating a multiple test correcting for 24 independent tests (22 autosomes, plus X and Y chromosomes). Significance was determined to be fwer <0.002 (0.05/24).
Overlap with previous findings
We also tested for altered methylation patterns within the four genomic regions identified as genome-wide differentially methylated in Ladd-Acosta et al. [21] Here, we note specifically that the technology and brain regions are not directly comparable between the studies. This analysis includes RRBS data on 63 individual samples from a single cortical brain region (BA19). Ladd-Acosta et al. studied 41 total samples across three brain regions (temporal cortex, N = 16; prefrontal cortex, N = 12; cerebellum (N = 13) using the Illuina Infinium 450k array, identifying three regions in the temporal cortex samples and one other region in the cerebellar samples to be genome-wide differentially methylated. While one would not expect perfectly overlapping coverage between the data sets given the altered technology, it is possible to query methylation patterns within our samples at the regions reported in Ladd-Acosta et al. To do so, we utilized the DMR data set (at least five reads across 20 cases and 20 controls) to visualize methylation patterns within the regions reported. Methylation patterns were visualized by case-control status, and overlap with significant DMRs was queried.
Global altered methylation analysis
For each cytosine context, the proportion of sites hypermethylated (defined as mean methylation in cases greater than zero) was calculated at three p value cutoffs (0.05, 5 × 10−3, and 5 × 10−4). To assign significance, this proportion was then compared to the proportion of sites hypermethylated in each of the 1000 residual bootstraps (see Additional file 2: Figure S2). Specifically, to conclude that there was in fact significant global methylation differences between cases and controls at p < 0.05, the signal for global alteration differences in the case-control analysis would have to be more extreme than the signal detected in 95% of the residual bootstraps.
Lists of functional genomic categories
Lists for 28 different functional genomic categories to test for enrichment of hypermethylated cytosines within the CpH context were downloaded from four different sources: (1) the UCSC Genome Browser (mRNA, transcription factor binding sites (tfbs), DNase I hypersensitive sites (dnase), enhancers, CTCF binding sites (CTCF), segmental duplications (segdups), repetitive regions (repeats), and histone marks from lymphoblastoid cell line GM12878 (H3K4m1, H3K4m2, H3K4m3, H3K9Ac, H3K9m3, H3K27Ac, H3K27m3, H3K36m3, H3K79m2, and H4K20m1), (2) UCL Cancer Institute (beacons), (3) the methylKit package [11] (promoters, exons, introns, transcription start sites (TSS), CpG islands (CGI), and CGI shores), and (4) the Epigenome Roadmap Project [22] (H327me3.brain, H3K9me3.brain, H3K36me3.brain, H3K4me1.brain, H3K9ac.brain, and H3K4me3.brain) (details in see Additional file 1: Table S1). Brain data from the Epigenome Roadmap project were downloaded from adult cingulate gyrus. For histone marks with data generated on more than one individual (H3K36me3.brain, H3K4me1.brain, H3K4me3.brain, and H3K9me3.brain), the intersection of regions across individuals was utilized for downstream analyses. For beacons, the 200 bp flanking the identified CpG dinucleotide were investigated for enrichment.
Functional enrichment testing
To test for genomic enrichment of hypermethylated CpH sites in each genomic list and at each p value cutoff from the differential methylation analysis (0.05, 5 × 10−3, and 5 × 10−4), two-sided Fisher’s exact 2 × 2 test was carried out. For each list and at each differential methylation p value cutoff, odds ratios and p values for enrichment were recorded.
Power calculation
Power calculations were carried out using the “pwr.t2n.test” function from the pwr package in R. This two-sided t test of means for samples of different sizes (N = 34 controls and 29 cases) was carried out at the 0.05 significance level (Type I error probability).