Inference of the epigenomic communication network

ChIP-Seq, MeDIP and GLIB data processing

We downloaded sra files from 139 Chromatin Immunoprecipitation Sequencing (ChIP-Seq), Methylated DNA immunoprecipitation (MeDIP) and GLIB (glucosylation, periodate oxidation and biotinylation) experiments described in Supplementary Table 1 at Juan et al. This collection includes 3 types of cytosine modifications (5mC, 5hmC and 5fC), 13 histone marks (H2Aub1, H2AZ, H3K4me1, H3K4me2, H3K4me3, H3K9ac, H3K9me3, H3K27ac, H3K27me3, H3K36me2, H3K36me3, H3K79me2, H4K20me3) and 61 different Chromatin related Proteins (CrPs). CrPs include structural proteins, elements of the machinery involved in cytosine and histone modification, transcription factors (TFs, such as the stemness-related TFs NANOG, OCT4 and SOX2), and four different post-translational modifications of RNA polymerase II (RNAPolII: S2P, S5P, S7P and 8WG16 - unmodified) with binding data available for ESCs. The sra files were transformed into fastq files with the sra-toolkit (v2.1.12) and aligned to the reference mm9/NCBI37 genome with bwa v0.5.9-r16 allowing 0-1 mismatches. Unique reads were converted to BED format.

Genome segmentation

The input information used to segment the genome into different chromatin states was that derived from the 3 cytosine modifications, the 13 histone marks and the insulator protein CTCF - which has been previously shown to define a particular chromatin state per se (Ernst & Kellis, 2010). A multivariate Hidden Markov Model (HMM) was employed that uses two types of information: the frequency with which different combinations of chromatin marks are found with each other, and the frequency with which different chromatin states are spatially related in the genome. To apply this method we used the ChromHmm software (Ernst & Kellis, 2012: v1.03). The input data to generate the model were the ChIP-Seq, MeDIP and GLIB bed files containing the genomic coordinates and strand orientation of the mapped sequences (see above). First, the genome was divided in 200 bp non-overlapping segments which were independently assigned a value of 1 or 0 in function of the presence or absence of histone marks, respectively, based on the count of the tags mapping to the segment and on a Poisson background model (Ernst & Kellis, 2012) using a threshold of 10-4. After establishing a binary distribution for each mark, we trained the HMM model using a fixed number of randomly-initialized hidden states that varied from 20 to 33 states. We focused on a 20-state model that provided sufficient resolution to resolve biologically meaningful chromatin patterns. We used this model to compute the probability that each location is in a given chromatin state and we then assigned each 200 bp segment to its most likely state. Only, intervals with a probability higher than 0.95 were considered for further analysis.

Read counts and pre-processing for the co-location network inference

We used the ChromHMM segments with a probability higher than 0.95 as samples for the network inference. We filtered all bins for each state that were unexpectedly large (the upper 1% for each state) because they might produce outliers in the data and it is hard to justify where the signal occurs within the region. We counted the overlapping ChIP-Seq reads for the resulting segments using Rsamtools, although some of the ChIP-experiments had to be excluded from the network inference due to the low number of reads per bin, or the low number of bins with signal or study dependent artifacts, including: CTCF_GSE11431, NANOG_GSE11431, LAMIN1B and H3K27me3_GSE36114, SMAD1_GSE11431, MBD1A_GSE39610, MBD1B_GSE39610, MBD2A_GSE39610, MBD3A_GSE39610, MBD4_GSE39610, and MECP2_GSE39610 (as MBD2A was not used, the MBD2 co-localization data corresponds to MBD2T).

Co-location network inference

We used the ChromHMM segments with a probability higher than 0.95 as samples for the network inference. We applied the method described in (Perner et al, 2014) that aims to unravel the direct interactions between factors that cannot be "explained" by the other observed factors and thus, this is a more specific approach than an analysis of simple pairwise correlations. Consequently, the more complete the number of factors included in the analysis, the higher the certainty that inferred direct interactions correspond to actual co-dependences. We inferred an interaction network for each chromHMM state. Briefly, an Elastic Net was trained in a 10-fold Cross-validation to predict the HM/CTCF/DNA methylation of the CrPs or to predict each CrP from all other CrPs. Furthermore, the sparse partial correlation network (SPCN) was obtained using all the samples available. We selected the interactions between Histone marks/cytosine modifications and CrPs that obtained a high coefficient (w >= 2*sd(all_w)) in the Elastic Net prediction and that have a non-zero partial correlation coefficient in the SPCN.

Functional Modularization of the Co-localization Network

The co-localization network was decomposed into local networks of positive interactions. First, we calculated the frequency of each positive interaction for every chromatin state using ChromHMM peaks, considering that an interaction is present if both interactors are "present" in the same 200 bp genomic window. We calculated this frequency for each of the 20 chromatin states, such that we have a vector of 20 frequencies for each positive interaction. In order to reduce state-specific biases, the frequencies of the interactions were standardized separately for every state. These vectors were clustered by hierarchical clustering (Pearson correlation, average linkage) and the largest statistically supported clusters (p-value < 0.05, n=10,000) according to Pvclust (Suzuki & Shimodaira, 2006) were defined as chromnets.

References

Ernst J & Kellis M (2010) Discovery and characterization of chromatin states for systematic annotation of the human genome. Nat. Biotechnol. 28: 817-825
Ernst J & Kellis M (2012) ChromHMM: automating chromatin-state discovery and characterization. Nat. Methods 9: 215-216
Juan D, Perner J., Carrillo de Santa Pau E., Marsili S., Ochoa D., Ho-Ryun Chung, Vingron M., Rico D., Valencia A. Epigenomic co-localization and co-evolution reveal a key role for 5hmC as a communication hub in the chromatin network of ESCs. [Cell Reports]
Perner J, Lasserre J, Kinkley S, Vingron M & Chung H-R (2014) Inference of interactions between chromatin modifiers and histone modifications: from ChIP-Seq data to chromatin-signaling. Nucleic Acids Res. 42: 13689-13695
Suzuki R & Shimodaira H (2006) Pvclust: an R package for assessing the uncertainty in hierarchical clustering. Bioinforma. Oxf. Engl. 22:1540-1542