Introduction
Colorectal cancer (CRC) is one of the most frequent cancers worldwide. A global incidence of more than one million new cases is reported worldwide each year [1]. CRC ranks as the third most common cancer in incidence and cancer-related death in the United States. In Europe, this disease is the third and second most commonly diagnosed cancer in men and women, respectively, and ranks second in both sexes in cancer-related mortality [2]. Therefore, more efficient preventative and therapeutic interventions of this disease are pivotal to the battle against this major public health threat.
Advances in early diagnosis and treatment modalities have reduced mortality rate over the past two decades [3]. However, depending on the tumor stage, patient survival varies considerably among CRC patients, thereby posing challenges to targeted treatment and prognosis. According to the SEER database (1973-2014, 2017 release), the 5-year relative survival rates are approximately 90% for CRC patients with tumors of stages I, IIA, and IIIA, but decreases to approximately 65% for stage IIB and IIIB patients, and further to 53% for stage IIIC. Furthermore, for metastatic, stage IV patients, the 5-year survival rate shows a drastic fall to 11% [4].
Most CRC cases (approx. 70%) occur sporadically as a result of genetic and epigenetic modifications of a number of genes, including prominent oncogenes and tumor suppressors such as APC, TP53, and KRAS, as well as genes involved in DNA mismatch repair [5-9]. Moreover, a panel of genetic (e.g. gene expression) and epigenetic (e.g. DNA methylation) aberrations have been associated with patient survival [5, 10-12]. For instance, overexpression of immune checkpoint regulator PD-L2 was recently reported to be associated with poor survival [13]. Additionally, a study of primary tumors found expression of several epithelial (claudin-1 and E-cadherin) and mesenchymal markers (snail-1 and vimentin) to be independent predictors of recurrence of stage II CRC [14].
In addition to protein-coding genes (PCGs), the significance of microRNAs (miRNAs) in CRC has been increasingly highlighted [15, 16]. miRNAs are a class of small, non-coding RNA molecules that exert their functions through binding to the 3’ untranslated region of target mRNAs. This in turn results in degradation of or translation repression of the bound mRNAs and therefore downregulation of the corresponding genes [17]. Using transcriptome profiling techniques, a flurry of studies has proposed miRNAs as prognostic indicators for CRC patients. For instance, serum miR-200c level was found to be a better predictor of reoccurrence than pathological staging for stage II and III patients [18]. Hansen et al. [19] also reported significant association between miR-126 upregulation and a longer progression-free survival.
Besides genetic features, epigenetic modifications such as DNA methylation are drawing increasing interest in CRC research. Observed predominantly at CpG dinucleotides in repetitive sequences, DNA methylation usually occurs early in carcinogenesis [20]. Through characterizing methylomes, studies have also proposed a number of methylation sites as potential prognosis factors. Methylation of genes p14ARF, RASSF1A and APC1A were suggested as unfavorable prognostic factors independent of tumor stage and differentiation [21]. Hypermethylation of p16 was also proposed as a predictive factor for poor prognosis [22]. Inversely, hypomethylation of genes such as insulin growth factor II (IGF2) has also been associated with poor prognosis [23].
Together, these studies have provided multiple lines of evidence to pursue to establish a robust panel of predictive factors for CRC prognosis. However, a distinct limitation is that most investigations were conducted without taking into consideration the network formed by aberrations at different levels of regulation. For colorectal cancer, a multi-faceted disease during which dysregulations abound at the genome, transcriptome, and proteome levels, a network-based, systems biology approach can lend more comprehensive insights into diagnosis estimation.
In this study, we applied a battery of network-based analyses to identifying potential prognostic factors in CRC. Using a TCGA dataset of 379 CRC patients, we have found significant positive or negative associations between patient survival and the levels of six miRNAs, 11 mRNAs, and nine methylated DNA species. Furthermore, we proposed a prognostic index (PI) as an integrated measure of all these factors, and validated its efficiency in differentiating between groups of different lengths of survival. Interestingly, combination of PI and tumor stage resulted in more precision in distinguishing overall survival between different patient subgroups, suggesting PI as a promising indicator to be incorporated into prognosis estimation for colorectal cancer patients.
Materials and Methods
Data acquisition and preprocessing
A dataset containing information of 379 CRC patients was downloaded from the TCGA (The Cancer Genome Atlas) database, including expression levels of 1, 881 mRNAs and 60, 484 miRNAs, genome-wide level 3 data of 295 specimens with 482, 421 DNA methylation sites (Illumina methylation 450), and survival. Afterwards, a filter was applied and removed genes whose transcript (mRNA or miRNA) or DNA methylation level was absent in more than 50% of all samples.
Statistical analysis
All statistical analyses were performed with R, a statistical software program, and related packages survival, limma, glmnet. The effect of age, gender, pathological stage and TNM stage on survival rate was assessed by Cox regression, respectively, and considered statistically significant when p ≤ 0.05. Kaplan-Meier survival analysis was used to plot the survival curve. Additionally, log-rank test was used to evaluate the difference of survival curves between different groups, where p ≤ 0.05 is considered statistically significant.
Screening for genes and DNA methylation sites closely associated to patient survival
Cox regression analysis was used to evaluate the association of the level of each gene and methylated DNA species with survival length. Due to the large number of genes and methylation sites included in the dataset, we used an analytical strategy that combines univariate Cox regression with Lasso-Cox (Least Absolute Shrinkage and Selection Operator-Cox) to identify survival-associated factors. First, univariate Cox regression was applied to each gene and methylation site to evaluate its effect on patient survival, followed by a filter which retains those with p ≤ 0.1 for subsequent analysis. Subsequently, Lasso-Cox was performed to further screen and shrink the data. Afterwards, association of gene expression and methylation level with patient survival was further analyzed with multivariate Cox regression, which adjusts for other patient characteristics (e.g. age and sex), as well as medical factors (e.g. pathological stage).
Screening for potential miRNA target genes
Next, for miRNAs whose expression levels showed significant association with patient survival, we retrieved their potential target genes from miRTarBase (the experimentally validated microRNA-target interactions database, release 7.0). miRTarBase is a manually collected, experimentally validated, and actively used database of miRNA targets. An miRNA-target co-expression network was then constructed. High-confidence (p ≤ 0.05) miRNA targets were identified and subjected to Lasso-Cox regression to screen for high-confidence targets. Finally, an miRNA-target interaction network was constructed by Cytoscape (version 3.5.1).
Construction of gene regulatory network and identification of prognostic factors
After identifying survival-associated mRNAs, miRNAs and DNA methylation sites, we constructed an interaction network among these features using Ingenuity Pathway Analysis (www.ingenuity.com). Univariate Cox regression was applied to further screen for significant nodes (p ≤ 0.1) in the network, in order to detect potentially important mRNAs that may have been missed in previous analyses. Subsequently, effect of each candidate on patient survival was evaluated with multivariate Cox regression, and a final list of candidate prognostic factors was established by retaining those with p < 0.05.
Validation of the prognostic value of the candidate prognostic factors
To evaluate the prognostic value of the candidate prognostic factors obtained in the last section, a prognostic index (PI) was proposed as an integrated indicator. In brief, PI was calculated as a weighted sum of expression levels of the transcripts (mRNA/miRNA) and methylated species present in a given sample (denoted as x in the following equation), using the Cox regression coefficient β as the corresponding weights. Therefore, for specimen j, PI was calculated as follows:
A greater value of PI is indicative of worse prognosis. Base on the corresponding PI value, patients were divided into high-risk (PI > 0) and low-risk groups (PI < 0). Afterwards, Kaplan-Meier survival analysis and log-rank test were used to evaluate the differences in patient survival between the two groups.
Results
Patient information
Of the 379 CRC patients, 206 (54.35% in all patients) were male and 173 (45.65%) female. The average age was 63.7 ± 11.7 years. Locations of the primary tumors encompassed a wide range, include the sigmoid colon (98 cases, 25.86% in all patients), ascending colon (81, 21.37%), cecum (68, 17.94%), transverse colon (16, 4.22%), descending colon (15, 3.95%), colon (10, 2.64%), and splenic flexure (3 cases, 0.79%).
Histologically, the majority of the cases were categorized as adenoma (M8140/3/1) or mucinous adenoma (M8480/3/1), accounting for 325 (85.75%) and 50 (13.19%) cases respectively. In terms of pathological stages, 62 cases (16.35%) were determined at stage I, 142 cases at stage II (37.47%), 110 cases at stage III (29.02%), and 55 cases of stage IV (14.51%). The average overall survival (OS) was 902 days (approx. 30 months), during which 77 patients passed away after follow-up.
We first examined the association between OS and each individual patient characteristics (e.g. age and sex) and medical factors (tumor histology, pathological stage, and TNM stage, etc.). Univariate Cox regression analysis revealed that pathological stage and TNM stage were most significantly associated with OS (Table 1). Additionally, Kaplan-Meier survival analysis and log-rank test showed that there was a significant difference in OS among patients with different tumors of pathological stages (Fig. 1D), but not between different age (> 65 and ≤ 65 years old) or sex groups (Fig. 1B & 1C). In particular, patients with specimens at stages III and IV showed significant decreases in OS (Fig. 1D), whereas specimens at stages I and II did not. A second analysis with the multivariate Cox regression model was further performed, in order to examine each factor after adjusting for all others, and validated pathological stage as the most significant survival-association factor (Table 1).
Fig. 1.
Overall survival was compared among patient groups with different age (B), sex (C), and pathological stage (D). While groups of different ages and sexes did not show significant difference, patients with tumors at stages III and IV showed significantly different OS with those with tumors at stages I and II. Dashed lines indicate 90% CI.
Screening miRNAs, mRNAs and DNA methylation sites for potential prognostic factors
From 1, 181 miRNAs, 60, 484 mRNAs and 394, 353 methylation sites, a combination of univariate Cox and Lasso-Cox analyses identified a series of factors closely associated with patient survival, including 18 miRNAs, 11 mRNAs and 12 methylation sites (Tables 2 and 3). These features were further validated with a multivariate Cox regression analysis. Among these features, worse prognosis was shown to be significantly associated with high expression of hsa-mir-503, ENSG00000228126, ENSG00000244242, and ENSG00000249001 (Table 2), as well as hypermethylation of cg00199007, cg04138846, cg06328127, and cg10248231 (Table 3). A better prognosis, on the other hand, showed strong association with elevated expression of hsa-mir-4442, hsa-mir-552, and hsa-mir-6854 (Table 2), as well as hypermethylation of cg00147009, cg03895094, cg07206725, and cg10584300 (Table 3).
Table 2.
miRNAs and mRNA whose expression levels showed significant association with overall survival. HR: Hazard Ratio; CI: 95.0% confidence interval; SE: standard errors of Coefficient; z value: Wald z-statistic value
Table 3.
DNA methylation sties whose methylation levels showed significant association with overall survival. HR: Hazard Ratio; CI: 95.0% confidence interval; SE: standard errors of Coefficient; z value: Wald z-statistic value
Next, we set to identify high-confidence targets for the survival-associated miRNAs in Table 2. First, a list of 1, 749 target genes for these miRNAs were retrieved from the miRTarBase database. Then, a Lasso-Cox strategy was used to further select 38 target genes showing significant co-expression with the corresponding miRNAs in our TCGA dataset. A co-expression network of these miRNAs and their potential targets was constructed, visualizing the strong, diverse correlation between the expression levels of each miRNA and its targets (Fig. 2).
Fig. 2.
A co-expression network of survival-associated miRNAs and potential target genes. Yellow boxes indicate miRNAs; orange boxes target genes. Solid lines indicate significantly co-expressed miRNA-target pairs (p ≤ 0.05).
Identification of candidate prognostic factors
In the next step, a regulatory network was constructed, showing regulations reported thus far between the miRNAs and PCGs from Table 2, as well as the 38 high-confidence miRNA targets identified in the last section. As shown in Fig. 3, the resulting network is one of considerable connectivity, suggesting close regulations between the high-confidence miRNA targets and the survival-associated miRNAs and PCGs. In particular, the highly connected nodes have been established to play roles in a plethora of cellular processes, including carcinogenesis, tumor proliferation, invasion, metastasis, and apoptosis (Table 4). Among others, mir-145, mir-26, and mir-34 have been implicated in adenocarcinoma growth, EPK2, mir-145 and mir-26 in tumor cell migration. In addition, CDK5R1, EPHA2, FOSL1, mir-328, mir-34, mir-503, and TGFB2 have been proposed roles in CRC metastasis. The high relevance of these survival-associated genes and miRNA targets in cancer biology suggest the power of the network-based analyses employed in this study in identifying essential factors related to patient prognosis.
Fig. 3.
A gene regulatory network of survival-associated miRNAs and PCGs identified in this study, including PCGs and miRNAs showing significant association with patient survival (Table 2), and high-confidence target genes of these miRNAs (Table 5). Molecules are represented as nodes, and the biological relationship between two nodes is represented as an edge (line). Nodes are displayed using various shapes that represent the functional class of the gene product.
As a final round of identification and validation, all the nodes in this regulatory network was subjected to a multivariate Cox regression analysis, in order to examine their association with patient survival. Compared with the previous multivariate Cox regression, which validated the 18 miRNAs and seven mRNAs in Table 2, this analysis also examined the association between patient survival and the 38 high-confidence miRNA targets. As miRNAs can exert their functions through downregulating the mRNA levels of their targets, addition of these 38 targets allowed the examination potential relevant factors that could have been missed in previous screenings due to a p value slightly higher than the cut-off threshold. Also, the new multivariate analysis took into account mRNA levels of these 38 targets in correcting the significance level of each survival-associated genes and methylated sites, thereby allowing screening for significant prognostic factors from a densely connected functional network.
This analysis yielded a final list consisting of 6 miRNAs, 11 mRNA (four from the 38 miRNA targets) and 9 methylation sites (Table 5), which were considered as candidate prognostic factors in colorectal cancer.
Table 5.
Candidate prognostic factors identified in this study, including 6 mRNAs, 11 mRNAs, and 9 methylation sites. HR: Hazard Ratio; CI: 95.0% confidence interval; SE: standard errors of Coefficient; z value: Wald z-statistic value
Validation of the candidate prognostic factors
To validate the prognostic value of all candidate prognostic factors identified in the last section, a prognostic index (PI) was proposed as an integrative prognostic predictor. A PI was calculated for each specimen as the weighted sum of expression/methylation level of each candidate gene/methylation site, with the corresponding Cox regression coefficients as weights. Consequently, a higher PI is a result of high expression levels of the miRNAs, mRNAs, or methylations whose expression associated with poor prognosis, as well as low expression levels of those whose expression associated with favorable prognosis. Therefore, a higher PI should be indicative of poor prognosis.
To test the efficiency of PI in grouping patients by overall survival, the 379 patients in the dataset were divided into two groups and subjected to survival analysis. The low-risk group (PI < 0) included 150 cases, whereas the high-risk group (PI > 0) included 145. There were five tumor-related deaths in the low-risk group, and 52 in the high-risk group. In addition, median OS for the low and high-risk groups were 783 and 514 days, respectively (Fig. 4). The former showed an average 1-, 3-, and 5-year survival rates 89.3%, 54.9%, and 35.9%, respectively, while those of the latter group were 70.4%, 37.6%, and 18.4%, respectively.
Fig. 4.
Median survival for patients in the low-risk (PI < 0) and high-risk (PI > 0) groups. Low-risk group showed a significantly greater survival time compared with the high-risk group (p< 0.0001).
Furthermore, Kaplan-Meier survival analysis and log-rank test showed a significant difference in the OS between the two risk groups, with the high-risk group exhibiting poorer survival (log-rank test p < 0.0001, Fig. 5A).
Fig. 5.
KaplanMeier curve showing OS of patients grouped by (A) PI, and (B) PI and pathological stage. (A) Lowrisk (PI < 0) and high-risk (PI > 0) groups showed significantly different survival curves. (B) Patients in the low- and high-risk groups were further divided based on pathological stages of the tumor (stage I/II vs stage III/IV), and analyzed for difference in the corresponding survival curves.
Interestingly, difference in the OS between patients grouped by tumor stage appeared to depend on PI level (Fig. 5B). Patients in the low- and high-risk groups can be further divided into a total of four subgroups, based on pathological staging. In the two high-risk subgroups (PI > 0), there was significant different (Cox p < 0.0001) in OS between the earlier (stages I/II) and later stages (stages III/IV). This observation was consistent with known fact. However, the two low-risk subgroups (PI < 0) showed no significant difference in OS (Cox p = 0.185). On the other hand, PI level was efficient in grouping patients by OS in manner that is independent of pathological staging. There were significant differences in OS between the low- and high-risk subgroups when both subgroups were of earlier stages (Cox p < 0.001) or of later stages (Cox p < 0.0001).
Together, these results suggest high value of the candidate prognostic factors identified in this study. Moreover, incorporation of PI into the conventional practice of patient grouping by pathological stage increased the precision with which to distinguish patients with different lengths of survival, thereby suggesting the use of tumor stage and PI to achieve a more accurate prognosis for CRC patients.
Discussion
Rapid progress of next-generation sequencing has provided a number of novel tools and thereby enabled the scientific community to approach cancers from previously impractical perspectives and to gain deeper insights. A first attempt was launched by The Cancer Genome Atlas, which conducted a genome-wide characterization of somatic mutations in CRC [24]. A variety of transcriptomic tools, such as mRNA and miRNA expression and promoter methylation, were applied. Based on the extent of mutations, a total of 276 CRC specimens were divided into two subsets, one showing hypermutation (16% of all cases), of which the majority exhibited high microsatellite instability and mutations in mismatch repair genes. The remaining 84% specimens showed similar patterns of mutations in 24 genes, including APC, TP53, and KRAS. In addition, recurrent alterations in copy-number (e.g. ERBB2 and IGF2) and chromosomal translocations were also identified. These alterations serve as the basis for CRC diagnosis and subtyping, and warranted investigation to establish the clinical practice of applying the subtyping in improving prognosis estimation.
Later, Guinney et al. [25] orchestrated a consortium and analyzed a large of number of CRC patients with a network-based approach. Among more than 4000 cases, the majority presented distinctly distinguishable features. Samples with mixed features accounted for only 13% of all cases. Four robust consensus molecular subtypes (CMSs) were identified, including a hypermutated, microsatellite unstable CMS1 subtype, a CMS2 marked by WNT and MYC signaling activation, a CMS3 showing evident metabolic dysregulation, and a CMS4 with marked activation of transforming growth factor-β signaling. Through characterizing the subtypes of CRC with transcriptomics approaches, this study revealed convergent pathway dependencies in thousands of cases, and proposed a classification system with clear biological interpretability, which could prove useful for clinical stratification and targeted interventions [26, 27].
In the present study, through a comprehensive analysis of miRNA and mRNA expression, DNA methylation, and patient survival of 379 CRC patients documented in the TCGA database, we identified a set of genetic and epigenetic modifications as potential prognostic factors, and validated the value of a prognostic index (PI) that takes into consideration all these factors. Furthermore, based on the association between pathological stage and patient survival, we propose a novel subgrouping system combining PI and pathological staging, for predicting prognosis for CRC patients.
We started with screening for clinical features associated with prognosis. A combination of Cox regression and Kaplan-Meier survival analysis revealed that overall survival and tumor-free survival showed little association with patient characteristics such as age and sex, but were significantly different between patients at different pathological stages (Fig. 1 & Fig. 2). These findings were consistent with known fact. Further screening of genetic and epigenetic aberrations highlighted a collection of 18 miRNAs, seven mRNAs, and 12 DNA methylation sites whose expression/methylation levels were significantly associated with overall survival (Tables 1 & 2). In order to identify potential target genes of the survival-associated 18 miRNAs, we retrieved targets from miRTarBase, a curated database of experimentally validated miRNA-target interactions [28], and performed a screening using a Lasso-Cox approach. Thirty-eight high-confidence targets were selected and showed highly correlated co-expression with their corresponding miRNAs (Fig. 3), suggesting potential regulations between the miRNA-target pairs [29], and thereby provide hints for downstream effectors through which the 18 miRNAs affect CRC prognosis.
As miRNAs exert their roles in CRC by downregulating the mRNA levels of their targets, it was possible that some survival-associated mRNAs have been missed during the first screening due to a slightly higher p value. To validate the functional relevance of these 38 targets with the survival-associated miRNAs and PCGs, a regulatory network was first constructed with these genes to visualize the reported biological relationships of the 38 targets in this network. Subsequently, a multivariate Cox regression was performed to screen all the nodes from this network and survival-associated methylated DNA species for factors significantly associated with OS. This analysis produced a list of candidate prognostic factors consisting of six miRNAs, 11 mRNA (four from the 38 miRNA targets) and nine methylation sites.
In order to validate the prognostic value of these candidates, a prognostic index (PI) was computed as an integrative indicator. Based on which patients were divided into low-risk (PI < 0) and high-risk (PI > 0) groups. Kaplan-Meier survival analysis revealed significant differences between OS of these two groups, suggesting the efficiency of the PI as a potential prognostic indicator. Furthermore, after combining PI with tumor pathological stages, a more precise prognosis was possible. On one hand, for patients in the same stage group (stages I & II), prognosis can be significantly different depending on the PI level. On the other, in the low-risk group, there was no significant difference in OS in patients at different stage groups (Cox p = 0.185), suggesting prognosis as the outcome of a comprehensive panel of factors, and therefore require a more well-defined approach in predicting prognosis.
These findings can be of considerable clinical importance, especially for patients at the later stages (stages III & IV). Evaluation of PI could provide assistance to making a more accurate prognosis, but also to planning therapeutic interventions. For instance, a low PI could help avoid prescribing a patient intense yet potentially unnecessary radio- and/or chemotherapies, and, equally importantly, provide mental relief for the patient.
However, our prognostic index model didn’t take into consideration risk factors like lifestyle (diet, physical activity, tobacco smoking, etc) and therapeutic regimen. Many studies support the association between lifestyle and the development of colorectal cancer [30-34]. Various energy balance host factors, including obesity, physical inactivity and certain dietary factors can affect patient’s outcome [33]. And, smoking has detrimental effects on survival even after CRC diagnosis [34]. The NCCN guideline recommend different therapeutic regimen according to disease status, like operation for early stage patients, targeted therapy for advanced patients without RAF/RAS driver gene mutation, chemotherapy for advanced patients with RAF/RAS driver gene mutation. Constrained by the limited lifestyle and therapeutic regimen information in our data, we have to assume that these factors are balanced between low-risk and high-risk groups identified by our prognostic index model. We will try to validate this model with patients in our hospital and think carefully the influence of lifestyle and therapeutic regimen in the future trail.
Conclusion
In summary, in this study, starting from a TCGA dataset of 379 colorectal cancer cases, we applied an integrative analysis combining Cox regression, co-expression and gene regulatory network analyses, and survival analysis to identify factors significantly associated to patient survival. By screening incorporating patient characteristics, genetic and epigenetic aberrations, and survival data, we identified a collection of 6 miRNAs, 11 mRNAs, and 9 DNA methylation sites whose expression levels were significantly associated with overall survival. An integrative prognostic index was proposed and used to divide patients to groups of different risk levels, which showed significantly different survival, thereby suggesting prognostic value of the PI and its underlying potential prognostic factors. Importantly, we found that combination of PI and tumor stage provided additional precision in subtyping patients with significantly different overall survival, suggesting the promise of a more comprehensive subtyping system, using both PI and pathological staging in predicting prognosis. Further research is warranted to validate the efficiency of this subtyping system, as well as to elucidate the roles these genetic and epigenetic modifications play in colon cancer biology.
Acknowledgements
We thank all the doctors in our department for their help in patients care so that we can have enough time to conduct this study.
Disclosure Statement
The authors declare that they have no competing interests.


Get Permission
