April 12, 2022

AACR 2022

Discovery and validation of orphan noncoding RNA profiles across multiple cancers in TCGA and two independent cohorts

Jeffrey Wang1,2 , Helen Li1 , Lisa Fish1,2 , Kimberly H. Chau1 , Patrick Arensdorf1 , Hani Goodarzi2 , Babak Alipanahi1

1Exai Bio Inc., Palo Alto, CA; 2UCSF School of Medicine, University of California, San Francisco, CA


  • Small non-coding RNAs (sncRNAs) have established roles as posttranscriptional regulators of cancer pathogenesis.
  • We recently reported a novel and previously unannotated class of sncRNAs that were found in breast cancer tissue but not in normal tissue adjacent to the tumor (hereinafter normal), which we termed orphan non-coding RNA (oncRNA).1
  • We showed that one of these oncRNAs is exploited by breast cancer cells to promote cancer metastasis,1 which suggests that other oncRNAs may also have pathological roles in cancer.
  • We now hypothesize that oncRNAs are present in many types of cancer and that oncRNAs may enable a new, RNA-based liquid biopsy strategy for early detection and monitoring in a wide variety of cancers.


  • Identify and validate novel oncRNAs in six different cancer sites in 10,963 samples (7,942 cancer and 3,021 normal samples) across three large independent cohorts to create the largest known library of oncRNAs.
  • Develop and validate an artificial intelligence-based (AI) approach to predict cancer tissue-of-origin by leveraging oncRNA expression profiles.


  • Two sources provided small RNA (smRNA) data for 10 types of cancer from 6 tissue sites and for their corresponding normal tissues.
  • The Cancer Genome Atlas (TCGA). A joint NCI and NHGRI program that provides open-source tumor/normal sequencing data. TCGA smRNA data were used here to discover new oncRNAs and to train a predictive AI model.
  • IndivuType. A comprehensive multi-omics dataset provided by Indivumed GmbH that offers sequencing data from a global patient sample collection. IndivuType provided two cohorts, A and B, which were used here to validate oncRNAs identified in TCGA and to validate the AI tissue-of-origin model.
  • The 6 cancer sites studied here (breast, colorectal, gastric, kidney, liver, lung) represent 46% of new cancer diagnoses and 51% of cancer deaths worldwide, per GLOBOCAN 2020.2
  • Sample collection/preparation and RNA sequencing for the TCGA and IndivuType cohorts were performed prior to and independently of this study, using standard methods. Patients had provided informed consent and contributing centers had obtained IRB approval.


  • Fisher’s Exact Test followed by Benjamini-Hochberg correction was used to identify cancer-specific oncRNAs in TCGA data. OncRNAs were considered validated in IndivuType data if statistical test P values ≤ 0.05.
  • An eXtreme Gradient Boosting (XGB) model for prediction of cancer tissue-of-origin from oncRNA profiles was trained with TGCA data and validated in the two IndivuType cohorts.

Results 1: Three Large Cohorts

Table 1. TCGA Samples used for Identifying oncRNAs

Numbers of independent samples evaluated from TCGA cohort, grouped by cancer site and tissue type (cancer or normal tissue).

Table 1 graphic illustrating TCGA Samples used for Identifying oncRNAs.

Table 2. IndivuType Samples used for Validating oncRNAs

Samples evaluated from non-overlapping IndivuType cohorts, A and B, are grouped by cancer site and tissue type (cancer or normal tissue).

Table 2 graphic illustrating IndivuType Samples used for Validating oncRNA

*Cohort B lacked normal tissues for breast cancer, which excluded it from validation.

Results 2: Heat Map of oncRNAs in TCGA Cohort

Figures 3 and 4: sensitivities for CRC Detection by Cancer Stage (I–IV) and Tumor T Category (T1–T4)

Figure 1. Discovery of cancer-specific orphan non-coding RNAs across 6 cancer sites.

749 representative oncRNAs discovered in 4,445 cancer samples across 6 cancer sites are shown.

Results 3: Identification and Validation of oncRNAs

Table 3. Identification of new oncRNAs in TGCA Cohort and Validation in IndivuType cohorts.

  • In total, 144,695 distinct oncRNAs were identified among the six cancers.
  • For each cancer site category, the majority of oncRNAs identified in TCGA were validated in one IndivuType cohort (Union column).
  • A total of 51,208 oncRNAs were validated in both independent IndivuType cohorts.
Table 3 graphic identification of new oncRNAs in TGCA Cohort and Validation in IndivuType cohorts.

*TCGA codes for each cancer site appear in Table 1. **Normal tissue samples were not available from Cohort B breast cancer patients. Totals are less than the sum of rows above it because some oncRNAs were found in >1 cancer site. Stouffer’s method was used to combine P values from Cohort A and Cohort B for each oncRNA.

Results 4: AI Analysis of oncRNA Profiles to Predict Cancer Tissue-of-Origin

Figure 2. Validation in 2 IndivuType Cohorts of an XGB prediction model that was Trained on TCGA oncRNA Data.

  • Accuracies were: 91.5% (95% CI: 90.3%–92.7%) for IndivuType Cohort A, and 96% (94.7%–97.0%) for IndivuType Cohort B.
Figure 2 graphic of validation in 2 IndivuType Cohorts of an XGB prediction model that was Trained on TCGA oncRNA Data.
Figure 2 graphic of validation in 2 IndivuType Cohorts of an XGB prediction model that was Trained on TCGA oncRNA Data.

Numbers across a row indicate the frequencies of the ground truth cancer among samples for which the model predicted the cancer site named at left.


  • We have identified 144,695 distinct oncRNAs across 6 types of cancer, of which 51,208 were validated in each of two independent cohorts.
  • We developed an artificial intelligence model that uses oncRNA profiles to predict cancer tissue-of-origin with high accuracy.
  • These results suggest that oncRNAs are a unique, generalizable feature of cancers with potential for use in cancer detection and monitoring.


Mathias Saver and Margarita Krawczyk from Indivumed are thanked for performing data processing. Indivumed combines the world’s most comprehensive multi-omics cancer data with extensive medical and bioinformatics expertise. Samples are collected within a global clinical network using a standardized approach to  ensure biospecimen quality


JW, HL, LF, KC are full-time employees of Exai Bio. BA and PA are cofounders, stockholders, and full-time employees of Exai Bio. HG is co-founder, stockholder, and advisor of Exai Bio.


  1. Fish L., et al. Nature Med. 2018;24:1743-51.
  2. Sung H., et al., CA Cancer J Clin. 2021: 71: 209- 249
Close the cookie popup
Cookie Settings
By clicking "Accept All", you are agreeing to store cookies on your device to enhance your experience and help Exai's marketing.
Accept All
cookie settings