Single-Cell Multiomics of Cancer And Machine Learning Approaches

Paramita Mishra

INTRODUCTION

Cancer is a quintessential complex disease, hallmarked by biomarkers acquired through molecular modifications at all levels (genomic, transcriptomic, proteomic, metablomic, epigenomic). At the advent of next generation sequencing (NGS) technologies, efforts were made to understand cancer mechanisms using single-omics data. For instance, epigenetic changes could be found using differential ATAC-Seq landscapes1, mRNA-Seq profiles could be used for differential expression analysis, and proteins could be identified using proteomic and transcriptomic data from ChIP-Seq and RNA-Seq.2 These techniques facilitate gene identification, epigenetic marker detection, and classification of tumors based on proteomic profiles. However, there is a missing piece of the puzzle: the relationship between the molecular nature and phenotypes involved in cancer has been very unclear.3

Single-cell multiomics (scMO) is a recent and groundbreaking technology for creating tangible omic data at multiple layers to identify cells and cellular function in tandem, bringing hope in understanding the genomic-level causation, heterogeneity, and branched evolution of cancer cells. In single-cell cancer epigenomics, integration of scMO data gives insight on tumor heterogeneity, the role of the tumor microenvironment (TMI) and corresponding precision therapies, cancer cell evolution, resistance and survival. The integrative analysis of single-cell transcriptomics and single-cell epigenomics in cancer has already answered some questions about molecular function and phenotypes.4 Single-cell omic and multi-omic analyses are boosted by new techniques for sample collection and analysis, such as liquid biopsy and scMO-based machine learning (ML). Our understanding of epigenetic abnormalities in cancer - often responsible for dysregulation, aberrant function and altered expression of genes - has increased exponentially with the onset of these techniques. In the process, scMO data has grown exponentially, increasing in its dimensionality as new integrative scMO sequencing techniques increase5 6. This points to the need for feature selection, data science, sequencing data optimization and data integration in multi-omics. Single-cell multiomics therefore involves an interdisciplinary integration of omic data analysis, big data, machine learning, precision medicine, and prior knowledge of carcinogenic genome and epigenome patterns, to gain sensitive and precise insights.

In this review, I aim to discuss cancer genomics, the growing presence of scMO in oncology research, processing and interpretation of big multiomics data, and machine learning for scMO. Specifically, I highlight the pathology, causation, genomics and epigenomics of cancer, single-cell sequencing techniques and its role in oncology, the importance of multi-omics and the role of scMO in precision medicine. The latter half reviews current trends in single-cell oncology-based ML for superior detection and treatment, highlighting basic probabilistic and ML models used in multi-omics and future ML-based genomic medicine. Finally, I discuss an underrepresented challenge - the presence of big data in cancer as a result of the rapidly-improving single-cell omics technologies and the subsequent “curse of dimensionality”7, highlighting data- and algorithm-based solutions.




I. THE COMPLEXITY OF CANCER

The definition of cancer is often extended beyond that of a targeted disease - it is considered a complex biological system, governed by genetic alterations. Cancer shows highly heterogeneity, both in its phenotypic manifestation and its genome. Despite this variability, a set of genomic properties are common in cancers: (1) uncontrolled growth and division of cells in our body, (2) damage to DNA and the presence of alterations, deletions, rearrangements, and genetic dysregulation, and (3) temporal genetic and epigenetic changes, causing loss of control at the single-cell gene level8.


Tumor Progression

Studies have generated a so-called “cancer blueprint9, showing how one alternatively -expressed cell can create a myriad of manifestations. The first manifestation is the pre-malignant lesion. An example is the autosomal dominant disease FAP (familial adenomatous polyposis; related to colon polyps.) The gene APC is an inhibitor of the WNT signalling pathway and cells without APC have unstoppable growth10. In practice, lesions often disappear11; if not, the next stage of cancer presents as a primary tumor. The local malignant tumor, unlike lesions, rarely disappears and often reaches the next stage: a lethal tumor that may spread in the body, resist immune response/treatment, and cause invasion and metastatic disease - the two key drivers of cancer mortality12.


Carcinogenic Mutations

High-throughput sequencing technologies have caused the identification of three genes shaping our understanding of the blueprint: oncogenes (e.g. growth factors or GF, GF receptors, signaling molecules, protein transcription regulators in nucleus13) which control cell growth and cell lethality, tumor suppressor genes which control DNA repair and growth1415, and most-interestingly, epigenetic modifiers such as enzymes regulating transcription or proteins for chromatin formation. 10% cancers are inherited in germline mutations with high penetrance (causing 40-90%< people inheriting the mutations to develop cancer/cancers16). Many such mutations were identified using bulk sequencing and are often oncogene or tumor suppressor mutations.

Tumor genome sequencing has revealed prevalent proteins affecting epigenetic regulation called chromatin remodeling factors (CRF). CRFs partake by changing DNA methylation, histone protein levels, or nucleosome position. The function of these complexes is unclear - yet, almost all cancers have one of more of these proteins frequently mutated17. The holistic study of often-unknown mutations like CRFs is important. One reason for this is the Knudson’s Two Hit Model (KTHM)18 - an early theoretical model used to explain a majority of cancers. KTHM states a lower bound of two mutations in cells to start tumor development. Therefore, markers of many forms of mutations must be measured to capture the holistic single-cell profile of a cancer cell.


Epigenomic Modification Of The Cancer Cell

The presence of unique cellular combinations of mutations is an important trait in cancer. Some of the least understood aspects of these combinations involve the epigenome. Mutations in epigenetic modifiers cause the reprogramming of gene expression. The epigenome is actively-changing in cancers - yet, it is one of the less-understood domains within single-cell genomics due to less reliable single cell sequencing. Technologies for scSeq of the epigenome are promising but still in their early stages; despite this, breakthroughs in oncology have been made through single-cell epigenomic data19.

Recent sc-Seq techniques have provided solutions for this problem and discovered a dramatic increase in the relevance of epigenetics in cancer cells. Many epigenomic changes are now known to be mutually dependent on other omic activity20. For instance, promoter CpG island hypermethylation-based silencing of repair genes, may occur in a “loop”21: the silencing can cause genetic changes, and translocations and mutations can then cause epigenetic disruption - a great example of sophisticated epigenomic insights using multi-omics, in this case using epigenomic and transcriptomic profiles. Furthermore, the integration of epigenomic data in omic and multi-omic analysis has made us aware of the phenomena below.


Cancer lethality, resistance and metastasis: Epigenomic changes are a key form of adaptation and survival, which generates late-stage mutations and heterogeneity in cancer. In general, histone modifying enzyme mutations allow the cancer cell to show differential regulation, causing heightened adaptation in tumor cells to changing conditions like chemotherapy, and ultimately develop resistance and take on properties such as invasion and metastasis22. Most cancer cells have epigenetic mutations that modify proteins regulating transcriptomics.If a cancerous tumor were to stop at the formation of a local, primary malignant tumor, we can simply use surgery, radiotherapy or chemotherapy - forms of physical tumor removal. However, instead this becomes a rapidly-spreading cancer due to cancer epigenomics. The alterations of the epigenome allow the tumor to “hack” the genetic code, and eventually with mutations and evolutionary pressure, this causes resistance to genomic/cancer therapy, initiation or increase of metastasis.


Tumor microenvironment (TME): The TME is a diverse molecular system composed of immune, stromal, endothelial, and tumor cells23. It also includes non-cellular components such as the extracellular matrix and secreted signaling molecules. The area is very epigenetically active, showing heterogeneity, plasticity, and complex molecular cross-interactions 24. Immune checkpoint blockade25 therapies - one of the greatest paradigm shifts in cancer immunotherapy - relies on T-cell mediated anti-tumor immunity within the TME26. High-dimensional multimodal datasets can enable cancer evolutionary lineage tracing and epigenetic profiling of TME immune cells. This would give insight into mechanisms driving the functional diversity of TME immune cells and tumor cells - a diversity driven to a large degree by epigenomic changes.


Cancer evolution: Tumor diversity undergoes Darwinian selection pressures which affects cancer genomics. Some TME-based environmental selection pressures are the immune system, food, oxygen or water deprivation, pH changes, temperature, chemotherapy, radiotherapy, exposure to mutagens. Epigenetic changes are also under the same Darwinian pressure, assuming variation and perfect cellular competition, and are often heritable. This can be used to create computational frameworks for the single-cell epigenome. Single-cell tumor phylogenic evolutionary trees can reveal “driver mutations” (initial mutations that occurred before mutagenesis) for therapies against resistant cells.27 Clonal cell populations can be used alongside spatial tags and serial sampling to understand the TME complex better.


Heterogeneity:28 Tumors contain heterogeneous cells with distinct genetic and phenotypic properties that can promote metastasis and drug resistance differentially. Inter-tumor heterogeneity means differences in the tumors among different patients, however intra-tumor or spatial heterogeneity is the difference within a single tumor mass. Precision medicine often aims to decipher and target this heterogeneity for treatment provision based on the precise molecular makeup of a tumor. Single-cell techniques provide a way to profile individual cells within tumours and learn differential function, resistance or metastasis. 29



II. SEQUENCING CANCER

Next Generation Sequencing (NGS)

NGS made it possible to sequence DNA more economically, sensitively and efficiently than Sanger sequencing. The parallel sequencing feature in NGS makes it a staple for sequencing, since this facilitates the processing of several samples and genomic areas in a quick way. Additionally, it can work much better than Sanger Sequencing on low-quantity input, detecting mutations more accurately. In a study, mutations in tumor tissues for the five most common cancer types were analyzed using NGS.30 The “Shannon entropy level” (to measure analytical utility) was calculated for each tumor. The aim was to see if NGS reveals new information (high numbers of entropy are positive). Even within the most common cancers, there was a big difference in scores, which shows that for some major cancer types, NGS may have analytic utility and certainly the right sensitivity to provide cancer diagnoses.

Since the onset of NGS, whole genome and exome DNA-sequencing have been key drivers of the scientific understanding of the disease. In 2013, full somatic mutations were discovered within tumors31, and their mutation load was discovered, as well as the differential mutation load based on the cancer. For the well-characterized oncogenes and tumor suppressor genes, there is evidence documenting their cellular growth-based functionality. However, there are some limitations of NGS for cancer; in the case of aneuploidy, tumour heterogeneity and contamination with normal tissue, which are all common in cancer, NGS would not give high-quality values. In general, bulk analysis techniques for cancer analysis would cause averaging signals from mixed cells, which may mask or hide tumor clones and may never be detected in measuring cell diversity.32 Finally, tumor genome sequencing may reveal mutations using NGS, but for many, there is no critical data emphasizing or defining gene function.33


Single-Cell Sequencing

Single-cell sequencing (sc-Seq) technologies have been rapidly built over the past decade with the aim of observing the “multilayered status” of cells. SCT has the power to elucidate genomic, epigenomic, and transcriptomic heterogeneity in cellular populations, and the changes at these levels.


Sample preparation for cancer: The benefit of creating a sample for scSeq is that almost any tissue can be used, albeit the data quality is fragile and the right techniques must be used to avoid amplification bias or allele dropout. An average cancer cell contains ∼6–12 pg and 10–50 pg of total DNA and total RNA, respectively, depending on ploidy and type. This includes 1 to 5% mRNA, meaning that amplification (WGA) is very important. If the sample is solid tumor cells, surgery and solid biopsy at very late metastatic stages of a cancer makes tissue difficult to obtain, as does collection at very early stages of cancer, and especially for a pre-cancer screen. Also, isolating cells from solid tissues may cause unbiased disaggregating of the tissue, skewing ‘omic data. An efficient solution is using another biomarker for cancer, called CTCs (Circulating tumor cells), cells shed by solid tumors during metastasis, using liquid biopsy as the sampling method. Liquid biopsies34 do not have an issue of unideal disaggregating of tissues, since circulating tumor DNA (ctDNA) released can directly be collected as the biomarker.35 Micro-manipulation devices or special pipettes are common isolation techniques when it does not matter that output is low-throughput. Flourescence-activating cell sorting and last capture micro-dissection were developed to improve isolation, and finally, the microfluidics technique was discovered, causing a significant increase in throughput with little material required. Barcoded seq prep may also be required for some formulations.


Drawbacks To Single-Cell Sequencing

Drawbacks include loss of tumor characteristics including spatial information, intratumor heterogeneity, and important cell-to-cell interactions. Single-cell preparation by nature requires single cells to be dissociated, therefore losing spatial information. Single-cell sequencing may not accurately represent the underlying genome of the whole tumor when a small biopsy is taken, due to intratumor heterogeneity not being captured. Dissociating single cells from tissues may itself alter the cells and their gene expression. There is also a tradeoff for microfluidic devices. There is a reduction in allelic dropout, but entire cell populations can be lost. There may be bias for certain cell sizes, which can skew results, like uneven amplification. Some single-call sequencing techniques require complex dissociation protocols to obtain individualized fresh cells. This may be due to unneeded manipulation between sample collection and processing. To avoid this happening, researchers have to work with cell lines or organoids, which is not a perfect alternative for the co-existent system existing in the TME, and the single-cell sequencing insights that could have been found from these interactions.


Single-Cell Multiomics And Analysing Cancer scMO Data

Multi-omics describes a set of multi-dimensional tools for wrangling high-throughput sequencing data from various domains and techniques. To study all types of cells and omics layers, we should consider single-cell sequencing methods from both laboratory and clinical views. Single cell sequencing can split the heterogeneity of bulk tissue at the genotypic and phenotypic level. Multi-omics studies characteristics of single cells, but also studies combined regulatory mechanisms evident only at pushed-down dimensions. The most special aspect of scMO is its ability to interpret correlations between separate omics reads, the ability to facilitate machine learning and dimensionality processing, and the applications to systems biology (through networks and correlations) and precision medicine (thorough ML-based regressions and classifications). scMO frequently covers the following at the single-cell level: transcriptomics, genomics, epigenomics, proteomics, temporal and spatial multi-omics.


Single-cell transcriptomics: Smart-seq is often the preferred method for scMO transcriptomics. It uses full-length cDNA amplification alongside oligo-dT priming and template switching. RamDa-seq7 detects RNAs with no poly-A tail, like enhancers. scRNA-Seq is difficult due to the volume of RNA copies in cells. Therefore, microdroplet technologies are used to optimize reverse transcription by conducting it with barcoding of each oil droplet. Microwells can also be used, since they can handle thousands of cell, and this adds to sensitivity by reducing allele dropouts. In cancer, the transcriptomic biosphere was first studied in 2016, solely using scRNA-seq on CD4 cells in melanoma patients.36


Single-cell genomics: Uniform amplification of the DNA is difficult at the single-cell level since only two copies of DNA are present in a single cell, quite unlike with transcriptomics. Amplification methods like degenerate oligonucleotide-primed PCR (DOP-PCR) must be used. 37Allelic dropout and amplification bias affect data quality and sequencing depth. Nuc-Seq and single nucleus exome-seq are great alternatives for when SNVs and indels are identified, since allelic dropout and bias makes traditional sc-DNASeq technique not sensitive enough for these mutations.


Single-cell epigenomics: We can use DNA methylation and histone profiles for single-cell epigenomics. Single-cell bisulfite sequencing (scBS-seq) is the most reliable technique for methylation. scATAC-seq and sc-Hi-C are methods for quantifying open chromatin patterns using small numbers of cells, and chromtin structure, respectively. Meanwhile for histone modifications, Drop-ChIP and scChIC-seq can be used. Drop-ChIP is a droplet microfluidics approach, and ChIP-Seq can be conducted at the single-cell level. In 2019, single-cell chromatin immunoprecipitation was used to look at breast cancer patients. However, ChIP-seq has yet to be fully adapted for single-cells. Histone landscapes of patient-derived xenografts showed contrast between cells that would respond to versus. be resistant to chemotherapy, confirming the existence of an epigenetic nature for tumor resistance.38


Single-cell proteomics, temporal and spatial multi-omics: For proteomics at a single-cell level, mass spectrometry or flow cytometry is often preferred over sequencing. However, there are some concerns about large sample size requirements as well as the ability to measure only a 39few proteins, for techniques like mass spectrometry. In this case, CyToF has recently been used. A form of mass cytometry, it can conduct analyses on a large number of proteins using labelled antibody tags. The integration of spatial data in multiomics is lacking. Single-cell sequencing lacks spatial data by nature - the tissue is isolated into single cells before sequencing. However, there are some new transcriptomics-based spatial techniques, using barcoding for spatial information. Recently, spatial transcriptome techniques Slide-seq75 and Visium are being used for conducting gene expression analysis in tissue sections. Spatial information is tagged through molecular barcoding. Finally, for temporal data, the closest technique for this information is Monocle and Monocle 2, an algorithm for complex single-cell trajectories. Monocle uses a machine learning technique called “Reversed graph embedding” to learn a principal curve passing through the central tendencies of a dataset and then generates a tree which is a temporal map.40 Monocle learns the output graph from single-cell RNA-Seq data.



III. MACHINE LEARNING

Machine Learning In Multi-Omics

Machine learning refers to algorithms that mathematically fit a predictive model to the observed (“training”) data. This model can then be generally applied to predict properties or “labels” of yet unencountered (“testing”) data. In the training process, the algorithm focuses on the progressive improved performance of a computer for the specific task assigned through learning with each iteration or “epoch” of data fed to the machine. It is a branch of artificial intelligence with an “ability to interpret large, cryptic cancer datasets”, and predict over them. Meanwhile, Deep Learning is a subset of Machine Learning that emerged in recent years. Neural networks are used to process and find complex representations of multi-omic data. When given large-scale datasets with high dimensionality, DL tends to outperform ML in oncology.41

Until recently, the interest for DL has been rather limited for multi-omics analysis 42. However, DL algorithm performance on analysis of omics data has shown promise in all realms, from using classification for detection of cancer, to precision risk stratification for cancer patients. Current single-cell seq technologies produce profiles for millions of single cells very quickly, opening the door to the use of powerful deep learning approaches. On top of that, multiomic data, after integration, can easily be fed to this algorithm, with a lot of choices for neural network analysis based on the kind of information fed - for instance, Convolutional Neural Nets, derived from their application in computer vision, can process positional and spatial information very well, interpreting data through a “moving window”.

Often another very appropriate choice for single-cell multi-omics learning is the use of autoencoders, a type of neural network containing three layers (encoder, bottleneck, decoder) to learn a compressed representation of raw data. The dimension of the last layer is normally lower than the input layer, reducing the curse of dimensionality which follows single-cell data. The encoder will learn as much information about the input as possible while ignoring the noise that is commonplace with this form of genetic data. Therefore, autoencoders are a dimensionality reduction algorithm, and store a low-dimensional optimized view of complex data which can be analysed and visualized. It also has a flexible architecture, increasing integration possibilities between gene and protein expression data. There are other DL algorithms that are commonplace in this domain such as DNNs, ANNs, and GANs.