Group and Basis Restricted Non-Negative Matrix Factorization and Random Forest for Molecular Histotype Classification and Raman Biomarker Monitoring in Breast Cancer

Raman spectroscopy is a non-invasive optical technique that can be used to investigate biochemical information embedded in cells and tissues exposed to ionizing radiation used in cancer therapy. Raman spectroscopy could potentially be incorporated in personalized radiation treatment design as a tool to monitor radiation response in at the metabolic level. However, tracking biochemical dynamics remains challenging for Raman spectroscopy. Here we developed a novel analytical framework by combining group and basis restricted non-negative matrix factorization and random forest (GBR-NMF-RF). This framework can monitor radiation response profiles in different molecular histotypes and biochemical dynamics in irradiated breast cancer cells. Five subtypes of; human breast cancer (MCF-7, BT-474, MDA-MB-230, and SK-BR-3) and normal cells derived from human breast tissue (MCF10A) which had been exposed to ionizing radiation were tested in this framework. Reference Raman spectra of 20 biochemicals were collected and used as the constrained Raman biomarkers in the GBR-NMF-RF framework. We obtained scores for individual biochemicals corresponding to the contribution of each Raman reference spectrum to each spectrum obtained from the five cell types. A random forest classifier was then fitted to the chemical scores for performing molecular histotype classifications (HER2, PR, ER, Ki67, and cancer versus non-cancer) and assessing the importance of the Raman biochemical basis spectra for each classification test. Overall, the GBR-NMF-RF framework yields classification results with high accuracy (>97%), high sensitivity (>97%), and high specificity (>97%). Variable importance calculated in the random forest model indicated high contributions from glycogen and lipids (cholesterol, phosphatidylserine, and stearic acid) in molecular histotype classifications.


Introduction
Radiation therapy (RT) is a treatment prescribed to approximately 50% of cancer patients. 1 Currently, RT treatment plans are often designed and evaluated based on morphological information acquired by imaging modalities such as computed tomography (CT) and magnetic resonance imaging (MRI). [2][3][4] RT treatment schemes do not incorporate information on biochemical dynamics in tumours as part of the treatment design and evaluation. However, tumor cells possess signaling pathways and metabolic processes inherently distinct from normal cells, leading to non-uniform cellular radiation response. 5 For example, breast cancer as a heterogeneous disease can be classified into molecular subtypes based on the immunochemical molecular histotypes. 6 Histopathological assessment based on the expression of estrogen receptor (ER), progesterone receptor (PR) and the human epidermal growth factor receptor 2 (HER2) has been included in breast cancer classification system. [7][8][9] A strong interest in designing personalized treatments based on the molecular subtypes has been advocated. 10,11 Although molecular subtypes and histotypes are often regarded as essential when making treatment decisions, the biomolecular dynamics for each subtype and histotype remain unclear and challenging to track in response to treatment. Developing a tool to identify and monitor biomarkers that can be used as targets in cancer treatment (e.g., to develop radiosensitizers) is necessary.
Raman spectroscopy (RS) is a noninvasive optical method applicable to soft tissues and cells to monitor and evaluate biological changes in response to stimuli on a cellular level in the tumor microenvironment. RS provides information on multiple biomarkers in cell samples using the vibrational fingerprint of the biochemicals. Previous works have demonstrated that RS can detect radiationinduced biochemical variations in cancer cells and tumours. [12][13][14][15][16][17] Performing unsupervised dimension reduction (e.g., principal component analysis (PCA) and nonnegative matrix factorization (NMF)) helps overcome the complexity of RS data analysis. Applying unsupervised dimension reduction techniques has helped identify glycogen and lipids as primary biochemicals responsible for spectral changes on cellular and tumour samples exposed to ionizing radiation. [12][13][14][15][16][17][18] However, the primary difficulty encountered by the commonly used unsupervised dimension reduction algorithms is interpreting the decomposed components or loading vectors. This limitation often results in unidentifiable bases which can be related to specific chemicals.
Negative matrix factorization is an unsupervised learning technique used for dimensionality reduction. 19 Due to the non-negativity assumption, NMF can improve the interpretability of decomposed components in spectroscopic data over other dimension reduction techniques. 16,20,21 NMF, as an unsupervised analytical method, also has its limitations. For example, matching chemical bases decomposed from unsupervised NMF to the actual biochemicals present within a Raman spectrum is a complex task. To address this limitation, a group and basis restricted non-negative matrix factorization (GBR-NMF) algorithm was used in the current study to conduct dimensionality reduction. 22 GBR-NMF constrains biochemicals during dimension reduction and allows for better evaluation of chemicals of interest. 23 In the present study, GBR-NMF is combined with a random forest (RF) classifier to form a data analytical framework. RF is an ensemble classifier that generates the forest's classification result based on the voting results of a collection of decision tree classifiers within the forest. 24 After GBR-NMF decomposes the Raman spectra, a random forest classifier was applied to the GBR-NMF decomposed scores of constrained chemicals. RF can classify the chemical scores into cell lines and molecular histotypes while providing information on the relative contribution of individual biochemicals in the classification via variable importance.
The current work aims to demonstrate that the GBR-NMF and random forest combined (GBR-NMF-RF) data analytical framework can accurately classify molecular histotypes of breast cell line RS data acquired from cells exposed to clinically relevant radiation doses. This data analytical framework also has the advantage of revealing the high-contributing chemicals in the classification. The combination of RS and GBR-NMF-RF provides a robust technique to track the biochemical dynamics across breast cancer molecular histotypes throughout treatment. The data used to test the model are Raman spectra acquired on one normal breast cell line, MCF10A, and four breast cancer cell lines (MCF-7, BT-474, MDA-MB-231, and SK-BR-3) irradiated with single doses of radiation (0, 10, 30, 50 Gy). These cell lines span a range of molecular histotypes and are presented in Table I. The GBR-NMF was performed on RS cellular data with two reference libraries of basis biochemicals (containing 16 and 20 biochemical bases, respectively) for comparison. RF was applied to the decomposed biochemical scores to classify the breast cell lines into different molecular histotypes, yielding classification results with high accuracy (>97%), high sensitivity (>97%), and high specificity (>97%). The variable importance ranking reveals high contributions from the Raman spectra of the biomarkers glycogen and lipids (cholesterol, phosphatidylserine, and stearic acid) in molecular histotype classifications.

Breast Cancer Cell Line Molecular Histotypes
The molecular histotypes of the breast cell lines are summarized from previous studies. 12,25,26 The previously established immunohistochemical information is shown in   27,28 Based on the immunohistochemistry of the histotypes, the cancer cell lines selected can be categorized into subtypes, such as Luminal A (with ER and/or PRþ, HER2�, low Ki67), Luminal B (HER2þ) (with ER and/or PRþ, HER2þ), HER2 (with ER�, PR�, HER2þ) and Basal-like (with ER�, PR�, HER2�, low Ki67). 8,9 Raman Spectroscopy Reference Biochemical Library Selection We constructed two chemical basis libraries to cover a range of biochemicals known to be present within the Raman cellular spectra, which includes carbohydrates, lipids, enzymes, amino acids, and nucleic acids. [29][30][31][32] The chemical reference libraries were built based on the biological composition of cells and biomolecular dynamics in cancer metabolism. Building a model is a trade-off between complexity and the ability to cover a complete set of molecular compounds. Hence, we reported both simpler (16-chemical) and more complex (20-chemical) libraries to assess the trade-off in complexity and model performance.
A 16-chemical library is presented in Table S1 (Supplemental Material). Glycogen, glucose, and lipids are included in the reference library since previous studies have traced them as Raman biomarkers for radiation treatment response. [13][14][15][16]23,33 A wide range of lipids were selected, including; cholesterol, glycerol, glyceryl tripalmitoleate, oleic acid, palmitic acid, phosphatidylcholine, phosphatidylserine, and stearic acid. The RS spectra of lipids are highly alike, [30][31][32] and previous studies could not distinguish the contribution of individual lipids in radiation response. 12,16 Citric acid and lactic acid were added to the library, given they are downstream products of glucose metabolism in cancer metabolic pathways. After glucose is consumed and changes to pyruvate via glycolysis, pyruvate will feed into the citric acid cycle, also known as the tricarboxylic acid cycle (TCA cycle), to make energy or be converted to lactic acid. 34,35 Coenzyme A (CoA) was included in the library as it is involved in the oxidative decarboxylation of pyruvate into acetyl-CoA, which has a vital role in fatty acid metabolism. 36 DNA is taken into account as an essential chemical to monitor since RT delivers ionizing radiation to the tumor, leading to DNA damage, particularly DNA double-strand breaks. 37 This library also includes glutathione and mannose. Glutathione is the most abundant antioxidant and non-protein thiol present in mammalian tissues due to cellular redox homeostasis (the maintenance of many cellular processes including signalings, etc.), which can implicate tumour initiation, progression, and treatment response. 38,39 Mannose is a sugar monomer carbohydrate participating in glycosylation (the reaction in which a carbohydrate is attached to a hydroxyl or other functional group). 40 Abnormal growth factor signalling in cancer development can be attributed to specific types of glycosylation. 41 A 20-chemical extended library was also constructed to exploit GBR-NMF's ability to constrain chemical bases of interest (Table S2). For example, amino acids or proteins were difficult to trace in previous studies by unsupervised dimension reduction methods, [12][13][14][15][16]33 although they have significant roles in cancer metabolism. Two amino acids of interest (valine and leucine) were selected for inclusion in the 20-basis library due to their importance in metabolic pathways related to amino acids in breast cancer. 42 They are branched-chain amino acids that serve as alternative sources of organic molecules that can also serve as indirect substrates for the TCA cycle. 43 Collagen is the major fibrous protein in the extracellular matrix and connective tissue. 44 However, given the Raman spectral data used in the current study are cellular, it is only included in the 20-chemical extended library to test the effect on the model. A triglyceride mixture was also added to the extended library to test whether the analytical methods could distinguish chemicals with similar spectra, such as triglycerides and glyceryl tripalmitoleate. The triglyceride mixture contains glyceryl tridecanoate, glyceryl tridodecanoate, glyceryl trimyristate, glyceryl trioctanoate, and tripalmitin, which all have similar RS spectra. 32 The RS spectrum of glyceryl tripalmitoleate resembles the triglyceride mixture spectrum with minor differences at 1266 and 1656 cm �1 . 32 All chemicals other than the triglycerides mixture were supplied by Sigma Aldrich, USA. The triglycerides mixture is a certified reference material of lipid standards purchased from Supelco (Bellefonte, USA).

Cell Cultures
All cell samples were cultured as described previously by Meksiarun et al. 12 Four types of human epithelial breast tumour cells (MCF-7, BT-474, MDA-MB-231, and SK-BR-3) were cultured in DMEM medium supplemented with 10% fetal bovine serum (FBS, Hyclone). MCF10A, the normal breast cells were cultured in MEBM medium supplemented with MEGM SingleQuots (Lonza, Basel, Switzerland). Five percent CO 2 was supplied to the cells in an incubator, and the temperature was maintained at 37.8 � C. The cells were harvested and plated into 12 identical flasks as equivalent aliquots of cell suspension. The flask cultures were incubated for 96 h to reach approximately 50% confluency. Fresh media were replaced in the cultures 1 h before irradiation. All flasks were removed from the incubator for less than 40 min before irradiation.

Cell Irradiation and Harvesting
For irradiation, flasks were placed between two 5 cm-thick pieces of water equivalent plastic solid water (Gammex RMI, USA) in order to provide electron buildup at the layer of the cells. 12 Irradiation was delivered using a 6 MV photon beam from a Varian 21EX linear accelerator (Varian Medical Systems Inc., USA) with a dose rate of 6 Gy min -1 at the isocenter with single fractions of 0, 10, 30, or 50 Gy. Each dose was delivered to three flasks.
At 18, 42, and 66 h post-irradiation, one flask from each dose was harvested for RS acquisition. The harvested cells were washed with phosphate-buffered saline (PBS, Hyclone) and then re-suspended in PBS plus 10% FBS. The suspension was centrifuged into a pellet (without isolating cells) and placed onto a 5 mm-thick MgF 2 disk (Janos Technology Inc., USA) based on a previously established protocol. 14

Raman Spectroscopy Acquisition (Breast Cells and Reference Biochemicals) and Pre-Processing of Spectra
The Raman spectra of cells were acquired based on a previously developed protocol. 33,[12][13][14] After the cell pellet was placed onto a 5 mm-thick MgF 2 disk, cells were user-identified and chosen for ease of Raman sampling (not overly splayed out, good sampling volume). Raman spectra of cells were acquired with an inVia Raman microscope (Renishaw Inc., USA) with a 100� dry objective (NA ¼ 0.9) (Leica Microsystems, Germany), a 600 lines per mm diffraction grating, a thermoelectrically cooled iDus charge-coupled device (CCD) detector (Andor Technology, UK). The acquisition parameters were 10 s acquisition time per cell and a 425-1820 cm � 1 spectral window. A 785 nm laser (Renishaw-based diode laser) was used for excitation, providing a laser power density at the sample of �0.5 mW/mm �3 . The size of the sampling volume was � 2 � 5 � 10 mm (aligned with the centre of the selected cell), allowing a single acquisition to represent the spectrum of a single cell (� 10 mm diameter). The laser focal spot on the Renishaw system is rectangular in shape due to the nature of the laser assembly (Renishaw-based 785 nm diode laser). Twenty spectra from single cells were collected from each cell line, resulting in 240 Raman spectra acquired over four doses (0, 10, 30, and 50 Gy) and three harvesting times (18,42, and 66 h), which accumulated in a total of 1200 acquired spectra.
Spectra of chemicals presented in Tables S1 and S2 (liquid or solid forms) were acquired using a Renishaw inVia Raman microscope (Renishaw Inc.) with a 100� dry objective (NA ¼ 0.9) (Leica Microsystems, Germany), a 1200 lines/mm diffraction grating, a 10-s exposure time, and a 785 nm laser (Renishaw). The spectra of all 20 reference chemicals are presented in Fig. S1. The reference chemical basis spectra were acquired at a higher spectral resolution in order to distinguish any overlapping features. However, all spectra were interpolated onto the same wavenumber axis (resolution) prior to performing our analysis. No adverse effects of the interpolation or higher spectral resolution of the basis spectra were observed in the analysis.
Each cell spectrum was processed with a cosmic ray removal program from WiRE (Renishaw Inc.). All the reference chemical and cosmic-ray-removed cellular spectra were processed with baseline subtraction (estimate and subtract the baseline due to the substrate and biological fluorescence) and normalization under the curve (equal to 1) in Matlab (The Mathworks, Inc.).

Data Analysis: GBR-NMF-RF Data Analytical Framework
Non-Negative Matrix Factorization and Group and Basis Restricted Non-Negative Matrix Factorization. NMF decomposes the non-negative data matrix X 2 R n�p �0 (where n is the number of samples and p is the number of variables in the data matrix) into two lower rank non-negative matrices where q is the number of underlying factors. This is generalized from the true decomposition of X with e 2 R n�p �0 representing the residual error. 19,22 Based on the NMF, group and basis restricted nonnegative matrix factorization (GBR-NMF) variation was proposed by Shreeves et al. 22 GBR-NMF further decomposed the non-negative data matrix X into W, A, and S such that where X is decomposed into an n � q score matrix W, a q � q auxiliary matrix A, and a q � p matrix S containing the (partially) constrained factors. The S matrix is partially constrained as the Raman biochemical library. GBR-NMF then updates sequentially the portions of each matrix of the decomposition not assumed to be constrained; for example, the biochemical bases in the matrix S are not updated during the optimization.
Random Forest and Variable Importance. After conducting the GBR-NMF decomposition on the Raman cellular data, the decomposed chemical scores were classified by random forest (RF) based on the molecular histotype information of the cellular data. RF is a classification method first published by Breiman. 24 RF contains a collection of decision tree classifiers, wherein each tree generates an output of the classification result from the input data. The forest chooses the final classification result based on the most popular vote over all the trees in the forest. Variable importance is a measure used to rank the importance of variables during the RF classification. 45 The out-of-bag (OOB) strategy is commonly used to determine the variable importance. [45][46][47] When building each tree, the bootstrap observations which are not used to construct a tree are saved as out-of-bag (OOB) observations. The prediction error on the out-of-bag portion for each tree of the data is recorded as the OOB error. Each predictor variable (e.g., biochemical basis) will later be permuted to calculate the importance of variables. The mean decrease in accuracy (MDA) is calculated by averaging the total difference of OOB errors before and after the random permutation of a single variable over all the trees.

Workflow of GBR-NMF-RF Framework
A workflow of the GBR-NMF-RF is summarized in Fig. 1. Before performing dimension reduction with GBR-NMF, the RS data set was randomly split into 10 training and testing sets to imitate the case where the model was first trained with prior-knowledge Raman data and then was used to classify new data. Due to the limited number of cells in some categories (e.g., cancer versus non-cancer), 40% of all the data were randomly selected as the testing set each time to make sure both training and testing sets have enough samples from each classification category. A summary of the classification tests and sizes of training and testing sets is presented in Table S3.
After the spectra were pre-processed, GBR-NMF was performed for dimension reduction. The matrix S in the GBR-NMF was constrained as the reference chemical RS libraries. One unconstrained factor is allowed in the matrix S to represent all the other biochemical change residuals unspecified in the Raman biochemical library. The GBR-NMF algorithm was implemented based on Shreeves et al. and available on Github. 22,48 The group constraining option was omitted in the current work since the underlying groups of the data set are not investigated through GBR-NMF. GBR-NMF was performed on the training set and testing set individually while using the same unconstrained basis matrix obtained from the training set decomposition.
After the dimension reduction step, RF was used to classify the molecular histotypes, and cell line types on the GBR-NMF decomposed biochemical scores. RF was performed with the randomForest package in R. 45,49 The number of trees specified was 1000 as default. The biomarker contribution results were obtained from the MDA variable importance function of the package. The decomposed biochemical scores from the training set were used to train the RF classifier, and the decomposed scores of chemical bases from the testing set were used to test the performance of the classifier. Every classification test was repeated 10 times to verify the stability of the model as the training and testing set of the data were selected randomly. Relevant metrics such as averaged accuracy, sensitivity, specificity, and variable importance were calculated based on the 10 repeated trials of the randomly split training and testing sets.
We also performed unsupervised NMF to compare with GBR-NMF. A standard NMF function from the scikit-learn Figure 1. GBR-NMF-RF data analytical framework workflow. After the cellular spectra were processed, the data set was split into 10 training and testing sets pairs. GBR-NMF dimension reduction was performed individually on the training set. The GBR-NMF decomposed chemical scores from the training set were used to random forests (RFs) based on the molecular histotype information of the cellular data. The GBR-NMF decomposed chemical scores from the testing set were used to get the classification performance and variable importance. package in Python was used to perform unsupervised NMF for comparison. 50

Results and Discussion Comparison Between Unsupervised NMF and GBR-NMF Chemical Scores
In the current study, Raman spectra were acquired on five cell lines (MCF10A, MCF-7, MDA-MB-231, BT-474, and SK-BR-3) spanning a range of immunochemical features (Table I) over four doses of radiation (0, 10, 30, and 50 Gy) and three harvesting times (18,42, and 66 h). The GBR-NMF-RF is implemented to analyze the cellular spectra to provide information on biomolecular dynamics while tracking the RT response profiles of molecular histotypes and subtypes. Two RS reference chemical libraries (Tables S1 and S2, and Fig. S1) were used as the constrained chemicals in the GBR-NMF model.
The average spectrum (red spectrum) with AE1 standard deviation (shadow spectrum) for each cell line across doses and days is presented in Fig. 2. For MCF10A, MDA-MB-231, and SK-BR-3, the most outstanding standard deviation occurs at 1436 cm À1 , which can be attributed to lipids/proteins (CH 2 scissoring). 30,31,51 The greatest standard deviation appears at 1438 cm À1 for BT-474, which can be attributed to the change in saturated fatty acids and triglycerides. 32 The greatest standard deviation of MCF-7 spectra is at 482 cm À1 and attributed to glycogen. 14,23,52 Only two chemicals can be identified from the unsupervised NMF decomposed chemical bases, as shown in Fig. 3. These two bases decomposed by unsupervised NMF were similar to two previously identified biochemicals, glycogen, and lipid. 13,14,16 Other than the first two bases shown in Fig. 3, the remaining decomposed bases were difficult to identify.
To compare the performance of the unsupervised NMF and GBR-NMF, the score plots of glycogen extracted from these two methods are presented in Fig. 4. The plots are displayed over different dose levels (0, 10, 30, and 50 Gy) and time post-irradiation (18, 42, and 66 h). The glycogen variation trend in the score plots was almost identical for GBR-NMF and the unsupervised NMF.
An elevated glycogen level in Fig. 4 was observed in MCF-7 and BT-474 cell lines, which belong to ER-positive and PR-positive molecular histotypes. Meksiarun et al. has , the greatest standard deviation occurs at 1436 cm À1 , which can be attributed to lipids/proteins (CH 2 scissoring). 30,31,51 The greatest standard deviation appears at 1438 cm À1 for BT474, which can be attributed to the change in saturated fatty acids and triglycerides. 32 For MCF7, the greatest standard deviation of MCF7 spectra is at 482 cm À1 and attributed to glycogen. 52 reported the same observation analyzed with PCA. 12 Upregulated glycogen levels can be linked to impaired inhibition of glycogen synthase kinase-3 (GSK-3) which regulates glycogen synthesis. 53 GSK-3 has been reported to have a vital role in regulating the transcription factors of hormone receptors (e.g., estrogen and progesterone). 54,55 Classification Performance on Molecular Histotypes, Cancer Versus Non-Cancer Cell Types and Cell Lines The GBR-NMF-RF analytical framework was tested to classify molecular histotypes, cancer versus non-cancer cell types, and cell lines based on the GBR-NMF decomposed biomarker scores with 10 trials (Tables II to IV).
The accuracy was high (>97%) for both 16-chemical and 20-chemical libraries, although the 20-chemical library had a slightly higher accuracy. In Table I, PR and ER histotypes have identical distributions among the cell lines. MCF-7 and BR-474 are PR-positive and ER-positive. MCF10A, MDA-MB-231, and SK-BR-3 are PR-negative and ER-negative. Therefore, ER and PR were combined into one classification test.
Overall, the framework had an outstanding performance in each of the molecular histotype and cancer-or-not classification tests resulting in high accuracy (>97.0%), high sensitivity (>97.0%), and high specificity (>97.0%) for both the 16-chemical (Table II) and 20-chemical libraries (Table III). The 20-chemical library strategy slightly  outperforms the 16-chemical library except forKi67 classification. When comparing the results of 16-chemical and 20-chemical reference libraries, the differences across different metrics were within 2% of the means.
The framework also performed well in the multi-class classification of cell line types, yielding high accuracy (>98.0%) for both the 16-chemical and 20-chemical libraries (Table IV). Again, the 20-chemical library classification average accuracy (98.5%) is slightly improved over the result of the 16-chemical library (98.0%).

Variable Importance of Chemicals and High Lipids Contribution to Histotype Classification
The variable importance plots (i.e., biochemical bases) of 10 trials for each classification test are presented in Fig. 5. The y-axis is the reference chemicals tested in the GBR-NMF-RF analysis framework, and x-axis is the averaged mean decrease accuracy of 10 trials. The variables with larger values of MDA are ranked higher in terms of their importance in the classification. In Fig. 5, different types of lipids including cholesterol, fatty acids (stearic acid, oleic acid, and palmitic acid), phospholipids (phosphatidylserine and phosphatidylcholine), and triglycerides (glyceryl tripalmitoleate and triglycerides mix) were commonly ranked high (top five) across all the molecular histotype classification tests. Glycogen also demonstrated high importance in HER2, PR/ER, and cancer-or-not classification tests. Valine, an amino acid, was only regarded as important to distinguish cell lines as cancer or non-cancer.
Although various lipids were observed as highcontributing chemicals, cholesterol, phosphatidylserine, and stearic acid stood out as the top three chemicals in multiple classifications (HER2, Ki67, and cell line types) in Fig. 5. The GBR-NMF decomposed score plots of cholesterol, phosphatidylserine, and stearic acid are displayed in Fig. 6. The plots are displayed over different dose levels (0, 10, 30, and 50 Gy) and time post-irradiation (18, 42, and 66 h).
Cholesterol is ranked as the most important chemical in the PR/ER classification test in Fig. 5(2). The three PR/ERnegative cell lines (MCF10A, MDA-MB-231, and SK-BR-3) also exhibited high levels of cholesterol content in Fig. 6a. Cholesterol has been directly linked to the implication of the development of estrogen receptor-positive breast cancer via the metabolite 27-hydroxycholesterol. 56-59 27hydroxycholesterol may act as a modulator in the cholesterol metabolism. 60,61 High levels of phosphatidylserine were presented in Ki67-high cell lines (MDA-MB-231, BT-474, and SK-BR-3) in Fig. 6b. The Ki67-low cell lines (MCF10A and MCF-7) have high stearic acid scores in Fig. 6c. Ki67 has been widely acknowledged as a proliferation marker for human tumor cells. 62 Phosphatidylserine was found to increase significantly on the surface of tumour cells and facilitate tumour growth and metastasis in the tumor microenvironment. 63,64 Studies on stearic acid have discovered its role in inhibiting human breast cancer cell proliferation in vitro. [65][66][67][68] Our observations of high variable importance of phosphatidylserine and stearic acid in the Ki67 classification test (Fig. 5 2 ), high phosphatidylserine scores in Ki67high cell lines, and high stearic acid scores in Ki67-low cell lines are congruent with previous studies on phosphatidylserine and stearic acid metabolism in cancer.

General Discussion
From the results presented above, RS combined with the GBR-NMF-RF analytical framework exhibited an outstanding ability to classify breast cell lines into molecular histotypes even after treated by RT. The combination of RS and GBR-NMF-RF provides an efficient approach to track the biochemical profiles of different molecular histotypes throughout radiation therapy treatment. The accuracy, sensitivity, and specificity results across all the molecular histotype classifications were high (>97%) under our novel analytical framework. The GBR-NMF-RF analytical framework produced robust molecular histotype classification results across different radiation doses (0, 10, 30, and 50 Gy) in this breast cell data set. Using the GBR-NMF-RF framework along with RS provides the potential to track molecular histotypes under varied conditions in different radiotherapy plans and has the potential to open up new treatment pathways which could be exploited in order to both improve and personalize RT. The chemicals selected in the reference library were to cover a complete set of molecular compounds based on the cell composition and chemicals involved in cancer metabolism. The original motivation to include two Raman libraries with different sizes was to examine whether the more extensive 20-chemical library would improve the margins of classification accuracy. Although the 20-chemical library did yield slightly higher accuracy, the difference was marginal (<2%) and demonstrates that the 16-chemical library selected captured a vast majority of biological information for the current molecular histotype classification purpose since the differences on the metrics were often less than 2% (in Tables II to IV).
The linkage between Raman biomarkers (e.g., glycogen and lipids (cholesterol, phosphatidylserine, and stearic acid)) and molecular histotypes (PR/ER and Ki67) were uncovered by GBR-NMF generated score plots and variable importance calculated by RF (Figs. 5 and 6). The importance of glycogen and lipids in radiation response and cancer metabolism has been previously reported. [12][13][14][15][16]23 The insights on how glycogen and lipids (cholesterol, phosphatidylserine, and stearic acid) can be linked to PR/ER and Ki67 molecular histotypes and how this relates to RT are reported by RS for the first time to our knowledge, although these metabolites have been studied intensively in the cancer research community. [56][57][58][59][63][64][65][66][67][68] The motivation to add extra chemicals such as leucine and valine into the reference chemical library was to potentially investigate whether RS and the GBR-NMF-RF framework could track changes in protein or amino acids in the breast cell lines. Proteins are difficult to trace because the amino acid composition of proteins can vary greatly. 69 Valine and leucine were selected given their emerging importance in cancer biology. 42 Understanding the requirements for valine, leucine, and other amino acids in the TCA may open up potential ways to target their utilization in combination with radiation therapy. 70 In this study, valine only showed a high contribution to cancer or non-cancer classification. Although tracking complex proteins with RS is currently challenging, possible improvements can still be achieved with advanced data analysis methods to extract relevant information.
Using RS with the novel GBR-NMF-RF analytical framework, the interpretability of the underlying biological information in our RS cellular data set was significantly improved over the previous studies. 33,[13][14][15][16]23 Combining RS and GBR-NMF-RF demonstrates a great potential to track the molecular histotypes and therefore monitor biochemical dynamics in specific tumour environments during radiation therapy treatments. Although a range of breast cancer cell lines were covered in the current study, future studies on other types of breast cancer cell lines or non-breast cancer cell lines should be conducted. Validation of the current results obtained using RS using other techniques such as mass spectrometry is also a promising direction to pursue for future research.

Conclusion
The GBR-NMF-RF analytical framework is capable of monitoring known immunohistochemical profiles of different breast cancer cell lines using Raman spectra collected subsequent to treatment with varying dose levels of ionising radiation. The GBR-NMF-RF framework had classification results for immunohistochemical profiles with high accuracy (>97%), high sensitivity (>97%), and high specificity (>97%). The variable importance function from the random forest also provided information about the high contributions of RS identified biomarkers (e.g., glycogen and lipids (cholesterol, phosphatidylserine, and stearic acid)) in molecular histotype classification tests and how those markers relate to RT. The combination of RS and GBR-NMF-RF gives a robust technique to track the molecular histotypes throughout the radiation therapy treatment.

Declaration of Conflicting Interests
The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Funding
The author(s) disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: This work was supported by grant funding from Natural Sciences and Engineering Research Council (NSERC).

Supplemental Material
All supplemental material mentioned in the text is available in the online version of the journal.