diff --git a/report.tex b/report.tex index 949c564..1e683cb 100644 --- a/report.tex +++ b/report.tex @@ -162,7 +162,9 @@ \section{Datasets} Data plays a crucial role in machine learning, serving as the foundation for model training and evaluation. The quality and quantity of data directly influence the performance and generalizability of machine learning algorithms. In the fields of biology and medicine, data collection is often costly and time-consuming. Additionally, the complexity and variability inherent in biological systems further complicate data acquisition and interpretation. In cancer research, these challenges are even more pronounced due to the heterogeneity of tumors and the intricate nature of cancer biology. However, there are valuable resources available, such as the Gene Expression Omnibus (GEO) database~\cite{geo} and The Cancer Genome Atlas (TCGA) database~\cite{tcga}, which provide researchers with access to extensive datasets. Moreover, nonprofit organizations like the American Type Culture Collection (ATCC)~\cite{atcc} enable researchers to obtain biological materials, including cancer cells. - Authors of~\cite{paclitaxel} prepared their dataset specifically for their research. Four kinds of epithelial ovarian cancer cells with different drug sensitivity (SKOV3, SKOV3\_Ta\_2\textmu M, SKOV3\_Ta\_8\textmu M, and SKOV3\_Ta\_20\textmu M) were studied in this work. The SKOV3 cells were sourced from the ATCC~\cite{atcc} and preserved at the Obstetrics and Gynecology Laboratory of Peking University People’s Hospital. The drug-resistant characteristics of SKOV3\_Ta\_2\textmu M, SKOV3\_Ta\_8\textmu M, and SKOV3\_Ta\_20\textmu M were acquired by progressively exposing SKOV3 cells to varying concentrations of paclitaxel. After approximately ten months, all the drug-resistant cancer cells were acquired. They then utilized Digital Holographic Flow Cytometry (DHFC), an advanced technology for label-free, high-throughput cell detection. Using DHFC along with additional post-processing, the authors generated a dataset comprising approximately 3000 a quantitative phase images (QPIs) of EOC cells, each sized at 300 by 300 pixels. Fig.~\ref{fig:skov3} presents the reconstructed QPIs of EOC cells with various degrees of drug resistance. + In articles~\cite{paclitaxel}, \cite{sers} and \cite{cervical} authors decided to prepare their own datasets specifically for their research. + + In~\cite{paclitaxel} four kinds of epithelial ovarian cancer cells with different drug sensitivity (SKOV3, SKOV3\_Ta\_2\textmu M, SKOV3\_Ta\_8\textmu M, and SKOV3\_Ta\_20\textmu M) were studied. The SKOV3 cells were sourced from the ATCC~\cite{atcc} and preserved at the Obstetrics and Gynecology Laboratory of Peking University People’s Hospital. The drug-resistant characteristics of SKOV3\_Ta\_2\textmu M, SKOV3\_Ta\_8\textmu M, and SKOV3\_Ta\_20\textmu M were acquired by progressively exposing SKOV3 cells to varying concentrations of paclitaxel. After approximately ten months, all the drug-resistant cancer cells were acquired. They then utilized Digital Holographic Flow Cytometry (DHFC), an advanced technology for label-free, high-throughput cell detection. Using DHFC along with additional post-processing, the authors generated a dataset comprising approximately 3000 a quantitative phase images (QPIs) of EOC cells, each sized at 300 by 300 pixels. Fig.~\ref{fig:skov3} presents the reconstructed QPIs of EOC cells with various degrees of drug resistance. \begin{figure}[h] \centering @@ -171,8 +173,9 @@ \label{fig:skov3} \end{figure} - In article~\cite{sers}, same as in~\cite{paclitaxel}, authors choosed the approach of collecting their own dataset. Their dataset was based on clinical plasma samples from 60 healthy volunteers which were used as a control group, and 60 nasopharyngeal cancer patients (30 plasma samples from radiotherapy sensitivity patients and 30 plasma samples from radiotherapy resistance patients). All plasma samples were - obtained from Fujian Provincial Cancer Hospital. As well as in~\cite{paclitaxel}, authors used unique method called surface enhanced Raman spectroscopy (SERS) to extract molecular profiles of patients plasma. Authors even claim that SERS based on surface plasmon resonance was used for this task for the first time. The SERS spectra were processed by deducting the fluorescence background signal using a fifth-order polynomial fitting method, and then the SERS signals were peak normalized, after which the spectra of the same plasma sample were averaged to represent the final SERS data for that sample. + The dataset in~\cite{sers} was based on clinical plasma samples from 60 healthy volunteers which were used as a control group, and 60 nasopharyngeal cancer patients (30 plasma samples from radiotherapy sensitivity patients and 30 plasma samples from radiotherapy resistance patients). All plasma samples were obtained from Fujian Provincial Cancer Hospital. As well as in~\cite{paclitaxel}, authors used unique method called surface enhanced Raman spectroscopy (SERS) to extract molecular profiles of patients plasma. Authors even claim that SERS based on surface plasmon resonance was used for this task for the first time. The SERS spectra were processed by deducting the fluorescence background signal using a fifth-order polynomial fitting method, and then the SERS signals were peak normalized, after which the spectra of the same plasma sample were averaged to represent the final SERS data for that sample. + + In~\cite{cervical}, authors prepared dataset with 259 samples. They choosed 259 patients at the People’s Hospital of Gansu Province and the First and Second Hospital of Lanzhou University who were diagnosted with locally advanced cervical cancer (LACC), applied neoadjuvant chemotherapy (NACT) to them and extracted their whole blood genomic DNA. After that 24 SNPs from PTEN/PI3K/AKT pathway: PTEN, PIK3CA, Akt1, and Akt2 were selected. 70 features were generated from 24 SNPs in the raw data using the one-hot encoding method resulting in 259x70 dataset. Clinical examination, colposcopy, and abdominal computer tomography were used to estimate the change of tumor size in all patients before and after each NACT cycle. In this study, patients with a complete response and partial response were classified as NACT effective group, and patients with stable disease and progressive disease were considered NACT ineffective group. Authors of articles~\cite{heterogeneity}, \cite{mitochondria}, \cite{kras}, \cite{glut} and~\cite{tabular} turned to open databases to prepare datasets for their research. Authors of~\cite{heterogeneity} downloaded frozen histopathologic images of 494 ovarian and 70 paracarcinoma tissues with hematoxylin–eosin (HE) staining from TCGA~\cite{tcga}. The corresponding clinical information, genomics, and transcriptomics profiles required for this study were also obtained from this database. Authors of~\cite{mitochondria} also used TCGA. They downloaded information on 183 esophageal cancer patients (95 squamous cell carcinomas and 88 adenocarcinomas) was obtained, including mRNA expression profiles, clinical features such as survival time and status, age, gender, and pathological stage (T, N, and M). Additionally authors used Gene Expression Omnibus (GEO) database~\cite{geo}. RNA sequencing (RNA-seq) for GSE45670 was downloaded from it. GSE45670 includes a total of 17 esophageal squamous cell carcinomas (ESCC) that did not respond to preoperative CRT, 11 ESCC that responded to preoperative CRT, and 10 samples from normal esophageal epithelium. The GEO dataset GSE53625 comprises 358 samples, including 179 ESCC tissue samples and an equal number of samples of adjacent normal tissues, along with detailed clinical data for the 179 ESCC patients. The GEO dataset GSE19417 contains data from 76 esophageal adenocarcinoma patients, offering detailed clinical data for 48 of these patients. Authors of~\cite{kras} also took gene expression profile data from GEO database, specifically from accession number GSE137912. Their analysis involved 7612 samples treated with KRAS G12C inhibitors. Among these samples, 4297 were tumor cells that persisted in proliferation, whereas 3315 were tumor cells that had ceased proliferating. Each sample contained the expression of 8687 genes. In~\cite{glut}, authors used datasets from both TCGA and GEO and also from European Genome-Phenome Archive (EGA)~\cite{ega}. Authors of~\cite{tabular} used GSE47856, GSE15622 and GSE146965 from the GEO database and RNAseq data from TCGA.