Добавил статью в датасеты

This commit is contained in:
2024-11-30 20:28:19 +03:00
parent d43ccf568f
commit f8c9e6ada5

View File

@@ -159,12 +159,12 @@
The authors of~\cite{tabular} developed a machine learning model to predict cisplatin sensitivity based on gene expression changes induced by cisplatin treatment. They combined gene expression data from sensitive ovarian cancer cell lines and patients with specific signaling alterations to identify a gene signature. Using this signature, they trained TabNet, an interpretable deep learning algorithm for tabular data, to perform binary classification of sensitivity to cisplatin. Also several other machine learning algorithms, including Ridge, LASSO, Elastic Net, Nu-Support Vector Classification (Nu-SVC), XGBoost, and Random Forest, were applied to the same task for comparission with TabNet.
Same as in the~\cite{heterogeneity}, the authors of~\cite{deep} used algorithms from the specialized software called Acapella (developed by PerkinElmer~\cite{PerkinElmer}) to extract 624 quantitative image features from cellular images. This information was fed into a deep learning model that identified a continuous 27-dimension space describing all of the observed cell morphologies. After that the random forest classifier was trained on populations of cells labeled as either drug sensitive or drug resistant.
\section{Datasets}
Data plays a crucial role in machine learning, serving as the foundation for model training and evaluation. The quality and quantity of data directly influence the performance and generalizability of machine learning algorithms. In the fields of biology and medicine, data collection is often costly and time-consuming. Additionally, the complexity and variability inherent in biological systems further complicate data acquisition and interpretation. In cancer research, these challenges are even more pronounced due to the heterogeneity of tumors and the intricate nature of cancer biology. However, there are valuable resources available, such as the Gene Expression Omnibus (GEO) database~\cite{geo} and The Cancer Genome Atlas (TCGA) database~\cite{tcga}, which provide researchers with access to extensive datasets. Moreover, nonprofit organizations like the American Type Culture Collection (ATCC)~\cite{atcc} enable researchers to obtain biological materials, including cancer cells.
In articles~\cite{paclitaxel}, \cite{sers} and \cite{cervical} authors decided to prepare their own datasets specifically for their research.
In articles~\cite{paclitaxel}, \cite{sers}, \cite{cervical} and \cite{deep} authors decided to prepare their own datasets specifically for their research.
In~\cite{paclitaxel} four kinds of epithelial ovarian cancer cells with different drug sensitivity (SKOV3, SKOV3\_Ta\_2\textmu M, SKOV3\_Ta\_8\textmu M, and SKOV3\_Ta\_20\textmu M) were studied. The SKOV3 cells were sourced from the ATCC~\cite{atcc} and preserved at the Obstetrics and Gynecology Laboratory of Peking University Peoples Hospital. The drug-resistant characteristics of SKOV3\_Ta\_2\textmu M, SKOV3\_Ta\_8\textmu M, and SKOV3\_Ta\_20\textmu M were acquired by progressively exposing SKOV3 cells to varying concentrations of paclitaxel. After approximately ten months, all the drug-resistant cancer cells were acquired. They then utilized Digital Holographic Flow Cytometry (DHFC), an advanced technology for label-free, high-throughput cell detection. Using DHFC along with additional post-processing, the authors generated a dataset comprising approximately 3000 a quantitative phase images (QPIs) of EOC cells, each sized at 300 by 300 pixels. Fig.~\ref{fig:skov3} presents the reconstructed QPIs of EOC cells with various degrees of drug resistance.
@@ -179,6 +179,8 @@
In~\cite{cervical}, authors prepared dataset with 259 samples. They choosed 259 patients at the Peoples Hospital of Gansu Province and the First and Second Hospital of Lanzhou University who were diagnosted with locally advanced cervical cancer (LACC), applied neoadjuvant chemotherapy (NACT) to them and extracted their whole blood genomic DNA. After that 24 SNPs from PTEN/PI3K/AKT pathway: PTEN, PIK3CA, Akt1, and Akt2 were selected. 70 features were generated from 24 SNPs in the raw data using the one-hot encoding method resulting in 259x70 dataset. Clinical examination, colposcopy, and abdominal computer tomography were used to estimate the change of tumor size in all patients before and after each NACT cycle. In this study, patients with a complete response and partial response were classified as NACT effective group, and patients with stable disease and progressive disease were considered NACT ineffective group.
The dataset in~\cite{deep} was based on 12 drug-resistant clones, i.e., populations of cells derived from a single progenitor cell and genetically identical to it, generated from five human cancer cell lines (tongue, lung, breast, and esophageal cancers). The authors used similiar approach as in \cite{paclitaxel} to get cells with drug resistance. Clones were made resistant to cetuximab, pertuzumab, or trastuzumab, which are inhibitory antibodies targeting members of the ErbB family of receptor tyrosine kinases, over a period of six months. Following this, cells were treated with small interfering RNAs (siRNAs) targeting 536 protein kinases and then one of 11 different ErbB-inhibiting antibody drugs. In total, the study imaged 848,802,073 cells. From each cell, 624 features were extracted using Acapella image analysis software, resulting in a dataset containing 529,652,493,552 data points. This approach enabled the comparison of the effects of inhibiting ErbB kinase signaling on cell morphology in drug-sensitive and drug-resistant cancer cell lines.
Authors of articles~\cite{heterogeneity}, \cite{mitochondria}, \cite{kras}, \cite{glut} and~\cite{tabular} turned to open databases to prepare datasets for their research. Authors of~\cite{heterogeneity} downloaded frozen histopathologic images of 494 ovarian and 70 paracarcinoma tissues with hematoxylineosin (HE) staining from TCGA~\cite{tcga}. The corresponding clinical information, genomics, and transcriptomics profiles required for this study were also obtained from this database. Authors of~\cite{mitochondria} also used TCGA. They downloaded information on 183 esophageal cancer patients (95 squamous cell carcinomas and 88 adenocarcinomas) was obtained, including mRNA expression profiles, clinical features such as survival time and status, age, gender, and pathological stage (T, N, and M). Additionally authors used Gene Expression Omnibus (GEO) database~\cite{geo}. RNA sequencing (RNA-seq) for GSE45670 was downloaded from it. GSE45670 includes a total of 17 esophageal squamous cell carcinomas (ESCC) that did not respond to preoperative CRT, 11 ESCC that responded to preoperative CRT, and 10 samples from normal esophageal epithelium. The GEO dataset GSE53625 comprises 358 samples, including 179 ESCC tissue samples and an equal number of samples of adjacent normal tissues, along with detailed clinical data for the 179 ESCC patients. The GEO dataset GSE19417 contains data from 76 esophageal adenocarcinoma patients, offering detailed clinical data for 48 of these patients. Authors of~\cite{kras} also took gene expression profile data from GEO database, specifically from accession number GSE137912. Their analysis involved 7612 samples treated with KRAS G12C inhibitors. Among these samples, 4297 were tumor cells that persisted in proliferation, whereas 3315 were tumor cells that had ceased proliferating. Each sample contained the expression of 8687 genes. In~\cite{glut}, authors used datasets from both TCGA and GEO and also from European Genome-Phenome Archive (EGA)~\cite{ega}. Authors of~\cite{tabular} used GSE47856, GSE15622 and GSE146965 from the GEO database and RNAseq data from TCGA.
In article~\cite{platinum}, authors prepared their own dataset and also used open databases. In this study, 4D data-independent acquisition (DIA) proteomic sequencing was performed on tissue-derived extracellular vesicles (tsEVs) obtained from 58 platinum-sensitive and 30 platinum-resistant patients with EOC. Also authors used the GSE15372, GSE33482, GSE26712 and GSE63885 microarray datasets from the Gene Expression Omnibus database~\cite{geo}. GSE15372 and GSE33482 represent EOC cell line-derived RNA microarray datasets, comprising 5 and 5 and 6 and 6 platinum-sensitive and resistant cell line samples, respectively. GSE26712 and GSE63885 involve clinical and sequencing data for 195 and 101 EOC patients, respectively. Additionally, transcriptomic sequencing data and clinical information from the tumour tissues of 379 patients with EOC, sourced from the TCGA database~\cite{tcga}, was used.