Πλοήγηση ανά Συγγραφέας "Giannakos, Iakovos"
Τώρα δείχνει 1 - 1 of 1
Αποτελέσματα ανά σελίδα
Επιλογές ταξινόμησης
Τεκμήριο Categorical embedding with Deep Learning.(ΕΛ.ΜΕ.ΠΑ., ΣΧΟΛΗ ΜΗΧΑΝΙΚΩΝ (ΣΜΗΧ), Τμήμα Ηλεκτρολόγων Μηχανικών και Μηχανικών Υπολογιστών, 2021-02-04) Giannakos, Iakovos; Γιαννακός, ΙάκωβοςThe study conducted in the framework of the dissertation on "Categorical embedding with deep learning". To clarify, the purpose of the dissertation is to study and implement a word-embedding neural network for genomic data which is a network consisting of three levels namely the input level, the hidden level and the output level. All these levels are interconnected with different forces (weights) which are also called word-embedding. The selected architecture of the neural network falls in the Natural Language Processing (NLP) category. NLP is a research field that investigates how a computer can control and extract knowledge from text or dialogue into a natural language. The model implemented in this dissertation is the Continues Bags of Words (CBOW), a model that accepts as input a set of number boxes (contexts) which are the number of words corresponding to a text. Each context corresponds to several words defined by the developer, has a target context and a table with the difference of the words in a text that correspond to that context. The network is trained with the assumption that each context is close to the words that are the target. The aim is to train the CBOW neural network and to form word embedding using as input known mutations of a human. Before we get to the training point, the network requires some data as input. Our data comes from the human genome using the Ensembl Variant Effect Predictor (VEP). Our main objective is to get all the human mutations (about 80 million mutations) and train a model that will handle each mutation as a word and each disease as the context. VEP is a tool for annotating, evaluating and prioritizing genomic mutations, even in non-coding areas. The VEP predicts the effects of sequence mutations on transcripts, protein products, regulatory regions, and binding patterns, utilizing the high quality, wide scope, and comprehensive design of Ensemble databases with high accuracy. In the next, we pass the variants/mutations to a python script where we select input features based on specific criteria described in chapter Experiment 1 (sub section Data) and Experiment 2 (sub section Data). After selecting the data, we form the context list with the data and a target context for each single-nucleotide polymorphism (SNP) variant. Then the CBOW model is trained with the variants contexts mentioned above and after some epochs the embedding (weights) that are between the first level and the hidden are formed. We extract these weights from the network and pass them to the Principal Component Analysis (PCA) to visualize it as a scatter plot. PCA, is a dimensionality-reduction method that is often used to reduce the dimensionality of large data sets, by transforming a large set of variables into a smaller one that still contains most of the information in the large set. Finally, cosine similarities were used. Cosine similarity is a measurement in data retrieval. The application of this measurement can be applied to two corpuses (paragraph, sentence and the whole corpus). If the similarity score is high between two corpus term vector and the query vector, the greater relevance of text and query. Once we have taken a SNP as a sample and passed it through cosine similarity we can find other SNP’s close to this that we expect to be more similar so there is a possibility that this mutation will affect our sample. We applied this methodology to three experiments. The first one was for the representation and clustering of human chromosome 22 variants. In this experiment we attempt to find relevance between random SNPs and verify them. Due to the large amount of chromosome data and processing time it was hard to have the best possible results. So we moved to the second and third experiment with less data targeted to a disease, specific in cancer variant and possible cancer variants. The results of the model are promising and we believe that such a methodology could be used in the genomics era.