Treffer: Analiza proteinskih nizova iz CoViD-a 19 ; Analysis of protein sequences from CoViD 19

Title:
Analiza proteinskih nizova iz CoViD-a 19 ; Analysis of protein sequences from CoViD 19
Authors:
Contributors:
Goldstein, Pavle
Publisher Information:
Sveučilište u Zagrebu. Prirodoslovno-matematički fakultet. Matematički odsjek.
University of Zagreb. Faculty of Science. Department of Mathematics.
Publication Year:
2021
Collection:
Croatian Digital Theses Repository (National and University Library in Zagreb)
Document Type:
Dissertation master thesis
File Description:
application/pdf
Language:
Croatian
Rights:
http://rightsstatements.org/vocab/InC/1.0/ ; info:eu-repo/semantics/openAccess
Accession Number:
edsbas.35F36F4
Database:
BASE

Weitere Informationen

Tema ovog diplomskog rada je analiza proteinskih nizova iz koronavirusa, tj. analiza proteina: E, M, N i S. U radu se statističkim analizama i primjenom tehnika strojnog učenja na višestruko poravnatim nizovima navedenih proteina pokušava pronaći dominantne mutacije. Na početku su dani matematički pojmovi potrebni za razumijevanje ostatka rada. Nakon toga, uvodi se struktura podataka na kojima su rađene analize te se podaci pripremaju za primjenu tehnika klasteriranja. Na kraju se primjenjuje jedna od tehnika klasteriranja (k-means++ algoritam) i analiziraju se rezultati. Pri tome, analizira se svaki protein koronavirusa zasebno i traže se najznačajnije pozicije za upravo takvo klasteriranje koje je dobiveno. Diplomski rad je većinom napravljen u programskom jeziku Python. Uz njega korišten je programski jezik R te za vizualizaciju rezultata Tableau. ; The topic of this thesis is the analysis of the Coronavirus protein sequences, i.e. study of proteins E, M, N, and S. By applying statistical analysis and machine learning techniques to multiple sequence alignment of a given protein – we aim to determine dominant mutations. The introduction contains a description of the mathematical framework. After that, a data structure used for the analysis is introduced, and data is prepared for the application of the clustering method. In the final part of the paper, one of the clustering methods (k-means++ algorithm) is applied, and the results are analyzed. In doing so, each Coronavirus protein is analyzed separately, and the goal is to find the most significant positions responsible for the obtained clustering. The thesis has been pre-developed using programming languages Python and R, while Tableau was used for the visualization of the data.