Sakamoto, TetsuLacerda, Lucas de Freitas2024-11-082024-11-082024-09-04LACERDA, Lucas de Freitas. Desenvolvimento de pipeline para análise de SNPs otimizados para identificação de espécies e seus híbridos: um estudo de caso em Sapajus (Primates). Orientadora: Dra. Tetsu Sakamoto. 2024. 62f. Dissertação (Mestrado em Bioinformática) - Universidade Federal Do Rio Grande Do Norte, Universidade Federal do Rio Grande do Norte, Natal, 2024.https://repositorio.ufrn.br/handle/123456789/60594The anthropogenic pressures suffered by the remnants of the Atlantic Forest along the Brazilian coast are reflected in impacts on the conservation status of the species that make up its fauna, including Neotropical primates. Aiming at the conservation of the threatened primates in the Northeast, the National Center for Research and Conservation of Brazilian Primates, CPB/ICMBio, coordinates the National Action Plan for the Conservation of Northeastern Primates (PAN-PRINE). One of the target species is the blond capuchin (Sapajus flavius), categorized as Endangered. To contribute to the implementation of PAN-PRINE actions, the present study aimed to analyze the genetic structure of specimens from both wild and captive populations of Sapajus individuals and propose a panel of diagnostic SNPs for the identification of two parental species (S. flavius and S. libidinosus) and hybrids, using machine learning techniques.Two population structure analyses were performed: one exploratory, involving several species of the genus and captive samples (n=228), and one specific analysis, with captive samples (n=52) and natural populations (n=127) of S. flavius and S. libidinosus, including natural hybrids between the species. Our exploratory analysis removed eight captive samples from the dataset that did not show an expected ancestry pattern for hybridization of the species of interest. Of the remaining samples, 30 were classified as hybrids, 14 as S. libidinosus, and 8 as S. flavius, based on the ancestry coefficients established to identify a species (Q>90%). These samples, along with the wild samples, were partitioned into 20% for the validation set and 80% for the training and testing set (70% and 30%, respectively). Six supervised learning algorithms were used to train predictive models: k-Nearest Neighbors (kNN), Decision Tree (DT), Naive Bayes (NVB), Support Vector Machine (SVM), X Gradient Boosting (XGB), and Random Forest (RF), followed by feature selection (n=2484), which in this case are SNPs. All models were trained using K-fold cross-validation (K=5). 15, 30, and 45 features were selected through forward feature selection. The RF, SVM, and NVB models consistently ranked highest as the number of features increased, based on accuracy scores in the validation set, with RF yielding the best results for larger numbers of SNPs. When we ranked the SNP sets selected by the models according to the best clustering generated by an unsupervised methodology, XGB and KNN were the top-ranked models based on the Rand Score (RS). None of our variants with a high capacity for group identification were located in coding regions of the genome; most were found in intergenic regions (n=20) and intronic regions, which may belong to different splicing variants of genes (n_vars=24, n_genes=119). From the initial set of 2484 SNPs, we were able to reduce the dimensionality of our data while retaining highly informative variants for group differentiation. Additionally, we identified that most of these variants do not impact coding regions but are highly associated with species differentiation. These results are important for developing a product that can serve as a tool for National Action Plans for the Conservation of endangered species and management decisions that take into account the genetic profile of populations and species studied, enabling more assertive conservation measures.Acesso AbertoHibridizaçãoMarcadores genéticosSNPs diagnósticosMachine LearningConservaçãoDesenvolvimento de pipeline para análise de SNPs otimizados para identificação de espécies e seus híbridos: um estudo de caso em Sapajus (Primates)masterThesisCNPQ::CIENCIAS BIOLOGICAS