Souza, Gustavo Antonio deMachado, Karla Cristina Tabosa2018-10-102018-10-102018-07-27MACHADO, Karla Cristina Tabosa. Desenvolvimento de abordagens computacionais para proteogenômica de procarioto. 2018. 77f. Dissertação (Mestrado em Bioinformática) - Instituto Metrópole Digital, Universidade Federal do Rio Grande do Norte, Natal, 2018.https://repositorio.ufrn.br/jspui/handle/123456789/26028Next-generation sequencers development cause a revolution in genomic research, and nowadays the complete genomic information of thousands of bacterial strains is available. Similar technological breakthroughs also happened for protein analysis by mass spectrometry (MS) in the last decade regarding sensitivity and throughput. Peptide sequence characterization in proteomics samples can be used to validate genomic regions as coding, research field known as proteogenomics. The proteogenomic approach is applied through the construction of customized protein sequence databases which will be inspected against peptide sequence data collected by MS. The probabilistic nature of peptide identification by MS, and the limitations found in the construction of precise protein databases have been relevant bottlenecks in the development of approaches for the analysis of samples containing proteins from a bacterial community. The development of these approaches becomes increasingly critical given the importance of characterizing biomes of clinical, environmental and industrial relevance. As the peptides identification depends on the quality and accuracy of the protein databases, this work aims to develop a computational strategy that builds customized protein databases sequence, through processing and analysis of protein sequence data from several strains of the same bacterial species. For the construction of databases, the approach performs the alignment of protein sequences of bacteria strains. Then, identifies and compares homologous and uniquely annotated proteins in all strains. And finally, reports those sequences in a non-redundant manner, which means, sequences extensively repeated among annotations are reported only once in order to keep the size database under control. Databases also report sequence variations, whether they result from genetic variations or annotation divergences, which are usually abdicated in databases used in proteomic analysis. Using mass spectrometry data collected from 8 clinical strains of Mycobacterium tuberculosis, assessed whether the protein identification performance of two sequence databases, one including all proteins from 65 sequenced strains, and one constructed with this approach using the same strains 65 strains. Besides reducing the computacional time, the number of identifications obtained in both searches was practically identical. Besides, databases for 10 bacterial species containing at least 65 strains characterized were created. These databases were monitored according to the relevant characteristics for the identification of proteins based on probabilistic by proteomics. Besides the databases, there was also a concern to create a registration file, in which each observation regarding the presence of homologous, differences of sequences, modification type and presence in strains was well described. When analyzing the databases created with this approach, it has been shown that, as expected the increase in database complexity correlates with pangenomic complexity of bacterial species. However Mycobacterium tuberculosis and Bortedella pertusis generated very complex databases even having low pangenomic complexity or no pangenome at all respectively. This indicates that differences in gene annotation is higher than average between strains of those species. It has also been demonstrated the possibility to use such strategy to create databases containing sequences of multiple species, in order to perform metaproteomic analyzes of MS data.Acesso AbertoProteômicaEspectrometria de massabancos de dadosProteínasBactériaDesenvolvimento de abordagens computacionais para proteogenômica de procariotoDevelopment of a computational approach for proteogenomics of prokaryotesmasterThesisCNPQ::CIENCIAS BIOLOGICAS: BIOINFORMÁTICA