Comparando embeddings contextuais no problema de busca de similaridade semântica em português

Bezerra, Leonardo Cesar TeonacioAndrade Junior, José estevam de2021-04-292021-10-062021-04-292021-10-062021-04-27ANDRADE JUNIOR, José Estevam de. Comparando embeddings contextuais no problema de busca de similaridade semântica em português. 2021. 50f. Trabalho de Conclusão de Curso (Graduação em Engenharia de Computação) - Centro de Tecnologia, Universidade Federal do Rio Grande do Norte, Natal, 2021.https://repositorio.ufrn.br/handle/123456789/43620Semantic textual similarity (STS) is a natural language processing problem that aims to measure how similar two pairs of sentences are semantically. This problem has been gaining great attention both in the industry, through the development of several textual recommendation systems, and in academia, mainly with the popularity of shared tasks such as those proposed by the International Workshop on Semantic Evaluation(SemEval). Although SemEval has contributed to the increase in works in this area, the literature still lacks studies focused on STS for the Portuguese language. To that end, the ASSIN and ASSIN 2 workshops created shared tasks for semantic similarity search in Portuguese,providing datasets that were used to evaluate models during the events. More recently,a model produced by a Portuguese pre-trained and fine-tuned BERT established the state-of-the-art for those datasets. This work compares the performance of the BERT and Sentence-BERT(SBERT) contextual embeddings on the datasets created in the workshops ASSIN and ASSIN 2. The BERT models were pre-trained in portuguese with (ptBERTft) and without (ptBERT) fine-tuning for STS. On the other hand, the SBERT models was pre-trained in a multilingual corpus(mSBERT), initially without fine-tuning. The results of this comparison showed that the embeddings produced by SBERT models were competitive, surpassing the results of ptBERT and also the results observed during the shared tasks ASSIN and ASSIN 2. In fact, the result of mSBERT was second only to the results obtained by ptBERTft. In the second part of our investigation, we fine-tuned the multilingual SBERT models for the proposed problems. The results of this step vary depending on the dataset. For ASSIN 2,the fine-tuning made the SBERT models competitive with ptBERTft, however requiring less computational resources. For ASSIN, by contrast, the performance gain obtained by fine-tuning was not enough to match the performance of ptBERTft.Aprendizado profundoProcessamento de linguagem naturalSimilaridade semântica textualWord embeddingsComparando embeddings contextuais no problema de busca de similaridade semântica em portuguêsbachelorThesis