UNIVERSIDADE FEDERAL DO RIO GRANDE DO NORTE CENTRO DE CIÊNCIAS EXATAS E DA TERRA INSTITUTO DE QUÍMICA PROGRAMA DE PÓS-GRADUAÇÃO EM QUÍMICA Multivariate Classification and Fourier-Transform Mid-Infrared Spectroscopy (FT-MIR) in Cancer Prostate Tissue Laurinda Fernanda Saldanha Siqueira Tese de Doutorado Natal/RN, janeiro de 2017 LAURINDA FERNANDA SALDANHA SIQUEIRA MULTIVARIATE CLASSIFICATION AND FOURIER-TRANSFORM MID- INFRARED SPECTROSCOPY (FT-MIR) IN CANCER PROSTATE TISSUE Thesis submitted to Chemistry Postgraduate Program of Federal University of Rio Grande do Norte (PPGQ/UFRN) in partial fulfillment of the requirements for the degree of Doctor of Philosophy in Chemistry. Coordination for the Improvement of Higher Education Personnel (CAPES) Advisor: Prof. Dr. Kássio Michel Gomes de Lima Natal, RN 2017 Universidade Federal do Rio Grande do Norte - UFRN Sistema de Bibliotecas - SISBI Catalogação de Publicação na Fonte. UFRN - Biblioteca Setorial do Instituto de Química - IQ Siqueira, Laurinda Fernanda Saldanha. Multivariate classification and Fourier-Transform Mid- Infrared Spectroscopy (FT-MIR) in cancer prostate tissue / Laurinda Fernanda Saldanha Siqueira. - 2017. 131 f.: il. Tese (Doutorado) - Universidade Federal do Rio Grande do Norte, Centro de Ciências Exatas e da Terra, Programa de Pós- Graduação em Química, Natal, 2017. Orientador: Prof. Dr. Kássio Michel Gomes de Lima. 1. Análise espectral - Tese. 2. Neoplasias da próstata - Tese. 3. Espectroscopia Infravermelho Transformada de Fourier - Tese. 4. Classificação multivariada - Tese. 5. Química analítica - Tese. I. Lima, Kássio Michel Gomes de. II. Título. RN/UF/BS-IQ CDU 543.42(043.2) LAURINDA FERNANDA SALDANHA SIQUEIRA MULTIVARIATE CLASSIFICATION AND FOURIER-TRANSFORM MID- INFRARED SPECTROSCOPY (FT-MIR) IN CANCER PROSTATE TISSUE Thesis submitted to Chemistry Postgraduate Program of Federal University of Rio Grande do Norte (PPGQ/UFRN) in partial fulfillment of the requirements for the degree of Doctor of Philosophy in Chemistry. Approved January 30, 2017 Examination committee: _______________________________________________________ Kássio Michel Gomes de Lima, PhD – UFRN (Advisor) _______________________________________________________ Edvan Cirino Da Silva, PhD – UFPB _______________________________________________________ Luciano Farias de Almeida, PhD – UFPB _______________________________________________________ Edgar Perin Moraes, PhD – UFRN _______________________________________________________ Tatiana de Campos Bicudo, PhD – UFRN To my Family, of blood and by choice. To millions of people with cancer: we‘ll win. ACKNOWLEDGMENTS This Thesis achievement was a complex, long and (self)exploratory journey of professional and personal maturation. It materialized under the light of those people that give me necessary support along these years. First of all, I need to thank my advisor Kassio Gomes de Lima (PPGQ/UFRN) for his motivation, enthusiasm and academic and personal support, for been an example and for the opportunity to leave my personal mark in the Chemistry Postgraduate Program and in our research group in the form of the first to publish two review papers, the first to be published in a journal considered the second with the highest impact factor of Analytical Chemistry (Trends in Analytical Chemistry, TrAC, Impact Factor: 7.487), the first to be published in this kind of journal, the first to be published in the journal Chemometrics and Intelligent Laboratory System and mainly, the first to publish a thesis where the chemistry talks openly with medicine. I hope that my legacy be followed and improved by our group and postgraduate program. I would like to acknowledge the colleagues of the Biological Chemistry and Chemometrics group (QBQ/UFRN) for inspiring efforts, especially Ana Carolina and Camilo Lelis. I would like to acknowledge Mr. Godoy of Bruker Inc. for their collaborations, Federal Institute of Education, Science and Technology of Maranhão (IFMA) for encouragement of professional improvement and Coordination for the Improvement of Higher Education Personnel (CAPES) for financial support. I never forget to thank my first advisor Mariano Ibañez Rojas (IFMA/UFMA) for showing me that research is the way. I’m grateful to my family in São Luís (MA, Brazil) for their distance support, specially Francisca Siqueira and João Silveira, for listening me and for just being with me in the most decisive, painful and hard moments. I’m grateful to my family in Natal (RN, Brazil) for their closer support, especially Maynara Costa for looking for me. These thanks extends to my closest friends, they know that. You have loved me and supported me along this journey. You have no idea how much it meant and means. For those all who doubt, told me that I could not, that I would not and that I should not, their resistance made me stronger, made me insist even more, made me the fighter, made me the woman that I am today. Thank you. And of course, I would like to thank the guy up there with a magnifying glass in his hand. Primeiramente, Jamais Temer. "If life gives you lemonade, make lemons. She’ll insane" (Phil Dunphin) “My job is making windows where there were once walls.” (Michel Foucault) “Veni, vidi, vici”. (Júlio César) CURRICULUM VITAE Formal Education/Degree 2012 – Actual Ph.D. in progress in Chemistry – UFRN (Natal, RN) 2008 – 2010 Specialization in Statistics – UEMA (São Luís, MA) 2003 - 2007 Graduation in Chemistry – IFMA (São Luís, MA) Professional Experience 2012 – Actual IFMA (Barreirinhas, MA) Analytical Chemistry Professor. 2010 – 2012 UFMA (São Luís, MA) Analytical Chemistry Professor. 2010 – 2010 UNB (Brasília, DF) Trainee Activities in Analytical Chemistry 2008 – 2012 IFMA (São Luís, MA) Analytical Chemistry Professor Research Projects 2012 – Actual Biospectroscopy and chemometrics in cancer studies Financial support: CAPES Coordenator: Prof. Dr. Kássio Michell Gomes de Lima IQ/UFRN (Natal, RN) 2012 – 2013 Analysis of the water quality of lagoons of tourist interest of Lencois Maranhenses National Park, Barreinhas, MA. Financial support: CNPQ/IFMA Coordenator: Prof. Laurinda Fernanda Saldanha Siqueira IFMA (São Luís, MA) 2008 – 2012 Analysis of time series, indicators of sustainability and water quality of beaches in the Island of Maranhão, Brasil. Financial support: CAPES Coordenator: Prof. Dr. Mariano Oscar Anibal Ibanez Rojas DEOLI/UFMA (São Luís, MA) 2005 – 2008 Analysis of micro and macronutrients and heavy metals in water bodies and pisciculture nurseries Financial support: CNPQ/IFMA/UFMA and BASA Advisor: Prof. Dr. Mariano Oscar Anibal Ibanez Rojas DQ/IFMA (São Luís, MA), DEOLI/UFMA (São Luís, MA) Principal Bibliographical Production SIQUEIRA, L. F. S. & LIMA, K. M. G. . A decade (2004 - 2014) of FTIR prostate cancer spectroscopy studies: an overview of recent advancements. TrAC. Trends in Analytical Chemistry (Regular ed.), v. 82, p. 208-221, 2016. SIQUEIRA, LAURINDA F. S. & LIMA, KÁSSIO M. G. MIR-biospectroscopy coupled with chemometrics in cancer studies. Analyst (London. 1877. Print), v. 141, p. 4833-4847, 2016. SIQUEIRA, LAURINDA F.S., ARAÚJO JR, RAIMUNDO F., ARAÚJO, AURIGENA ANTUNES DE, MORAIS, CAMILO L.M., M.G. LIMA, KÁSSIO. LDA vs. QDA for FT- MIR prostate cancer tissue classification. Chemometrics and Intelligent Laboratory Systems, 162, 123–129, 2017. Projects Reviewer 2016 – Actual Federal Institute of Education, Science and Technology of Pernambuco (IFPE) 2016 – Actual Federal Institute of Education, Science and Technology Fluminense (IFF) 2015 - Actual Federal Institute of Education, Science and Technology of Maranhão (IFMA) Reviewer 2016 – Actual Revista de Engenharias (FSMA) viii RESUMO CLASSIFICAÇÃO MULTIVARIADA E ESPECTROSCOPIA DO INFRAVERMELHO MÉDIO COM TRANSFORMADA DE FOURIER EM TECIDOS DE CÂNCER DE PRÓSTATA Esta tese é um aporte teórico-prático para a diferenciação dos tipos de câncer de próstata por meio de classificação multivariada aplicada em espectros MIR oriundos de tecidos humanos. Para isso, buscou-se identificar diferenças espectrais entre os graus de câncer de próstata, determinar potenciais marcadores bioquímicos responsáveis pela diferenciação e comparar os desempenhos dos modelos multivariados de classificação, a partir de amostras de tecidos de próstata previamente classificadas em Gleason II, III e IV para câncer. Em um primeiro estudo, os modelos PCA-LDA, SPA-LDA e GA-LDA foram construídos visando uma metodologia para discriminação dos estágios de câncer de próstata baseada na graduação de Gleason e na categorização de ‘Baixo e Alto Graus’; e, para identificação de potenciais marcadores espectrais. Os desempenhos dos modelos foram comparados. GA-LDA produziu os resultados mais satisfatórios, sendo melhor na perspectiva de ‘Baixo e Alto graus’, com taxas de acerto de 83% e valores de sensibilidade e especificidade 100% e 80%, respectivamente. Em um segundo estudo, PCA-LDA/QDA e GA-LDA/QDA tiveram seus desempenhos comparados na classificação de ‘Baixo e Alto graus’ de câncer de próstata, considerando caráter linear ou quadrático na diferenciação. Os modelos QDA obtiveram resultados superiores aos LDA, bem como métodos de seleção de variáveis (GA) foram melhores do que os de redução de variáveis (PCA). GA-QDA obteve melhor desempenho com taxas de acerto para amostras de calibração e de previsão de 97% e 100%, respectivamente; e sensibilidade e especificidade de 75% e 100%, respectivamente. Em um terceiro estudo, modelos SVM independentes (linear, polinomial, RBF e quadrático) e os algoritmos PCA-SVM, SPA-SVM e GA-SVM foram aplicados a fim de avaliar o uso de métodos de redução e seleção de variáveis em um enfoque não linear, para rastreamento de ‘Baixo e Alto graus’ do câncer de próstata. Os modelos SVM independentes obtiveram desempenhos inferiores aos dos demais. O melhor modelo foi GA-SVM com 100% e 90% das amostras de câncer ‘Baixo grau’ de calibração e previsão corretamente classificadas, respectivamente; e sensibilidade e especificidade de 90%. Os potenciais biomarcadores espectrais identificados pelos estudos foram atribuídos às regiões de amidas I, II, III e proteínas (≈1591–1483 cm-1), de DNA e RNA (≈1000–1490 cm−1) e de fosforização de proteínas (≈970 cm-1). A variação das respectivas intensidades foi mais acentuada nos espectros do ‘Alto grau’ de câncer. Alterações nessas regiões podem indicar modificações metabólicas provocadas pela progressão do câncer. Os métodos propostos mostraram que potencialmente podem ter melhores desempenhos que os métodos tradicionais de diagnóstico. Os resultados encontrados indicaram que a classificação multivariada combinada com FT- MIR possibilitou diferenciar estados patológicos dos tecidos principalmente nos estados iniciais do câncer (‘Baixo grau’) com objetividade, rapidez, acurácia, fácil procedimento, independência de variabilidade intra e inter-observador, e alta sensibilidade e especificidade; em comparação às técnicas tradicionais que são operador-dependentes, tem elevada variabilidade intra- e inter-observador, são morosas, tem preparação difícil, e apresentam menores valores sensibilidade e especificidade. Ademais, as metodologias propostas aqui poderão implicar em ganho econômico e social provenientes do diagnóstico precoce e do tratamento nos estágios iniciais do câncer, possibilitando ganho em qualidade de vida e sobrevida dos pacientes. Palavras-chave: Câncer. Classificação Multivariada. FT-MIR. ix ABSTRACT MULTIVARIATE CLASSIFICATION AND FOURIER-TRANSFORM MID- INFRARED SPECTROSCOPY (FT-MIR) IN CANCER PROSTATE TISSUE This thesis is a theoretical-practical contribution for differentiation of prostate cancer stages through multivariate classification applied in MIR spectra from human tissues. The aim of this study was to identify spectral differences between prostate cancer stages, to determine potential biochemical markers responsible for differentiation, and to compare the performance of multivariate classification models from prostate tissue samples previously classified in Gleason II, III and IV for cancer. In a first study, the PCA-LDA, SPA-LDA and GA-LDA models were constructed aiming at a methodology to discriminate prostate cancer stages based on Gleason graduation criteria vs. the categorization of 'Low and High Degrees'; and, to identify potential spectral markers. The models performances were compared. GA-LDA produced the most satisfactory results, being better in the perspective of 'Low and High degrees', with correct classification rate of 83% and sensitivity and specificity values 100% and 80%, respectively. In a second study, PCA-LDA/QDA and GA-LDA/QDA had their performances compared in the classification of 'Low and High grades' of prostate cancer, considering linear or quadratic character in the differentiation. The QDA models obtained better results than the LDA, as well as variables selection method (GA) were better than the variables reduction method (PCA). GA-QDA obtained better performance with classification rates for calibration and prediction samples of 97% and 100%, respectively; and sensitivity and specificity of 75% and 100%, respectively. In a third study, independent SVM models (linear-, polynomial-, RBF- and quadratic-SVMs) and the PCA-SVM, SPA-SVM and GA- SVM algorithms were applied in order to evaluate the use of variables reduction and selection methods in a nonlinear approach for screening 'Low and High grades' of prostate cancer. Independent SVM models had lower performance than the others. The best model was GA- SVM with 100% and 90% of 'Low Grade' calibration and prediction samples correctly classified, respectively; and sensitivity and specificity of 90%. The potential spectral biomarkers identified by the studies were attributed to the regions of amides I, II, III and proteins (≈1,591–1,483 cm-1), DNA and RNA (≈1,000–1,490 cm-1) and protein phosphorylation (≈970 cm-1). The intensities variation was more pronounced in 'High degree' spectra. Changes in these regions may indicate metabolic changes caused by cancer advance. The proposed methods showed potentially better performance than traditional diagnostic methods. The results showed that the multivariate classification combined with FT-MIR can differentiate pathological states of tissues mainly in the early stages of cancer ('Low grade') with speed, accuracy, easy proceedings, independence of intra- and inter-observer variability, and high sensitivity and specificity; in comparison to traditional techniques (which suffer with operator-dependence, high intra- and inter-observer variability, high time consuming, difficult preparation and lower sensitivity and specificity). In addition, the methodologies proposed here may imply economic and social benefits based on early diagnosis and treatments, allowing improvement in quality of life and survival of patients. Keywords: Cancer. Multivariate Classification. FT-MIR x ABBREVIATIONS ANN: Artificial Neural Networks; ATR: Attenuated Total Reflectance BPH: Benign Prostatic Hyperplasia; CIN1: Static as Cervical Intraepithelial Neoplasia DA: Discriminant Analysis; DF: Discriminant Function EMSC: Extended Multiplicative Scatter Correction FCM: Fuzzy C-Means Clustering FFPE: Formalin-Fixation and Paraffin- Embedding FN: False negative FP: False positive FPA: Focal Plane Array; FSD: Fourier Self-Deconvolution FTIR: Fourier-Transform Infrared Spectroscopy FTIR-PAS: FTIR-Photo Acoustic Spectroscopy FT-MIR: Fourier-Transform Mid-Infrared Spectroscopy GA: Genetic Algorithm GA-LDA: Genetic Algorithm-Linear Discriminant Analysis GA-QDA: Genetic Algorithm-Quadratic Discriminant Analysis GS: Gleason Score; H&E: Hematoxylin and eosin HCA: Hierarchical Cluster Analysis IR: Infrared; KMC: K-means clustering LD1: Linear Discriminant score LDA: Linear Discriminant Analysis; LHS: Left-hand shoulder LNCaP: Lymph Node Metastase LR-: Negative Likelihood Ratio LR+: Positive Likelihood Ratio MCT: Mercury–Cadmium–Telluride; MIR: Mid-Infrared Region; MS: Mass Spectrometry; MSC: Multiplicative Scatter Correction; N/C: Nucleus-To-Cytoplasm Ratio; NIRS: Near-infrared Spectroscopy; NPV: Negative Predictive Value; PC-3: Bone Marrow Metastase PCA: Principal Component Analysis; PCa: Prostate Cancer; PCA-DA: Principal Component Analysis- Discriminant Analysis PCA-LDA: Principal Component Analysis-Linear Discriminant Analysis PCA-QDA: Principal Component Analysis-Quadratic Discriminant Analysis PCs: Principal Components; PLS: Partial Least-Squares PLS-DA: Partial Least Squares- Discriminant Analysis PNT2-C2: Non-Malignant Normal Prostate Epithelial Cells; PPV: Positive Predictive Value; xi PROG: Cytology that Progressed To High- Grade Disease; QCL: Quantum-Cascade Laser; QDA: Quadratic Discriminant Analysis; REG: Cytology that regressed after 1 year; RHS: Right-hand side; RMieS: Resonant Mie scattering; RMieS-EMSC: Resonant Mie Scattering- Extended Multiplicative Signal Correction Algorithm; SENS: Sensibility; SNR: Signal-To-Noise Ratio; SNV: Standard Normal Variate; SPA: Successive Projection Algorithm; SPA-LDA: Successive Projections Algorithm-Linear Discriminant Analysis; SPEC: Specificity; SVM: Support Vector Machines; TMA: tissue microarrays; TN: True negative; TNM: Tumour/Node/Metastases. TP: True positive; WHO: World Health Organization; YOU: Youden’s index; νasPO2 – : Asymmetric phosphate stretching vibrations; νsPO2 − : Symmetric phosphate stretching vibrations xii PREFACE This Thesis was developed by the partnership between Institute of Chemistry (IQ), Department of Pathology (DPAT), Department of Morphology (DMOR), Department of Biophysics and Pharmacology (DBF) of the Federal University of Rio Grande do Norte (UFRN), and Liga Norte Riograndense contra o Câncer (Centro Avançado de Oncologia – CECAN). All experiments were performed in compliance with the relevant laws and institutional guidelines (Res. CNS n.466/2012; CEP n. 030/0030/2006), where the institutional committees of the Liga Norte-Riograndense Contra o Cancer, Brazil, approved this research (n. 030/0030/2006). Contribuitions to this study: Kássio Michell G. Lima (IQ/UFRN) and the master student Camilo Lelis M. Morais (PPGQ/IQ/UFRN) provide orientations related to computational and experimental analysis; Prof. Raimundo F. Araújo Júnior (DPAT/DMOR/UFRN) and Aurigena Antunes de Araújo (DBF/UFRN) provide the biological samples for spectral analysis; the graduate student Melyna Souto was responsable for samples treatment; and I, Laurinda Siqueira, prepared the samples to spectral analysis, I did spectral acquisition and data preprocessing, I built the multivariate classification models, and I wrote the 1 st draft manuscript. Below it is listed the academic biography of the authors and co-authors. Laurinda Fernanda Saldanha Siqueira. Received her BSc in chemistry from the Federal Institute of Education, Science and Technology of Maranhão State (Brazil) in 2008, and became a specialist in statistics at the State University of Maranhão (Brazil) in 2010. She is currently a PhD student in the Biological Chemistry and Chemometrics group (UFRN), working on biospectroscopy and classification techniques in prostate cancer studies. She is also an assistant professor of analytical chemistry at the Federal Institute of Education, Science and Technology of Maranhão State. (http://lattes.cnpq.br/3968320118706677). Kássio Michel Gomes de Lima. Is an assistant professor of analytical chemistry at the Institute of Chemistry at UFRN (Brazil). He received his PhD in sciences (2007) from UNICAMP (Brazil) and he was a post-doctorate fellow (July 2013–July 2014) at CSIC (Barcelona, Spain) in the Environmental Chemometrics Group (led by Dr RomaTauler). He also worked as a visiting researcher at Lancaster University (2014), Centre for Biophotonics (led by Prof. Francis L. Martin). He has been the head of the Biological Chemistry and Chemometrics group (UFRN) since 2011 and his current research interests are on multivariate xiii calibration and classification techniques of analytical and biological systems (http://lattes.cnpq.br/6928918856031880). Raimundo Fernandes Araújo Júnior. Received his post-doctorate in experimental oncology by USP (2008-2010). His area to study is the biological application and pharmaceutical drugs associated to synthetic and natural nanoparticles in models of inflammation and cancer in vitro and in vivo observing signaling pathways and the oxidative stress system through protein-gene expression, and analysis of tumor progression in human cancers (http://lattes.cnpq.br/1903940945895093). Aurigena Antunes de Araujo. Is an Associate Professor I in UFRN. Received her BSd in Dentistry, a Master's degree in Social Dentistry and PhD in a Postgraduate Program in Health Sciences by UFRN. Currently. Has experience in the Pharmacology field, emphasizing Pharmacokinetics, Pharmacoepidemiology and Experimental Pharmacology (http://lattes.cnpq.br/3531154240424211). Camilo de Lelis Medeiros de Morais. Received his BSd in chemistry by UFRN. Currently, he is a master's student in chemistry by PPGQ/UFRN, working in the Biological Chemistry and Chemometrics group (QBQ/UFRN) where research digital systems and electronic devices, and multivariate analysis (http://lattes.cnpq.br/7832928791745545). Melyna Soares de Souto. Currently is pharmacy student, with experience in morphology. (http://lattes.cnpq.br/8709166372799086). xiv CONTENTS CHAPTER 1 GENERAL INTRODUCTION.............................................................. 15 CHAPTER 2 MIR-biospectroscopy coupled with chemometrics in cancer studies. Laurinda F. S. Siqueira, Kássio M. G. Lima. Analyst, 2016, 141, 4833-4847.............................................................. 29 CHAPTER 3 A decade (2004 – 2014) of FTIR prostate cancer spectroscopy studies: an overview of recent advancements. Laurinda F. S. Siqueira, Kássio M. G. Lima. Trends in Analytical Chemistry, 2016, 82, 208–221............................. 45 CHAPTER 4 A comparison of multivariate analysis and variable selection methods to prostate cancer classification from FT-MIR biomedical spectroscopy data. Laurinda F. S. Siqueira, Raimundo F. Araújo Júnior, Aurigena Antunes de Araújo, Camilo L.M. Morais, Kássio M.G. Lima. Article submitted to Analytical Methods. Manuscript number: AY- COM-09-2016-002461………………………………………..……… 60 CHAPTER 5 LDA vs. QDA for FT-MIR prostate cancer tissue classification. Laurinda F. S. Siqueira, Raimundo F. Araújo Júnior, Aurigena Antunes de Araújo, Camilo L.M. Morais, Kássio M.G. Lima. Chemometrics and Intelligent Laboratory System, 2017, 162, 123- 129 …………………………………………………………….……… 85 CHAPTER 6 SVM for FT-MIR prostate cancer classification. Laurinda F. S. Siqueira, Camilo L.M. Morais, Kássio M.G. Lima. Article submitted to Scientific Report – Nature………………...…….. 93 CHAPTER 7 CONCLUSIONS AND PERSPECTIVES ……………….…………... 127 15 CHAPTER 1 – GENERAL INTRODUCTION 1. INTRODUCTION AND MOTIVATIONS……………………………………..…….. 15 2. MAIN OBJECTIVES……………………………………………………….………… 21 3. THESIS OUTLINE……………………………………………………………………. 21 4. METHODOLOGIC PROCEEDINGS………………………………………………… 22 REFERENCES……………………………………………………………………....... 24 1. INTRODUCTION AND MOTIVATIONS This research is a theoretical and practical support to the differentiation of prostate cancer stages by multivariate classification applied in Mid-infrared spectra derived from human tissues. Multivariate classification combined with Fourier-Transform Mid-Infrared Spectroscopy (FT-MIR) is presented here as complement and or alternative to traditional methods of cancer screening and diagnosis. 1.1 Prostate cancer Prostate cancer is the second most common cancer in the world’s male population, with projections around of 1 million new cases per year. 1 In Brazil, the prostate cancer is also the second most present in males of all regions, with estimates of 61,200 new cases which results in an estimated risk of 61.82 new cases per 100,000 men in 2016. 2 Largest incidence of this kind of cancer is found above 65 years old; however, it can be observe an increase in the others age groups based on changes in socioeconomic context and in the quality improvement of information systems. The mortality rate presents an ascendant profile as well. 1,2 The risk factors are age, family history and race/ethnicity 1 . Besides, alterations in life style, nutrition, tabagism, alcoholism and comorbidities (such as diabetes, obesity and others) are also reported. 3 Prostate cancer diagnosis is based on histopathological search in tissue samples derived from biopsy, after detection of abnormalities by Digital Rectal Examination (DRE) 1 “Race/ethnicity” is an international adopted criterion. In Brazil is adopted the term “color and race/or ethnicity”. Color and race represent the individual identity, while ethnicity corresponds to individual culture. 16 and or dosages of Prostate Specific Antigen (PSA) and Prostatic Acid Phosphatase (PAP). DRE allows palpation of only lateral and posterior portions, leaving out 40-50% of the prostate, resulting in sensitivity and specificity values about of 21-37% and 71-91%, respectively. When DRE is combined to PSA dosage, sensitivity and specificity increase to 51-68% and 92-94%, respectively. However, PSA dosage can be influenced by others factors, as prostatitis, ejaculation and acute urinary retention, which may imply in indication of unnecessary biopsies. 4–8 Gleason grading system is the golden standard staging method for prostate cancer. Originally established in 1960-1970, this system was changed significantly after two international consensus (in 2005 and 2014) promoted by International Society of Urologic Pathology (ISUP). The changes aimed to increase the diagnosis objectivity and to reduce the high intra- and inter-observer variability. 9 The Gleason grading varies from grade 1, the less aggressive and with better prognosis, to grade 5, the most aggressive and with worse prognosis. 7–11 Despite Gleason system evolution, the early stages of the prostate cancer are still difficult to identification, detection and diagnosis. Moreover, this system is based on visual criteria of pattern recognition which are operator-dependent and subject to high intra- and inter-observer variability. Besides, its proceedings are difficult and high time consuming. 12 Based on estimates of prostate cancer incidence and on limitations of reference methods and techniques, which it is built the motivation of this research: the necessity of new technologies to early and rapid diagnosis, that may be at the same time: operational low cost, accurates, effectives, objectives, easy proceedings and independents of intra- and inter- observer variability. The early and rapid prostate cancer detection imply in better prognosis and more chances of cure. At least 10 years of survival are reached in 98% of the cases which precocious detection occur, against only 46% when detection occurs belatedly (metastatic cancer). 2,5,6,13 The demand from Brazil’s Unified Health System (SUS), added to complex proceedings and large time involved between the initial evaluation by exams and biopsy, and anatomopathologic analysis for prostate cancer detection by traditional methods, can imply in a late diagnosis, even considering maximum prompt of 60 days to start cancer treatment as specified in the Brazilian Law n. 12,732/2012. 14 Early detection also involves less aggressive treatments and less mutilating (such as videolaparoscopic surgery) and implies in reduction of the high costs from treatments of cancer in advanced stages and or metastatic. 5,6 According to SUS Table of Procedures, 17 Medications and Orthoses, Prostheses and Special Materials 15 , the treatment cost for the early stages of prostate cancer (Fig.1.1A) varies from 2 to 4.5 thousand reais per patient, against from 7 to 9.5 thousand reais for the intermediary stages (Fig.1.1B) and from 7.5 to 15 thousand reais for advanced stages (Fig.1.1C), disregarding materials and hospitalization expenses. Thus, treatments for cancer in advanced stages are almost eight times more expensive than for cancer in early stages. In 2016, it is estimated costs around 6 billion reais in cancer treatments by Brazilian Ministry of Health. 2 This higher cost of advance stages treatments can be found globally. 16,17 Expenses exceeding $100 billion for cancer treatments are projected for worldwide in 2016. 18 Fig.1.1 – (1) Proposed routine. (2) Brazilian diagnostic and therapeutic guidelines for prostate cancer, according to Brazilian Ministry of Health. 19 It is in this scenario that the multivariate classification combined with Fourier- Transform Mid-Infrared Spectroscopy (FT-MIR) is a potential tool, since permit detection of biochemistry alterations, even in early stages of cancer. 20–22 This implies in early diagnosis possibilities, rather than relying on visual criteria to evaluate morphological alterations, as traditional methods do. The routine for prostate cancer classification proposed in this work (Fig.1.1) focuses on speed and objectivity of the multivariate classification and FT-MIR combination, corroborating with literature related. 23–26 Comparatively, FT-MIR simplifies samples treatment and proceedings, improving celerity to diagnosis. Based on vibrational spectra information of a given sample, it is possible 18 to differentiate tissues pathologic stages. Moreover, this methodology is independent of intra- and inter-observer variability and has been demonstrating high sensitivity and specificity, and results reliability. 27–30 It is expected that FT-MIR and multivariate classification may produce beneficial economic and social impacts derived from detection and treatments of early stages of cancer, consequently to enable improving life quality and survival rate of patients. 1.2 Multivariate classification and FT-MIR Chemometrics is “the science of relating measurements made on a chemical system or process to the state of the system via application of mathematical or statistical methods” (p.408) 31 , generally applied in (i) data sampling, (ii) data pre-processing, (iii) experimental design, (iv) pattern recognition, (v) multivariate classification, (vi) signal processing, (vii) multivariate calibration, (viii) images treatment, and others. Before these applications, multivariate classification combined with FT-MIR appears as tools in several researches related to exploratory analysis and categorization of spectral data derived from biological samples, particularly in cancer studies aiming diagnostic and classification. 32,33 Data pre-processing results of “manipulation of raw data prior to a specified data analysis treatment” (p.408) 31. Pre-processing is a first step for effective application of the multivariate classification in MIR spectral data, to improve robustness and accuracy to subsequent methods. Pre-processing tests can be listed based on its goals in (i) quality tests, generally applied to optimization; (ii) tests of baseline correction and spectral filtering, generally applied to correct and remove baseline, de-noise data and smooth signals; and (ii) normalization tests, generally applied to reduce distortions In the work reported here, the pre-processing tests Extended Multiplicative Scatter Correction (EMSC), Savitzky–Golay smoothing and Normalization to Amide I peak were applied. EMSC allows correction of scattering effects and interferences, and comports normalization, and separation and quantification of many kinds of chemistry and physique variations in vibrational spectra. 34 Savitzky–Golay smoothing is used mainly to remove spectral noise without degrading spectral information, based on least-squares criteria, fit by a polynomial of adequate degree and a finite number of slider windows. 35,36 Normalization to Amide I peak (1,650 cm-1) shifts and scales all spectra so that these have same absorbance intensity at the amide I peak, reducing distortions and highlighting spectral differences. 24 Together and in this application order, these tests remove useless information, optimize the 19 application of subsequent methods and improve visual interpretation and reliability of the results. Samples selection is a second step before the building of multivariate classification models. Indeed, the models success is directly related to this step. Kennard-Stone algorithm is a very popular method for samples selection. 37 The samples are divided in calibration and test sets. The models are built based on training samples which should be representative and have larger variability as possible, in such a way that unknown samples can be also considered. Therefore, most samples (60-80%) are for calibration. Testing samples enable application and validation of the built models derived from training samples. 3,38 Variables reduction and selection methods are generally associated to classification methods for exploratory and grouping aims respectively, which improve robustness, specificity and sensitivity for the models built, preserve the maximum of useful information, reduce the redundancy and collinearity, eliminate potential interferences, increase signal/noise ratio, optimize models, and others. 33 Although of distinct processes, these methods tend to indicate variables (wavenumber) similar and or complementary according to analysis goals. Principal Component Analysis (PCA) is a popular method of variables and dimensionality reduction based on a sequence of linear combinations between the variables with larger covariance. These independent combinations are represented by a new matrix so called Principal Component (PC), which is product of Scores (related to explained variance) and Loadings (related to linear combinations of each variable and respective PC). Each PC has more explained variance than the next one during all iterative process of variables reduction. Thus, choice of PC number can imply in information redundancy (high PC number) or insufficiency (low PC number). 26,39–42 Successive Projections Algorithm (SPA) and Genetic Algorithms (GA) are two variables selection methods from original data. SPA selects more important variables based on several vector projections, where new variables are incorporated into an initial variable until a number N of variables is reached. 3,22,33,38,43,44 GA selects the best variables according to stochastic and heuristic modelling based on evolutionary theory and on selection, recombination/crossover and mutation operators. Variables are selected based on lowest prediction and validation errors, and a fitness criterion. 12,22,38,44–46 Finally, Linear Discriminant Analysis (LDA), Quadratic Discriminant Analysis (QDA) and Support Vector Machine (SVM) as classification methods have the advantages of objectivity, speed, accuracy, easy proceedings; it may applied as exploratory and or predictive 20 methods and separate the classes according to optimal behavior found (linear, quadratic, nonlinear, etc.), and others. 33 In Discriminant Analysis, one global model is built to discriminate the classes behavior: in this case, linear (LDA) or quadratic (QDA). LDA classifies according to linear combinations of variables and maximizes the ratio of the difference between classes (interclass variance) to the differences within the classes. Linear separation between the classes based on Mahalanobis 2 distance derived from the same and unique variance- covariance matrix for all classes (namely, pooled variance-covariance matrix). QDA classification is based on Mahalanobis distance which derived from unequal covariance matrices for the classes and separation behavior is quadratic. 12,47,48 SVM classification is based on construction of two optimal parallel hyperplanes for class separation and on data multidimensionality. 47–52 Complexity of biological matrices and abnormal and normal processes associated imply in large amount of spectral information. For this reason, the application of the chemometrics tools aforesaid arises to reduce, select and classify information derived from MIR spectra (400 – 4,000 cm-1) useful to diagnosis and categorization of cancer stages. FT- MIR absorbers the energies derived from molecular vibrational modes (such as, stretching and bending) which characterize given biochemical or chemical specie. 53 The ‘fingerprint region’ (900 – 1,800 cm-1) contains the most biochemical information area and for potential spectral biomarkers related to normal and abnormal patterns of health and diseases, such as spectral bands attributed to functional groups of lipids, carbohydrates, nucleic acids and proteins. Despite the complexity of normal and pathologic biochemical process and the fragility of biological matrices, FT-MIR presents as strengths operational low cost, versatility, speed and easy proceedings, added to character nondestructive and noninvasive. 12,22,24,32,33,38,44,53,54 The combination between FT-MIR and multivariate classification tends to be a complement and or alternative tool for investigation, diagnosis, categorization and surveillance of the cancer and others diseases, against the traditional methods of detection and classification. Many researches in worldwide can corroborate with this. 32,33,53,55 2 Mahalanobis distance was used in this research based on our priority for classification and identification the most important variables responsible for it, since this measure incorporate in a covariance matrix differences in the variables variances and also correlations between them. While, Euclidian distance gives a same weight to all variables, consequently the whole classification may be influenced by some irrelevant variables. 56–59 21 2. MAIN OBJECTIVES  To provide a potential complement and or alternative method for early diagnosis and classification of prostate cancer to traditional methods.  To apply multivariate classification and Fourier-Transform Mid-Infrared Spectroscopy (FT-MIR) for differentiation of prostate cancer stages via tissue analysis.  To identify spectral differences and potential spectral bands working as biochemical markers responsible for differentiation of the cancer stages.  To evaluate and compare performance of the multivariate classification models applied. 3. THESIS OUTLINE This thesis presents first-author works which are essential to the development of the degree of PhD in Chemistry by Chemistry Postgraduate Program of the Federal University of Rio Grande do Norte (IQ/PPGQ/UFRN). This thesis was organized as bellow: Chapter 2 – “MIR-biospectroscopy coupled with chemometrics in cancer studies” (published in Analyst, DOI: 10.1039/c6an01247g) – brings a review of chemometrics application in cancer spectroscopic studies, highlighting advantages, disadvantages and applications of the chemometric algorithms, representing a conceptual framework to multivariate classification based on biomedical spectral data (Chapter 4, 5 and 6). Chapter 3 – “A decade (2004 – 2014) of FTIR prostate cancer spectroscopy studies: An overview of recent advancements” (published in Trends in Analytical Chemistry – TrAC, DOI: 10.1016/j.trac.2016.05.028) – brings an overview of FT-MIR applied specifically in prostate cancer studies, emphasizing samples treatment, instrumentation, spectral acquisition, pre-processing tests and feature extraction techniques, representing a conceptual support to FT-MIR application in prostate cancer samples and multivariate analysis (Chapter 4, 5 and 6). Chapter 4 – “A comparison of multivariate analysis and variable selection methods to prostate cancer classification from FT-MIR biomedical spectroscopy data” (submitted to Analytical Methods, manuscript number: AY-COM-09-2016-002461) – gives a practical illustration of multivariate analysis application in MIR data for prostate cancer classification, highlighting feature selection and reduction and biomarker extraction, representing an experimental basis to applications of multivariate classification (Chapter 5 and 6) . 22 Chapter 5 – “LDA vs. QDA for FT-MIR prostate cancer spectroscopy classification” (submitted to Chemometrics and Intelligent Laboratory System, manuscript number: CHEMOLAB_2016_123) – presents application of two discriminant analysis as example of multivariate classification applied in prostate cancer spectral datasets, underlining a comparative discussion of its applications, strengths and drawbacks. Chapter 6 – “SVM for FT-MIR prostate cancer classification” (submitted to Scientific Reports – Nature) – provides a practical application of nonlinear algorithms in spectral data derived from prostate cancer samples, emphasizing spectral differences for the prostate cancer stages and underlining a comparative discussion with traditional methods of cancer screening. Chapter 7 – “Conclusions and Perspectives” – summarizes the achievements, featuring a comparative discussion of main papers results, and presents suggestions for future works. 4. METHODOLOGIC PROCEEDINGS This study was developed by the partnership between Institute of Chemistry and Department of Pathology of the Federal University of Rio Grande do Norte, Natal, Brazil. All experiments were performed incompliance with the relevant laws and institutional guidelines, where the institutional committees (No. 030/0030/2006) of the Liga Norte-Riograndense Contra o Cancer, Brazil, approved this research. 4.1 Tissue collection and preparation Prostate tissue sections were provided by Pathology Department of the Federal University of Rio Grande of Norte (UFRN/Brazil), formalin-fixed, dehydrated and paraffin- embedded (FFPE) in pathology blocks, previously classified based on Gleason system by pathologists. The tissue sections were floated onto ZnSe slides (Bruker Optics Ltd., Coventry, UK), de-waxed by serial immersion in fresh xylenes baths, washed and cleared in an absolute ethanol bath, allowed to air-dry and then placed in a desiccator until analysis. 23 4.2 FT-MIR spectroscopy FT-MIR spectra were collected by transmission mode in wavenumber range 600– 4,000 cm -1 (32 scans, spectral resolution of 8 cm -1 )using a Bruker Lumos FTIR spectrometer- microscope (Bruker Optics Ltd., Coventry, UK) and converted into absorbance by Bruker OPUS software. For every new sample, a new background was taken. 4.3 Computational analysis The samples were selected by Kennard-Stone algorithm 37 in calibration, validation and test datasets. a. Pre-processing. Extended Multiplicative Scatter Correction (EMSC), 1 st order Savitzky- Golay Smoothing (15 points) and Normalization to amide I peak (1,650 cm-1) were performed in this order as pre-processing tests. b. Methods. The variables reduction method, Principal Component Analysis (PCA), and the methods of variables selection, Successive Projection Algorithm (SPA) and Genetic Algoritms (GA), followed by Linear Discriminant Analysis (LDA), Quadratic Discriminant Analysis (QDA) and Support Vector Machines (SVM) were performed to classify prostate cancer grades. c. Figures of merit. Sensitivity (SENS), Specificity (SPEC), Positive (or Precision) Predictive Values (PPV), Negative Predictive Values (NPV), Youden index (YOU), and Positive Likelihood Ratio (LR+) and Negative Likelihood Ratio (LR-) were analyzed in order to validate and compare the multivariate classification models applied. 24 REFERENCES 1. WHO. Mortality and global health estimates. Projection of death rates for 2015-2030. http://apps.who.int/gho/data/node.main.PROJRATEWORLD?lang=en. 2. BRASIL/INCA. Instituto Nacional do Cancer - Estimativa 2016. Ministério Da Saúde. 2016:51. doi:978-85-7318-194-4. 3. Theophilou G, Lima KMG, Briggs M, Martin-hirsch PL. A biospectroscopic analysis of human prostate tissue obtained from different time periods points to a trans- generational alteration in spectral phenotype. Nat Publ Gr. 2015;(August):1-13. doi:10.1038/srep13465. 4. Chen D, Tian Y, Liu X. Structural nonparallel support vector machine for pattern recognition. Pattern Recognit. 2016;60:296-305. doi:http://dx.doi.org/10.1016/j.patcog.2016.04.017. 5. BRASIL/INCA. Programa nacional de controle do câncer da próstata: documento de consenso. http://bvsms.saude.gov.br/bvs/publicacoes/cancer_da_prostata.pdf. 6. BRASIL/INCA. Informativo detecção precoce: monitoramento das ações de controle do câncer de próstata. http://www1.inca.gov.br/inca/Arquivos/Informativo_Deteccao_Precoce_2_agosto_201 4.pdf. 7. Epstein JI, Egevad L, Amin MB, Delahunt B, Srigley JR, Humphrey PA. The 2014 International Society of Urological Pathology (ISUP) Consensus Conference on Gleason Grading of Prostatic Carcinoma. Am J Surg Pathol. October 2015:1. doi:10.1097/PAS.0000000000000530. 8. van Leeuwen PJ, van Vugt H a, Bangma CH. The implementation of screening for prostate cancer. Prostate Cancer Prostatic Dis. 2010;13(3):218-227. doi:10.1038/pcan.2010.14. 9. Chen N, Zhou Q. The evolving Gleason grading system. Chin J Cancer Res. 2016;28(1):58-64. doi:10.3978/j.issn.1000-9604.2016.02.04. 10. Epstein JI, Allsbrook WCJ, Amin MB, Egevad LL. The 2005 International Society of Urological Pathology (ISUP) Consensus Conference on Gleason Grading of Prostatic Carcinoma. Am J Surg Pathol. 2005;29(9):1228-1242. doi:10.1097/01.pas.0000173646.99337.b1. 11. Pierorazio PM, Walsh PC, Partin AW, Epstein JI. Prognostic Gleason grade grouping: Data based on the modified Gleason scoring system. BJU Int. 2013;111(5):753-760. 25 doi:10.1111/j.1464-410X.2012.11611.x. 12. Khanmohammadi M, Ghasemi K, Garmarudi AB. Genetic algorithm spectral feature selection coupled with quadratic discriminant analysis for ATR-FTIR spectrometric diagnosis of basal cell carcinoma via blood sample analysis. Rsc Adv. 2014;4(78):41484-41490. doi:DOI 10.1039/c4ra04965a. 13. BRASIL/INCA. Rastreamento do câncer de próstata. http://www2.inca.gov.br/wps/wcm/connect/9e6e07004a50eca8968bd6504e7bf539/Not a+Técnica+CAP+finalizada.pdf?MOD=AJPERES&CACHEID=9e6e07004a50eca896 8bd6504e7bf539. 14. BRASIL. Ministério da Saúde. Lei n. 12.732, de 22 de novembro de 2012. http://www.planalto.gov.br/ccivil_03/_ato2011-2014/2012/lei/l12732.htm. 15. SIGTAP/DATASUS. Tabela de Procedimentos, Medicamentos e OPM do SUS. http://sigtap.datasus.gov.br/tabela- unificada/app/sec/procedimento/exibir/0416010113/10/2016. 16. Subramanian S, Tangka Fl, Sabatino S, et al. Impact of Chronic Conditions on the Cost of Cancer Care for Medicaid Beneficiaries. 2012;2(4):1-21. doi:10.5600/mmrr.002.04.a07. 17. Tangka FKL, Subramanian S, Sabatino SA, et al. End-of-life medical costs of medicaid cancer patients. Health Serv Res. 2015;50(3):690-709. doi:10.1111/1475-6773.12259. 18. Aitken M. Global Oncology Trend Report. IMS Inst Healthc Informatics. 2016;(June). www.theimsinstitute.org. 19. BRASIL/CONITEC. Diretrizes diagnósticas e terapêuticas do adenocarcinoma de próstata. http://conitec.gov.br/images/Consultas/Relatorios/2015/DDT_Adenocarcinomadeprost ata_CP.pdf. 20. Patel II, Trevisan J, Singh PB, et al. Segregation of human prostate tissues classified high-risk (UK) versus low-risk (India) for adenocarcinoma using Fourier-transform infrared or Raman microspectroscopy coupled with discriminant analysis. Anal Bioanal Chem. 2011;401(3):969-982. doi:10.1007/s00216-011-5123-z. 21. Pezzei C, Pallua JD, Schaefer G, et al. Characterization of normal and malignant prostate tissue by Fourier transform infrared microspectroscopy. Mol Biosyst. 2010;6(11):2287-2295. doi:10.1039/c0mb00041h. 22. Purandare NC, Patel II, Lima KMG, et al. Infrared spectroscopy with multivariate analysis segregates low-grade cervical cytology based on likelihood to regress, remain 26 static or progress. Anal Methods. 2014;6:4576-4584. doi:10.1039/c3ay42224k. 23. Fernandez DC, Bhargava R, Hewitt SM, Levin IW. Infrared spectroscopic imaging for histopathologic recognition. Nat Biotechnol. 2005;23(4):469-474. doi:10.1038/nbt1080. 24. Baker MJ, Trevisan J, Bassan P, et al. Using Fourier transform IR spectroscopy to analyze biological materials. Nat Protoc. 2014;9(8):1771-1791. doi:10.1038/nprot.2014.110. 25. Gazi E, Baker M, Dwyer J, et al. A Correlation of FTIR Spectra Derived from Prostate Cancer Biopsies with Gleason Grade and Tumour Stage. Eur Urol. 2006;50(4):750- 761. doi:10.1016/j.eururo.2006.03.031. 26. Kelly JG, Trevisan J, Scott AD, et al. Biospectroscopy to metabolically profile biomolecular structure: A multistage approach linking computational analysis with biomarkers. J Proteome Res. 2011;10(4):1437-1448. doi:10.1021/pr101067u. 27. PETIBOIS C, DELERIS G. Chemical mapping of tumor progression by FT-IR imaging: towards molecular histopathology. doi:10.1016/j.tibtech.2006.08.005. 28. BUNACIU AA, HOANG VD, ABOUL-ENEIN H. Applications of FT-IR Spectrophotometry in Cancer Diagnostics. doi:10.1080/10408347.2014.904733. 29. Lyng F, Ramos I, Ibrahim O, Byrne H. Vibrational Microspectroscopy for Cancer Screening. Appl Sci. 2015;5(1):23-35. doi:10.3390/app5010023. 30. VERDONCK M, GARAUD S, DUVILLIER H, WILLARD-GALLO K, GOORMAGHTIGH E. Label-free phenotyping of peripheral blood lymphocytes by infrared imaging. doi:10.1039/c4an01855a. 31. Hibbert DB. Vocabulary of concepts and terms in chemometrics (IUPAC Recommendations 2016). Pure Appl Chem. 2016;88(4):407-443. doi:10.1515/pac- 2015-0605. 32. Siqueira LFS, Lima KMG. Trends in Analytical Chemistry A decade ( 2004 – 2014 ) of FTIR prostate cancer spectroscopy studies : An overview of recent advancements. Trends Anal Chem. 2016;82:208-221. doi:10.1016/j.trac.2016.05.028. 33. Siqueira LFS, Lima KMG. MIR-biospectroscopy coupled with chemometrics in cancer studies. Analyst. 2016:4833-4847. doi:10.1039/C6AN01247G. 34. Afseth NK, Kohler A. Extended multiplicative signal correction in vibrational spectroscopy, a tutorial. Chemom Intell Lab Syst. 2012;117:92-99. doi:10.1016/j.chemolab.2012.03.004. 35. Luo J, Ying K, Bai J. Savitzky-Golay smoothing and differentiation filter for even 27 number data. Signal Processing. 2005;85(7):1429-1434. doi:10.1016/j.sigpro.2005.02.002. 36. Savitzky A, Golay MJE. Smoothing and Differentiation of Data by Simplified Least Squares Procedures. Anal Chem. 1964;36(8):1627-1639. doi:10.1021/ac60214a047. 37. Kennard R, Stone L a. Computer aided design of experiments. Technometrics. 1969;11(1):137-148. http://amstat.tandfonline.com/doi/full/10.1080/00401706.1969.10490666. 38. Lima KMG, Gajjar KB, Martin-Hirsch PL, Martin FL. Segregation of ovarian cancer stage exploiting spectral biomarkers derived from blood plasma or serum analysis: ATR-FTIR spectroscopy coupled with variable selection methods. Biotechnol Prog. 2015;31(3):832-839. doi:10.1002/btpr.2084. 39. Abdi H, Williams LJ. Principal component analysis. Wiley Interdiscip Rev Comput Stat. 2010;2(4):433-459. doi:10.1002/wics.101. 40. Bro R, Smilde AK. Principal component analysis. Anal Methods. 2014;6(9):2812. doi:10.1039/c3ay41907j. 41. Bickel P, Diggle P, Fienberg S, et al. Principal component analysis. Springer Verlang. 2002:2812-2831. doi:10.1016/0169-7439(87)80084-9. 42. Harvey TJ, Gazi E, Henderson A, et al. Factors influencing the discrimination and classification of prostate cancer cell lines by FTIR microspectroscopy. Analyst. 2009;134(6):1083-1091. doi:10.1039/b903249e. 43. Soares SFC, Galvão RKH, Araújo MCU, et al. A modification of the successive projections algorithm for spectral variable selection in the presence of unknown interferents. Anal Chim Acta. 2011;689(1):22-28. doi:10.1016/j.aca.2011.01.022. 44. Theophilou G, Lima KMG, Martin-Hirsch PL, Stringfellow HF, Martin FL. ATR-FTIR spectroscopy coupled with chemometric analysis discriminates normal, borderline and malignant ovarian tissue: classifying subtypes of human cancer. Analyst. 2015:585- 594. doi:10.1039/c5an00939a. 45. Niazi A, Leardi R. Genetic algorithms in chemometrics. J Chemom. 2012;26(6):345- 351. doi:10.1002/cem.2426. 46. Latif AHMM, Brunner E. A genetic algorithm for designing microarray experiments. Comput Stat. 2016;31(2):409-424. doi:10.1007/s00180-015-0618-2. 47. Dixon SJ, Brereton RG. Comparison of performance of five common classifiers represented as boundary methods: Euclidean Distance to Centroids, Linear Discriminant Analysis, Quadratic Discriminant Analysis, Learning Vector 28 Quantization and Support Vector Machines, as dependent on. Chemom Intell Lab Syst. 2009;95(1):1-17. doi:10.1016/j.chemolab.2008.07.010. 48. Dixon SJ, Heinrich N, Holmboe M, et al. Application of classification methods when group sizes are unequal by incorporation of prior probabilities to three common approaches: Application to simulations and mouse urinary chemosignals. Chemom Intell Lab Syst. 2009;99(2):111-120. doi:10.1016/j.chemolab.2009.07.016. 49. Cortes C, Vapnik V. Support-Vector Networks. Mach Learn. 1995;20(3):273-297. doi:10.1023/A:1022627411411. 50. Dietrich R, Opper M, Sompolinsky H. Statistical Mechanics of Support Vector Networks. Phys Rev Lett. 1999;82(14):2975-2978. doi:10.1103/PhysRevLett.82.2975. 51. Valentini G, Dietterich TG. Bias—Variance Analysis and Ensembles of SVM. J Mach Learn Res. 2002;5:222-231. doi:10.1007/3-540-45428-4_22. 52. Luts J, Ojeda F, Van de Plas Raf R, De Moor B, Van Huffel S, Suykens JAK. A tutorial on support vector machine-based methods for classification problems in chemometrics. Anal Chim Acta. 2010;665(2):129-145. doi:10.1016/j.aca.2010.03.030. 53. Hughes C, Baker MJ. Can mid-infrared biomedical spectroscopy of cells, fluids and tissue aid improvements in cancer survival? A patient paradigm. Analyst. 2016;141(2):467-475. doi:10.1039/C5AN01858G. 54. Hughes C, Gaunt L, Brown M, Clarke NW, Gardner P. Assessment of paraffin removal from prostate FFPE sections using transmission mode FTIR-FPA imaging. Anal Methods. 2014;6(4):1028-1035. doi:10.1039/c3ay41308j. 55. Hands JR, Dorling KM, Abel P, et al. Attenuated Total Reflection Fourier Transform Infrared (ATR-FTIR) spectral discrimination of brain tumour severity from serum samples. J Biophotonics. 2014;7(3-4):189-199. doi:10.1002/jbio.201300149. 56. De Maesschalck R, Jouan-Rimbaud D, Massart DLL. The Mahalanobis distance. Chemom Intell Lab Syst. 2000;50(1):1-18. doi:10.1016/S0169-7439(99)00047-7. 57. Xiang S, Nie F, Zhang C. Learning a Mahalanobis distance metric for data clustering and classification. Pattern Recognit. 2008;41(12):3600-3612. doi:10.1016/j.patcog.2008.05.018. 58. Brereton RG. The mahalanobis distance and its relationship to principal component scores. J Chemom. 2015;29(3):143-145. doi:10.1002/cem.2692. 59. Ramirez-Lopez L, Behrens T, Schmidt K, Rossel RAV, Demattê JAM, Scholten T. Distance and similarity-search metrics for use with soil vis-NIR spectra. Geoderma. 2013;199:43-53. doi:10.1016/j.geoderma.2012.08.035. 29 CHAPTER 2 MIR-biospectroscopy coupled with chemometrics in cancer studies. Laurinda F. S. Siqueira Kássio M. G. Lima. Analyst, 2016, 141, 4833-4847. Contributions:  I wrote the review manuscript Laurinda F. S. Siqueira Kássio M. G. Lima. Analyst CRITICAL REVIEW Cite this: Analyst, 2016, 141, 4833 Received 31st May 2016, Accepted 6th July 2016 DOI: 10.1039/c6an01247g www.rsc.org/analyst MIR-biospectroscopy coupled with chemometrics in cancer studies Laurinda F. S. Siqueira and Kássio M. G. Lima* This review focuses on chemometric techniques applied in MIR-biospectroscopy for cancer diagnosis and analysis over the last ten years of research. Experimental applications of chemometrics coupled with biospectroscopy are discussed throughout this work. The advantages and drawbacks of this association are also highlighted. Chemometric algorithms are evidenced as a powerful tool for cancer diagnosis, classification, and in different matrices. In fact, it is shown how chemometrics can be implemented along all different types of cancer analyses. Introduction This paper presents a review from 2005 to 2015 of chemo- metrics application in MIR biospectroscopy, emphasizing cancer diagnosis and classification. It shows the strengths and drawbacks of MIR and chemometrics, and exploring the place of each in biomedical work is an emerging theme which con- tinues to advance at a fast pace. MIR-biospectroscopy is based on the principle that when a sample is interrogated with an IR beam, the functional groups within the sample will absorb infrared radiation and vibrate in one of a number of ways. These absorptions/vibrations can then be directly correlated to the biochemical species and the resulting infrared absorption spectrum can be described as an infrared ‘fingerprint’ characteristic of any chemical or bio- chemical substance.1 Spectroscopic techniques do not need any time-consuming or labor-intensive sample pre-treatments, destructive or complex chemical analysis or high quantities of organic sol- vents; conversely, they provide rapid analysis with minimum sample preparation. Thus, IR techniques result in time and Laurinda F. S. Siqueira Received her BSc in chemistry from the Federal Institute of Edu- cation, Science and Technology of Maranhão State (Brazil) in 2008, and became a specialist in statistics at the State University of Maranhão (Brazil) in 2010. She is currently a PhD student in the Biological Chemistry and Chemometrics group (UFRN), working on biospectroscopy and classification techniques in pros- tate cancer studies. She is also an assistant professor of analyti- cal chemistry at the Federal Institute of Education, Science and Technology of Maranhão State. Kássio M. G. Lima Is an assistant professor of analytical chemistry at the Insti- tute of Chemistry at UFRN (Brazil). He received his PhD in sciences (2007) from UNICAMP (Brazil) and he was a post-docto- rate fellow (July 2013–July 2014) at CSIC (Barcelona, Spain) in the Environmental Chemo- metrics Group (led by Dr Roma Tauler). He also worked as a visiting researcher at Lancaster University (2014), Centre for Bio- photonics (led by Prof. Francis L. Martin). He has been the head of the Biological Chemistry and Chemometrics group (UFRN) since 2011 and his current research interests are on multivariate calibration and classification tech- niques of analytical and biological systems. Biological Chemistry and Chemometrics, Institute of Chemistry, Federal University of Rio Grande of Norte, Natal 59072-970, RN-Brazil. E-mail: kassiolima@gmail.com; Tel: +55 84 3342 2323 This journal is © The Royal Society of Chemistry 2016 Analyst, 2016, 141, 4833–4847 | 4833 30 cost saving, and increase sample throughput.2 Versatility is one of the main strengths of IR methods, because almost all clinical compounds are active in the IR range and can there- fore be quantifiable. In addition, modern spectrophotometers are versatile instruments able to directly measure solid, liquid or gaseous samples using easily interchangeable accessories, including a wide variety of ATR modules, transmission cells for liquids and gases, and flow cells for macro-samples and micro-samples.3 For this reason, IR spectroscopy is a powerful technique for the identification, quantification and structural analysis of small molecules. Thus, very high resolution materials can be imaged at the subcellular level and beyond to allow a detailed understanding of biological processes.4 In the middle of numerous spectral bands in the MIR spec- trum (400–4000 cm−1), there is a region which represents specific biochemical supply. This biochemical-cell-fingerprint region generated by MIR-biospectroscopy reflects the compo- sitional and quantitative differences of biochemical constitu- ents in cells. Peaks within the biochemical-cell fingerprint region (1800 cm−1 to 900 cm−1) contain spectral features associated with lipids (»1750 cm−1), amide I (»1650 cm−1), amide II (»1550 cm−1), methyl groups of lipids and proteins (»1400 cm−1), amide III (»1260 cm−1), asymmetric phosphate stretching vibrations (νasPO2 −; »1225 cm−1), symmetric phos- phate stretching vibrations (νsPO2 −; »1080 cm−1), C–OH groups of serine, threonine and tyrosine and C–O groups of carbo- hydrates (»1155 cm−1), glycogen (»1030 cm−1) and protein phosphorylation (»970 cm−1).5,6 Each of these “prominent” wavenumbers work as a biomarker, which can be associated with biochemical alterations when compared to samples and the reference class. However, the changes and variations which occur in bio- samples as soon as they are separated from their substrate tissue and blood, the tremendous complexity of the systems and biochemical processes to be correctly represented with minimum reproducible samples, and also the complexity and variation in biochemical processes promoted by cancer and other diseases are all challenges in MIR-biospectroscopy studies. These problems have resulted in this technology having a limited impact on clinical practice and being poorly understood by health professionals. Alterations in cell functions may be a product of changes in biochemical pathways, but are more likely to represent changes in the magnitude or amplification of cell pathways, thus requir- ing quantitative measurements of molecules. Identifying specific ‘biomarkers’ of disease is likely to only be possible under a limited range of conditions. Ranges of quantitative change with cross over between the disease state and normality are likely to prove challenging when making careful assessment of any proposed clinical instruments in a well conducted clinical trial to establish sensitivity and specificity.4,7–9 In order to try to solve these problems and to map quanti- tative changes, it is necessary to conduct chemometrics. Chemometrics is a multivariate data analysis approach that uses mathematics, statistics and formal logic, and is com- monly employed: (i) to design or select optimal experimental procedures, (ii) to provide maximum relevant chemical infor- mation by analyzing chemical data, and (iii) to obtain knowl- edge about chemical systems.3,9–11 The appropriate use of chemometrics permits processing of large amounts of MIR data variables that subsequently require data reduction approaches in order to identify sources of spectral and inter- class variations.12 The goal of this work is to highlight possible parameters and a variety of biological data where chemometrics should be applied. This review is organized into two sections as brief overviews: first, chemometric data analysis is discussed in four steps (pre-processing tests, feature extraction, clustering analy- sis, and validation and quality tools); and second, chemo- metric applications in MIR-biospectroscopy data from cancer are focused on. Chemometrics in MIR- biospectroscopy Several issues related to vibrational spectra data analysis of biological samples are constantly discussed in the academic community. However, there is a consensus that typically employed chemometric models have two purposes: exploratory analysis and/or classification of samples from MIR-biospectro- scopy data, as will be shown.13,14 For better exploration of its potential, chemometric data analysis was divided into four stages: (1) pre-processing; (2) feature extraction; (3) clustering; and (4) validation and quality tools. Pre-processing The first step of IR spectra analysis. The current trends in pre- processing tests and some of the visual effects are presented in Table 1 and Fig. 1, respectively. Pre-processing has been shown to be of crucial importance for subsequent data mining tasks. In fact, it is now widely recognized that quantitative and classification models deve- loped on the basis of pre-processed data generally perform Table 1 Current trends in pre-processing tests Preprocessing tests Ref. Quality tests (test based on second derivative spectra in region intensity thresholds of water vapor lines; test for sample thickness; test of the spectral signal-to-noise ratio – SNR; test for specific band; and others) 76 Baseline correction and spectral filtering (offset baseline correction; polynomial fit; Savitzky–Golay smoothing; first and second differentiation; Fourier self-deconvolution – FSD; and others) 87 Normalization (normalization to a particular peak (e.g. for amide I and amide II peaks); vector normalization (Euclidean or L2-norm); multiplicative scatter correction – MSC; extended multiplicative signal correction – EMSC; resonant Mie scattering correction – RMieSC; min–max normalization; 1-norm normalization; standard normal variate – SNV, often used in near-infrared spectroscopy; and others) 77 Critical Review Analyst 4834 | Analyst, 2016, 141, 4833–4847 This journal is © The Royal Society of Chemistry 2016 31 better than models that solely use raw data. It improves the robustness and accuracy of subsequent multivariate analyses and increases the interpretability of the data by correcting issues associated with spectral data acquisition.15–27 Com- bined or alone, they aim to (i) improve the robustness and accuracy of subsequent quantitative or classification analyses; (ii) improve interpretability; raw data are transformed into a format that will be better understood by both humans and machines; (iii) detect and remove outliers and trends; (iv) reduce the dimensionality of the data mining task; (v) remove irrelevant and redundant information by feature selection.24 Feature extraction The feature extraction stage is responsible for producing a smaller number of variables that are more informative than the original whole set of wavenumber-variables. It constitutes an important data reduction step in order to match the com- plexity of the subsequent supervised classifier with the amount of data available so as to avoid overfitting or under training. The rationale is data visualization, or preparing the dataset for classification. It is argued that efficient feature extraction can relieve the load of subsequent classifier design.28 In this section, traditional techniques such as Principal Components Analysis (PCA), Linear Discriminants Analysis (LDA) and Partial Least-Squares (PLS) are discussed as part of the computational methodology; variable selection methods such as Successive Projections Algorithm (SPA) and Genetic Algorithm (GA) are examined; and ultimately, algorithm com- binations are considered. Moreover, strengths and drawbacks are also highlighted. PCA. PCA is a particular popular form of unsupervised feature extraction. It is a linear transformation of the wave- number data set operated by the PCA loadings matrix. The loading vectors (principal components, PCs) within this matrix are eigenvectors of the covariance matrix of the data. PCA may be applied to the spectral data set, followed by the selection of single factors. The number of factors to retain may be subject to optimization. One way out is to order the PCA factors from the most to the least discriminant on the basis of their P values as determined by a statistical test. The percen- tage of explained variance can also be taken into account.24 Strengths: it is employed to reduce dimensionality and generate a visualization of data; it captures as much variability as poss- ible, with the assumption that variation implies information; it removes the redundancy in the original data set.21 Weak- nesses: it impartially considers all inter and within-variability in its algorithm; it only separates classes if there is great varia- bility among them, which cannot exist in many sets of spectral data; it modifies the original data vectors.18,29 LDA. LDA is a supervised technique which forms linear com- binations of variables dependent on differences between the classes in the data set.15,30,31 Like PCA, LDA is a feature reduction method, however, it selects the space directions that Fig. 1 Visual effect of different pre-processing steps on a set of FTIR spectra (Baker et al.3). Analyst Critical Review This journal is © The Royal Society of Chemistry 2016 Analyst, 2016, 141, 4833–4847 | 4835 32 achieve maximum separation among the different classes and it uses Euclidean distance to classify unknown samples.2 LDA classification is based on the Mahalanobis distance, which is derived from a common covariance matrix for all classes. LDA is usually applied using the spectral band ratios as parameters to distinguish the FTIR spectra of normal blood samples and of cancerous cases.28 Strengths: fast, accurate and easy to perform; it maximizes the between-class variance over the within-class variance, giving optimal class segregation; in other words, it focuses on finding optimal boundaries between classes; it may be applied as a predictive method, with the goal of formulating a discrimination rule used to predict or allocate unknown samples in predefined classes, and it also may be applied as an exploratory tool to increase the understanding of the differences between classes.32 Weak- nesses: it requires an initial reduction of the number of vari- ables; it can over-fit if the number of spectra is insufficient; it can only be used in a covariance matrix estimated directly from original data.4,13 PLS. PLS is a linear regression method and a supervised technique, based on the construction of a set of linear combi- nations of the wavenumbers by the same means as PCA, but it uses the data classes in the construction.4 It has become the reference or first choice method and has been the most common algorithm employed for regression and modelling of IR signals. It is the calibration method normally selected for modelling IR spectra.8 However, PLS is possibly one of the most misunderstood and misused methods due to having complicated mathematics. Strengths: wide applicability and availability in many kinds of software; it belongs to the family of “full-spectrum” methods.29–31,33 Weaknesses: it has compli- cated mathematics involved in the correlation of wavenumbers and data classes for complete interpretation;7 it only retains variance in the independent variables that exhibit linear corre- lation to the dependent variables; it has a tendency to over-fit; it requires more validation than PCA.30 QDA. QDA is based on the Mahalanobis distance, which is founded on class-specific covariance matrices, unlike the LDA algorithm that considers all-class covariance in the calcu- lation.34,35 The QDA algorithm has an assumption that the within-class variance is smaller than the between-class vari- ance. Together with the LDA and PLS algorithms, it belongs to the Canonical Variate Analysis class, which separates samples into classes by minimizing the within-class variance and maximizing the between-class variance.8,11,13 QDA by itself corresponds to a robust algorithm, and its use is growing due to the complexity of biological sample analysis. Strengths: it incorporates the non-linear behavior of samples; it is less subject to constraints; it has a quadratic boundary between classes.36 Weaknesses: it can only be used in the covariance matrix estimated directly from the original data; it only accepts quadratic behavior for all variables.8,16,37 GA. GA is a combination algorithm inspired by Mendelian genetics that uses a combination of selection, recombination and mutation to develop a solution to a problem.16 It is impor- tant to realize that the performance of GA is based on popu- lation of the solution, rather than on one specific solution. GA is used also as an optimization method for selection of parameters.19,38–43 GA is commonly used to characterize a subset and wavelength selection strategy. Strengths: it elimi- nates potential interferents; selected variables generate a lower signal/noise ratio; it optimizes a given response function.8 Weaknesses: it requires a creative encoding of the “chromo- some” idea.16,42,43 SPA. The selection of variables is a combined optimization problem with constraints. The optimization is constrained due to the reduced number of variable subsets, which are formed according to a sequence of projection operations involving the matrix of instrumental responses. The projection operations are used to choose subsets of variables with a small degree of multi-collinearity in order to minimize redundancy and ill-con- ditioned problems. This allows for detection of specific spec- tral ranges within which specimens differ within sample categories, and also from those that fall within the boundaries between different categories.42,43 Strengths: it does not modify the original data vectors; it selects wavelengths whose infor- mation content is minimally redundant; it solves collinearity problems; its projections are only used for selection purposes; the relationship between spectral variables and data vectors is preserved. Weaknesses: high computational requirements and more time-consuming according to sample size.44–65 Algorithm combinations. PCA–DA discriminates the spectra, using the PCs to reduce the dimensionality of the data prior to DA; and DA then discriminates on the basis of the resulting PCs and the a priori knowledge of the group memberships that are fed into the DA algorithm. This is achieved by maximizing the intergroup variance and minimizing the intra-group vari- ance. Strengths: PCA–DA model being built for each PC up to a maximum number of PCs, and then the optimum number of PCs provides maximal group separation and correct identifi- cation of classes, in addition to both being supervised methods.66 Weaknesses: it needs a method of calculating the optimum number of principal components; the only robust means of estimating the correct number of PCs is by carrying out a method of cross validation, in this case training-set/test- set validation.53 PCA–LDA is a supervised model that searches for variables which contain the smallest intra-group separation and the largest inter-group separation, and constructs a linear combi- nation of variables to discriminate between the groups. Strengths: it permits the construction of a predictive model that can be used for multi-group data classification and dimensionality reduction for a given data set.54–57,67 Weak- nesses: it needs a method of calculating the optimum number of principal components.54,56,57 PLS-DA is performed to extract latent variables that enable the construction of a factor capable of predicting a class. This technique will group the IR data into classes predefined by the operator and construct a discriminant model which will be tested on a validation data set.56 Strengths: wide applicability and availability in many kinds of software.55 Weaknesses: it needs group predefinitions; it requires a priori identification Critical Review Analyst 4836 | Analyst, 2016, 141, 4833–4847 This journal is © The Royal Society of Chemistry 2016 33 of data groups contained within the training samples; IR images are not included in the training data set.68 Moreover, the use of the variable selection algorithms (such as GA and SPA) followed by classification tools (such as LDA and QDA) have been increasingly reported.53,61,69 These algor- ithm combinations are discussed in the following sections. Clustering Clustering is an unsupervised method that aims to group the spectra into classes.2,70–78 It is based on the fact that data from specimens and biological tissues have high molecular complex- ity, so the discrimination between the different types of biological samples requires a thorough evaluation and compari- son of the similarities and dissimilarities between the spectra.2,16,30,35–37,41,73,79,80 A measure of similarity is estab- lished for each class of related spectra and a mean characteristic spectrum can be extracted for each class. The mean spectrum of a cluster represents all spectra in a cluster and can be used for the interpretation of the chemical or biochemical differences between clusters. There are also a variety of algorithms to select from to perform cluster analysis, including Ward’s algorithm, which we employ because it minimizes the heterogeneity of the cluster.15,16,18,29,30,66,68,69,81–84 In this regard, Hierarchical Clus- tering Analysis (HCA), K-means clustering (KMC) and Fuzzy C-means clustering (FCM) are discussed below. HCA. HCA is an unsupervised clustering method, and is a partitioning algorithm in two forms, agglomerative and divi- sive. Agglomerative forms clusters by merging two similar data points at a time, until all of the data points belong to a cluster. By an opposite route, divisive clustering starts out with all the data belonging to one cluster, then by similarity measures the cluster will be divided until all data points are their own clus- ters.10,11 HCA is visualized by a ‘dendrogram’; agglomerative and divisive clustering on the same data set should produce the same diagram except they will be mirror images of one another.54–56,67 The distance between each cluster gives an esti- mation of the spectral differences.35,66,78,85 Strengths: it pro- duces clusters of high homogeneity; it does not require random starting points; the results of clustering are always independent of the starting conditions. Weaknesses: the cluster membership of an individual spectrum can only take the values 0 or 1; high computational requirements; more time- consuming.43,72 KMC. KMC is a non-hierarchical clustering method, used to reclassify spectra that offer similar spectral characteristics. The minimization of the squared distances between the data and their cluster center is the basis of this method, whereas the class membership of an individual spectrum can only take the value 0 or 1. An iterative algorithm is used for updating randomly selected initial cluster centers. Assuming well-defined boundaries between the clusters, this algorithm obtains the class membership for each spectrum. If the closest center is not associated with the cluster to which the object currently belongs, the object reassigns cluster mem- bership to the cluster with the closest center.7,59 Strengths: it produces clusters of high homogeneity; it allows for mole- cular and spatial information to be obtained. Weaknesses: the cluster membership of an individual spectrum can only take the values 0 or 1; randomizing initial k starting points is required.8,72 FCM. This is a non-hierarchical clustering method that differentiates objects into groups whose members reveal a certain degree of similarity. The output of this clustering method is a membership function that defines the degree of membership of a given spectrum to the clusters. To calculate the class membership grade for each spectrum, a fuzzy itera- tive algorithm is used based on minimizing an objective func- tion. The minimization of an objective function represents the distance from any given data point (spectrum) to the actual cluster center weighted by that data point’s membership grade.21,86 FCM is very similar to KMC except that it has an additional component known as a “fuzzifier,” which controls the amount of fuzziness each data point has to a cluster.18 Strengths: outliers and data display properties of more than one class which can be characterized by assigning non-zero class membership values to several clusters.42 Weaknesses: the fuzzifier can affect the stability of cluster assignments; there- fore, it is often set at a number that will not cause this; randomizing initial C starting points is required.43 Validation and quality tools After the model application, quality metrics need to be verified through multivariate classification quality features such as sensitivity, specificity, positive (or precision) and negative pre- dictive values, Youden index, and positive and negative likeli- hood ratios. Table 2 summarizes these equations. Table 2 Quality metrics used in multivariate classification analysis Validation and quality tools Equations Validation and quality tools Equations Sensitivity TP TPþ FN    100 Youden’s index (YOU) SENS − (1 − SPEC) Specificity TN TNþ FP    100 Likelihood ratio positive (LR(+)) SENS 1 SPEC   Positive predictive value (PPV) TP TPþ FP    100 Likelihood ratio negative (LR(−)) SPEC 1 SENS   Negative predictive value (NPV) TN TNþ FN    100 Analyst Critical Review This journal is © The Royal Society of Chemistry 2016 Analyst, 2016, 141, 4833–4847 | 4837 34 Sensitivity is the confidence that a positive result for a sample of the label class is obtained; this is positive in disease. Specificity is the confidence that a negative result for a sample of the non-label class is obtained; this is negative in health.73 A sensitivity of 100% recognizes all people with the condition. A specificity of 100% means that the test recognizes all healthy people as healthy.72 The Positive Predictive Value (PPV) measures the proportion of correctly assigned positive examples,16 therefore, it shows how many test positives are true positives. The Negative Predic- tive Value (NPV) measures the proportion of correctly assigned negative examples,8,11,13 thus it shows how many test negatives are true negatives. For both, if the value obtained is high (as close to 100 as possible), it suggests that that the test is accu- rate. Predictive values of diagnostic or screening tests recog- nize the influence of the prevalence of disease.29–31,33 Youden’s index (YOU) evaluates the classifier’s ability to avoid failure.29 It is a function of sensitivity and specificity. It is a commonly used measure of overall diagnostic effective- ness. This index ranges between 0 and 1, with values close to 1 indicating that the effectiveness is relatively large and values close to 0 indicating limited effectiveness. Thus, it optimizes the biomarker’s differentiating ability when equal weight is given to sensitivity and specificity.29 The magnitude of the LR provides an intuitive feeling for how strongly a given test result will raise (rule-in) or lower (rule-out) the likelihood of a disease. Likelihood Ratios (LR+) represent the ratio between the probability of predicting an example as positive when it truly is positive, and the prob- ability of predicting an example as positive when it actually is not positive.15,30,79 The LR+ corresponds to the clinical concept of “ruling-OUT disease”, thus if the LR+ of a test is very high and such a test is negative, it rules out the disease.71 The LR− represents the ratio between the prob- ability of predicting an example as negative when it is actually positive, and the probability of predicting an example as nega- tive when it truly is negative.87 The LR− corresponds to the clinical concept of “ruling-IN disease”, thus if the LR of a negative test is very low, and such a test is positive, it rules for disease.88 Current trends in MIR- biospectroscopy coupled with chemometrics in cancer studies Table 3 presents the examples of chemometrics applied in cancer studies over ten years (2005–2015) of MIR-bio- spectroscopy research. This is by no means exhaustive, but suffices to illustrate the potential applications of the techno- logy and of chemometrics. As shown, MIR-biospectroscopy coupled with chemometrics has been used as a very powerful tool for cancer screening, diagnosis and surveillance through classification and charac- terization of the cell, tissue and biofluids, as well as biomarker identification. In this section, chemometric applications in MIR-biospectroscopy data from cancer will be discussed, as well as a brief overview is summarized in Table 3 according to the literature.80 Screening and diagnosis In the cancer screening field, cancer may change the bio- chemical composition, not only of cells and tissues, but also of body fluids; as such, biofluids appear as a major potential sample to be analyzed by MIR-biospectroscopy as an alterna- tive to invasive samples from biopsies. Hughes et al.68 and Tatarkovič et al.89 used blood samples to diagnose prostate cancer and differentiate colon cancer or non-cancer, respec- tively, with high accuracy and sensitivity. The same high sensi- tivity was found by Patel & Martin28 for breast, skin, gastric, bladder, esophagus and colon cancer diagnosis from blood samples. Menzies et al.66 used sputum as a non-invasive sample to investigate head and neck cancer at an earlier stage; in this case, FTIR detects laryngeal tumors, hidden from non- invasive observation, in addition to oral and oropharyngeal tumors, providing a single set of biomarkers for multiple tumor types. There is a considerable amount of literature for cancer diag- nosis and classification and for differentiating cancer and non-cancer areas, so the description is only for a few studies in this area. In this field, the PCA chemometric multivariate algorithm was the most used, alone as the first choice, or coupled with other algorithms (Fig. 2). Harvey et al.18 success- fully applied PCA alone after some pre-processing tests to differentiate normal and cancerous prostate cell lines through FTIR-Photo Acoustic Spectroscopy (FTIR-PAS) data. However, in this study, the aim was only to ‘separate’ cell line types; for more specific results, it would be necessary to conduct more robust tests, such as clustering analysis or other algorithms coupled with PCA. Clustering techniques (HCA, KMC, FMC) – complemented (or not) with PCA or other algorithms – appear as unsuper- vised methods that aim to group the spectra into classes. Conti et al.73 used PCA and HCA analysis for colon cancer diagnosis and to characterize totally healthy and malignant tissues in agreement with histopathological analysis, and it was also possible to characterize necrotic zones. Ly et al.72 dis- cussed the identification of tumor lesions from IR spectral images of skin and colon tissue sections using KMC cluster- ing to highlight tumor tissue within non-cancerous tissue and HCA on cluster centers to evaluate the discriminating power; they also created a “subtraction paraffin-model” using PCA to keep the maximum variance in the paraffin dataset while reducing the amount of data and EMSC to eliminate paraffin contribution in all spectra. Pezzei et al.,16 aiming to obtain FTIR imaging data with the corresponding histological and immunohistochemical tissue morphology, implemented mul- tiple clustering techniques for FTIR imaging to increase the information content, enabling visualization of different tissue Critical Review Analyst 4838 | Analyst, 2016, 141, 4833–4847 This journal is © The Royal Society of Chemistry 2016 35 Table 3 MIR-biospectroscopy coupled with chemometrics in cancer studies, from 2005 to 2015 Determination/parameter Matrix/IR mode/spectral region (cm−1) Pre-processing methods Chemometric analysis Results Year Ref Characterization and classification of astrocytic glioma tissue in brain tumors Tissue Linear baseline LDA–GA 95% accuracy for distinction of control from tumor tissue 2005 78 Transmission Offset correction 1000–1800 Min–max normalization Mean and histogram filter Correlation of FTIR spectra derived from prostate cancer tissue with Gleason score and the clinical stage of the tumor Tissue Baseline correction in spectral region (1720–980 cm−1) LDA 70% sensitivity and 81% specificity between Gleason criteria and FTIR-LDA grade correlation 2006 41 Transmission Normalization to amide I (1650 cm−1) 750–4000 Segregation grades of exfoliative cervical cytology Cell Normalization to the amide I (1650 cm−1) PCA–LDA Possible biomarkers were identified (protein phosphorylation at 970 cm−1, glycogen at 1030 cm−1, and shifts in the centroid of amide I at 1650 cm−1) 2007 85 ATR 900–1850 Breast, skin, gastric, bladder, esophagus and colon cancer diagnosis by human blood sample analysis Biofluid Water correction background LDA 90% sensitivity in cancer detection 2007 81 ATR 900–2000 Obtainment of FTIR-PAS spectra of prostate cancer cell line Cell Background subtraction PCA Different cell line types were successfully separated into clusters from each form of cancerous disease 2007 82 Reflection Vector normalization 500–4000 Mean-centering Identification of colon and skin cancerous lesions from IR spectral images of paraffin-embedded biopsies Tissue Water correction background PCA EMSC–KMC model was very powerful for the identification of tumors and for the automation of paraffin correction 2008 68 Transmission EMSC KMC 900–1800 HCA Characterization of spectral biomarkers to distinguish healthy from pathological colon tissues Tissue Polynomial line fit PCA Identification of markers (nucleic acids at 1913 cm−1; proteins at 2196 cm−1 of proteins; 2300/2210 cm−1 band ratio approximately equal to unit; the intensity of the band at 2460 cm−1 that appears related to the progression of the disease) 2008 66 ATR 2nd derivative HCA 700–4000 FSD Curve fitting Normalization to amide I (1650 cm−1) Categorization of prostate cancer tissue specimens Tissue 1st derivative (Savitzky–Golay algorithm, 9 smoothing points) PCA–DA Sensitivity of 92.3% and specificity of 99.4% in the determination of the biochemical changes associated with the progression of cancer 2008 69 Transmission 2nd derivative 750–4000 Vector normalization Determination of whether FTIR microspectroscopy might be exploited in order to derive an endometrial carcinoma subtype-specific biochemical-cell fingerprint Cell Normalization to the amide I (1650 cm−1) PCA–LDA Identify markers νsPO2 − (1080 cm−1), amide I (1650 cm−1) and amide II (1550 cm−1) 2009 83 Transmission 900–1800 Determination of a spectral signature that identifies specific sub-types of prostate cancer tissue Tissue Vector normalization PCA–DA Sensitivity and specificity as high as 83.6% and 86.0% 2009 84 Transmission 750–4000 Recording of infrared spectra of closely related prostate cell types and their classification, and investigation of the factors influencing this classification Cell Vector normalization PCA–LDA Sensitivity and specificity >94% and >98% 2009 67 Reflection Mean-centering Neither growth media nor the N/C ratio could totally explain the classification. Biochemical differences were the major contributing factors 500–4000 A n alyst C riticalR e view T h is jo u rn alis © Th e R o yalSo ciety o f C h em istry 20 16 A n alyst,20 16 ,14 1,4 8 33–4 8 4 7 | 4 8 39 36 Table 3 (Contd.) Determination/parameter Matrix/IR mode/spectral region (cm−1) Pre-processing methods Chemometric analysis Results Year Ref Investigation of a family of prostate cancer cell lines derived from the same anatomical position by FTIR spectroscopy, laboratory and synchrotron based studies Cell Vector normalization PCA–DA Sensitivity and specificity values of >85% and >77% 2010 89 Reflection EMSC Differentiation classification accuracy is better within the laboratory based study compared to the synchrotron based study 750–4000 Smoothing (Savitzky–Golay 5th order) GA Differentiation between normal and malignant breast tissue Tissue Water vapor correction PLS More than 95% of the training and validation spectra were correctly identified 2010 15, 58, 60, 69 and 90 Transmission SNR correction 1000–1400 Normalization to the amide I (1650 cm−1) 2nd derivative Identification and characterization of the tumor initiating renal epithelial carcinoma or cancer stem cell and markers Cell Smoothing (Savitzky–Golay algorithm, 7 points) PCA–LDA Identify lipid and phosphodiester vibrations as markers 2010 61–64 and 84 ATR 2nd derivative 1000–1450 RMieS-EMSC Identification of biomarkers in corneal squamous cell carcinoma Cell Smoothing (Savitzky–Golay algorithm, 13 points) PCA–LDA Identify carbohydrates, glycogen, amide I, lipids, protein and DNA/RNA as markers 2010 15, 16, 18, 68, 78 and 85Transmission Baseline correction 400–4000 Normalization to the amide I peak (1650) Development of new tools for histomorphological analysis and the characterization of snap frozen prostate cancer tissues Tissue Vector normalization PCA Spectroscopic imaging accurately reproduces tissue histology of cancer 2010 28, 41, 58, 60, 70 and 85Transmission Smoothing (Savitzky–Golay algorithm, 13 points) HCA All clustering analyses allow the identification of specific tissue structures (cancer, benign, stroma)750–4000 2nd derivative KMC FCM Determination of whether biospectroscopy coupled with multivariate analysis may be employed to interrogate fine needle aspirates of breast tumors Cell Baseline correction PCA DNA/RNA region should be used as the difference between cytology grades in the atypical and malignant specimens 2011 58, 60 and 69ATR Normalization to the amide I peak (1650 cm−1) PCA–LDA Transmission 650–4000 Detection of colon cancer according to the spectral features of colon tissues Tissue SNR correction PCA–LDA HCA cannot separate two groups 2011 35, 66 and 78ATR HCA PCA–LDA had sensitivity and specificity of 100 and 93.1%, respectively 1000–1400 Identify DNA, RNA and carbohydrates as markers Test for endometrial cancer tissue Tissue Baseline correction PCA–LDA Identify lipids, amide I and amide II as markers 2011 87 ATR Normalization to the amide I peak (1650 cm−1) HCA 1000–1800 Characterization of different types of pituitary gland cancer and normal pituitary tissue Tissue Removal outliers KMC Classification with an overall accuracy success rate of 86% 2012 77 Transmission Baseline correction LDA 1000–1800 Normalization Determination of discriminatory power and formalin- fixation and paraffin-embedding (FFPE)-induced spectral modifications in normal cell lines and epithelial, melanoma and breast cancerous cell lines Cell Water vapor correction HCA Correct identification of the cell types despite spectral modifications due to the fixation process 2013 78 Transmission SNR correction PCA 1000–1800 Normalization for equal area between to the amide I and II (1725 and 1481 cm−1) PLS-DA C riticalR e view A n alyst 4 8 4 0 | A n alyst,20 16 ,14 1,4 8 33–4 8 4 7 Th is jo urn alis © Th e R o yalSo ciety o fC h em istry 20 16 37 Table 3 (Contd.) Determination/parameter Matrix/IR mode/spectral region (cm−1) Pre-processing methods Chemometric analysis Results Year Ref Identification of blood-borne spectral bladder cancer marker candidates Biofluid Removal outliers LDA Spectral marker in amide I and C–H stretching vibration region 2013 41 Transmission Baseline correction based on Pearson-correlation Sensitivity of 93 ± 10% and a specificity of 46 ± 18% ATR Min–max normalization 1000–4000 1st and 2nd derivatives Differentiation of normal cells from premalignant and cancerous oral epithelial cells Cell Baseline correction LDA Accuracy of 77–89.6% in the classification of different stages of oral carcinogenesis 2013 85 Transmission Identify methylene (CH2) and methyl group (CH3) stretching vibrations as markers 650–3600 Classification of ovarian and endometrial cancer, by blood plasma and serum analysis Biofluid Vector normalization PLS Classification results for ovarian cancer and endometrial cancer had >96.7% and >81.7% accuracy 2013 81 ATR Smoothing (Savitzky–Golay algorithm, 9 points) PCA 900–1800 LDA QDA Examination of cervical cancer cytology Cell Rubber band baseline- correction (64 points) PCA–LDA Identify glycogen, amide I, amide II and phosphate bands as spectral markers 2013 82 ATR Normalization to the amide I (1650 cm−1) 900–1800 Analyze blood serum to diagnose prostate cancer Biofluid Normalization to the amide I (1650 cm−1) PCA Distorted amide I and II peak ratios present in some transmission spectra were not observed in the equivalent ATR spectra 2014 68 ATR 1st derivative KMC Transmission Smoothing (Savitzky–Golay, 13 points) 700–4000 Use of FTIR to diagnose head and neck cancer at an earlier stage by sputum analysis Biofluid Baseline correction PLS Identify amides I and II (1650 cm−1 and 1540 cm−1) as markers and region between carbohydrate and associated nucleic acid (1042 cm−1) 2014 66 Transmission Smoothing (Savitzky–Golay, 9 points) 450–4000 Vector normalization Diagnosis approach for basal cell carcinoma via blood sample analysis (skin cancer) Biofluid Water vapor correction QDA 94.74% and 100% accuracy for QDA and GA–QDA models 2014 69 ATR GA–QDA Identify amides I and II (1650 cm−1 and 1540 cm−1) as markers 900–2000 Detection and identification of colon adenocarcinoma Tissue Water vapor correction LDA 100% sensitivity for detection and differentiation of normal and malignant colonic features based purely on their intrinsic biochemical features 2014 83 Transmission EMSC KMC 750–4000 Identification of low-grade cases of cervical cancer and of the wavenumbers as predictive markers of disease progression Cell Rubber band baseline correction PCA–LDA Identification of bands responsible for separating between static, regressive and progressive disease specimens 2014 84 ATR Normalization to the amide I peak (1650 cm−1) SPA–LDA 900–1800 GA–LDA Determination of changes in lipids of cell extracts, after cancer treatments Cell Water vapor correction PLS No evidence of significant changes in the lipid composition of cell extracts after treatment by the PLS model 2014 67 ATR Baseline correction 600–4000 Normalization for equal area Investigation of blood plasma samples of colon cancer patients and healthy controls Biofluid Water vapor correction PCA–LDA Sample discrimination accuracy reached 100% and the cross-validation resulted in 93% sensitivity and 81% specificity 2015 89 ATR Linear baseline correction 600–4000 A n alyst C riticalR e view T h is jo u rn alis © Th e R o yalSo ciety o f C h em istry 20 16 A n alyst,20 16 ,14 1,4 8 33–4 8 4 7 | 4 8 4 1 38 features in pseudo colour images. These studies had sensi- tivities higher than 90%, which demonstrates high classifi- cation power. PCA–DA and PCA–LDA are a common combination of algor- ithms in many cancer studies. Harvey et al.18,29 applied the PCA–LDA algorithm to prostate cancer cell line classification, showing that the four cell lines clustered and separated to a much greater extent than with PCA. The high sensitivity values in this study indicate that PCA–LDA could classify unknown prostate cancer cells with a high degree of accuracy, and there- fore with promising diagnostic potential. Mainly, it demon- strated that FTIR spectroscopy coupled with PCA–LDA can be used to discriminate and classify prostate cell lines with a high degree of accuracy. The same satisfactory results of PCA–LDA application were found by Baker et al.15 for prostate cancer tissue; by Walsh et al.,71 Taylor et al.,87 Purandare et al.,82,84 Hughes et al.,80 and Kelly et al.75 in investigating cervical, renal, corneal and breast cancer in cells; and by Tatarkovič et al.89 in exploring colon cancer in plasma samples; and others such as German et al.,24 Bhargava et al.,21 Harvey et al.,18,29 Patel et al.,28 and Gajjar et al.81 Many other researchers have sought to use algorithms of the selection variables, such as GA and SPA, coupled with classification algorithms, such as LDA.73 For example, Baker et al.3 did this when using GA to discover the optimum pre- processing technique. Purandare et al.84 used PCA, GA and SPA followed by LDA to investigate the progress of cervical cancer. SPA–LDA and GA–LDA resulted in a better segregation of cytology categories than PCA–LDA. In turn, the GA–LDA model had even better segregation and successfully detected the biochemical alterations in the cytology specimens using 35 wavenumbers against 10 variables of the SPA–LDA model. The contribution of this research was the indication that the main biochemical alterations are associated with lipids, proteins, nucleic acids, carbohydrates and to a lesser extent with DNA vibrations. Results achieved by Harvey et al.29 showed that by using multivariate chemometric analysis (PCA and PCA–LDA) it was possible to discriminate and classify prostate cell lines, which gave good values for sensitivity and specificity. For this paper, Fig. 3(a) and (b) show the PCA plots for both the 1st and 2nd derivative FTIR spectra. As can be seen, Fig. 3(a) shows evi- Fig. 2 Applications of chemometric tools for biospectroscopy data analysis of cancer studies from 2005 to 2015, according to an ISI Web of Science search (May 2016). Fig. 3 PCA versus PCA–LDA training set template for clustering and separation of four prostate cell lines (BPH: benign prostatic hyperplasia; PC-3: bone marrow metastases; LNCaP: lymph node metastases; PNT2-C2: non-malignant normal prostate epithelial cells). (Harvey et al.29). Critical Review Analyst 4842 | Analyst, 2016, 141, 4833–4847 This journal is © The Royal Society of Chemistry 2016 39 dence of separation of the four cell lines into four defined quadrants, with separation occurring on both the PC1 and PC3 axes. For the 2nd derivative plot (Fig. 3(b)), separation of cell lines occurs only on the PC2 axis, with LNCaP and BPH displaying the greatest separation. Purandare et al.84 investi- gated PCA–LDA and variable selection techniques, namely the successive projection algorithm (SPA) and genetic algorithm (GA) to predict which cytology samples would progress, remain static or regress. For this study, SPA resulted in selecting 10 variables, as shown in Fig. 4A. Using these 10 selected wave- numbers, Fisher scores were obtained (Fig. 4B). When GA was applied to the dataset, it resulted in the selection of 35 vari- ables (Fig. 4C). Using these 35 selected wavenumbers, Fisher scores were obtained for all the specimens in the dataset (Fig. 4D) whose cost function minimum point was achieved with 35 wavenumbers, thus achieving good separation for each category, especially for the progressive disease class. Moreover, other authors have sought to use even more robust classification algorithms, such as QDA, coupled with algorithms of selection variables. Khanmohammadi et al.69 applied GA coupled with QDA for diagnosis of skin cancer in blood sample analyses. In this case, GA–QDA showed great modifications in the calibration and test set, since GA checks several sets of populations (wavelengths) and selects wave- lengths with high efficiency in the classification algorithm, while QDA classifies selected features, giving the best classifi- cation performance for diagnosis. A high proportion of the abovementioned studies were aimed at biomarker identification. Biomarker identification is the extraction of the wavenumber variables which are the most important according to the algorithm and model used. Conti et al.73 and Taylor et al.87 identified nucleic acids at 1913 cm−1 and proteins at 2196 cm−1 as biomarkers for colon cancer, and related the intensity of the band at 2460 cm−1 to the pro- gression of the disease; also, Kelly et al.2 and Purandare et al.84 identified νsPO 2− (1080 cm−1), amide I (1650 cm−1) and amide II (1550 cm−1) markers for endometrial carcinoma. However, there has been limited effort towards the develop- Fig. 4 The application of variable selection techniques to the segregation of retrospectively categorized low-grade cervical cytology specimens (CIN1: static as cervical intraepithelial neoplasia; REG: cytology that regressed after 1 year; PROG: cytology that progressed to high-grade disease). SPA–LDA results: (A) 10 wavenumber variables selected; and, (B) DF1 × DF2 discriminant function values calculated using the variables selected. GA–LDA results: (C) 35 wavenumbers selected; and, (D) DF1 × DF2 discriminant function values calculated using the variables selected (Purandare et al.84). Analyst Critical Review This journal is © The Royal Society of Chemistry 2016 Analyst, 2016, 141, 4833–4847 | 4843 40 ment and validation of stable biomarker identification methods. Regarding this, Trevisan et al.53 suggest validation of biomarker identification methods to improve stability, mainly being used to perform multivariate feature selection, such as GA, PCA, LDA and QDA algorithms. Lastly, Table 4 summarizes the main characteristics of the chemometric algorithms applied in cancer studies. It is important to highlight that all the research employed MATLAB toolboxes (Mathworks Inc.) for MIR-biospectroscopy data analysis, except for Walsh et al.,71 Kelly et al.,75 Gajjar et al.,81 and Menzies et al.,66 which used Opus software (Bruker Optics Inc.); Gazi et al.70 used OMNIC software; Kelly et al.2 used Pirouette software (Infometrix Inc.); Chiu et al.85 employed Win-DAS software (Wiley Inc.); Tatarkovič et al.89 employed Unscrambler X software (Camo, Norway); and Conti et al.73 used Spectrum 5.3 (Perkin-Elmer Inc.), Grams AI (Galactic Corp.) and Pirouette 4.0 (Infometrix Corp.) software. MATLAB (http://www.mathworks.com) is still a popular devel- opment environment and programming language, mainly due to its customized software that can be written for specific aims. Surveillance FTIR surveillance of cancer progress and of the changes induced by anticancer drug therapy was conducted by Derenne et al.67 and others.29 In these studies, the subtle differences that occur in the spectrum of cancer cell lines upon exposure to concentrations of classical anticancer drugs were investi- gated. For this objective, the PLS discriminant model,15 RMies-EMSC algorithm35 and HCA clustering80 were applied. When these models were applied to compare the impact of various anticancer treatments on cell lines, no evidence for sig- nificant changes could be observed, indicating that the changes were too small for meaningful quantification. However, these studies had the great contribution of identify- ing the IR spectra of entire cells exposed to the same drugs, showing significant absorbance variations in a spectral region (1800–1700 cm−1); as reported for amide I and II bands and protein content. This suggests that FTIR detects slight meta- bolic modifications and that the spectrum of cells exposed to anticancer drugs could characterize each unique cytotoxic activity. FTIR was also demonstrated as a useful tool to obtain fingerprints of anticancer drugs. Conclusions This retrospective study was intended to explore chemo- metric applications in cancer MIR-biospectroscopy data. We have highlighted that chemometrics could improve decisions on the basis of mathematical models and is not only based on univariate fixed values. We have also detached some pre-processing, feature extraction and clustering methods utilized in a variety of studies, which demonstrate chemometrics’ ability and facility to differ- entiate, classify and identify tumors, stages of tumors and biomarkers. Further, we have established the varied forms in which MIR biospectroscopy can be applied regarding the sample formats (cell lines, tissues and biofluids), the instru- mentation modes (ATR, transmission and reflection) and the spectral acquisition. Despite the fact that biological sample heterogeneity pro- vides very complex spectra and this makes the detection/deter- mination/classification of minor components of the samples Table 4 Current chemometric algorithms used in cancer studies Algorithm Aims Assumptions Ref. GA Feature subset and wavelength selection Selection based on the prediction accuracy of the model 15, 58, 60, 69 and 90heuristic Variables in non-linear models SPA Variable selection and calibration Selection based on a minimal redundant subset of samples for minimization of collinearity 61–64 and 84 A separate validation set is available the number of selected samples must be larger than the number of variables PCA Projection and visualization of data in a low- dimensional space Unsupervised 15, 16, 18, 68, 78 and 85Clustering due to the greatest variance LDA Classification of samples Supervised 28, 41, 58, 60, 70 and 85Intra-class variance is smaller than inter-class variance Linear boundary between classes Classes have normal distributions Class covariance matrices are identical QDA Classification of samples Supervised; 58, 60 and 69 Intra-class variance is smaller than inter-class variance Quadratic boundary between classes Classes have normal distributions PLS Classification of samples Supervised 35, 66 and 78 Intra-class variance is smaller than inter-class variance Covariance between data and class membership is related to classification Critical Review Analyst 4844 | Analyst, 2016, 141, 4833–4847 This journal is © The Royal Society of Chemistry 2016 41 more difficult, IR spectroscopy coupled with chemometrics allows for extracting significant and valuable information from large and complex data sets. As a consequence, many studies have been carried out by using MIR-biospectroscopy coupled with exhaustive chemometric analyses in recent years. This technique has been confirmed as a rapid, non-destructive authentic measuring tool, as shown in the large amount of literature regarding the same. With the advantages of chemometrics and MIR-biospectro- scopy, and the high number of applications previously men- tioned in this and in other studies, it is expected that chemometric and analytical technique combinations will con- tinue to grow during the next few years and become a reliable tool for understanding the biochemical and biomolecular basis of all cancer and disease types. Abbreviations ATR Attenuated total reflection BPH Benign prostatic hyperplasia CIN1 Static as cervical intraepithelial neoplasia DA Discriminant analysis DF Function discriminant EMSC Extended multiplicative signal correction FCM Fuzzy C-means clustering FFPE Formalin-fixation and paraffin-embedding FN False negative FP False positive FSD Fourier self-deconvolution FTIR Fourier-transform infrared spectroscopy FTIR-PAS FTIR-photo acoustic spectroscopy GA Genetic algorithm GA–LDA Genetic algorithm–linear discriminant analysis GA–QDA Genetic algorithm–quadratic discriminant analysis HCA Hierarchical cluster analysis IR Infrared KMC K-means clustering LDA Linear discriminant analysis LNCaP Lymph node metastasis LR− Negative likelihood ratio LR+ Positive likelihood ratio MIR Mid-infrared MSC Multiplicative scatter correction N/C Nucleus-to-cytoplasm ratio NPV Negative predictive value PC-3 Bone marrow metastasis PCA Principal components analysis PCA–DA Principal components–discriminant analysis PCA–LDA Principal components–linear discriminant analysis PCs Principal components PLS Partial least-squares PLS–DA Partial least-squares–discriminant analysis PNT2-C2 Non-malignant normal prostate epithelial cells PPV Positive predictive value PROG Cytology that progressed to high-grade disease QDA Quadratic discriminant analysis REG Cytology that regressed after 1 year RMieS-EMSC Resonant Mie scattering-extended multiplica- tive signal correction algorithm SENS Sensitivity SNR Signal-to-noise ratio SNV Standard normal variate SPA Successive projections algorithm SPA–LDA Successive projections algorithm–linear discri- minant analysis SPEC Specificity TN True negative TP True positive YOU Youden’s index Acknowledgements L. F. S. Siqueira would like to acknowledge the financial support from the PPGQ/UFRN/CAPES and IFMA. K. M. G. Lima acknowledges the CNPq/CAPES project (Grant 070/2012 and 305962/2014-FAPERN) (PPP 005/2012) for financial support. We are grateful to Fabio Godoy (Bruker Optics Ltd) for excel- lent technical assistance in the study by using a Bruker Lumus FTIR spectrometer. References 1 D. I. Ellis and R. Goodacre, Analyst, 2006, 131, 875– 885. 2 J. G. Kelly, M. N. Singh, H. F. Stringfellow, M. J. Walsh, J. M. Nicholson, F. Bahrami, K. M. Ashton, M. A. Pitt, P. L. Martin-Hirsch and F. L. Martin, Cancer Lett., 2009, 274, 208–217. 3 M. J. Baker, J. Trevisan, P. Bassan, R. Bhargava, H. J. Butler, K. M. Dorling, P. R. Fielden, S. W. Fogarty, N. J. Fullwood, K. a. Heys, C. Hughes, P. Lasch, P. L. Martin-Hirsch, B. Obinaju, G. D. Sockalingum, J. Sulé-Suso, R. J. Strong, M. J. Walsh, B. R. Wood, P. Gardner and F. L. Martin, Nat. Protoc., 2014, 9, 1771–1791. 4 D. Perez-Guaita, S. Garrigues and M. de la, TrAC, Trends Anal. Chem., 2014, 62, 93–105. 5 S. Wold, Kem. Tidskr., 1972, 34–37. 6 D. L. Massart, B. M. G. Vandeginste, L. M. C. Buydens, S. de Jong, P. J. Lewi and J. SmeyersVerbeke, Handbook of chemometrics and qualimetrics-data handling in science, Part A, Elsevier, Amsterdam, Netherlands, 1997. 7 G. G. Dumancas, S. Ramasahayam, G. Bello, J. Hughes and R. Kramer, TrAC, Trends Anal. Chem., 2015, 74, 79–88. 8 J. G. Kelly, J. Trevisan, A. D. Scott, P. L. Carmichael, H. M. Pollock, P. L. Martin-Hirsch and F. L. Martin, J. Pro- teome Res., 2011, 10, 1437–1448. 9 J. Trevisan, P. Angelov, P. L. Carmichael, A. Scott and F. Martin, Analyst, 2012, 137, 3202–3215. 10 P. Lasch, Chemom. Intell. Lab. Syst., 2012, 117, 100– 114. Analyst Critical Review This journal is © The Royal Society of Chemistry 2016 Analyst, 2016, 141, 4833–4847 | 4845 42 11 P. Lasch and W. Petrich, Biomed. Appl. Sync. Infrared Microspec., 2011, 11, 192–225. 12 S. G. Guyon, M. Nikravesh and L. A. Zadeh, Feature Extrac- tion - Foundations and Applications, Springer, New York, 2006. 13 L. Wang and B. Mizaikoff, Anal. Bioanal. Chem., 2008, 391, 1641–1654. 14 H. Abdi and L. J. Willians, Principal component analysis, Wiley Interdiscip. Rev. Comput. Stat. 2, 2010. 15 M. J. Baker, C. Clarke, D. Démoulin, J. M. Nicholson, F. M. Lyng, H. J. Byrne, C. A. Hart, M. D. Brown, N. W. Clarke and P. Gardner, Analyst, 2010, 135, 887–894. 16 C. Pezzei, J. D. Pallua, G. Schaefer, C. Seifarth, V. Huck- Pezzei, L. K. Bittner, H. Klocker, G. Bartsch, G. K. Bonn and C. W. Huck, Mol. BioSyst., 2010, 6, 2287–2295. 17 P. Bassan, A. Sachdeva, A. Kohler, C. Hughes, A. Henderson, J. Boyle, J. H. Shanks, M. Brown, N. W. Clarke and P. Gardner, Analyst, 2012, 137, 1370– 1377. 18 T. J. Harvey, A. Henderson, E. Gazi, N. W. Clarke, M. Brown, E. C. Faria, R. D. Snook and P. Gardner, Analyst, 2007, 132, 292–295. 19 P. Bassan, A. Sachdeva, J. H. Shanks, M. D. Brown, N. W. Clarke and P. Gardner, Analyst, 2013, 138, 7066– 7069. 20 J. Vongsvivut, M. R. Miller, D. McNaughton, P. Heraud and C. J. Barrow, Food Bioprocess Technol., 2014, 7, 2410–2422. 21 R. Bhargava, D. C. Fernandez, S. M. Hewitt and I. W. Levin, Biochim. Biophys. Acta, Biomembr., 2006, 1758, 830–845. 22 C. Hughes, J. Iqbal-Wahid, M. Brown, J. H. Shanks, A. Eustace, H. Denley, P. J. Hoskin, C. West, N. W. Clarke and P. Gardner, J. Biophotonics, 2013, 6, 73–87. 23 T. Korenius, J. Laurikkala and M. Juhola, Inf. Sci., 2007, 177, 4893–4905. 24 M. J. German, A. Hammiche, N. Ragavan, M. J. Tobin, L. J. Cooper, S. S. Matanhelia, A. C. Hindley, C. M. Nicholson, N. J. Fullwood, H. M. Pollock and F. L. Martin, Biophys. J., 2006, 90, 3783–3795. 25 K. Polat and S. Güneş, Expert Syst. Appl., 2008, 34, 214– 221. 26 M. J. Walsh, M. N. Singh, H. F. Stringfellow, H. M. Pollock, A. Hammiche, O. Grude, N. J. Fullwood, M. a. Pitt, P. L. Martin-Hirsch and F. L. Martin, Biomarker Insights, 2008, 3, 179–189. 27 M. Khanmohammadi, M. a. Ansari, A. B. Garmarudi, G. Hassanzadeh and G. Garoosi, Cancer Invest., 2007, 25, 397–404. 28 I. I. Patel and F. L. Martin, Analyst, 2010, 135, 3060–3069. 29 T. J. Harvey, E. Gazi, A. Henderson, R. D. Snook, N. W. Clarke, M. Brown and P. Gardner, Analyst, 2009, 134, 1083–1091. 30 M. J. Baker, E. Gazi, M. D. Brown, J. H. Shanks, P. Gardner and N. W. Clarke, Br. J. Cancer, 2008, 99, 1859–1866. 31 M. J. Baker, M. D. Brown, E. Gazi, N. W. Clarke, J. C. Vickerman and N. P. Lockyer, Analyst, 2008, 133, 175– 179. 32 V. Llabjani, F. L. Martin, K. M. Ashton, T. Dawson, L. D. Heppenstall, P. L. Martin-Hirsch, W. Pang, J. Trevisan, H. F. Stringfellow, I. I. Patel and K. Gajjar, Anal. Methods, 2012, 5, 89–102. 33 M. Romeo, B. Mohlenhoff, M. Jennings and M. Diem, Biochim. Biophys. Acta, Biomembr., 2006, 1758, 915–922. 34 M. Verdonck, A. Denayer, B. Delvaux, S. Garaud, B. R. De Wind, C. Desmedt, C. Sotiriou, K. Willard-Gallob and E. Goormmaghtigha, Analyst, 2016, 141, 606–619. 35 A. Bénard, C. Desmedt, V. Durbecq, G. Rouas, D. Larsimont, C. Sotiriou and E. Goormaghtigh, Spectroscopy, 2010, 24, 67–72. 36 B. R. Wood, K. R. Bambery, C. J. Evans, M. a. Quinn and D. McNaughton, BMC Med. Imaging, 2006, 6, 12. 37 F. L. Martin, M. J. German, E. Wit, T. Fearn, N. Ragavan and H. M. Pollock, J. Comput. Biol., 2007, 14, 1176–1184. 38 P. Bassan, J. Lee, A. Sachdeva, J. Pissardini, K. M. Dorling, J. S. Fletcher, A. Henderson and P. Gardner, Analyst, 2013, 138, 144–157. 39 P. Bassan, A. Sachdeva, J. Lee and P. Gardner, Analyst, 2013, 138, 4139–4146. 40 E. Gazi, J. Dwyer, N. P. Lockyer, P. Gardner, J. H. Shanks, J. Roulson, C. a. Hart, N. W. Clarke and M. D. Brown, Anal. Bioanal. Chem., 2007, 387, 1621–1631. 41 J. Ollesch, S. L. Drees, H. M. Heise, T. Behrens, T. Brüning and K. Gerwert, Analyst, 2013, 138, 4092–4102. 42 P. Lasch, M. Diem and D. Naumann, Proc. SPIE, 2004, 5321, 1–9. 43 P. Lasch, W. Haensch, D. Naumann and M. Diem, Biochim. Biophys. Acta, 2004, 1688, 176–186. 44 K. M. G. Lima, K. Gajjar, G. Valasoulis, M. Nasioutziki, M. Kyrgiou, P. Karakitsos, E. Paraskevaidis, P. L. Martin and F. L. Martin, Anal. Methods, 2014, 6, 9643–9652. 45 K. M. G. Lima, K. B. Gajjar, P. L. Martin-Hirsch and F. L. Martin, Biotechnol. Prog., 2015, 31, 832–839. 46 A. Mignolet and E. Goormaghtigh, Analyst, 2015, 140, 2393–2401. 47 F. Großerüschkamp, A. Kallenbach-Thieltges, T. Behrens, T. Brüning, M. Altmayer, G. Stamatis, D. Theegarten and K. Gerwert, Analyst, 2015, 140, 2114–2120. 48 N. Wald and E. Goormaghtigh, Analyst, 2015, 140, 2144– 2155. 49 M. J. Pilling, P. Bassan and P. Gardner, Analyst, 2015, 140, 2383–2392. 50 J. Dudala, M. Bialas, A. Surowka, M. Bereza-Buziak, A. Hubalewska-Dydejczyk, A. Budzynski, M. Pedziwiatr, M. Kolodziej, K. Wehbe and M. Lankosz, Analyst, 2015, 140, 2101–2106. 51 A. D. Surowka, D. Adamek and M. Szczerbowska-Boru- chowska, Analyst, 2015, 140, 2428–2438. 52 S. Rak, T. De Zan, J. Stefulj, M. Kosovic, O. Gamulin and M. Osmak, Analyst, 2014, 139, 3407–3415. 53 J. Trevisan, J. Park, P. P. Angelov, A. A. Ahmadzai, K. Gajjar, A. D. Scott, P. L. Carmichael and F. L. Martin, J. Bio- photonics, 2014, 7, 254–265. 54 A. Derenne, R. Gasper and E. Goormaghtigh, Analyst, 2011, 136, 1134–1141. Critical Review Analyst 4846 | Analyst, 2016, 141, 4833–4847 This journal is © The Royal Society of Chemistry 2016 43 55 A. Derenne, M. Verdonck and E. Goormaghtigh, Analyst, 2012, 137, 3255–3264. 56 A. Derenne, A. Mignolet and E. Goormaghtigh, Analyst, 2013, 138, 3998–4005. 57 G. Bellisola, G. Cinque, M. Vezzalini, E. Moratti, G. Silvestri, S. Redaelli, C. G. Passerini, K. Wehbe and C. Sorio, Analyst, 2013, 138, 3934–3945. 58 E. Szymanska, J. Gerretzen, J. Engel, B. Geurts, L. Blanchet and L. M. C. Buydens, TrAC, Trends Anal. Chem., 2015, 69, 34–51. 59 D. Habier, R. L. Fernando and J. C. Dekkers, Genetics, 2007, 177, 2389–2397. 60 J. D. Pallua, C. Pezzei, B. Zelger, G. Schaefer, L. K. Bittner, V. A. Huck-Pezzei, S. A. Schoenbichler, H. Hahn, A. Kloss- Brandstaetter, F. Kloss, G. K. Bonn and C. W. Huck, Analyst, 2012, 137, 3965–3974. 61 G. Theophilou, K. M. G. Lima, M. Briggs, P. L. Martin- Hirsch, H. F. Stringfellow and F. L. Martin, Sci. Rep., 2015, 5, 13465. 62 G. Theophilou, K. M. G. Lima, P. L. Martin-Hirsch, H. F. Stringfellow and F. L. Martin, Analyst, 2015, 141, 585– 594. 63 S. F. C. Soares, A. A. Gomes, A. R. G. F. Filho, M. C. U. Araujo and R. K. H. Galvão, Trends Anal. Chem., 2013, 42, 84–98. 64 M. J. C. Pontes, R. K. H. Galvão, M. C. U. Araújo, P. N. T. Moreira, O. D. P. Neto, G. E. José and T. C. B. Saldanha, Chemom. Intell. Lab. Syst., 2005, 78, 11– 18. 65 D. Ballabio and V. Consonni, Anal. Methods, 2013, 5, 3790– 3798. 66 G. E. Menzies, H. R. Fox, C. Marnane, L. Pope, V. Prabhu, S. Winter, A. V. Derrick and P. D. Lewis, Transl. Res., 2014, 163, 19–26. 67 A. Derenne, O. Vandersleyen and E. Goormaghtigh, Biochim. Biophys. Acta, 2014, 1841, 1200–1209. 68 C. Hughes, M. Brown, G. Clemens, A. Henderson, G. Monjardez, N. W. Clarke and P. Gardner, J. Biophotonics, 2014, 7, 180–188. 69 M. Khanmohammadi, K. Ghasemi and A. Bagheri Garmar- udi, RSC Adv., 2014, 4, 41484–41490. 70 E. Gazi, M. Baker, J. Dwyer, N. P. Lockyer, P. Gardner, J. H. Shanks, R. S. Reeve, C. A. Hart, N. W. Clarke and M. D. Brown, Eur. Urol., 2006, 50, 750–761. 71 M. J. Walsh, M. N. Singh, H. M. Pollock, L. J. Cooper, M. J. German, H. F. Stringfellow, N. J. Fullwood, E. Paraskevaidis, P. L. Martin-Hirsch and F. L. Martin, Biochem. Biophys. Res. Commun., 2007, 352, 213–219. 72 E. Ly, O. Piot, R. Wolthuis, A. Durlach, P. Bernard and M. Manfait, Analyst, 2008, 133, 197–205. 73 C. Conti, P. Ferraris, E. Giorgini, C. Rubini, S. Sabbatini, G. Tosi, J. Anastassopoulou, P. Arapantoni, E. Boukaki, S. Konstadoudakis, T. Theophanides and C. Valavanis, J. Mol. Struct., 2008, 881, 46–51. 74 J. G. Kelly, T. Nakamura, S. Kinoshita, N. J. Fullwood and F. L. Martin, Analyst, 2010, 135, 3120–3125. 75 J. G. Kelly, A. a. Ahmadzai, P. Hermansen, M. A. Pitt, Z. Saidan, P. L. Martin-Hirsch and F. L. Martin, Anal. Bioanal. Chem., 2011, 401, 957–967. 76 M. Khanmohammadi, A. Bagheri Garmarudi, S. Samani, K. Ghasemi and A. Ashuri, Pathol. Oncol. Res., 2011, 17, 435–441. 77 G. Steiner, L. Mackenroth, K. D. Geiger, A. Stelling, T. Pinzer, O. Uckermann, V. Sablinskas, G. Schackert, E. Koch and M. Kirsch, Anal. Bioanal. Chem., 2012, 403, 727–735. 78 M. Verdonck, N. Wald, J. Janssis, P. Yan, C. Meyer, A. Legat, D. E. Speiser, C. Desmedt, D. Larsimont, C. Sotiriou and E. Goormaghtigh, Analyst, 2013, 138, 4083–4091. 79 M. J. Baker, E. Gazi, M. D. Brown, J. H. Shanks, N. W. Clarke and P. Gardner, J. Biophotonics, 2009, 2, 104– 113. 80 C. Hughes, M. Liew, A. Sachdeva, P. Bassan, P. Dumas, C. a. Hart, M. D. Brown, N. W. Clarke and P. Gardner, Analyst, 2010, 135, 3133–3141. 81 K. Gajjar, J. Trevisan, G. Owens, P. J. Keating, N. J. Wood, H. F. Stringfellow, P. L. Martin-Hirsch and F. L. Martin, Analyst, 2013, 138, 3917–3926. 82 N. C. Purandare, I. I. Patel, J. Trevisan, N. Bolger, R. Kelehan, G. von Bünau, P. L. Martin-Hirsch, W. J. Prendiville and F. L. Martin, Analyst, 2013, 138, 3909– 3916. 83 J. Nallala, M.-D. Diebold, C. Gobinet, O. Bouché, G. D. Sockalingum, O. Piot and M. Manfait, Analyst, 2014, 139, 4005–4015. 84 N. C. Purandare, I. I. Patel, K. M. G. Lima, J. Trevisan, M. Ma’Ayeh, A. McHugh, G. Von Bünau, P. L. Martin Hirsch, W. J. Prendiville and F. L. Martin, Anal. Methods, 2014, 6, 4576–4584. 85 L. F. Chiu, P. Y. Huang, W. F. Chiang, T. Y. Wong, S. H. Lin, Y. C. Lee and D. Bin Shieh, Anal. Bioanal. Chem., 2013, 405, 1995–2007. 86 D. Naumann, SPIE BiOS, 2008, 1–12. 87 S. E. Taylor, K. T. Cheung, I. I. Patel, J. Trevisan, H. F. Stringfellow, K. M. Ashton, N. J. Wood, P. J. Keating, P. L. Martin-Hirsch and F. L. Martin, Br. J. Cancer, 2011, 104, 790–797. 88 A. Derenne, R. Gasper and E. Goormaghtigh, Analyst, 2011, 136, 1134–1141. 89 M. Tatarkovič, M. Miškovičová, L. Šťovíčková, A. Synytsya, L. Petruželka and V. Setnička, Analyst, 2015, 140, 2287–2293. 90 C. Beleites, G. Steiner, M. G. Sowa, R. Baumgartner, S. Sobottka, G. Schackert and R. Salzer, Vib. Spectrosc., 2005, 38, 143–149. Analyst Critical Review This journal is © The Royal Society of Chemistry 2016 Analyst, 2016, 141, 4833–4847 | 4847 44 45 CHAPTER 3 A decade (2004 – 2014) of FTIR prostate cancer spectroscopy studies: an overview of recent advancements. Laurinda F. S. Siqueira Kássio M. G. Lima. Trends in Analytical Chemistry, 2016, 82, 208–221. Contributions:  I wrote the review manuscript Laurinda F. S. Siqueira Kássio M. G. Lima. A decade (2004 – 2014) of FTIR prostate cancer spectroscopy studies: An overview of recent advancements Laurinda F.S. Siqueira, Kássio M.G. Lima * Biological Chemistry and Chemometrics, Institute of Chemistry, Federal University of Rio Grande of Norte, Natal, RN 59072-970, Brazil A R T I C L E I N F O Keywords: Infrared (IR) microspectroscopy Prostate tissue Cancer diagnosis Data analysis Biomarker extraction A B S T R A C T This paper presents a retrospective study from 2004 to 2014 of FTIR prostate cancer spectroscopy related to tissues and cell biology. Since vibrational spectroscopy is delicately sensitive to the biochemical com- position of the sample and variations therein, it is possible to monitor metabolic processes in tissue and cells, and to construct spectral maps based on thousands of collected IR spectra. These reveal informa- tion on tissue structure, distribution of cellular components, metabolic activity and the health condition of cells and tissues. In addition, rapid collection, reliable data, a powerful ability to structure elucida- tion about IR spectroscopy, and the need for a rapid diagnosis of traditional biopsy (subject to sampling and inter-observer) have potentiated infrared as a way for a new type of analysis based on optical ex- amination and being more objective than conventional colour methods. © 2016 Elsevier B.V. All rights reserved. Contents 1. Introduction ........................................................................................................................................................................................................................................................ 209 2. FTIR spectroscopy in prostate cancer diagnosis, classification and imaging ............................................................................................................................... 209 2.1. Sample preparation ............................................................................................................................................................................................................................ 210 2.1.1. Sample formats .................................................................................................................................................................................................................... 210 2.1.2. Sample thickness ................................................................................................................................................................................................................ 210 2.1.3. Sample treatment ............................................................................................................................................................................................................... 210 2.1.4. Substrate choice .................................................................................................................................................................................................................. 210 2.2. Instrumentation and spectral acquisition ................................................................................................................................................................................... 210 2.2.1. Modes ..................................................................................................................................................................................................................................... 211 2.2.2. Mapping and imaging ...................................................................................................................................................................................................... 211 2.2.3. Spectral acquisition ........................................................................................................................................................................................................... 212 2.3. Data processing .................................................................................................................................................................................................................................... 212 2.3.1. Pre-processing ..................................................................................................................................................................................................................... 213 2.3.2. Feature extraction, clustering, classification and biomarker extraction ......................................................................................................... 213 3. Applications ........................................................................................................................................................................................................................................................ 214 4. Perspectives ........................................................................................................................................................................................................................................................ 219 5. Conclusion ........................................................................................................................................................................................................................................................... 220 Acknowledgments ............................................................................................................................................................................................................................................ 220 Uncited references ............................................................................................................................................................................................................................................ 220 References ............................................................................................................................................................................................................................................................ 220 Abbreviations: ANN, Artificial Neural Networks; ATR, Attenuated Total Reflection; BPH, Benign Prostatic Hyperplasia; PCa, Prostate Cancer; EMSC, Extended Multiplica- tive Signal Correction; FCM, Fuzzy C-means; FSD, Fourier Self-Deconvolution; FPA, Focal Plane Array; FTIR, Fourier-Transform Infrared Spectroscopy; GA, Genetic Algorithm; GS, Gleason Score; HCA, Hierarchical Cluster Analysis; IR, Infrared; LDA, Linear Discriminant Analysis; MIR, Mid-Infrared Region; MCT, Mercury–Cadmium–Telluride; MS, Mass Spectrometry; MSC, Multiplicative Scatter Correction; NIRS, Near-infrared Spectroscopy; PCA, Principal Components Analysis; QCL, Quantum-Cascade Laser; SNR, Signal- To-Noise Ratio; SNV, Standard Normal Variate; SPA, Successive Projection Algorithm; TNM, Tumour/Node/Metastases. * Corresponding author. Tel.: +55 84 3342 2323; Fax: +55 84 3211 9224. E-mail address: kassiolima@gmail.com (K.M.G. Lima). http://dx.doi.org/10.1016/j.trac.2016.05.028 0165-9936/© 2016 Elsevier B.V. All rights reserved. Trends in Analytical Chemistry 82 (2016) 208–221 Contents lists available at ScienceDirect Trends in Analytical Chemistry journal homepage: www.elsevier.com/ locate / t rac 46 1. Introduction This paper presents a review of vibrational micro-spectral imaging classification of prostate tissue and cancer diagnosis from 2004 to 2014. Historically, the beginning of increasing inter- est in IR as a potential technique in various areas dates back over 60 years, with studies by Blout and Mellots (1950) and Woernley (1952), who investigated IR spectra of tissue homogenates in search of disease indicators. They were carried out using single beam, manually scanned instruments, which required milligram quantities of the sample, and exhibited poor sensitivity and repro- ducibility. Furthermore, since the framework for interpreting the observed spectra had not yet been developed, the field was abandoned [1]. In the 1980’s, research of living systems started again with many advances in instrumentation, interpretation methods and structure elucidation. In this period, the focus was investigations on the identification of bacterial and fungal patho- gens [2]. In the 1990’s, the focus was investigations on human cell and tissue diseases, with the first being Wong in 1991 which did not overcome the issue of tissue heterogeneity. Only when micro- spectroscopic methods were used did a detailed histopathological correlation between spectra and disease stage become possible [1]. Well known rapid collection, reliable data and a powerful ability to structure elucidation of Fourier-transform infrared (FTIR) spec- troscopy technology [3] added to the need for a rapid diagnosis versus the traditional biopsy has potentiated infrared imaging as a way for a new type of analysis based on optical examination. In addition, it is more objective than conventional colourmethods, since it is possible “to read” the biochemical changes instead of an ap- proach based on morphological changes. In the 1990’s, Malins D.C et al. [4] provided a virtually perfect separation of clusters points representing DNA from normal pros- tate tissue, benign prostatic hyperplasia, and adenocarcinoma in prostate cancer using exploratory analysis (PCA) coupled with FTIR. The findings suggested that the progression of normal pros- tate tissue to benign prostatic hyperplasia and to prostate cancer involves structural alterations in DNA that are distinctly different. A few years after, Malins D.C. et al. [5] investigated prostate glands of certain healthy men over 55 years of age, showing that the same DNA signature exists in normal tissues adjacent to tumours. Prostate tissue is structurally complex, primarily consisting of glandular ducts lined by epithelial cells and supported by hetero- geneous stroma. The tissue also contains blood vessels, blood, nerves, ganglion cells, lymphocytes and stones (which are comprised of luminal secretions and cellular debris) that are organized into struc- tures, measuring from tens to hundreds of microns, and that are readily observable within stained tissue using bright-field micros- copy at low to medium magnifications [6]. Histopathological typing using the Gleason grading system is the standard approach for grading prostate cancer and provides an in- dication as to the aggressiveness of a tumour. However, this system is based upon a visual criterion of pattern recognition that is operator-dependent and subject to intra- and inter-observer vari- ability. Thus, there is a need for molecular based techniques to grade tissue samples in a reliable and reproducible manner. FTIR imaging of microarrays was coupled with statistical pattern recognition tech- niques in order to demonstrate histopathologic characterization of prostatic tissue and to differentiate benign from malignant pros- tatic epithelium [6–10]. The transition of a normal cell to a diseased cell is accompanied by a change in a variety of biomolecules that can be simultane- ously and indiscriminatelyprobedbyFTIRmicrospectroscopy, yielding spectral signatures that enable differentiation between normal and cancerous cells and tissues [6,11–13].Therefore, this demonstrates that histopathologic changes can now be defined by biochemistry- based, objective spectroscopic criteria that do not require a pathologist’s interpretation. Spectroscopic imaging represents a new avenue for the chemical interpretation of tissue and offers addi- tional capabilities for automated, statistically controlled and reproducible subtype recognition. All of these aspects have allowed IR biospectroscopy applica- tions to be applied to diagnosingmany types of cancer over the years. In this paper, the diagnosis and classification of prostate cancer is specifically emphasized. 2. FTIR spectroscopy in prostate cancer diagnosis, classification and imaging Within vibrational spectroscopy in the last few years, FTIR has been applied to metabolomics and to diagnose diseases of inter- est farmore than either Raman (which predominantlymeasures non- polar bonds, as opposed to the polar bonds that FTIR measures) or NIRS. In the case of NIRS, it is predominantly overtones and com- bination vibrations that are measured, while the FTIR spectra that are collected in the MIR are much more information-rich in terms of chemical content, as it is the fundamental vibrations that are being measured [6]. FTIR is based on the principle that when a sample is investigated with an infrared (IR) beam, the functional groups within the sample will absorb the infrared radiation and vibrate in one of a number of ways; either stretching, bending, deforming or combining vibrations. These absorptions/vibrations can then be directly correlated to (bio)chemical species, and the resultant infrared absorption spectrum can be described as an infrared ‘fingerprint’ characteristic of any chemical or biochemical sub- stance [14]. Infrared (IR) spectroscopy exploits the ability of cellular biomolecules to be absorbed in the MIR region through vibrational transitions of chemical bonds. Formost disease diagnoses, research- ers have concentrated on thisMIR part of the spectrum (from4000– 600 cm−1), because in contrast to Near Infrared Spectroscopy (NIRS) (14000–4000 cm−1) the fundamental vibration is seen rather than being overtone or harmonic. Thus, the MIR spectra contain many sharp peaks and is very information rich. In biological terms, the vibrations in the 1500–1750 cm−1 wavenumber region (the amide I and II bands) are ascribable to CLO, NH and C–N from proteins and peptides, for example. Due to its rapidity, reproducibility, holistic nature and ability to analyse carbohydrates, amino acids, fatty acids, lipids, proteins and simultaneous polysaccharides of FTIR, it has been recognizedas a valuable tool formetabolic fingerprinting/footprinting [14–16]. Some cellular biomolecules (or biochemicals) of interest that absorb at different wavenumbers are amide I (≈1.650 cm−1), amide II (≈1.550 cm−1), protein (≈1.425 cm−1), Amide III (≈1.260 cm−1), asym- metric phosphate stretching vibrations (νasPO2−, ≈1.225 cm−1), carbohydrates (≈1.155 cm−1), symmetric phosphate stretching vi- brations (νsPO2− ≈ 1.080 cm−1) and protein phosphorylation (≈970 cm−1) [17–19]. After the progress in research about IR spectroscopy being used in biological material and for diagnosing diseases in the last decade (1994 − 2004) as indicated by Diem et al. [1], many other studies and papers using FTIR spectroscopy as an imaging tool or in clas- sifying spectral categories and determining the distinction between benign and malignant tumours in tissue samples of prostates have been reported by Lasch et al. [20], Gazi et al. [11,21,22], Harvey et al. [11,23], Baker et al. [24–26], Bassan et al. [27–30], Hughes et al. [31] and Malins et al. [4,5]. More particularities of FTIR application in prostate cancer diagnoses and classification are described further in this review. 209L.F.S. Siqueira, K.M.G. Lima / Trends in Analytical Chemistry 82 (2016) 208–221 47 2.1. Sample preparation 2.1.1. Sample formats The main samples used from prostates for clinical IR spectros- copy are tissue samples, biofluids and cells. The focus of this paper is prostate tissue and cell lines. Table 1 shows some format con- siderations for each sample. 2.1.2. Sample thickness Regarding the thickness of material, it should be enough to allow a sufficiently large absorbance intensity to be recorded. In trans- mission and transflection modes, the specimen thickness needs to be adjusted appropriately; if it is too thick, the detector response function will be non-linear so that Beer-Lambert’s law cannot be applied. This has serious consequences for subsequent quantita- tive and classification analyses. When using ATR-FTIR spectroscopy, it is ideal that the specimen is three or four-fold thicker than the penetration depth [26]. However, there seems to be consensus among researchers that the thickness should be between 5 −10 μm for pros- tate tissue, as shown in Table 1. 2.1.3. Sample treatment The removal of contaminants and excess of solvents is an im- portant step of analysis, because some of these absorb in the same wavenumber as the fingerprint of interest in studies of prostate tissue. The majority of studies covered here involve paraffinized tissue. The tissue must be dewaxed, re-hydrated and dried before the anal- ysis, since paraffin has significant peaks at ~2.954 − 1.373 cm−1; in other words, the spectral signal of paraffin includes high absorp- tions in a number of mid-range regions of the wavenumber. If residual paraffin had been present, large differences in spectral in- tensities would have obscured tissue classification. Some procedures used are: 1 Immersion in hexane at 40°C for 48h and or in sequence of 3 − 4 baths [6,8,31]; 2 Immersion in xylene in sequence of 3 − 4 baths for 5 minutes, then wash and clean in acetone for 5 minutes, and leave to air- dry [19,26]; 3 Wash in an orbital mixer with Citro clear for 6 or 20 minutes, and then acetone at 48°C for a further 6 or 20 minutes before air-drying for 1h under ambient conditions [11,27]. 90% ethanol is used to wash frozen tissue for 1 minute, fol- lowed by pure ethanol for 1 minute and then dry in an aspirator (3.2 kPa) for 30 minutes at room temperature [32]. Some authors, such as Bassan et al. [27,28,30] did not use any procedure for dewaxing, instead they used subtraction of paraffin spectra. However, Hughes et al. [31] compared hexane and xylene sol- vents and the efficacy time of prostate tissue deparaffinization. In this analysis, the tissue samples were dipped 10 times in 3 mL of xylene and hexane, and then left immersed for 5, 15, 30, 60 and 120 minutes, and 24h. Then the sample was removed and dipped 10 times in 3 mL of ethanol and then left to air dry at room tem- perature for 30 minutes. The infrared image was acquired after each procedure. In this context, Hughes et al. [31] showed that the mean spec- tral signals of the xylene washed tissue specimen appeared consistent. This could suggest that all of the paraffin had been removed up to a detectable limit after 5 minutes of solvent immer- sion. In the hexane specimen, only slight observable differences were noted between 5 and 10 minutes. This can be seen by the differ- ence in peak heights between Vas(CH2), (~2917 cm−1) and δs(OH) (~3289 cm−1) visually expressed by the gradient between them, as well as a slight change in the peak shoulder shape of δ(CH2) at ~1462 cm−1. Even slight changeswere not detectable after 10minutes, indicating a steady-state in accordancewith that of chemically bound tissue. In comparing dewaxing efficacy between xylene and hexane, Hughes et al. [31] found no major difference. However, as hexane is more flammable and the time for dewaxing xylene is shorter, the authors suggest that tissue should be dewaxed for a minimum of 5 minutes with xylene. 2.1.4. Substrate choice There appears to be general consensus concerning the slide or matrix upon which the sample will be placed and the preparation steps associated with it suggests that transmission or ATR spec- troscopy measurements are more applicable to examination of biological material. This material needs to be an IR-transparent ma- terial such as ZnSe, BaF2 or CaF2, instead of IR-reflective substrates such as Low-EIR slides. This is essential in order to acquire the best and most-reproducible spectra [24,36]. Table 2 shows some con- siderations about the substrate choice. 2.2. Instrumentation and spectral acquisition FTIR spectroscopic combines an interferometer, IR microscope, and array detector [37]. The FTIR spectrometer use a polychro- matic radiation source in combination with a Fourier transform instrument to measure all wavelengths at once. The infrared radi- ation ismixedwith light from a reference laser to provide a calibrated interferogram. A variety of choices are available for the IR source, including globar, synchrotron and quantum-cascade lasers (QCLs). When the laser has passed through the sample, the signal is mea- sured using a single IR sensitive detector, such asmercury–cadmium– telluride (MCT), 2D focal plane array (FPA), linear array or single Table 1 ample formats and thickness of prostate samples for FTIR spectroscopy Sample format Thickness Considerations Tissue 4 μm [32,33] 5 μm [8,27,32] 6 μm [20] 10 μm [19,22,24,25,34] Can be stored for months. The thicker the tissue, the greater the chance of probing heterogeneous layers and possibly multiple cell types, rendering cell type signal less pure. For fixed-tissue, the sample can be affected by chemical deparaffinization, For cryosectioning tissue, although snap-freezing negates the use of fixatives such as formalin or the use of paraffin, it may damage the structural integrity of the tissue. Also, once a sample is thawed, components may start to degrade quickly. Cell lines 2 μm [11,23,34] 5 μm [30] Only small sample volumes are needed. Cells fixed and then placed onto slides may be uneven in thickness. The strong absorptivity of the water molecule can hinder single-cell microspectroscopy. Biofluids 50 μL [35] ~1 μL [26,30] Only small sample volumes are needed. No reagents are required, a profile of spectral alterations can be determined and the methods are suitable for automation. Samples that are not immediately used should be frozen and stored at ~ 80°C 210 L.F.S. Siqueira, K.M.G. Lima / Trends in Analytical Chemistry 82 (2016) 208–221 48 element. The signal is subject to Fourier transformation to produce an infrared transmittance spectrum over the range of wave- lengths. This can then be converted into absorbance values to provide positive peaks corresponding with high absorption by a particular molecular species [24,38]. Some IR spectrometers most utilized in prostate studies are shown in Table 3. 2.2.1. Modes The choice between ATR, transmission and reflection sampling modes depends on the sample types and each application. Transmission and transflection sampling modes have been applied to a variety of biological specimens that can be sectioned into a thin layer allowing for accurate spectral data acquisition. In the conventional transmission mode, the IR light is absorbed by a thin tissue section in its path, while in the transflection mode the IR light is transflected by a reflecting surface onto which the tissue is placed [40]. ATR mode differs in that the IR beam is directed through an in- ternal reflection element with a high refractive index. In this case, the IR surface must be in direct contact with the sample for eva- nescent wave penetration. This penetration typically ranges from 1 to 2 μm within the 1,800–900 cm−1 region, but it should be re- membered that there is still ~5% intensity at a depth of 3 μm [15,24,27,34,41]. In general, transmission and transflection imaging have been widely implemented in biological tissues. Imaging in ATR mode is a versatile option, because little sample preparation is required owing to minimal sample-thickness restrictions. Therefore, this means that it has been implemented in biological fields such as pharmacolo- gy and subcellular investigation [42,43]. Themajorly of prostate cancer studies are developed in the trans- mission mode [4,6,17,20,22,23,28,29,31,32,44]. More recently, the reflection mode [9,21,31], the transflection mode [25,27,28,32] and ATR mode [25,30] have also been applied. In this context, Pezzei et al. [32] used NIR reflectance for testing the measurement mode, and NIR transmission, MIR reflectance and MIR transmission modes to analyse 12 μm thick prostate tissue on the CaF2 slide. The measurement results in the reflectance mode in both the NIR as well as in the MIR wavelength range were unus- able because there was only noise. In the transmission mode, the MIR modus delivered better results than the NIR modus. This con- firms the preference of the transmission mode and the MIR. 2.2.2. Mapping and imaging Mapping and imaging depends on the detector used in the anal- ysis. Detectors can be separated into single-element, linear array and FPA detectors, which allows for construction of spectral maps or images. These tissue maps permit a direct correlation between the spectral map and the sample histopathology; in fact, spectral imaging (or mapping) reproduces the tissue architecture in both normal and diseased tissue samples. Single element detector allows for individual point spectra to be obtained across a whole sample (this is useful when analysing biofluids); a particular application has been to derive single cell- specific fingerprint spectra across a heterogeneous tissue section. The absorbance intensity in themaps at each spectral point becomes an individual pixel in the resultant pseudo colour images, which can give details of how different biomolecules vary across the target area. However, spectral maps take a much longer time than individual point spectra due to point spectra often having a high signal-to- noise ratio (SNR) which results in high-quality spectra. To fix this, simply reduce to a lower SNR [24,44,45]. FPA and linear array detectors provide imaging using spatial multi-element detectors that allow simultaneous spectral acquisi- tion, which produce spectral imageswith good SNR and lateral spatial resolution close to the diffraction limit when combined with suit- able optics. Measurements using an FPA detector are fast, since such detectors allow for the acquisition of thousands of spectra simul- taneously. These can be used to generate pseudo colour images of the target area, such as those which are shown in the character- ization of prostate tissue [4,19,46]. In general, mapping studies typically examine samples at coarse spatial resolution (>20 μm) and investigate small numbers of pa- Table 2 onsiderations about substrate choice Substrate choice Considerations CaF2 [20,32,33] Less reflective loss at the substrate–sample interface Biochemically compatible for cell growth BaF2 [6,8,22,24,25,34] Not suitable for cell growth due to low cell viability ZnSe [36] Higher refractive index could lead to strong interference between refracted light beams Not suitable for cell growth due to low cell viability Low-e Mirr IR infrared reflecting plates (Kevley technologies, Ohio, USA) [11,19,22–24,27,29,30,32] Suitable for in situ cell culture Not suitable for samples where the thickness is undetermined or is less than the wavelength of the infrared light Table 3 nstrumentation used in prostate cancer studies from 2004 − 2014 Instrumentation References Bruker HYPERION Infrared Microscope (Bruker Optics, Ettlingen, Germany). [20] Michelson interferometer and all-reflecting infrared microscope (Perkin-Elmer Spotlight 300). [6] Spotlight 300 (Perkin Elmer Inc.) infrared imaging spectrometer. [8] Nicolet Magna system 550 spectrometer equipped with liquid nitrogen-cooled MCT/A detector and a KBr beam splitter, attached to a microscope equipped with a video camera to view optical images (x150 magnification) of the sampling area and a programmable computerized x–y stage. Aperture size of 60 × 60 μm. [20,22,39] BioRad FTS 7000 spectrometer coupled with an MTEC (model 300) PA cell (MTEC Photoacoustics, USA), operated in dynamic scanning mode. [11] Nicolet FTIR spectrometer coupled to a Nic-Plan microscope and equipped with an MCT liquid N2-cooled detector. [23] FTIR microscope (Perkin-Elmer Spotlight 400, MA, USA) equipped with a liquid nitrogen-cooled mercury–cadmium–telluride (MCT) 16- element linear array detector. [32] Thermo Nicolet 6700 FTIR spectrometer coupled to a Nicolet Continu μmmicroscope (Thermo Fisher Scientific, Hemel Hempstead, UK) equipped with a KBr beam splitter and a mercury cadmium telluride (MCT) detector. [19] Varian FTS 7000 spectrometer coupled to a Varian 600 UMA FTIR microscope equipped with a germanium crystal ATR accessory. [27] Varian 670-IR spectrometer coupled with a Varian 620-IR imaging microscope (Agilent Technologies, Santa Clara, CA) equipped with a 128x128 pixel liquid nitrogen-cooled mercury-cadmium-telluride (MCT) focal planar array (FPA) detector, with or without ATR accessory Agilent 600 series Ge-based slide-on ATR accessory. [25,27,29,30] 211L.F.S. Siqueira, K.M.G. Lima / Trends in Analytical Chemistry 82 (2016) 208–221 49 tients (>50) or record small numbers of spectra per patient (>100) [6]. Spectral maps are composed of a large number of point spectra acquired in a stepwise manner, and it is necessary to set up back- ground scans to be taken at set intervals to account for the atmospheric variation over the extended acquisition time [47]. Imaging is defined as data analysis that uses an unsupervised data processingmethod to reveal tissue structure on a ‘spectral cube’ acquired by a mapping or imaging technique. Imaging allows for the study of shape and penetration of important histopathological features on the basis of the underlying chemistry [24,48]. For spec- tral images, background spectra should be acquired over a defined time period, depending on the sample acquisition time [47]. 2.2.3. Spectral acquisition Maximal information for cells and tissue can be obtained by sam- pling spectral data in a mapping or imaging approach in which 1,000 or 10,000 of spectra are collected for individual ‘‘pixels’’ (spatial el- ements). Analysis of such spectral ‘‘hypercubes’’ (consisting of pixel coordinates, wavelength and intensity information) yields astound- ing information on tissue architecture, and presence/absence of disease [1]. Measurements of an FTIR absorption spectrum involve collect- ing a ‘single-beam’ spectrum. A background single-beam spectrum provides the source intensity, as modified by the instrument; placing a sample in the beam path and measuring the single beam again theoretically only provides the additional effect of the sample ab- sorbance. A logarithm (to the base 10) of the ratio of these quantities provides the absorbance, which is directly related to concentra- tion by Beer’s law [26]. For spectral acquisition from prostate tissues, some spectral reso- lutions and wavenumber ranges are listed in Table 4. The interest area (fingerprint) or diagnostic spectral region for prostate cancer is well marked, and some areas represent more dif- ferentiation between normal and cancerous prostatic tissue and cells, as shown in Table 5. For Pezzei et al. [32], prostate cancer tissue is highly metabolic, possibly because of a higher proliferation rate compared to benign or stromal tissue, as shown in Fig. 1D. It depicts a chemical map generated by integrating the area under band absorption at 2920– 2850 cm−1, which is commonly attributed to lipids and carbohydrates. In contrast to cancer, the stroma tissue seems to have a very high production of proteins, as shown in the chemical map of Fig. 1C which shows absorption at 1591–1483 cm−1, being attributed to a secondary amide that indicates a high amount of protein produc- tion by the stroma tissue. More spectral differentiation between normal and cancerous pros- tate tissue will be discussed in the applications section. 2.3. Data processing Some analysing tools for diagnosis, imaging, finding and iden- tifying biomarkers are described here. A diagnosis using IR spectroscopy requires a more complex framework that uses super- vised classification methods [11,13,19,26,30,32,49]. The modelling process for diagnosis requires separate training and testing stages, and respective training and test data sets. The optimal size of a training data set has been under-investigated to date, but it has been suggested that it may be problem dependent. The number of times that the classifier correctly guessed the class of the testing sample should be counted to calculate a classification rate. Next, additional samples to repeat the cross-validation process should be conducted in order to compare the new classification rate with the old one. The process of adding samples and repeating cross- validation should continue until the classification rate stops improving [52]. It is important to note that a diagnostic framework may be set to use either point spectra or image maps; in the latter case, the trained classification system can be used to predict tissue struc- ture [8,20,26,48]. The data analysis steps are also described here, being: pre- processing, feature extraction, clustering and classification. Table 4 Spectral resolutions and wavenumber ranges for spectral acquisition from prostate tissues Spectral resolution References Wavenumber range (cm−1) References 2 cm−1 (for cell lines studies) [11,22,23] 650–4000 [19] 1 cm−1 [21] 720–4000 [6,8] 4 cm−1 [6] [22,23,25,29–32] 750–4000 [22,24,25,34,49] 8 cm−1 [50] 800–4000 [23] 12 cm−1 [49] 850–4000 [27] 16 cm−1 [11,29,30] 900–4000 [27] 1000–4000 [32] 1000–3800 [33] Table 5 Marking of fingerprint area for spectral acquisition from prostate tissues Fingerprint area [6,8,22,24,49,51] ~1600 to ~1000 cm−1 900–1185 cm−1 Carbohydrate region 1185–1300 cm−1 Nucleic acid phosphates region 1591–1483 cm−1 Secondary amide region 2850 − 2920 cm−1 Lipids region Fig. 1. (A) Detail of a measured prostate cancer slide (HE staining after IR measurement, Gleason score 5) with marked regions (1 = cancer, 2 = stroma, 3 = benign glands), (B) FTIR imaging result shown in false colour representation. Colours reflect intensities of the selected absorption at 1080–1060 cm−1, which is commonly attributed to nucleic acids, (C) FTIR imaging result shown in false colour representation. Colours reflect intensities of the selected absorption at 1591–1483 cm−1, which is commonly attributed to proteins, (D) FTIR imaging result shown in false colour representation. Colours reflect intensities of the selected absorption at 2920–2850 cm−1, which is com- monly attributed to lipids and carbohydrates (Pezzei et al [32].). 212 L.F.S. Siqueira, K.M.G. Lima / Trends in Analytical Chemistry 82 (2016) 208–221 50 2.3.1. Pre-processing Pre-processing improves the robustness and accuracy of sub- sequent multivariate analyses and increases data interpretability by correcting issues associated with spectral data acquisition [52]. The main goals of data pre-processing can be summarized as follows: (I) Improvement of the robustness and accuracy of subse- quent quantitative or classification analyses; (II) Improved interpretability: raw data are transformed into a format that will be better understood by both humans and machines; (III) Detec- tion and removal of outliers and trends; (IV) Reduction of the dimensionality of the data mining task. (V) Removal of irrelevant and redundant information by feature selection [52,53]. The choices of pre-processing methods may depend on the anal- ysis goal, the physical state of the sample, and the time and computing power available [26]. Some pre-processing methods and examples used in prostate tissue studies and others are shown in Table 6. 2.3.2. Feature extraction, clustering, classification and biomarker extraction 2.3.2.1. Feature extraction. For diagnosis, feature extraction consti- tutes an important data reduction step in order to match the complexity of the subsequent supervised classifier with the amount of data available so as to avoid overfitting or under-training. PCA is one particular popular form of unsupervised feature extraction that is used for this purpose. PCA may be applied to the spectral data set, followed by selection of a single PCA factor for the colour gradient. The number of PCA factors to retain may be subject to optimization. One way out is to order the PCA factors from the most to the least discriminant on the basis of their P values as determined by a statistical test. The percentage of explained variance can also be taken into account [6,58]. Many studies about prostate tissue have used PCA in their analyses [11,13,19,29,30,32,49,55,59,60]. 2.3.2.2. Clustering. When it comes to data indicating specimens and biological tissues of high molecular complexity, the discrimina- tion between different types of biological samples requires a thorough evaluation and comparison of the similarities and dis- similarities between the spectra. In this case, when there is a hierarchical cluster analysis, the results of FTIR imaging and light microscopy are highly correlated. In clustered images, the spectra of a particular cluster are encoded in a unique colour. To assemble the infrared images, the colour spectrum allocated to each cluster is displayed at the coordinates in which each pixel spectra were col- lected [35,44]. Clustering methods such as hierarchical cluster analysis (HCA), k-means clustering (KMC), and more recently Fuzzy C-means are frequently used in IR-imaging studies to identify prostate tissuemor- phology [20,25,32,44,61]. Hierarchical clustering analysis (HCA) is an unsupervised ‘‘hard’’ clustering method, i.e. spectra may or may not belong to a given cluster [20]. It reclassifies spectra into clusters by a minimal dis- tance criterion. The distance between each cluster gives an estimation of the spectral differences and the results are shown in a dendro- gram [32]. HCA groups spectra into mutually exclusive clusters; in IR-imaging studies, HCA-based segmentation is achieved by as- signing a distinct colour to the spectra in one cluster. Because each spectrum of an IR-imaging experiment has a unique spatial (x,y) position, pseudo colour segmentation maps can be easily gener- ated by specifically plotting coloured pixels as a function of the spatial coordinates [61]. K-means clustering is a non-hierarchical clustering method, and is used to reclassify spectra that offer similar spectral characteris- tics. The minimization of the squared distances between the data and their cluster centre is the basis of this method, whereas the class membership of an individual spectrum can only take the value of 0 or 1. An iterative algorithm is used for updating randomly se- lected initial cluster centres. Assuming well-defined boundaries between the clusters, this algorithm obtains the class member- ship for each spectrum [28–32,42,49]. The iterative K-means algorithm can be described as follows: (1) IR spectra are illus- trated as points in a p-dimensional space, where a number of k points is initially chosen, and each point denotes the origin of a future cluster; (3) a minimal distance of values between the points and all objects (spectra) are calculated; (4) centroids of the clusters are calculated and distance values between the centroids and each of the objects are recalculated. If the closest centroid is not associ- atedwith the cluster towhich the object currently belongs, the object reassigns cluster membership to the cluster with the closest cen- troid. This process continues until none of the objects have been reassigned [20]. Fuzzy C-means (FCM) clustering is a non-hierarchical cluster- ing method that differentiates objects into groups whose members reveal a certain degree of similarity. The output of this clustering method is a membership function that defines the degree of mem- bership of a given spectrum to the clusters. The values of the membership function can vary between 1 (highest degree of cluster membership) to 0 (no class membership), where the sum of the C cluster membership values for one object equals 1.‘‘Soft’’ linguis- tic system variables and a continuous range of truth values are used in the interval. To calculate the class membership grade for each spectrum, a fuzzy iterative algorithm is used based on minimizing an objective function. Theminimization of an objective function rep- resents the distance from any given data point (spectrum) to the actual cluster centre weighted by that data point’s membership grade [18,32]. 2.3.2.3. Classification. Discriminant Analysis or Linear Discrimi- nant Analysis are used coupled with PCA to increase the power of segregation, respectively forming the PCA algorithms [8,11,19,23–25] and Clusters Analysis [20,22,30,32–35,42,49]. PCA-LDA was used to discriminate the spectra. This is achieved by maximizing the in- tergroup variance andminimizing the intra-group variance. As PCA- LDA is a supervised technique, a way of calculating the optimum number of principal components is needed. The only robust way Table 6 Pre-processing tests used in prostate tissue studies Pre-processing tests References Quality tests (test for absorption bands of atmospheric water vapour; so-called test for sample thickness; test of the spectral signal-to-noise ratio (SNR); test for specific band; bad pixel test) [8,27,32,33,46,52–54] Normalization (Multiplicative Scatter Correction- MSC; Extended Multiplicative Signal Correction – EMSC; Min-Max normalization; 1-norm normalization; vector normalization; standard normal variate– SNV; vector normalization) [8,11,19,22,24,27,29,30,34,49,52,55] Baseline correction (offset baseline correction; piecewise baseline correction; polynomial baseline correction; Savitzky-Golay baseline correction) [8,19,22,24,32,34,49,55,56] Spectral filtering (smoothing/derivatives – noise and derivate filters, Fourier self-deconvolution – FSD) [8,11,19,23,24,33,49,52,55] Other methods (spectral subtraction) [11,28,29,56,57] 213L.F.S. Siqueira, K.M.G. Lima / Trends in Analytical Chemistry 82 (2016) 208–221 51 of estimating the correct number of PCs is by carrying out some method of cross validation; in this case training set/test set vali- dation. Essentially, a PCA-LDA model is built for each PC up to a maximum number of m PCs. The optimum number of PCs is the one that provides maximal group separation and correct identifi- cation of classes [22,39,52,62]. PCA-LDA permits the construction of a predictive model that can be used for multi group data classification and dimensionality re- duction for a given data set. LDA generates a first discriminant function, based on a linear combination of predictor variables gen- erated from each case that best separate the groups. A second discriminant function is generated which is uncorrelated with func- tion 1 and further separates the groups, the result of which allows maximal variance between groups and minimal variance within groups [22]. PCA-LDA is a supervised method that searches vari- ables which contain the smallest intra-group separation and the largest inter-group separation, and constructs a linear combina- tion of variables to discriminate between the groups [23]. In this case, LDA maximizes the inter-category variance in relation to the intra-category variance based on pre-set class labels, resulting in optimal class segregation [22]. Further, as part of the computational methodology, variable selec- tion methods such as successive projections algorithm (SPA) [63], in conjunction with LDA and genetic algorithm (GA) [64] improve the model performance compared with the full spectrum model. These al- gorithms eliminate potential interferents and variables that generate a lower signal/noise ratio. The optimum number of variables for SPA– LDA and GA–LDA was performed with an average risk G of LDA misclassification. This kind of cost function is calculated in the val- idation set as: G N g V n n NV = = ∑1 1 , (1) where gn is defined as g r x m r x m n n I n I m l n n I m = ( ) ( ) ( ) ( )≠ ( ) ( ) 2 2 , min , (2) where I n( ) is the index of the true class for the nth validation object xn . In this definition, the numerator is the squared Mahalanobis dis- tance between object xn(of class index In) and the sample mean mI n( ) of its true class. The denominator in Eq. (2) corresponds to the squaredMahalanobis distance between object xn and the centre of the closest wrong class. 2.3.2.4. Biomarker extraction. Biomarker extraction is the interpre- tation of the internal structure of a classifier or a feature extraction model to identify the wavenumber-variables that are the most im- portant according to the model. Trevisan J. et al. [65,66] showed the existence of four main study goals: Pattern Finding, Biomarker Iden- tification, Imaging, and Diagnosis. For biomarker extraction strategies, the wavenumber-variables are associated with chemical bonds, which in turn are associated with cellular activity. For FTIR spec- troscopy, the extraction of specific information of prostate tissues demands the highest standards of accuracy and reproducibility of measurements because the expected spectral differences between healthy and diseased tissues are very small in relation to a large back- ground absorbance of the whole sample. Malins D.C. et al. [5] investigated wavenumber-variables of struc- tural differences between the DNA of younger men and the prostate cancer DNA phenotype in older men. Several selected wavenumbers appear to be of particular interest, namely, the variables at 884, 886, 964, 1020, 1088, 1231, 1369, 1415, 1482, 1531, 1577, 1605, 1647 and 1689 cm−1, representing structural alterations in functional groups (e.g., C = O and NH2 of guanine and C = N of adenine) and in the ring-stretching vibration of thymine. The authors also found differences between the phosphodiester-deoxyribose structures of the DNA of younger men and the cancer DNA phenotype in older men. However, the total absence of significant differences between these spectra (normal and cancer) means that the structures are vir- tually identical. From this study, it is now possible to identify a prostate cancer DNA phenotype, identical to the DNA structure of tumours in normal tissues in some older men, as well as in histo- logically normal tissues adjacent to tumours. This suggests that the presence of this phenotype in apparently normal prostates is a prom- ising early indicator of cancer risk. Patel I.I. et al. [19] applied FTIR and Raman microspectroscopy as a novel tool to seek out risk factors associated with susceptibil- ity to adenocarcinoma in the human prostate associated with different demographic regions (UK and India). Discrimination of high- risk (UK) and low-risk (India) prostate tissue using FTIR coupled with PCA-LDA revealed some interesting biomarkers as being responsi- ble for segregation between the high-risk (UK) vs. low-risk (India) cohorts. The biomarkers are: 1,624 cm−1 [right-hand shoulder (RHS) Amide I]; 1,667 cm−1 (Amide I); 1,200, 1,009 and 972 cm−1 (protein phosphorylation); 1,586 cm−1 (Amide I/II), 1,123 cm−1 (RNA), 1,543 cm−1 (Amide II), 1,458 cm−1 (proteins), 1,011 cm−1 and 1,069 cm−1 (DNA/RNA). This study was important because, even though latent CaP occurs in both cohorts at a similar prevalence, clinically inva- sive disease arises in high-risk regions such as the UK at a much higher incidence. Furthermore, the study has shown that second- ary protein structure variations are the main biomolecular markers that differ in prostate tissues from differentially susceptible cohorts, indicating that the biochemical differences may lend vital clues into the aetiology of CaP and its progression. Gazi E. et al. [67] used Fourier transform infrared (FTIR) microspectroscopy in a study of prostate cancer cell lines derived from different metastatic sites and on tissue from benign prostate and Gleason-graded malignant prostate tissue. The chemometric treatment of FTIR spectra using the linear discriminant algorithm demonstrated a promising method for the classification of benign and malignant tissue and the separation of Gleason-graded CaP spectra; especially for the ratio of peak areas at 1030 and 1080 cm−1, respectively corresponding to glycogen and phosphate vibrations, suggesting a potential method for differentiating benign from ma- lignant cells. In a similar approach, Harvey T.J. et al. [23] obtained Fourier trans- form infrared (FTIR) spectra of fixed prostate cell lines of differing types as well as the primary epithelial cells from benign prostatic hyperplasia (BPH). Blind testing of the PCA-LDA classifier achieved very promising prediction values (>94% and >98% for sensitivities and specificities, respectively), indicating that FTIR spectroscopy can be used to discriminate and classify prostate cell lines with a high degree of accuracy. Although the authors have not shown the wavenumber-variables for this study, they examined the possible influences of different factors on the discrimination and classifica- tion of prostate cell lines. Firstly, the effect of using different growth media during cell culturing indicated that this did not influence chemometric discrimination. Secondly, differences in the nucleus- to-cytoplasm (N/C) ratio concluded that this factor was not the main reason for the discrimination and classification of prostate cancer (CaP) cell lines. 3. Applications As the initial study analyzed, Lasch et al. [44] tested three clus- tering algorithms on prostate tissue imaging: hierarchical, K-means and Fuzzy C-means clustering. The results show a high grade of cor- respondence of Hierarchical clustering in the K-means and Fuzzy C-means clustering methods. Additionally, the results of hierarchical 214 L.F.S. Siqueira, K.M.G. Lima / Trends in Analytical Chemistry 82 (2016) 208–221 52 clustering are always independent from the starting conditions, as no random initialization of K (K-means clustering) or C (Fuzzy C-means clustering) starting points is required. This initial study when compared to the following shows the evolution of proce- dures and imaging instrumentation. Gazi and co-workers [21] studied prostate cancer cell lines derived from different metastatic sites and tissue samples from Benign pros- tatic hyperplasia (BPH) and Gleason-graded malignant prostate tissue. It was found that the ratio of peak areas at 1030 cm−1 and 1080 cm−1, ascribed to glycogen and phosphate vibrations respec- tively, suggested a potential method for differentiating benign cells from malignant cells. The use of this ratio in association with FTIR spectral imaging was said to provide a basis for estimating areas of malignant tissue within defined regions of a specimen. Subse- quent results from further investigations by the same group suggested that the extent to which clusters were separated from each other may be associated with the invasive properties of each cell line. They further suggested that the cluster plot could be used to determinewhether inorganic ions have a negative or positive effect on invasiveness as a consequence of ion uptake, which could be sub- sequently confirmed and quantified through ToF-SIMS imaging [19,63]. Bhargava et al. [8] proposed to employ TMAs for a high through- put automatized investigation of common challenges in the vibrational spectroscopic analyses of tissue. TMA (tissuemicroarrays) consist of multiple tissue samples of uniform dimensions placed on a single substrate. Individual specimens can be acquired from hun- dreds of different donors and may contain multiple samples from the same donor tissue. The following four steps were adopted: (1) selection of a model for spectroscopic image analysis, with the chosen morphology being based on what functionally determines distinct cell types and extra-cellular components that have char- acteristic physical dimensions accessible by the system’s spatial resolution; (2) data pre-processing to optimize results and com- putations for the chosen model, with the principal components analysis being chosen for this reason; (3) classification procedure, through linear discriminant function based directly on the Bayes- ian probability (metric Bayesian classification), evaluated for receiver operating characteristics and (4) post-processing operations to isolate specific cell types or to enhance the display. The search by Bhargava et al. [8] undoubtedly significantly en- hances the rate and quality of spectroscopic analyses of tissue specimens, allowing for the realization of the statistical sampling and further numerical analysis to explore associations betweenmo- lecular changes and clinic-pathologic information. German et al. [53] employed ATR spectroscopy and synchro- tron radiation-based FTIRmicrospectroscopy to examine the prostate cancer free spectral signature of peripheral zone and transition zone glandular epithelial cells in comparison with those in histologi- cally designated prostate cancer regions. Both techniques highlighted similar spectral characteristics, and good clustering was shown in all three cell regions (peripheral, transition and cancerous zones), where ATR spectroscopy would have the advantage of delivering a biochemical cell signature over a wider surface area. A disadvan- tage would be that a signature unique to a particular cell type might be lost in such an average spectrum, and for this purpose, synchro- tron FTIR microspectroscopy might be employed to facilitate single- cell examination. In a comparison between cell zones from electron microscopy, German et al. [53] showed that with increasing Gleason grade, there was an increase in the differentiation compared to normal tissue. This suggests that significant differences in the biochemistry of pros- tate epithelial cells could be used as potential biomarkers. In this case, a more intense wavenumber peak that might be associated with increased lipid content may point to increased hormone re- sponsiveness in the peripheral zone compared to the transition zone. Peripheral zone epithelial cells have higher carbohydrate/phosphate ratios and lower RNA/DNA ratios compared to cancerous cells, whereas those of the transition zone exhibit intermediate levels. However, the transition zone of epithelial cells possesses a more similar biochemical cell spectral fingerprint to the cancerous zone compared to those in the apparently more susceptible peripheral zone region. Continuing with the research of 2005, Gazi et al. [22] proposed a FTIR-linear discriminant analysis model to differentiate Gleason grades of prostate tumours; a cluster plot (Fig. 2A) is generated and can be considered as a prototype diagnostic classifier. The clusters generated in their model were representative of each Gleason- graded disease state. The scatter plot in Fig. 2C of the Gleason score versus the FTIR-LDA score demonstrates a significant association (p = 0.01) between scores produced from the malignant lesions of each of the blind-tested biopsies. Gazi et al. [31] used a combination of FTIR microspectroscopy and histological stains to increase molecular specificity and probe the biochemistry of metastatic CaP cells in bone marrow tissue. A distinction was provided between the following dominant meta- bolic processes driving the proliferation of the metastatic cells in each of three specimens in this study. To summarize for specimen1,there were significantly high (p ≤ 0.05) carbohydrate (8.23 ± 1.44 cm−1), phosphate (6.13 ± 1.5 cm−1) and lipid hydrocar- bon (24.14 ± 5.9 cm−1) signals comparedwith the organ-confined CaP control, together with vacuolation of cell cytoplasm; glycolipid syn- thesis in specimen 2, through significantly high (p ≤ 0.05) carbohydrate (5.51 ± 0.04 cm−1) and high lipid hydrocarbon (17.91 ± 2.3 cm−1) signals compared with organ-confined CaP control together with positive diastase-digested periodic acid Schiff stain- ing in themajority of metastatic CaP cells; and glycolysis in specimen 3, though significantly high (p ≤ 0.05) carbohydrate (8.86 ± 1.78 cm−1) signals and significantly lower (p ≤ 0.05) lipid hydrocarbon (11.67 ± 0.4 cm−1) signals than organ-confined CaP control, togeth- er with negative diastase-digested periodic acid Schiff staining in the majority of metastatic CaP cells. The significance of this work by Gazi et al. [31] undoubtedly pro- vides structural information as well as relative quantifications of a wide range of biomolecular domains. In addition, economically on a limited source of CaP bone metastases biopsies, which are not usually preformed. In research by Harvey et al. [11], the PCA separates the data into four distinct clusters representing the four different cell lines. The first component separates only benign prostatic hyperplasia and non- malignant normal prostate epithelial cells, whereas the second component separates lymph node metastase, bone marrow metastase, and benign prostatic hyperplasia/non-malignant normal prostate epithelial cells. The underlying reasons why we see dif- ferentiating prostate cancer cell lines is yet unclear, but may be due to differences in cell morphology or biochemistry, or from a com- bination of both. It has also been suggested that differences in the nucleoplasmic to cytoplasmic (N/C) ratio between cell lines might also play a role in the differentiation [9,60]. In a study by Baker et al. [25], it was the first time that FTIR spec- troscopy was shown to correlate with the local bio-potential of prostate cancer. Utilizing the clinical stage taken from the TNM (tumour/node/metastases) classification that classifies T1 and T2 tumours (confined to the prostate), T3 (breaching the prostatic capsule or invading the seminal vesicle) and T4 tumours (extend- ing beyond the prostate and seminal vesicle to invade local pelvic structures), the FTIR PCA-LDA discrimination showed that there is a valid biochemical difference between the T1 and T2 (less aggres- sive), and T3 and T4 (more aggressive) groups of Gleason grade (Fig. 3). The spectral loading peaks responsible for this important discrimination indicates that there are major peaks in the load- ings which could be used as a biomarker to indicate local 215L.F.S. Siqueira, K.M.G. Lima / Trends in Analytical Chemistry 82 (2016) 208–221 53 aggressiveness at 1558 cm−1, attributable to amide II (for less locally aggressive), and 1541 cm−1, attributable to amide II of a-helical struc- tures (more locally aggressive) (Fig. 3). The results also show for the first time that a two-band criterion-based system identifies char- acteristics that differentiate between tumours which are clinically confined to the prostate from those that are clinically invasive. One of the limitations of the Baker et al. [62] study and the study by Gazi et al. [22] is that the Gleason grading system was used as the reference standard for the development of the diagnostic al- gorithms. This methodology is inevitably flawed, as it will tend to incorporate problems inherent in the Gleason system into the new system. However, it was felt to be important to investigate this novel Fig. 2. (A) Combined-groups plot of linear discriminant function weights for Fourier transform infrared (FTIR) spectra taken from prostate epithelial cells of Gleason- graded malignant states in primary prostate tissue (the FTIR-linear discriminant analysis [LDA] tumour-grading model); (B) Mean IR spectra recorded frommalignant epithelial cells of different Gleason-graded (GG) 2–5 tissue. Boxed area represents the IR diagnostic spectral patterns in band region 1480 to 1000 cm−1 used to train the FTIR-LDA grading model; (C) Correlation between Fourier transform infrared-linear discriminant analysis score with Gleason score for each of the biopsies tested. The areas of the squares are proportional to the number of points at that location (in brackets) (Gazi et al [22].). Fig. 3. Discriminant function plot for the vector normalised Gleason score model based upon training set data (empty shapes) and blind set data (full shapes) with a 95% confidence limit where the green circle = GS < 7, red square = GS = 7 and cyan diamond = GS > 7. (Baker et al [25].). 216 L.F.S. Siqueira, K.M.G. Lima / Trends in Analytical Chemistry 82 (2016) 208–221 54 area in this manner to establish proof of principle. This goal has now been achieved, with the data currently presented clearly showing that FTIR-based methods can identify and discriminate between dif- ferent types of prostate cancer [62]. In a Harvey et al. [11] complementary study, Harvey et al. [23] investigated the influence of the differences in growth media in cell culturing, of the nucleus-to-cytoplasm (N/C) ratio and biochemi- cal differences on prostate cancer classification by FTIR. PCA applied on first and second derivative showed greatest separation between malignant lymph node and benign prostatic hyperplasia. For Harvey et al. [23], of further interest is the fact that cell lines separated despite being grown in the samemedia. Therefore, this strongly sug- gests that using different growth media does not contribute to the discrimination. The intrinsic biochemistry of the cell is therefore not determined by the growth medium to the extent that it can be detected by FTIR spectroscopy. Harvey et al. [23] also successfully applied PCA-LDA algorithm for cell line classification. The high sensitivity and sensitivity values indicates that this PCA-LDA could classify unknown prostate cancer cells with a high degree of accuracy, and therefore has promising diagnostic potential. Regarding the N/C ratio, the results show that there is no clustering of spectra based on their N/C ratio, and that the separation between cell line types is much greater than the sep- aration between spectra of different N/C ratios from the same cell. In similar research to Gazi et al. [22] and Baker et al. [62], Baker et al [25]. also successfully used PCA-LDA to discriminate prostate cancer tissue in 3-band Gleason score (GS) criteria, which divided specimens into groups corresponding to GS < 7 for less aggressive, GS = 7 for intermediate aggressiveness, and GS > 7 for themost likely to progress, and also used the clinical stage taken from the TNM (tumour/node/metastases) classification system. The success of the research by Barker et al. [25] is that the FTIR combined with PCA- LDA in both systems to discriminate prostate cancer tissue utilizing an independent observer criterion. In a sequential study, Barker et al. [39] used IRmicrospectroscopy, synchrotron IRmicrospectroscopyandbroadbeammicrospectroscopy to investigate the RWPE human prostate epithelial cell line family. The RWPE cell lines with a common lineage represent a unique and relevantmodelwhichmimics progression stages from localizedma- lignancy to invasive cancer, and can be used to study carcinogenesis, progression, intervention and chemoprevention [39]. The genetic algorithm (GA) was used to discover the optimum pre-processing technique from a range of pre-processing techniques and PCA-LDA algorithm. The Baker et al [39]. study takes this further by suggesting that biochemical changes induced by different transformation methods are primarily responsible for the discrimination of the RWPE family of cell lines, and it is not possible (as was the aim of the research), to model biochemical changes associated with invasiveness using FTIR spectroscopy in prostate cancer using these cell lines. The con- tributions of this research were: (1) it determined that the differentiation classification accuracy is better within the IR mi- croscopy based study compared to the synchrotron based study, primarily due to significantly higher variance in single-cell data and the smaller datasets available; and (2) it demonstrated the poten- tial of FTIR coupled with multivariate analysis technique for pathological screening applications, although further studies in- volving primary cells and tissue are clearly required. Bassan et al [50]. applied a Resonant Mie scattering-extended multiplicative signal correction (RMieS/EMSC) algorithm, modi- fied to correct the dispersion in the IR spectra from synchrotron FTIR micro-spectroscopy of the cell lines of prostate cancer. This disper- sion is predominantly due to resonantMie scattering (RMieS) caused by a real changing refractive index near an absorption band. This causes some degree of index matching, meaning that the efficien- cy with which the photons are scattered at this wavenumber is reduced to almost zero, and visually interpreted as a sharp de- crease in absorbance [28]. The new algorithm subtracts a curve, which is the sum of a constant offset value, a sloping baseline and the RMieS curve, which is described by the summation term [50]. For testing the new algorithm, Bassan et al [50]. used 50 simu- lated spectra and compared the traditional Mie Scattering-ESMC algorithm with the modified RMieS-ESMC. The original position of the amide I peak was set to 1655 ± 1 cm−1, as a band position close to the correct wavenumber value is only obtained when the RMieS- EMSC correction is performed. Therefore, the Mie Scattering- EMSC may successfully enable data to be separated into groups (which is often the aim of the experiment), so any biological in- terpretation of the data, particularly with respect to the Amide I band and associated protein structure, cannot be made unless the RMieS- EMSC correction is applied [50]. In a complex study, Pezzei et al [32]. optimized tissue thick- ness andmeasurement mode and tested clustering for differentiated normal and cancerous prostate tissue. Tissue thickness was opti- mized, as previously mentioned, resulting in no huge difference between the spectra of the 8, 12 or 16 μm thick tissue sections. For measurementmode optimization, the results in the reflectancemode both in the NIRS, as well as in the MIR wavelength range were un- usable, because there was only noise. In the transmission mode, the MIR modus delivered better results than the NIR modus, and so all the following tissues were measured in the MIR transmission mode [32]. The clustering analysis is presented in Fig. 4. For Pezzei et al. [32], the cancer tissue is highly metabolic, pos- sibly according to a higher proliferation rate compared to benign or stromal tissue. In contrast to the cancer, the stroma tissue seems to have a very high production of proteins, as previously men- tioned. This work had the same result of comparison between HCA, K-Means and FCM cluster analysis as Lasch et al. [44], where it shows themajor advantage of Hierarchical clustering over the K-means and also Fuzzy C-means clustering methods. Patel et al. [22] researched the segregation of human prostate tissues classified as high-risk (by UK) versus low-risk (by India) for adenocarcinoma, using FTIR or Raman microspectroscopy coupled with principal components-linear discriminant analysis (PCA- LDA). In this study, each stroma and glandular tissue were differentiated for high-risk and low-risk, and respective biomarkers were also indicated. In the case of ATR-FTIR spectroscopy, Patel et al. [22] found sig- nificant differences using PCA-LDA algorithm (p < 0.0001) between tissue classified as high-risk (UK) and low-risk (India). The identi- fied biomarkers responsible for extremely significant inter-class variance and segregation between the high-risk (UK) vs. low-risk (India) cohorts are in Table 6. To discriminate the tissue types (stroma × glandular epithelium), the authors indicated that the stromal environment in some UK tissue samples is similar to that derived from the India cohort based on the discriminating bio- chemical wavenumbers in Table 7. Similarly, Bassan et al [27]. tested the correction power of a Res- onant Mie Scattering-Extended Multiplicative Signal Correction (RMieS-EMSC) algorithm. In this complementary work, a simu- lated data set (I) was analysed to see how many iterations of the RMieS-EMSC algorithm are required to recover the correct classi- fication; (II) a corrected spectrum with three different non-ideal reference spectra to see if the three corrected spectra converge to a unique solution, and (III) investigated the influence of the number iterations on the classification of images. This work is a sequence of the Bassan et al. [50] approach. Schematically, Bassan et al. [27] (1) simulated 100 spectra in the region generally considered to be of most diagnostic interest (Fig. 5A); (2) plotted the corresponding PCA scores for the simulated data (Fig. 5B); (3) artificially distorted by RMieS scattering (Fig. 5C); (4) and then again plotted the corresponding PCA scores of the scat- 217L.F.S. Siqueira, K.M.G. Lima / Trends in Analytical Chemistry 82 (2016) 208–221 55 tered spectra (Fig. 5D) that show that separation and classification cannot be achieved with this data; (5) the scores are subsequently treatedwith the RMieS-EMSC correction algorithm, using the protein (matrigel) reference spectrum as the first guess/reference spec- trum, then used the first hierarchical cluster analysis (HCA) using Euclidean distances and linkage via Wards algorithm, and used ar- tificial neural networks (ANN1 and ANN2) (Fig. 5E-F). Fig. 6G shows the total absorbance FTIR image of prostate tissue, and Fig. 5H shows that significant improvement is obtained by 10 iterations of clas- sification. The colour scheme is a heat map, such that white indicates high absorbance; the green pixels are assigned to epithelial cells; the red pixels represent stroma (low level of staining); the blue pixels are regions which could not be assigned to epithelium or stroma; the black pixels represent areas where there is no tissue. Finally, Bassan et al. [27] found very little differences in the choice of reference spectrum and in the classification images. However, the choice of reference spectrum used should be expected to be con- stant in any pre-processing and analysis. All these results prove that using RMieS-EMSC algorithm is a possible iterative improvement in the classification accuracy of infrared spectra and images, irre- spective of the reference spectrum used, especially for applications relating to clustering of simulated and real data. Bassan et al. [29] proposed measuring the entire sample using rapid FTIR chemical imaging at high spatial resolution, showing a 66 million pixel chemical image of a whole prostate cross section of size ca. 4 × 5 cm2, where each pixel covers 5.5 × 5.5 μm2 of tissue. The author shows how an entire organ such as the prostate can be chemically imaged in a clinical time frame of hours rather than days; 14 hours to be exact. Bassan et al. [30] continued to research ways to optimize the ap- plication of FTIR in classification of prostate tissue and cancer diagnosis. The authors determined that low-e microscope slides are not a suitable substrate for samples where the thickness is unde- termined or is less than the wavelength of the infrared light (less than 10 μm of tissue) for MIR-ATR spectroscopy. This is a real problem for cells when the thickness may only be 1 μm. Similarly, there is a problem for biological fluids that have been dried onto the surface since the thickness is uncertain [30,68]. This was men- tioned previously. Likewise, Bassan et al. [30] also showed that the measured ab- sorbance spectrum of a chemically homogenous thin film is severely distorted in transflection-mode compared to transmission mode, Fig. 4. (A) Corresponding frozen section of the measured prostate cancer slide (Gleason Score 5, HE stained, detail) with marked regions (1 = cancer, 2 = stroma, 3 = benign glands), (B) Immunohistochemical validation slide (basal cell marker p63, negative in cancer glands, positive in benign glands), (C) Measured prostate cancer slide (HE stain- ing after IR measurement, Gleason Score 5, detail) with marked regions (1 = cancer, 2 = stroma, 3 = benign glands), (D) Hierarchical Cluster Analysis spectroscopic image of a human prostate cancer section with 5 clusters, (E) K-means clustering spectroscopic image of a human prostate cancer section with 5 clusters, (F) fuzzy C-means clus- tering spectroscopic image of a human prostate cancer section with 5 clusters (Pezzei et al [32].). Table 7 Biomarkers identified for segregation of human prostate tissues classified high- risk (by UK) versus low-risk (by India) for adenocarcinoma using FTIR, Patel et al [22]. High-risk (UK) vs. low-risk (India) cohorts Biomarkers Wavenumber (cm−1) Amide I/II trough 1.586 RNA 1.123 Amide II 1.543 Proteins 1.458 DNA/RNA 1.011and1.069 India vs. UK glandular epithelium Biomarkers Wavenumber (cm−1) Amide I 1.666 RHS Amide I 1.622 Amide II 1.535 RHS Amide II 1.524 DNA 1.230 India vs. UK stroma Biomarkers Wavenumber (cm−1) Amide I 1.664 RHS Amide I 1.620 Lipid 1.716 Amide I/II 1.587 RHS Amide II 1.529 218 L.F.S. Siqueira, K.M.G. Lima / Trends in Analytical Chemistry 82 (2016) 208–221 56 and that a non-absorbing sample can give rise to oscillations in the spectral baseline similar in profile to that of Mie scattering. In general, data from transflection mode FTIR using low-e slides should be treated with extreme caution to ensure that observed biochemi- cal differences are not from sample thickness differences, which will be difficult without repeating the experiment in transmissionmode. Bassan et al. [28] suggest the use of the first or second derivative as a way to negate the problem, and thickness 4 μm as optimal, but it is clear that the method should still be used with caution (Fig. 6). Hughes et al. [31] continued research on the optimization of FTIR applied to disease diagnosis, along with Bamberry et al. [69] and Bassan et al. [25,27,28,30,50], to determinewhich is the better solvent and the optimum time for prostate tissue dewaxing through a com- parison between xylene and hexane. As previously mentioned, Hughes et al. [31]. found nomajor difference in comparing dewaxing efficacy between xylene and hexane. However, hexane is more flam- mable, and the time for dewaxing by xylene is shorter, so the authors suggest that tissue should be dewaxed for a minimum of 5–10 minutes with xylene. This work is important as the spectral signal of paraffin includes high absorptions in a number of mid-range wavenumber regions. If residual paraffin had been present, large differences in spectral intensities would have obscured tissue classification. 4. Perspectives This review has highlighted major trends and drivers within cel- lular and tissue spectroscopy, and in order to look towards the future, the following areas will need to be taken into account. From IR studies, other techniques have been applied in classification and di- agnosis of prostate tissue, cells and biofluids, including Raman spectroscopy and NIRS [19,23,49,59,70]. In the area of instrumentation, the evolution of more sensitive modern mass spectrometric analytical techniques, such as in- depth proteomic analysis of complex biological samples or the identification of a specific molecule modification that is specific to a disease state has meant that mass spectrometry (MS) is playing a key role in biomarker discovery and evaluation studies. In the area of classification analysis, although these are not pros- tate cancer studies, the use of supervised techniques coupled with variable selection methods such as genetic algorithm–linear dis- criminant analysis (GA–LDA) and successive projection algorithm (SPA–LDA) are interesting approaches and have been successfully applied. The researchers could have applied templates for this con- dition, as a major objective in treating prostate cancer is predicting which disease will remain organ-confined and which will be des- tined to spread, even at the time of diagnosis. Fig. 5. (a) Four groups of simulated infrared spectra consisting of 100 spectra in each group. (b) PCA scores plot of the data shown in (a). (c) The spectra from (a) incorpo- rating the influence of scattering. (d) The PCA scores plot of the scattered spectra in (c). (e) Classification accuracy as a function of the number of iterations of the RMieS- EMSC algorithm for three different classification models, HCA objective clustering (blue curve), ANN-1, the pure chemistry model (green curve) and ANN-2, the current iteration model (red curve). (f) The PCA scores plot for the data in (c) after 25 iterations of the RMieS-EMSC algorithm. 219L.F.S. Siqueira, K.M.G. Lima / Trends in Analytical Chemistry 82 (2016) 208–221 57 The application of vibrational spectroscopy (FTIR) is much at the forefront in current research focusing on the development of prac- tical diagnostic and/or prognostic tools. Some advantages towards clinical translation for prostate cancer diagnosis are: i) FTIR microspectroscopy permits the rapid collection of spectra ob- tained from small well-defined spatial regions (of around 7 × 7 μm2), which can be scanned to provide a ‘biochemical map’ of a tissue specimen; ii) IR radiation is a low-energy process, and the imaging of any sample can be replicated numerous times to reduce intra/ inter-specimen variability; iii) FTIR equipment is relatively inexpensive, so this technique could have a role alongside conven- tional methods for grading prostate cancer tissue. On the other hand, because of the large number of data points acquired in FTIR spectra, and together with the fact that the biochemical changes beingmoni- tored can be quite subtle, chemometric methods are helpful in identifying biochemical variance. 5. Conclusion In this retrospective study, it was intended to explore IR micro- spectroscopy research related to the diagnosis and classification of prostate cancer, and to discuss the applicability and advances in in- strumentation, acquisition and quality of spectra and imaging. The pre-processing and feature extraction methods utilized in these studies were also included, which demonstrate the ability to dif- ferentiate between tumours that are clinically confined to the prostate and the normal prostate tissue, and can also segregate stages of tumours. The biomarkers which allowed this differentiation, as pointed out in the research, were amide I, amide II, and DNA/RNA. Clearly, with the evolution of these techniques, most of the atten- tion was focused on optimizing procedures (for example, the choice of sample thickness and substrates, treatment sample (dewaxing), and others),and the diversification of applications (tissue, cells, biofluids). This improvement leads to a shorter time of analysis and diagnosis, which is essential for cancer management. It is hoped that this article not only serves as a starting point for beginners in the field, but also as a source of reference for more experienced spectroscopists. Acknowledgments L.F.S. Siqueira would like to acknowledge the financial support from the PPGQ/UFRN/CAPES. K.M.G. Lima acknowledges the CNPq/ CAPES project (Grant 070/2012 and 305962/2014 − 4), FAPERN (PPP 005/2012) for financial support. We are grateful to Fabio Godoy (Bruker Optics Ltd) for excellent technical assistance this study by Bruker Lumus FTIR spectrometer. References [1] M. Diem, M. Romeo, S. Boydston-White, M. Miljkovic, C. Matthaus, A decade of vibrational micro-spectroscopy of human cells and tissue (1994–2004), Analyst 129 (2004) 880–885. [2] D. Helm, H. Labischinski, G. Schallehn, D. Naumann, Classification and identification of bacteria by Fourier-transform infrared spectroscopy, J. Gen. Microbiol. 137 (1991) 69–79. [3] W. Yang, X. Xiao, J. Tan, Q. Cai, In situ evaluation of breast cancer cell growth with 3D ATR-FTIR spectroscopy, Vib. Spectrosc. 49 (2009) 64–67. [4] D.C. Malins, N.L. Polissar, S.J. Gunselman, Models of DNA structure achieve almost perfect discrimination between normal prostate, benign prostatic hyperplasia (BPH), and adenocarcinoma and have a high potential for predicting BPH and prostate cancer, Proc. Natl. Acad. Sci. U.S.A. 94 (1997) 259–264. [5] D.C. Malins, N.K. Gilman, V.M. Green, T.M. Wheeler, E.A. Barker, K.M. Anderson, A cancer DNA phenotype in healthy prostates, conserved in tumors and adjacent normal cells, implies a relationship to carcinogenesis, Proc. Natl. Acad. Sci. U.S.A. 102 (2005) 19093–19096. [6] D.C. Fernandez, R. Bhargava, S.M. Hewitt, I.W. Levin, Infrared spectroscopic imaging for histopathologic recognition, Nat. Biotechnol. 23 (2005) 469–474. [7] I.W. Levin, R. Bhargava, Fourier transform infrared vibrational spectroscopic imaging: integrating microscopy and molecular recognition, Annu. Rev. Phys. Chem. 56 (2005) 429–474. [8] R. Bhargava, D.C. Fernandez, S.M. Hewitt, I.W. Levin, High throughput assessment of cells and tissues: bayesian classification of spectral metrics from infrared vibrational spectroscopic imaging data, Biochim. Biophys. Acta 1758 (2006) 830–845. [9] C. Krafft, V. Sergo, Biomedical applications of Raman and infrared spectroscopy to diagnose tissues, Spectroscopy 20 (2006) 195–218. [10] M.A. Mackanos, C.H. Contag, FTIR microspectroscopy for improved prostate cancer diagnosis, Trends Biotechnol. 27 (2009) 661–663. [11] T.J. Harvey, A. Henderson, E. Gazi, N.W. Clarke, M. Brown, E.C. Faria, et al., Discrimination of prostate cancer cells by reflection mode FTIR photoacoustic spectroscopy, Analyst 132 (2007) 292–295. [12] M. Rouprêt, V. Hupertan, D.R. Yates, J.W.F. Catto, I. Rehman, M. Meuth, et al., Molecular detection of localized prostate cancer using quantitative methylation- specific PCR on urinary cells obtained following prostate massage, Clin. Cancer Res. 13 (2007) 1720–1725. [13] M.J. Walsh, M.N. Singh, H.F. Stringfellow, H.M. Pollock, A. Hammiche, O. Grude, et al., FTIR microspectroscopy coupled with two-class discrimination segregates markers responsible for inter- and intra-category variance in exfoliative cervical cytology, Biomark. Insights 3 (2008) 179–189. [14] D.I. Ellis, R. Goodacre, Metabolic fingerprinting in disease diagnosis: biomedical applications of infrared and Raman spectroscopy, Analyst 131 (2006) 875– 885. [15] W.B. Dunn, D.I. Ellis, Metabolomics: current analytical platforms and methodologies, Trends Anal. Chem. 24 (2005) 285–294. [16] D.I. Ellis, W.B. Dunn, J.L. Griffin, J.W. Allwood, R. Goodacre, Metabolic fingerprinting as a diagnostic tool, Pharmacogenomics 8 (2007) 1243– 1266. [17] B. Stuart, Biological applications, in: Infrared Spectrosc. Fundam. Appl, Wiley, Chichester, England, 2004, pp. 137–163. [18] M.J. Walsh, M.J. German, M. Singh, H.M. Pollock, A. Hammiche, M. Kyrgiou, et al., IR microspectroscopy: potential applications in cervical cancer screening, Cancer Lett. 246 (2007) 1–11. [19] I.I. Patel, J. Trevisan, P.B. Singh, C.M. Nicholson, R.K.G. Krishnan, S.S. Matanhelia, et al., Segregation of human prostate tissues classified high-risk (UK) versus low-risk (India) for adenocarcinoma using Fourier-transform infrared or Raman microspectroscopy coupled with discriminant analysis, Anal. Bioanal. Chem. 401 (2011) 969–982. [20] P. Lasch, M. Diem, D. Naumann, FTIR microspectroscopic imaging of prostate tissue sections, Proc. SPIE. 5321 (2004) 1–9. [21] E. Gazi, J. Dwyer, N. Lockyer, P. Gardner, J. Vickerman, J. Miyan, et al., The combined application of FTIR microspectroscopy and ToF-SIMS imaging in the study of prostate cancer, Faraday Discuss. 126 (2004) 41–59. [22] E. Gazi, M. Baker, J. Dwyer, N.P. Lockyer, P. Gardner, J.H. Shanks, et al., A correlation of FTIR spectra derived from prostate cancer biopsies with gleason grade and tumour stage, Eur. Urol. 50 (2006) 750–761. Fig. 6. False colour classification images of the 5 class histological model for pros- tate tissue. Classifier trained on a database of transflection simulated spectra of 4 μm thickness (a-c) using a first (d-f) and second derivate (g-i) (Bassan et al [32].). 220 L.F.S. Siqueira, K.M.G. Lima / Trends in Analytical Chemistry 82 (2016) 208–221 58 [23] T.J. Harvey, E. Gazi, A. Henderson, R.D. Snook, N.W. Clarke, M. Brown, et al., Factors influencing the discrimination and classification of prostate cancer cell lines by FTIR microspectroscopy, Analyst 134 (2009) 1083–1091. [24] M.J. Baker, E. Gazi, M.D. Brown, J.H. Shanks, P. Gardner, N.W. Clarke, FTIR-based spectroscopic analysis in the identification of clinically aggressive prostate cancer, Br. J. Cancer 99 (2008) 1859–1866. [25] M.J. Baker, E. Gazi, M.D. Brown, J.H. Shanks, N.W. Clarke, P. Gardner, Investigating FTIR based histopathology for the diagnosis of prostate cancer, J. Biophotonics. 2 (2009) 104–113. [26] M.J. Baker, J. Trevisan, P. Bassan, R. Bhargava, H.J. Butler, K.M. Dorling, et al., Using fourier transform IR spectroscopy to analyze biological materials, Nat. Protoc. 9 (2014) 1771–1791. [27] P. Bassan, A. Sachdeva, A. Kohler, C. Hughes, A. Henderson, J. Boyle, et al., FTIR microscopy of biological cells and tissue: data analysis using resonant Mie scattering (RMieS) EMSC algorithm, Analyst 137 (2012) 1370–1377. [28] P. Bassan, H.J. Byrne, F. Bonnier, J. Lee, P. Dumas, P. Gardner, Resonant Mie scattering in infrared spectroscopy of biological materials–understanding the “dispersion artefact, Analyst 134 (2009) 1586–1593. [29] P. Bassan, A. Sachdeva, J.H. Shanks, M.D. Brown, N.W. Clarke, P. Gardner, Whole organ cross-section chemical imaging using label-free mega-mosaic FTIR microscopy, Analyst 138 (2013) 7066–7069. [30] P. Bassan, J. Lee, A. Sachdeva, J. Pissardini, K.M. Dorling, J.S. Fletcher, et al., The inherent problem of transflection-mode infrared spectroscopic microscopy and the ramifications for biomedical single point and imaging applications, Analyst (2013) 144–157. [31] C. Hughes, L. Gaunt, M. Brown, N.W. Clarke, P. Gardner, Assessment of paraffin removal from prostate FFPE sections using transmissionmode FTIR-FPA imaging, Anal. Methods 6 (2014) 1028–1035. [32] P. Bassan, A. Sachdeva, J. Lee, P. Gardner, Substrate contributions in micro-ATR of thin samples: implications for analysis of cells, tissue and biological fluids, Analyst 138 (2013) 4139–4146. [33] C. Hughes, L. Gaunt, M. Brown, N.W. Clarke, P. Gardner, Assessment of paraffin removal from prostate FFPE sections using transmissionmode FTIR-FPA imaging, Anal. Methods 6 (2014) 1028. [34] E. Gazi, J. Dwyer, N.P. Lockyer, P. Gardner, J.H. Shanks, J. Roulson, et al., Biomolecular profiling of metastatic prostate cancer cells in bonemarrow tissue using FTIR microspectroscopy: a pilot study, Anal. Bioanal. Chem. 387 (2007) 1621–1631. [35] J. Ollesch, S.L. Drees, H.M. Heise, T. Behrens, T. Brüning, K. Gerwert, FTIR spectroscopy of biofluids revisited: an automated approach to spectral biomarker identification, Analyst 138 (2013) 4092–4102. [36] K. Wehbe, J. Filik, M.D. Frogley, G. Cinque, The effect of optical substrates on micro-FTIR analysis of single mammalian cells, Anal. Bioanal. Chem. 405 (2013) 1311–1324. [37] J. Nallala, M.-D. Diebold, C. Gobinet, O. Bouché, G.D. Sockalingum, O. Piot, et al., Infrared spectral histopathology for cancer diagnosis: a novel approach for automated pattern recognition of colon adenocarcinoma, Analyst 139 (2014) 4005–4015. [38] S.G. Kazarian, K.L.A. Chan, Applications of ATR-FTIR spectroscopic imaging to biomedical samples, Biochim. Biophys. Acta 1758 (2006) 858–867. [39] S.E. Glassford, B. Byrne, S.G. Kazarian, Recent applications of ATR FTIR spectroscopy and imaging to proteins, Biochim. Biophys. Acta. 2013 (1834) 2849–2858. [40] K. Chan, S. Kazarian, New opportunities in micro-and macro-attenuated total reflection infrared spectroscopic imaging: spatial resolution and sampling versatility, Appl. Spectrosc. 57 (2003) 381–389. [41] S.E. Holton, M.J. Walsh, R. Bhargava, Subcellular localization of early biochemical transformations in cancer-activated fibroblasts using infrared spectroscopic imaging, Analyst 136 (2011) 2953–2958. [42] P. Lasch, W. Haensch, D. Naumann, M. Diem, Imaging of colorectal adenocarcinoma using FT-IR microspectroscopy and cluster analysis, Biochim. Biophys. Acta 1688 (2004) 176–186. [43] J. Schubert, A. Mazur, B. Bird, M. Miljkovic, M. Diem, Single point vs. mapping approach for spectral cytopathology (SPC), J. Biophotonics. 3 (2010) 588–596. [44] E.A. Carter, K.K. Tam, R.S. Armstrong, P.A. Lay, Vibrational spectroscopic mapping and imaging of tissues and cells, Biophys. Rev. 1 (2009) 95–103. [45] P. Lasch, M. Boese, A. Pacifico, M. Diem, FT-IR spectroscopic investigations of single cells on the subcellular level, Vib. Spectrosc. 28 (2002) 147–157. [46] B.R. Wood, K.R. Bambery, C.J. Evans, M.A. Quinn, D. McNaughton, A three- dimensional multivariate image processing technique for the analysis of FTIR spectroscopic images of multiple tissue sections, BMC Med. Imaging 6 (2006) 1–9. [47] D. Naumann, FT-IR spectroscopy of microorganisms at the Robert Koch Institute: experiences gained during a successful project, SPIE BiOS, 2008 1–12. [48] M.J. German, A. Hammiche, N. Ragavan, M.J. Tobin, L.J. Cooper, S.S. Matanhelia, et al., Infrared spectroscopy with multivariate analysis potentially facilitates the segregation of different types of prostate cell, Biophys. J. 90 (2006) 3783–3795. [49] C. Pezzei, J.D. Pallua, G. Schaefer, C. Seifarth, V. Huck-Pezzei, L.K. Bittner, et al., Characterization of normal and malignant prostate tissue by Fourier transform infrared microspectroscopy, Mol. Biosyst. 6 (2010) 2287–2295. [50] P. Lasch, D. Naumann, Spatial resolution in infrared microspectroscopic imaging of tissues, Biochim. Biophys. Acta 1758 (2006) 814–829. [51] J.T. Kwak, R. Reddy, S. Sinha, R. Bhargava, Analysis of variance in spectroscopic imaging data from human tissues, Anal. Chem. 84 (2012) 1063–1069. [52] P. Lasch, Spectral pre-processing for biomedical vibrational spectroscopy and microspectroscopic imaging, Chemom. Intell. Lab. Syst. 117 (2012) 100–114. [53] P. Lasch, W. Petrich, Data acquisition and analysis in biomedical vibrational spectroscopy, Appl. Sync. Infrared Microspec. (2011) 192–225. [54] P. Bassan, M.J. Weida, J. Rowlette, P. Gardner, Large scale infrared imaging of tissue micro arrays (TMAs) using a tunable Quantum Cascade Laser (QCL) based microscope, Analyst 139 (2014) 3856–3859. [55] M.J. Baker, C. Clarke, D. Démoulin, J.M. Nicholson, F.M. Lyng, H.J. Byrne, et al., An investigation of the RWPE prostate derived family of cell lines using FTIR spectroscopy, Analyst 135 (2010) 887–894. [56] M. Romeo, B. Mohlenhoff, M. Jennings, M. Diem, Infrared micro-spectroscopic studies of epithelial cells, Biochim. Biophys. Acta 1758 (2006) 915–922. [57] P. Bassan, A. Kohler, H. Martens, J. Lee, H.J. Byrne, P. Dumas, et al., Resonant Mie scattering (RMieS) correction of infrared spectra from highly scattering biological samples, Analyst 135 (2010) 268–277. [58] Z. Movasaghi, S. Rehman, I. ur Rehman, Fourier Transform Infrared (FTIR) spectroscopy of biological tissues, Appl. Spectrosc. Rev. 43 (2008) 134– 179. [59] M.J. Baker, M.D. Brown, E. Gazi, N.W. Clarke, J.C. Vickerman, N.P. Lockyer, Discrimination of prostate cancer cells and non-malignant cells using secondary ion mass spectrometry, Analyst 133 (2008) 175–179. [60] E. Gazi, N.P. Lockyer, J.C. Vickerman, P. Gardner, J. Dwyer, C.A. Hart, et al., Imaging ToF-SIMS and synchrotron-based FT-IR microspectroscopic studies of prostate cancer cell lines, Appl. Surf. Sci. 231–232 (2004) 452–456. [61] I.I. Patel, F.L. Martin, Discrimination of zone-specific spectral signatures in normal human prostate using Raman spectroscopy, Analyst 135 (2010) 3060–3069. [62] C. Hughes, J. Iqbal-Wahid, M. Brown, J.H. Shanks, A. Eustace, H. Denley, et al., FTIR microspectroscopy of selected rare diverse sub-variants of carcinoma of the urinary bladder, J. Biophotonics. 6 (2013) 73–87. [63] N.C. Purandare, I.I. Patel, K.M.G. Lima, J. Trevisan, M. Ma’Ayeh, A. McHugh, et al., Infrared spectroscopy with multivariate analysis segregates low-grade cervical cytology based on likelihood to regress, remain static or progress, Anal. Methods 6 (2014) 4576–4584. [64] K.M.G. Lima, K. Gajjar, G. Valasoulis, M. Nasioutziki, M. Kyrgiou, P. Karakitsos, et al., Classification of cervical cytology for human papilloma virus (HPV) infection using biospectroscopy and variable selection, Anal. Methods 6 (2014) 9643–9652. [65] J. Trevisan, J. Park, P.P. Angelov, A.A. Ahmadzai, K. Gajjar, A.D. Scott, et al., Measuring similarity and improving stability in biomarker identification methods applied to Fourier-transform infrared (FTIR) spectroscopy, J. Biophotonics. 7 (2014) 254–265. [66] J. Trevisan, P. Angelov, P.L. Carmichael, A. Scott, F. Martin, Extracting biological information with computational analysis of Fourier transform infrared (FTIR) biospectroscopy datasets:current practices to future perspectives, Analyst 137 (2012) 3202–3215. [67] E. Gazi, J. Dwyer, P. Gardner, A. Ghanbari-Siahkali, A.P. Wade, J. Miyan, et al., Applications of Fourier transform infrared microspectroscopy in studies of benign prostate and prostate cancer. A pilot study, J. Pathol. 201 (2003) 99– 108. [68] C. Kendall, M. Isabelle, F. Bazant-Hegemark, J. Hutchings, L. Orr, J. Babrah, et al., Vibrational spectroscopy: a clinical tool for cancer diagnostics, Analyst 134 (2009) 1029–1045. [69] K.R. Bambery, B.R. Wood, D. McNaughton, Resonant Mie scattering (RMieS) correction applied to FTIR images of biological tissue samples, Analyst 137 (2012) 126–132. [70] P. Crow, A. Molckovsky, N. Stone, J. Uff, B. Wilson, L.M. Wongkeesong, Assessment of fiber optic near-infrared raman spectroscopy for diagnosis of bladder and prostate cancer, Urology 65 (2005) 1126–1130. 221L.F.S. Siqueira, K.M.G. Lima / Trends in Analytical Chemistry 82 (2016) 208–221 59 60 CHAPTER 4 A COMPARISON OF MULTIVARIATE ANALYSIS AND VARIABLE SELECTION METHODS TO PROSTATE CANCER CLASSIFICATION FROM FT-MIR BIOMEDICAL SPECTROSCOPIC DATA. Laurinda F. S. Siqueira Raimundo F. Araújo Júnior, Aurigena Antunes de Araújo, Camilo L.M. Morais, Kássio M. G. Lima. Article submitted to Analytical Methods. Manuscript number: AY-COM-09-2016-002461 Contributions: • I did sample treatment. • I did spectral acquisition. • I did data pre-processing • I built the multivariate classification models. • I wrote the manuscript. Laurinda F. S. Siqueira Kássio M. G. Lima. Contents ABSTRACT.......................................................................................................... 61 1 INTRODUCTION................................................................................................. 62 2 EXPERIMENTAL,,,............................................................................................. 64 3 RESULTS.............................................................................................................. 68 4 DISCUSSION....................................................................................................... 72 5 CONCLUSIONS................................................................................................... 78 ACKNOWLEDGEMENTS.................................................................................. 78 ABBREVIATIONS............................................................................................... 79 REFERENCES...................................................................................................... 80 61 GHAPICAL ABSTRACT ABSTRACT: Prostate cancer is the second most commonly-diagnosed malignancy in males worldwide. Vibrational spectroscopy can be applied to identify a susceptibility-to- adenocarcinoma biochemical signature. In this study, it was set out to determine whether FT- MIR spectroscopy combined with multivariate analysis could be utilized to classify categories of cancerous prostate tissues. PCA, SPA and GA followed by LDA were applied to classify 3- category discriminant analysis (Gleason II, III and IV) and also Low grade (Gleason II) versus High grade (Gleason III+IV). GA-LDA was the better model correctly classifying 94.9% and 81.4% of training and test sets respectively, with sensitivity of 100% and specificity of 77.8% for 3-category discriminant analysis; and correctly classifying 96.9% and 83.3% of training and test sets respectively, with sensitivity of 71.4% and specificity of 80% for Low vs. High grades. The most important wavenumber responsible for the classification pointed by models were related to secondary protein variations (amide II, ≈ 1550 cm-1, and amide I, ≈ 1630cm-1) and DNA/RNA alterations (≈1000–1490 cm−1). Keywords: FT-MIR, PCA, GA, SPA, LDA, tissue, prostate cancer 62 1. INTRODUCTION In 2015, the projected number of death of men for prostate cancer was 328,982. About 75% was in age range more than 70 years and with a mortality rate of 152 in 100,000, followed for more than 22% in age range from 50 – 69 years and with a mortality rate of 12 in 100,000. In 2030, this estimate will grow to 541,242 deaths, of which about 79% will be in age range more than 70 years and a mortality rate of 154 in 100,000, and about 19% in age range from 50 – 69 years and a mortality rate of 13 in 100,0001. A retrospective analysis of mortality or progressive disease state in 1,349 male patients with biopsy-confirmed prostate adenocarcinoma (International Classification of Diseases, 10th edition, code C61) was done in a Brazilian Health System cancer referral center in Natal, Brazil, in which was observed that the black skin color had increased in until three times the chance to develop metastasis as well as metastasis increased in 10 times the death inside 5 years 2 . The estimates increase could be associated to the risk factors for prostate cancer, such as the increasing age, the ethnic origin and the heredity, mainly, and the environmental and the lifestyle, in minor scale 3-7 . Prostate tissue is structurally complex, consisting of glandular ducts lined by epithelial cells and supported by heterogeneous stroma; it also contains blood vessels, blood, nerves, ganglion cells, lymphocytes and stones that are organized into structures, measuring from tens to hundreds of microns 8 . The Gleason system is the standard approach for grading prostate cancer and provides an indication as to the aggressiveness of a tumor. The original scheme established in the 1960–1970 evolved to a significantly modified system after two major consensus meetings conducted by the International Society of Urologic Pathology (ISUP) in 2005 and 2014, and was adopted by the 2016 WHO classification of tumors of the prostate 9 . The classical Gleason grading system defines Gleason 1 (the best differentiated and is correlated with the most favorable prognosis) to Gleason 5 (the least differentiated and correlated with poor prognosis) and also defines Gleason score as the sum of the primary and secondary patterns (grades) to correlate with the biological behavior of prostate adenocarcinoma even better 9-11 . However, this systems are based upon a visual criterion of pattern recognition that is operator-dependent and subject to intra- and inter-observer variability. There are others drawbacks of biopsy-based detection approaches, including the heterogeneity of samples, difficult preparation procedure, harmful for the organs, probability of spreading cancer, time consuming procedure, samples susceptible to physical damage, and others 12 . 63 Thus, there is a need for molecular based techniques to grade tissue samples in a reliable and reproducible manner. The transition of a normal cell to a diseased cell is accompanied by a change in a variety of biomolecules that can be simultaneously and indiscriminately probed by FTIR microspectroscopy coupled with chemometrics tools, yielding spectral signatures that enable differentiation between normal and cancerous cells and tissues 13-17 . For most disease diagnoses, researchers have concentrated on the MIR spectrum (from 4000–600 cm−1). In biological terms, the vibrations in the 1500–1750 cm−1 region (the amide I and II bands) are ascribable to CLO, NH and C–N from proteins and peptides, for example. Due to rapidity, reproducibility, holistic nature and ability to analyze carbohydrates, amino acids, fatty acids, lipids, proteins and simultaneous polysaccharides of FTIR coupled with chemometrics, it has been recognized as a valuable tool for metabolic fingerprinting/foot printing 18-22 . Working on the assumption that histopathologic changes can be defined by biochemistry, by objective spectroscopic criteria that do not require a pathologist’s interpretation and that biospectroscopy and chemometrics may play an important role in the identification of structural alterations of cellular molecules based on chemical bonds and can offer additional capabilities for automated, statistically controlled and reproducible subtype recognition 18 , we used FTIR spectroscopy to interrogate prostate tissue. The resulting spectral data were analyzed using multivariate analysis and variable reduction and selection techniques in the form of Principal Component Analysis (PCA), Successive Projection Algorithm (SPA) and Genetic Algorithm (GA) followed by Linear Discriminant Analysis (LDA), resulting in the PCA-LDA, SPA-LDA and GA-LDA models. The multivariate classification accuracy results were tested based on sensitivity, specificity, positive (or precision) and negative predictive values, Youden index, and positive and negative likelihood ratios. Sample preparation, spectroscopic measurement, data pre- processing, feature extraction and analytical validation, also were addressed. This study was developed by the partnership between Institute of Chemistry and Department of Pathology of the University of Rio Grande of Norte, Natal, Brazil. All experiments were performed incompliance with the relevant laws and institutional guidelines, where the institutional committees (No. 030/0030/2006) of the Liga Norte-Riograndense Contra o Cancer, Brazil, approved this research. 64 2. EXPERIMENTAL Tissue collection. Prostate tissue sections were obtained from the Pathology Department of the Federal University of Rio Grande of Norte (UFRN/Brazil). Prostate tissue sections were formalin-fixed, dehydrated and paraffin-embedded (FFPE) in pathology blocks (n = 45), previously classified according to Gleason traditional grading by pathologists. No significant changes in fixation or paraffin embedding occurred during analysis period and no degradation of tissue architecture was observed. These 45 samples of tissue were distributed according to the classical Gleason grade to form 3 categories: Gleason 2 (n = 23), Gleason 3 (n = 15) and Gleason 4 (n = 7). Five-μm-thick tissue sections were floated onto ZnSe slides (Bruker Optics Ltd., Coventry, UK). These were de-waxed by serial immersion in fresh xylene baths for 5 min and washed and cleared in an absolute ethanol bath for another 5 min 22 .The resulting samples were allowed to air-dry and then placed in a desiccator until analysis. FT-MIR spectroscopy. A minimum of 40 and a maximum of 100 FT-MIR spectra per tissue were collected in transmission mode using a Bruker Lumos FTIR spectrometer-microscope (Bruker Optics Ltd., Coventry, UK). FTIR spectra were collected in the Mid-IR wavenumber range 600–4,000 cm-1 with a spectral resolution of 8 cm-1 and 32 scans. Spectra were acquired with a new background taken for every new sample; these were converted into absorbance by Bruker OPUS software. Data processing. The importing and pre-treatment of the spectral data and the construction of chemometric classification models were executed using PLS toolbox 7.8 (Eigenvector Research, Inc.3905 West Eaglerock Drive, Wenatchee, WA 98801) and MATLAB R2012b (Mathworks Inc, Natick, MA, USA). FT-MIR spectra were cut to include wavelengths between 900 and 1,800 cm -1 , the area associated with the biological spectral fingerprints 23 . In the resulting dataset were performed Extended Multiplicative Scatter Correction to correct baseline of scattering effects, 1 st order Savitzky-Golay smoothing (15 points) to emphasize relevant information and to wipe out background noise and normalization to amide I (1,650 cm -1 ) to reduce distortions 24 . Multivariate analysis and variable extraction methods. Variable extraction methods consisted of three models: PCA, SPA and GA, all followed by LDA. Before applying each analytical model, spectral data were divided into training (60%), validation (20%) and 65 prediction (20%) sets by applying the classic Kennard-Stone (KS) uniform sampling algorithm 24,25 . The training and validation datasets were used in the modelling procedures (including variable reduction and selection for LDA), whereas the prediction dataset was only used for the final classification evaluation. PCA is employed to reduce dimensionality and generate a visualization of data; it captures as much variability as possible. Principal components (PCs) can capture most of the variance (> 95%) present in the original dataset 24-28 . The optimum number of 10 PCs, which had 98% of explained variance, was applied to classify the prostates sample depending on the Gleason grade they were classified. SPA applies projection operations that are used to choose subsets of variables with a small degree of multi-collinearity in order to minimize redundancy and ill-conditioned problems. SPA does not modify the original data vectors as PCA does. In this case projections are used only for selection purposes. Thus, the relation between spectral variables and data vectors is preserved 27-29 . GA is commonly used to characterize a subset and wavelength selection strategy 12, 27- 31 . The GA routine was carried out utilizing 40 generations containing 80 chromosomes each. This algorithm was repeated three times, starting from different random initial populations. The best solution resulting from the three realizations of the GA was employed. The optimum number of variables for GA–LDA was performed with an average risk G of LDA misclassification 32 . This kind of cost function is calculated in the validation set as: 𝑔 𝑛 = 𝑟2(𝑥𝑛,𝑚𝐼(𝑛)) 𝑚𝑖𝑛𝐼(𝑚)≠𝐼(𝑛)𝑟 2(𝑥𝑛,𝑚𝐼(𝑚)) (Eq.1) Where 𝑔 𝑛 is defined as: 𝑔 𝑛 = 𝑟2(𝑥𝑛,𝑚𝐼(𝑛)) 𝑚𝑖𝑛𝐼(𝑚)≠𝐼(𝑛)𝑟 2(𝑥𝑛,𝑚𝐼(𝑚)) (Eq.2) Where I(n) is the index of the true class for the nth validation object xn; 𝑟 2(𝑥𝑛, 𝑚𝐼(𝑛)) is the squared Mahalanobis distance between object xn (of class index I(n)) and the sample mean mI(n) of its true class; and 𝑟2(𝑥𝑛, 𝑚𝐼(𝑚)) is the squared Mahalanobis distance between object xn and the center of the closest wrong class. 66 LDA usually are applied using the spectral band ratios as parameters to distinguish the FTIR spectra of normal samples and of cancerous cases. LDA scores and discriminant function (DF) values were obtained. Typically, the first LDA factor (LD1) is used to visualize the main biochemical alterations within the sample on a 1-dimensional scores plot 23-32 . To obtain discriminant profile, the LDA classification score (Lij) is calculated for a given class k by the following equation, considering that the class covariance matrices are assumed to be equal: 𝐿𝑖𝑘 = (𝐱𝑖 − ?̅?𝑘) T𝚺𝑝𝑜𝑜𝑙𝑒𝑑 −1 (𝐱𝑖 − ?̅?𝑘) − 2 log𝑒 𝜋𝑘 (Eq.3) Where xi is an unknown measurement vector for a sample i; ?̅?𝑘 is the mean measurement vector of class k; Σpooled is the pooled covariance matrix; and πk is the prior probability of class k 49 . The methods has been used to separate the three-categories discriminant analysis of Gleason system pre-assigned (2, 3 and 4 grades), as well as two-category discriminant analysis (low grade versus high grades: Gleason 2 versus Gleason 3 and 4, respectively), aiming to test the classification power and robustness of these. The Spearman rho, nonparametric (2-tailed) test was used to determine whether there was a significant correlation between the each LDA model and Gleason grades pre-assigned. Validation, comparison and quality performance. To comparison and evaluation of the classification power and quality between the multivariate analysis and variable reduction and selection methods (PCA-LDA, SPA-LDA and GA-LDA), it was analyzed the quality metrics from multivariate classification quality features such as Sensitivity, Specificity, Positive (or Precision) and Negative Predictive Values, Youden index, and Positive and Negative Likelihood Ratios. To comparison, we should know that (1) Sensitivity is the confidence that a positive result for a sample of the label class is obtained, this is positive in disease; (2) Specificity is the confidence that a negative result for a sample of non-many of test positives are true positives; (3) Positive Predictive Value (PPV) shows how many of test positives are true positives; (4) Negative Predictive Value (NPV) shows how many of test negatives are true negatives; (5) Youden index (YOU) evaluates the classifier's ability to avoid failure; (6) Likelihood Ratios (LR+) represents the ratio between the probability of predicting an example as positive when it truly is positive, and the probability of predicting an example as positive 67 when it actually is not positive; (7) The LR- represents the ratio between the probability of predicting an example as negative when it is actually positive, and the probability of predicting an example as negative when it truly is negative 29,33,34 . All this means that the better models must be high sensitivity and specificity to be precise and accurate in class separation. Besides, the methods must be YOU more close to 100 to prove it capacities of classification and biomarker identification. NNP and NPV also must high for affirmation or negation of the group segregation. While the LR+ must be high and the LR- must be low, which provides an intuitive feeling that the models rules the classification. In addition, two measures of classification were calculated, namely the training and test (or prediction) set classification rates. The classification rate of the training set involves applying the models to the same set of samples used to build and optimize these models. The classification rate of the test set is used to test the classification ability of the models, and it gives a more realistic representation of their classification ability. To compare this two rates implies identify over-fitting, if these values are very different from each other. Table 1 summarizes equations. Table 1 – Multivariate classification, quality features and equations Validation and quality tools Equations Validation and quality tools Equations Sensitivity (SENS) ( TP TP + FN ) × 100 Youden’s index(YOU) SENS − (1 − SPEC) Specificity (SPEC) ( TN TN + FP ) × 100 Likelihood ratio positive (LR(+)) ( SENS 1 − SPEC ) Positive Predictive Value (PPV) ( TP TP + FP ) × 100 Likelihood ratio negative(LR(-)) ( SPEC 1 − SENS ) Negative Predictive Value (NPV) ( TN TN + FN ) × 100 68 3. RESULTS Three-category discriminant analysis of Gleason grade system using PCA-LDA, SPA- LDA and GA-LDA The pre-processing of the FT-MIR spectra datasets for pre-assigned categories was shown in Figure 1. The non-pre-processed (Fig. 1A) and the pre-processed (Fig. 1B) mean FT-MIR spectra derived from each spectral data set used to train the Gleason clusters, generated three categories: (1) Gleason 2, green line; (2) Gleason 3, red line; and (3) Gleason 4, blue line. There is a significant differentiation between categories (P < 0.01) and a visual inspection alone allows to identifying distinguishing features. In order to classify of the prostate samples according to Gleason grade and to determine the biochemical markers responsible for any such classification, were applied chemometric analysis techniques. PCA, SPA and GA followed by LDA were adopted to systematically identify spectral differences between the pre-assigned categories. Scores plots (DF1 x DF2) derived from PCA-LDA of the FT-MIR spectra were displayed in Figure 1C. PCA-LDA model was carried out using the first ten PCs, which explains about 98% of the variance within the sample population. Scores plots identify the similarities and dissimilarities between different categories and present them as clusters of points. The loadings plots derived from PCA-LDA which identifies the important wavenumbers for separation of the different categories were presented in Figure 1D. These include 960; 1,150; 1,227; 1,250; 1,280; 1,360; 1,410; 1,455;1,545; and 1,574 cm −1 . Loadings plots identify the distinguishing wavenumbers. It is perceptible that there is some separation between the pre-assigned categories. This separation was significant (P < 0.00001). The scores plots derived from SPA-LDA (Fig. 1E) identified significant segregation between the categories selected (P < 0.00005); due to tendency of the spectral points from the same category to cluster and to segregate different groups. There is a clear better separation within categories by SPA-LDA model in comparison to PCA-LDA model. The application of the SPA-LDA to the dataset resulted in twenty-nine wavenumbers selected, which were: 968; 1,109; 1,128; 1,140; 1,147; 1,157; 1,166; 1,174; 1,188; 1,197; 1,210; 1,213; 1,225; 1,237; 1,243; 1,254; 1,270; 1,295; 1,303; 1,335; 1,360; 1,402; 1,425; 1,451; 1,460; 1,486; 1,534; 1,559; 1,620; 1,630 and 1,650 cm −1 (Fig. 1F). The scores plots for classification derived from GA-LDA model were displayed in Figure 1G. The GA model selected 15 wavenumbers (Fig. 1H), which include: 1,035; 1,084; 1,129; 1,190; 1,222; 1,280; 1,344; 1,370; 1,433; 1,518; 1,540; 1,557; 1,630; 1,652 and 69 1,680cm −1 . There is better separation (P < 0.00005) between the three categories assigned by GA-LDA model in comparison to PCA-LDA and SPA-LDA models. Figure 1 – (A) Non-pre-processed and (B) pre-processed FT-MIR derived spectral dataset for 3-category discriminant analysis to Gleason grade. (C) Scores (DF1 × DF2) plot calculated by PCA-LDA. (D) Loadings plot derived from PCA-LDA. (E) Scores (DF1 × DF2) plot calculated by SPA-LDA. (F) 29 wavenumbers selected by SPA-LDA. (G) Scores (DF1 × DF2) plot calculated by GA-LDA. (H) 15 wavenumbers selected by GA-LDA. 70 Two-category discriminant analysis of Gleason grade system using PCA-LDA, SPA- LDA and GA-LDA The distinction between two-category discriminant separated was more clear (Figure 2). The non-pre-processed (Fig. 2A) and the pre-processed (Fig. 2B) mean FT-MIR spectra derived from each spectral data set used to train the Gleason clusters, generated two categories: (1) low grade (Gleason 2), blue line; and (2) high grade (Gleason 3 and 4), red line. There is a better visual differentiation between categories and clear identification of distinguishing features. Similarly, it was applied PCA, SPA and GA followed by LDA to systematically identify spectral differences between the two categories. The scores plots (Figure 2C) derived from PCA-LDA of these two categories with significant segregation between them (P < 0.00001); and the associated loadings plots (Fig. 2D) identifies the principal segregating wavenumbers, which were: 1,155; 1,225; 1,280; 1,360; 1,380; 1,460; 1,540; 1,560; 1,575; and 1,630 cm -1 . Similarly, SPA-LDA identified significant separation (P < 0.00005) between the two categories as shown by the related scores plot (Fig. 2E). This approach used thirty wavenumbers: 980; 1,074; 1,123; 1,132; 1,142; 1,150; 1,171; 1,187; 1,199; 1,210; 1,222; 1,231; 1,242; 1,253; 1,269; 1,285; 1,303; 1,361; 1,378; 1,405; 1,419; 1,440; 1,452; 1.462; 1,491; 1,504; 1,536; 1,559; 1,620, 1,650 and 1,680 cm −1 (Fig. 2F). GA-LDA model generated the best classification (Fig. 2G) using 15 variables selected. These were: 1,083; 1,142; 1,194;1,229; 1,272; 1,297; 1,333; 1,345; 1,431; 1,504; 1,526; 1,531;1,546; 1.650 and 1,680 cm −1 (Fig. 3H). This separation is also significant (P < 0.00005). Validation, comparison and quality performance. The models performance of each classification category was presented in Table 2. It is possible to see that the sensitivity from PCA-LDA, SPA-LDA and GA-LDA achieved scores of 57.2%, 83.3% and 100% for the 3-category discriminant analysis, respectively, showing that these categories can be well classified by these models, especially by SPA-LDA and GA- LDA. For the 2-category discriminant analysis (low and high grade), the sensitivity values from PCA-LDA, SPA-LDA and GA-LDA models were 60%, 66.7% and 71.4%, respectively, showing that these categories can be relatively well classified by these models, especially by GA-LDA which presented classification rate of 83.3% (better in comparison to 81.4% of GA- LDA for the three-category). Furthermore, the specificity for both categorizations suggests that SPA-LDA and GA-SPA had better accuracy than PCA-LDA. 71 Figure 2 – (A) Non-pre-processed and (B) pre-processed FT-MIR derived spectral dataset for 2-category discriminant analysis (low versus high grade). (C) Scores (DF1 × DF2) plot calculated by PCA-LDA. (D) Loadings plot derived from PCA-LDA. (E) Scores (DF1 × DF2) plot calculated by SPA-LDA. (F) 30 wavenumbers selected by SPA-LDA. (G) Scores (DF1 × DF2) plot calculated by GA-LDA. (H) 15 wavenumbers selected by GA-LDA. 72 Table 2 – Values of quality performance features from PCA-LDA, SPA-LDA and GA-LDA models by FT-MIR spectroscopy for each classification category of prostate cancer. Quality performance features PCA-LDA SPA-LDA GA-LDA 3-category discriminant analysis Spearman Correlation Coefficient (%) 60.5 77.5 79.8 Classification rate (%) Training set 84.2 91.2 94.9 Test set 57.2 71.4 81.4 Sensitivity (%) 57.2 83.3 100 Specificity (%) 57.2 75 77.8 Positive Predictive Value (PPV) 57.2 71.4 71.4 Negative Predictive Value (NPV) 57.2 85.7 100 Youden’s index (YOU) 14.3 58.3 77,8 Likelihood ratio positive (LR(+)) 1.3 3.3 4.5 Likelihood ratio negative (LR(-)) 0.7 0.2 0.1 2-category discriminant analysis Spearman Correlation Coefficient (%) 57.9 77.8 91.5 Classification rate (%) Training set 87.8 93.8 96.9 Test set 60 66.7 83.3 Sensitivity (%) 60 66.7 71.4 Specificity (%) 60 66.7 80 Positive Predictive Value (PPV) 60 66.7 83.4 Negative Predictive Value (NPV) 60 66.7 66.7 Youden’s index (YOU) 20 33.3 51.4 Likelihood ratio positive (LR(+)) 1.5 2 3.6 Likelihood ratio negative (LR(-)) 0.7 0.5 0.3 4. DISCUSSION This study aimed to identify spectral differences between cancer prostate tissues according to the Gleason system, and to test the classification power of multivariate analysis and variable reduction and selection methods. For this intent, it were used prostate tissues taken from formalin-fixed, dehydrated and paraffin-embedded (FFPE) pathology blocks, previously classified by pathologists according with Gleason system. The samples did not show any complicating diathermy effect. Fortunately, no contributions of paraffin vibrational modes were apparent in the low-wavenumber region of FT-MIR spectra used in this project However, tissues samples are notoriously fragile and susceptible damage and degradation quickly 18 . It is needed to be careful with the sample treatment. The small number of samples 73 in this study reflects the challenge of to deal with tissues samples, like refer some authors which also use small number of samples per class as Patel et al. 17 and Theophilou et al. 27 . We decided use transmission mode, first of all, due necessity of a non-destructive technique. Besides, our search treats with tissues samples, which cannot support the three or four-fold thickness recommended to the penetration of the ATR crystal for example 47 . Also, it is our interest use a technique that allows mapping an area as transmission mode does; and not a single point or some separate points as ATR does. Other modes can be used, as Diffuse Reflectance (DRIFT), which is fast, simple and non-invasive also. We hope use this in future works soon. The Gleason grade 2 training specimen originated from a clinically low-stage tumor, whereas, the Gleason grade 3 training specimen was derived from a patient whose tumor exhibited bone metastases at time of biopsy and the Gleason grade 4 training specimens were derived from tumors that exhibited bone metastases more aggressive at time of biopsy 16,36 . A priori, interestingly, the mean FT-MIR-derived spectra from Gleason grade 3 and Gleason grade 4 (Fig. 1A-B) are in closer proximity to one another. It can postulate that because both tumors can exhibit more similar and more aggressive behaviors than Gleason grade 2; they may consist of cells with overlapping phenotypes, which result in overlapping spectral features in their associated IR spectra 16,17,27 . This is the reason why was decided to investigate the two-category discriminant analysis also. Spectral differences could be the first evidence of phenotypic alterations 27 . A total of n = 45 tissues were mapping using FT-MIR. The spectra were investigated from 900 to 1,800 cm −1 as most bio-molecular spectral signatures reside within this area 35 . Multivariate analysis and variable reduction and selection methods allowed discrimination of prostate tissue according to the Gleason grade. There was apparent separation between the clusters of different categories that became more pronounced when the robustness of the algorithm became larger. Three multivariate and variable reduction and selection models followed by Linear Discriminant Analysis were applied to the spectra obtained by FT-MIR spectroscopy (PCA- LDA, SPA-LDA and GA-LDA). They had varying degrees of success in correctly classifying the samples into categories according with the three-category (2, 3 and 4 grades) and two- category (low versus high grade: 2 versus 3-4 grades, respectively) of the Gleason system. They scores plots identified spectral similarity/dissimilarity and segregation, with corresponding loadings/cluster vector plots highlighting segregating wavenumbers 27 . 74 Resulting scores and loadings plots provide a visual representation and interpretation of variables responsible for any segregation 37 . For the both classifications, the weakest approach was PCA-LDA with 57.4% of the population data correctly classified into three-category of the Gleason system and 60% of the population data correctly classified into low and high grades (Table 2). Ten PCs were used as they provided enough variance (98%). The related scores plot (Fig. 1C and 1C) shows better separation between low versus high grades in comparison with the three-category discriminant analysis of the Gleason grade singly. The variables responsible for the classification are chosen by PCA-LDA through a sequence of linear combinations between the originals variables (wavelengths) that have greater covariance. The current combinations are independent and represented by scores and loadings matrixes, which represents of new variables with bigger class variance explained. With the loadings plots, we have the visualization of variables (biomarker) from PCA-LDA model. An important consideration when applying LDA following PCA: despite to reduce dimensionality and capture as much variability, the number PCs included influences in the resulting information – too few results in a lack of information, while too many increases the amount of noise in the data. So this method have the disadvantage of the potential LDA overfitting causing arbitrary separation, too much noise, degrading the interpretation of the loadings plots and therefore positive results; this happens when include more than 20 PCs 24,28 . This can be counteracted by using large spectral datasets of more than five times the number of variables 25 . GA-LDA as well as SPA-LDA approaches revealed better segregation between the different categories. The segregation capacity of SPA-LDA (Fig. 1G and 2G) was ranked second for both classifications (71.4% and 66.7%, respectively) (Table 2). Both methods confirmed the closer proximity between Gleason grade 3 and 4, such as previously showed in their spectral behavior (Fig. 1A-B). Although of high computational requirements and more time-consuming of SPA-LDA model, this method selects wavelengths whose information content is minimally redundant, solving collinearity problems 27,28,39 . SPA-LDA classifier selects the most relevant variables responsible for class segregation, through of several vector projections. In the iterations, new variables are incorporated into a single initial variable (wavelength), until a number N of wavelengths is reached. At the end, a new matrix is formed with the variables with a small degree of multi- 75 collinearity in order to minimize redundancy and ill-conditioned problems. The plot from the loading matrix provides a visualization of the variable selected (biomarker) by method. GA-LDA (Fig. 1E and 2E) was the best method for both categories discriminants of the FT-MIR dataset, with 81.4% of the sample correctly classified into three grades of the Gleason system and 83.3% of the population data correctly classified into low and high grades. Moreover, the method presents more accuracy (80%) to separate low and high grades. The success is model can be associated to fact that it selects the variables which generate a lower signal/noise ratio, causing optimization of the response function 39-41 . GA-LDA classifier selects variables responsible for class segregation based on evolutionary theory having selection, crossing and mutation as operators. The generations define the iteration cycles of the method and the genes contained in a chromosome represent the variables (wavelengths). The selection elects the most able chromosomes (the variables that present the lowest prediction and validation errors) and forms a new population (a set of variables or matrix). Within the new population, the crossing or recombination consists of the random selection of individuals (variables) that will be crossed and generate descendants (new variables) with useful information. The mutation randomly changes genes of a chosen chromosome; as in nature, the mutation rate in the method tends to be low. Thus, modified chromosomes tend to have a greater chance of being selected after each generation (cycle of iterations). Once the number of generations previously defined has been reached, we have the variables selected by the method, which loading plot allows the visualization of these. Models validation was done based on the calculation and analysis of the so-called figures of merit (so-called, quality metrics, quality features or indexes to evaluate the models), as sensitivity, specificity, positive (or precision) and negative predictive values, Youden index, and positive and negative likelihood ratios. Besides, the similarities between classification rates of training and test set indicated well-balanced models, no identifying over-fitting. Classification rate of the test set shown classification ability of the models, and gives a more realistic representation of their classification ability. All methods shown better and significant correlation (P < 0.01) with the Gleason grades pre-assigned to the two-category discriminant analysis. Again, GA-LDA model generate best correlation to the two-category (91.5%) as well as to the three-category discriminant analysis (79.8%). The others quality features confirmed the applicability potential of this methods for both classification categories of Gleason grades, in particular the GA-LDA. The PPV and NPV values obtained were higher (close to 100), which suggest that the method was doing correct. YOU index values was close to 100 indicating that the method 76 effectiveness was relatively large; thus optimizes the biomarker’s differentiating ability when equal weight is given to sensitivity and specificity. The LR+ was high and the LR- was low, this provides an intuitive feeling that the result of GA-LDA rules the classification 29,33,34 . The same logic can be applied to the others methods. The Table 3 lists the molecular entities associated with wavenumbers responsible for classification by all methods. The indication of these wavenumbers responsible for classification by each model was discussed previously. Table 3 – Discriminating wavenumbers identified by LDA models cluster vectors using FT- MIR spectroscopy to prostate cancer. Marked the most intensity variation between High grade vs. Low grade spectra. Tentative wavenumber (cm −1 ) assignments ≈ 1,680 LHS Amide I (C=O stretch; C–N stretch) ≈ 1,650 Amide I (C=O stretch; C–N stretch) ≈ 1,620 RHS Amide I (C=O stretch; C–N stretch) ≈ 1,585 Amide I/II trough ≈ 1,570 LHS Amide II (N–H bend and C–N stretch) ≈ 1,550 Amide II (N–H bend and C–N stretch) ≈ 1,520 RHS Amide II (N–H bend and C–N stretch) ≈ 1,455 Protein (C–H and N–H deformation modes) ≈ 1,400 Fatty acids and amino acids (C=O stretching of COO-groups) ≈ 1,250 – 1,360 Amide III ( C–N stretching) ≈ 1,230 DNA (O–P–O asymmetric stretch) ≈ 1,120 – 1,180 RNA Ribose and DNA (C–O stretching) ≈ 1,080 DNA/RNA (O–P–O symmetric stretch) ≈ 1,030 Glycogen (C–O–H bend) ≈ 970 Protein phosphorylation The spectral bands contain for prostate cancer classification and characteristics, in descending order of intensity variation, were: (1) Increased secondary protein structures and protein region involving amino acid conformational changes in C=O, C-O, C-H and N-H (≈ 1591–1483 cm-1). This spectral site had the most increased variation. In fact, amide II (≈ 1550 cm-1) and amide I (≈ 1630cm-1) are more sensitive to the conformational substructures changes in tissues; this tends to become more clear when entire population of cellular proteins in the High grade and the Low grade of prostate cancerous tissues are compared. Plus, this intensity increase also can imply in a 77 reduction in the intermolecular aggregation of the tissue proteins promoted by cancer transformation 17,20,24,27,42,43,48 . (2) Increased DNA and RNA bands (≈1000–1490 cm−1). The νsPO2 − and νsPO2 – frequency was more higher in the High grade than Low grade category of cancerous tissue, this can indicates that the intermolecular interactions between nucleic acids in the High grade category are stronger as a result of intermolecular differentiations 17,19,24-26,42,48 ; and (3) Increased protein phosphorylation ( ≈ 970 cm-1) – widely appreciated as an essential to post-translational modification of the proteins and regulating metabolism of these 17,12,24,27,42,48 – and other biomarkers had minor intensity variation, are included in Table 3. Interestingly, the chemometrics approaches GA-LDA and SPA-LDA identified and highlighted marked variation in the spectral regions containing DNA and RNA bands (≈ 1,000 – 1,490 cm−1) involving nucleic acids, phosphate and deoxyribose modifications. Additionally, alterations in phosphate stretching vibrations including both νsPO2 − and νasPO2 – (DNA/RNA) indicate changes in DNA conformation. These were also identified as the fifteen most altered biomarker between individual sample analyses. DNA vibrational modes have been identified as the key features in the discrimination of prostate grades and susceptibility to adenocarcinoma 6,24,32-44 . Also interesting is that SPA and GA algorithms identified wavenumbers indicating variability within the protein region involving amino acid conformationalchanges in C=O, C- O, C-H and N-H( ≈ 1,591–1,483 cm-1) and, interestingly, protein phosphorylation ( ≈ 970 cm- 1 ). This could be due to post-translational modifications related to changes evident within the DNA/RNA spectral regions. The high occurrence of protein structure variation in data could be a result of alterations in expression of phase I/II metabolizing enzymes 6,17,41-44 . Other biomarkers were shown in the Table 3. FT-MIR spectroscopy coupled with multivariate analysis and variable reduction and selection methods allows the identification of biomarkers that can be adapted for easy discrimination between categories. Micro-environmental cellular communication plays a significant role in cancer initiation and progression 45 ; therefore examination of any part of the prostate tissue may provide information that may lead to better understanding disease alterations 46 . 78 5. CONCLUSION In general, GA-LDA was more successful distinguishing categories with classification rates around 83.3% and sensitivity and specificity about 100% and 80% respectively, demonstrating that FT-MIR spectroscopy in conjunction with powerful chemometric approaches has the potential to classification. The two categorizations based on Gleason criteria provided greater correlation with this method (around 91% by GA-LDA). The most important wavenumber indicated by multivariate models were related to secondary protein structure variations and DNA/RNA alterations. These potential biochemical markers may lend vital clues into the etiology of prostate and its progression. The molecular classification using FT-MIR combined with multivariate analysis and variable reduction and selection methods has the potential to improving histologic assessment and prediction of the prostate cancer. The evidences presented here are sufficient to warrant the further testing of this methodology on a larger dataset of patients; and also to test the possible correlations with lifestyle, such as age, generation, body mass index, weight, diet, alcohol consumption, comorbidities 27 , aiming identify prostate cancer variability. ACKNOWLEDGEMENTS Laurinda F.S. Siqueira and Camilo L.M. Morais would like to acknowledge the financial support from the PPGQ/UFRN/CAPES. K.M.G. Lima acknowledges the CNPq (Grant 305962/20144) for financial support. 79 ABBREVIATIONS DF: Discriminant Function FFPE: Formalin-Fixation and Paraffin- Embedding FN: False negative FP: False positive FTIR: Fourier-Transform Infrared Spectroscopy FT-MIR: Fourier-Transform Mid-Infrared Spectroscopy GA: Genetic Algorithms GA-LDA: Genetic Algorithm-Linear Discriminant Analysis IR: Infrared LD1: Linear Discriminant score LDA: Linear Discriminant Analysis LHS: Left-hand shoulder LR-: Negative Likelihood Ratio LR+: Positive Likelihood Ratio MIR: Mid-Infrared EMSC: Extended Multiplicative Scatter Correction NPV: Negative Predictive Value PCA: Principal Components Analysis PCA-LDA: Principal Components-Linear Discriminant Analysis PC: Principal Component RHS: Right-hand side SENS: Sensitivity SPA: Successive Projections Algorithm SPA-LDA: Successive Projections Algorithm-Linear Discriminant Analysis SPEC: Specificity TN: True negative TP: True positive WHO: World Health Organization YOU: Youden index νasPO2 – : Asymmetric phosphate stretching vibrations νsPO2 − : Symmetric phosphate stretching vibrations 80 REFERENCES 1. WHO. Mortality and global health estimates. Projection of death rates for 2015-2030. WHO (2016). http://apps.who.int/gho/data/node.main.PROJRATEWORLD?lang=en. 2. B. C. de Souza, H. G. Guedes, V. C. B. Oliveira, F. A. de Araujo, C. C. O. Ramos, K. C. P. Medeiros and R. F. J. Araujo, High incidence of prostate cancer metastasis in Afro-Brazilian men with low educational levels: a retrospective observational study. BMC Public Health, 13, 537 (2013). 3. O. Bratt, J.-E. Damber, M. Emanuelsson and H. Grönberg, Hereditary prostate cancer: clinical characteristics and survival. J. Urol. 167, 2423–2426 (2002). 4. S. M. Ho, Y. K. Leung and I. Chung, Ann. N. Y. Estrogens and antiestrogens as etiological factors and therapeutics for prostate cancer. Acad. Sci., 1089, 177–193 (2006). 5. W. D., Dunsmuir, D. Hrouda, & R. S. Kirby.Brit. Malignant changes in the prostate with ageing J. Urol, 82, 47–58 (1998). 6. J. Cuzick, G. P. Swanson, G. Fisher, A. R. Brothman, D. M. Berney, J. E. Reid, D. Mesher, V. O. Speights, E. Stankiewicz, C. S. Foster, H. Møller, P. Scardino, J. D. Warren, J. Park, A. Younus, D. D. Flake, S. Wagner, A. Gutin, J. S. Lanchbury and S. Stone, Prognostic value of an RNA expression signature derived from cell cycle proliferation genes in patients with prostate cancer: a retrospective study. Lancet Oncol., 12, 245–255 (2011). 7. C. Alberti. Hereditary/familial versus sporadic prostate cancer: few indisputable genetic differences and many similar clinicopathological features. Eur. Rev. Med. Pharmacol. Sci., 14, 31–41 (2010). 8. D.C. Fernandez, R. Bhargava, S.M. Hewitt, I.W. Levin. Infrared spectroscopic imaging for histopathologic recognition. Nat. Biotechnol., 23, 469–474 (2005). 9. N. Chen and Q. Zhou. The evolving Gleason grading system. Chin J Cancer Res, 28(1), 58-64 (2016). 10. P.M. Pierorazio, P.C. Walsh, A.W., Partin et al. Prognostic Gleason grade grouping: data based on the modified Gleason scoring system. BJU Int, 111, 753-60 (2013). 11. J.I. Epstein, L. Egevad, M.B. Amin, et al. The 2014 International Society of Urological Pathology (ISUP) Consensus conference on gleason grading of prostatic carcinoma: definition of grading patterns and proposal for a new grading system. Am J Surg Pathol., 40, 244-52 (2016). 81 12. M. Khanmohammadi, K. Ghasemi and G.A. Bagheri. Genetic algorithm spectral feature selection coupled with quadratic discriminant analysis for ATR-FTIR spectrometric diagnosis of basal cell carcinoma via blood sample analysis. RSC Adv., 4, 41484-11490 (2014). 13. M.A. Mackanos and C.H. Contag. FTIR microspectroscopy for improved prostate cancer diagnosis. Trends Biotechnol.. 27, 661–663 (2009). 14. T.J. Harvey, A. Henderson, E. Gazi, N.W. Clarke, M. Brown, E.C. Faria, et al. Discrimination of prostate cancer cells by reflection mode FTIR photoacoustic spectroscopy. Analyst, 132, 292–295 (2007). 15. M. Rouprêt, V. Hupertan, D.R. Yates, J.W.F. Catto, I. Rehman, M. Meuth, et al. Molecular detection of localized prostate cancer using quantitative methylation specific PCR on urinary cells obtained following prostate massage. Clin. Cancer Res., 13, 1720– 1725 (2007). 16. E. Gazi, M. Baker, J. Dwyer, N.P. Lockyer, P. Gardner, J.H. Shanks, et al. A correlation of FTIR spectra derived from prostate cancer biopsies with Gleason grade and tumor stage. Eur. Urol., 50, 750–761 (2006). 17. I.I. Patel, J. Trevisan, P.B. Singh, C.M. Nicholson, R.K.G. Krishnan, S.S. Matanhelia, et al. Segregation of human prostate tissues classified high-risk (UK) versus low-risk (India) for adenocarcinoma using Fourier-transform infrared or Ramanmicrospectroscopy coupled with discriminant analysis. Anal. Bioanal. Chem., 401, 969–982 (2011). 18. L.F.S. Siqueira, and K.M.G. Lima. A decade (2004 – 2014) of FTIR prostate cancerspectroscopy studies: an overview of recent advancements. Trends Anal. Chem, 82, 208–221 (2016). 19. D.I. Ellis, W.B. Dunn, J.L. Griffin, J.W. Allwood, R. Goodacre. Metabolic fingerprinting as a diagnostic tool. 8, 1243–1266 (2007). 20. D.I. Ellis and R. Goodacre. Metabolic fingerprinting in disease diagnosis: biomedicalapplications of infrared and Raman spectroscopy, Analyst, 131, 875–885 (2006). 21. W.B. Dunn and D.I. Ellis, Metabolomics: current analytical platforms and methodologies, Trends Anal. Chem. 24, 285–294 (2005). 22. C. Hughes, L. Gaunt, M. Brown, N.W. Clarke, P. Gardner, Assessment of paraffin removal from prostate FFPE sections using transmission mode FTIR-FPA imaging. Anal. Methods, 6, 1028–1035 (2014). 82 23. Harvey, T. J. et al. Factors influencing the discrimination and classification of prostate cancer cell lines by FTIR microspectroscopy. Analyst 134, 1083–1091 (2009). 24. Kelly, J. G. et al. Biospectroscopy to metabolically profile biomolecular structure: A multistage approach linking computational analysis with biomarkers. J. Proteome Res. 10, 1437–1448 (2011). 25. Trevisan, J., Angelov, P. P., Carmichael, P. L., Scott, A. D. & Martin, F. L. Extracting biological information with computational analysis of Fourier-transform infrared (FTIR) biospectroscopy datasets: current practices to future perspectives. Analyst 137, 3202–15 (2012). 26. Abdi, H. & Williams, L. J. Principal component analysis. Wiley Interdiscip. Rev. Comput. Stat. 2, 433–459 (2010). 27. Theophilou, G., Lima, K. M. G., Briggs, M. & Martin-hirsch, P. L. A biospectroscopic analysis of human prostate tissue obtained from different time periods points to a trans- generational alteration in spectral phenotype. Nat. Publ. Gr. 1–13 (2015). doi:10.1038/srep13465 28. Siqueira, L. F. S. & Lima, K. M. G. MIR-biospectroscopy coupled with chemometrics in cancer studies. Analyst 4833–4847 (2016). doi:10.1039/C6AN01247G 29. Soares, S. F. C. et al. A modification of the successive projections algorithm for spectral variable selection in the presence of unknown interferents. Anal. Chim. Acta 689, 22–28 (2011). 30. Whitley, D. A Genetic Algorithm Tutorial. Stat. Comput. 4, 65–85 (1994). M.J.P. Castanho, F. Hernandes, A.M. De Ré, S. Rautenberg, A. Billis. , A. Fuzzy expert system for predicting pathological stage of prostate cancer. Expert Systems With Applications, 40(2), 466-470 (2013). 31. Purandare, N. C. et al. Infrared spectroscopy with multivariate analysis segregates low- grade cervical cytology based on likelihood to regress, remain static or progress. Anal. Methods 6, 4576–4584 (2014). 32. Fisher, S. E. Vibrational Spectroscopy: What Does the Clinician Need? SHEILA. Biomed. Appl. Synchrotron Infrared Microspectrosc. Ed. 1–28 (2011). 33. Baia, T. C., Gama, R. A., Silva de Lima, L. A. & Lima, K. M. G. FTIR microspectroscopy coupled with variable selection methods for the identification of flunitrazepam in necrophagous flies. Anal. Methods 8, 968–972 (2016). 34. Z. Movasaghi, S. Rehman, I.U. Rehman. Fourier Transform Infrared (FTIR) Spectroscopy of Biological Tissues. Appl. Spectrosc. Rev., 43, 134–179 (2008). 83 35. Lapointe, J. et al. Gene expression profiling identifies clinically relevant subtypes of prostate cancer. Proc. Natl. Acad. Sci. U. S. A. 101, 811–6 (2004). 36. German, M. J. et al. Infrared spectroscopy with multivariate analysis potentially facilitates the segregation of different types of prostate cell. Biophys. J. 90, 3783–95 (2006). 37. Baker, M. J. et al. An investigation of the RWPE prostate derived family of cell lines using FTIR spectroscopy. Analyst 135, 887–894 (2010). 38. Beleites, C. et al. Classification of human gliomas by infrared imaging spectroscopy and chemometric image processing. Vib. Spectrosc. 38, 143–149 (2005). 39. Lima, K. M. G., Gajjar, K. B., Martin-Hirsch, P. L. & Martin, F. L. Segregation of ovarian cancer stage exploiting spectral biomarkers derived from blood plasma or serum analysis: ATR-FTIR spectroscopy coupled with variable selection methods. Biotechnol. Prog. 31, 832–839 (2015) 40. Malins, D. C. et al. Cancer-related changes in prostate DNA as men age and early identification of metastasis in primary prostate tumors. Proc. Natl. Acad. Sci. 100, 5401 (2003). 41. Patel, I. I. & Martin, F. L. Discrimination of zone-specific spectral signatures in normal human prostate using Raman spectroscopy. Analyst 135, 3060–3069 (2010). 42. Baker, M. J. et al. FTIR-based spectroscopic analysis in the identification of clinically aggressive prostate cancer. Br. J. Cancer 99, 1859–1866 (2008). 43. E. Gazi, J. Dwyer, P. Gardner, A. Ghanbari-Siahkali, A.P. Wade, J. Miyan, N.P. Lockyer, J.C. Vickerman, N.W. Clarke et al. J Pathol., 2003, 201, 99–108. 44. Pezzei, C. et al. Characterization of normal and malignant prostate tissue by Fourier transform infrared microspectroscopy. Mol. Biosyst. 6, 2287–2295 (2010). 45. D. Hanahan and R. A. Weinberg. Hallmarks of cancer: the next generation. Cell, 144, 646–674 (2011). 46. S. H. Jung, S. Shin, M. S. Kim, I. P. Baek, J. Y. Lee, S. H. Lee, T. M. Kim, S. H. Lee and Y. J. Chung, Genetic progression of high grade prostatic intraepithelial neoplasia to prostate cancer , Eur. Urol., 69, 823–830 (2015) 47. Baker, M. J. et al. Using Fourier transform IR spectroscopy to analyze biological materials. Nat. Protoc. 9, 1771–91 (2014). 48. Dixon, S. J. & Brereton, R. G. Comparison of performance of five common classifiers represented as boundary methods: Euclidean Distance to Centroids, Linear Discriminant Analysis, Quadratic Discriminant Analysis, Learning Vector Quantization and Support 84 Vector Machines, as dependent on. Chemom. Intell. Lab. Syst. 95, 1–17 (2009). 85 CHAPTER 5 LDA vs. QDA FOR FT-MIR PROSTATE CANCER TISSUE CLASSIFICATION. Laurinda F. S. Siqueira Raimundo F. Araújo Júnior, Aurigena Antunes de Araújo, Camilo L.M. Morais, Kássio M. G. Lima. Chemometrics and Intelligent Laboratory System, 2017, 162, 123-129. Contributions:  I did sample treatments.  I did spectral acquisition.  I did data preprocessing  I built the multivariate classification models.  I wrote the manuscript. Laurinda F. S. Siqueira Kássio M. G. Lima Contents lists available at ScienceDirect Chemometrics and Intelligent Laboratory Systems journal homepage: www.elsevier.com/locate/chemometrics LDA vs. QDA for FT-MIR prostate cancer tissue classification Laurinda F.S. Siqueiraa, Raimundo F. Araújo Júniorb, Aurigena Antunes de Araújoc, Camilo L.M. Moraisa, Kássio M.G. Limaa,⁎ a Biological Chemistry and Chemometrics, Institute of Chemistry, Federal University of Rio Grande of Norte, Natal 59072-970, RN, Brazil b Department of Morphology, Post graduation programme in Health Science / Post graduation programme in Structural and Functional Biology, Federal University of Rio Grande do Norte, Natal 59072-970, RN, Brazil c Department of Biophysics and Pharmacology, Post graduation programme in Public Health / Post graduation programme in Pharmaceutical Science, Federal University of Rio Grande do Norte, Natal 59072-970, RN, Brazil A R T I C L E I N F O Keywords: FT-MIR LDA QDA Tissue Prostate cancer A B S T R A C T Discrimination/classification of biological material a ta molecular level is one of the key aims of chemometrics applied to biospectroscopic data. Two discriminant functions, namely Linear Discriminant Analysis (LDA) and Quadratic Discriminant Analysis (QDA), were considered in this study for prostate cancer classification based on FT-MIR data, and illustrated graphically as boundary methods. Principal Component Analysis (PCA) was applied as a variable/dimensionality reduction method and Genetic Algorithm (GA) as variable selection method, followed by LDA and QDA. The performance of each method was determined using 40–100 MIR spectra per tissue sample (n=45), previously classified according to Gleason traditional grading by pathologists. The methods were used to separate the two-category of prostate cancer: Low grade (Gleason grade 2) vs. High grade (Gleason grade 3 and 4). The models were optimized using a training set and their performance was evaluated using a test set. Classification rates and quality metrics (Sensitivity, Specificity, Positive (or Precision) and Negative Predictive Values, Youden's index, and Positive and Negative Likelihood Ratios) were computed for each model. QDA-based models obtained higher classification rates and quality performance than LDA-based models. The models studied identify that secondary protein structure variations and DNA/RNA alterations are the main biomolecular ‘difference markers’ for prostate cancer grades. 1. Introduction Discrimination/classification of biological material in a molecular level is one of the key aims of chemometrics applied to biospectroscopic data. Two discriminant functions, namely Linear Discriminant Analysis (LDA) and Quadratic Discriminant Analysis (QDA), were considered in this study for prostate cancer classification based on FT-MIR data. The Gleason grade is the ‘gold’ standard approach for prostate cancer classification and it provides an indication of the aggressiveness and biological behavior of a tumor; Gleason grade 1 (the most differentiated and is correlated with the most favorable prognosis) to Gleason grade 5 (the least differentiated and correlated with poor prognosis) [1–3]. However, there are drawbacks of this approach, such as being a system based upon a visual criterion of pattern recognition that is operator-dependent and subject to intra- and inter-observer variability; heterogeneity of samples; a difficult preparation procedure which is also harmful to the organs; the probability of spreading cancer; time consuming procedure; samples susceptible to physical damage, and others [4], which have allowed a scenario of biospectroscopy coupled with chemometrics to be presented as an alternative for classifying cancer, as it can offer additional capabilities for automated, statistically controlled and reproducible subtype recognition to histo- pathologic changes defined by structural alterations of cellular mole- cules and is based on chemical bonds and biochemistry. LDA and QDA are boundary discriminant methods, which aim to find boundaries that separate groups or classes of samples. Boundary http://dx.doi.org/10.1016/j.chemolab.2017.01.021 Received 23 September 2016; Received in revised form 28 December 2016; Accepted 30 January 2017 ⁎ Corresponding author. E-mail address: kassiolima@gmail.com (K.M.G. Lima). Abbreviations: DF, Discriminant Function; FFPE, Formalin-Fixation and Paraffin-Embedding;; FN, False negative; FP, False positive; FTIR, Fourier-Transform Infrared Spectroscopy; FT- MIR, Fourier-Transform Mid-Infrared Spectroscopy; GA, Genetic Algorithm; GA-LDA, Genetic Algorithm-Linear Discriminant Analysis; GA-QDA, Genetic Algorithm-Quadratic Discriminant Analysis; IR, Infrared;; LD1, Linear Discriminant score; LDA, Linear Discriminant Analysis; LHS, Left-hand shoulder; LR-, Negative Likelihood Ratio; LR+, Positive Likelihood Ratio; MIR, Mid-Infrared; EMSC, Extended Multiplicative Scatter Correction; NPV, Negative Predictive Value; PCA, Principal Component Analysis;; PCA-LDA, Principal Component Analysis-Linear Discriminant Analysis; PCA-QDA, Principal Component Analysis-Quadratic Discriminant Analysis; PCs, Principal Components; QDA, Quadratic Discriminant Analysis; RHS, Right-hand side; SENS, Sensibility; SPEC, Specificity;; TN, True negative; TP, True positive; WHO, World Health Organization; YOU, Youden's index; νasPO2–, Asymmetric phosphate stretching vibrations; νsPO2−, Symmetric phosphate stretching vibrations Chemometrics and Intelligent Laboratory Systems 162 (2017) 123–129 Available online 31 January 2017 0169-7439/ © 2017 Elsevier B.V. All rights reserved. MARK 86 divides the space into regions, with each according to different groups or classes; and it also depends on the classifier type, namely: LDA obtains linear boundaries, where a straight line or hyperplane divides the variable space into regions; and QDA obtains quadratic boundaries, where a quadratic curve divides the variable space into regions. Both models are based on the Mahalanobis distance, but LDA assumes a single variance–covariance matrix over all classes and QDA assumes different variance–covariance matrices for each class [5]. LDA allows discrimination of groups which have multivariate normal distributions with the same covariance matrix. LDA uses the pooled variance–covariance matrix and does not take into account different variance structures for the two classes. It is usually applied using the spectral band ratios as parameters to distinguish the FTIR spectra [4–22]. QDA allows for discriminating classes which have significantly different class-specific covariance matrices and forms a separate variance model for each class, while the class populations represent multivariate normal distributions with the same mean. QDA classifier aims to find a transformation of the input features which is able to optimally discriminate between the classes in the dataset [5,7,8,23–27]. To reduce FT-MIR spectral data dimensionality before each dis- criminant analysis, it is possible to transform the full feature space into a lower dimensional space or to perform feature selection to find subsets of variables relevant for classification. In this paper, Principal Component Analysis (PCA) was applied to reduce data dimensionality, where Genetic Algorithm (GA) was applied as the feature selection method. PCA is employed to reduce dimensionality and generate a visualization of multivariate data; it captures as much variability as possible and preserves the information caused by the mains sources of data variability, while also disregarding less relevant information such as noise or collinearities in signals [8–17,28,25]. GA is a combination algorithm inspired by Mendelian genetics and probabilities which uses a combination of selection, recombination and mutation to develop a solution to a problem; it is commonly used to characterize a subset and wavelength selection strategy, then it iteratively selects the spectral regions until the quality of the classifier reaches an optimum [4,16–20]. PCA and GA followed by LDA and QDA were applied to classify prostate cancer tissues, each one specifically designed with a certain Gleason grade class to illustrate the effectiveness of the methods. For comparison of the classification models, quality performance metrics were considered as criteria such as Sensitivity, Specificity, Positive (or Precision) and Negative Predictive Values, Youden's index, and Positive and Negative Likelihood Ratios. In the prostate cancer field, few works had focused in QDA-based classification in comparison to LDA-based classification; the same logic can be applied to methods of variables selection in comparison to methods of variables reduction [14–16]. In this way, it's expected to have in our academic hands effective measurement tools, which not suffer with the ‘dependence-observer’ neither with lengthy procedures of the ‘gold’ techniques standards for classification. In addition, it can help to medical community increase the quality of the diagnosis and classification of prostate cancer and reduce time of the procedures, speeding treatment and to improving the prognostics. 2. Experimental This study was developed by the partnership between Institute of Chemistry and Department of Pathology of the Federal University of Rio Grande of Norte, Natal, Brazil. All experiments were performed incompliance with the relevant laws and institutional guidelines, and the institutional committees (No. 030/0030/2006) of the League Against Cancer of Rio Grande of Norte(RN/Brazil) approved this research. 2.1. Tissue collection Prostate tissue sections were obtained from the Pathology Department of the Federal University of Rio Grande of Norte (UFRN/ Brazil). Prostate tissue sections were formalin-fixed, dehydrated and paraffin-embedded (FFPE) in pathology blocks (n=45), previously classified according to Gleason traditional grading by pathologists (Fig. 1C). No significant changes in fixation or paraffin embedding occurred during the analysis period and no degradation of tissue architecture was observed. The 45 tissue samples were distributed into three grades: Gleason grade 2 (n=23), Gleason grade 3 (n=15) and Gleason grade 4 (n=7). Five-μm-thick [9,10] tissue sections were floated onto ZnSe slides (Bruker Optics Ltd., Coventry, UK). Then they were de-waxed by immersion in a fresh xylene bath for 5 min and washed and cleared in an absolute ethanol bath for another 5 min. The resulting samples were allowed to air-dry and then placed in a desiccator until analysis (Fig. 1A-B) [9]. Fig. 1. Sample preparation and FT-MIR spectroscopy. (A) Detail of the ZnSe slide containing a prepared sample. (B) Micrograph of a prostate tissue sample as visualized during MIR spectroscopy. (C) H & E stained section for histological Gleason grading. (D) Unprocessed FT-MIR spectral dataset (x-axis: wavenumbers (cm−1), y-axis: absorbance). (E) Pre-processed FT-MIR mean spectral dataset by ESMC and Savitzky-Golay smoothing (x-axis: wavenumbers (cm−1), y-axis: intensity). L.F.S. Siqueira et al. Chemometrics and Intelligent Laboratory Systems 162 (2017) 123–129 124 87 2.2. FT-MIR spectroscopy A minimum of 40 and a maximum of 100 FT-MIR spectra per tissue were collected in transmission mode using a Bruker Vector 27 FTIR spectrometer (Bruker Optics Ltd., Coventry, UK). FTIR spectra represent an average of 32 scans in the Mid-IR wavenumber range 600– 4,000 cm−1 with a spectral resolution of 8 cm−1.Spectra were acquired with a new background taken for every new sample; these were converted into absorbance by Bruker OPUS software. 2.3. Computational analysis 2.3.1. Pre-processing The importing and pre-treatment of the spectral data and the construction of chemometric classification models were executed using PLS Toolbox 7.8 (Eigenvector Research, Inc., Wenatchee, WA, USA) and MATLAB R2012b (MathWorks Inc. Natick, MA, USA)environment. FT- MIR spectra were cut from 900–1800 cm−1, since this spectral range contains the so-called fingerprint region, which proved to be the most diagnostically significant (Fig. 1D) [28]. Extended Multiplicative Scatter Correction (ESMC) and 1st order Savitzky-Golay smoothing (15 points)were performed on the resulting mean spectra dataset for de- noising (Fig. 1E) [12]. 2.3.2. Methods PCA and GA followed by LDA and QDA were used to separate the two categories of prostate cancer: Low grade (Gleason grade 2) versus High grade (Gleason grade 3 and 4). Before application of each method, spectral data were divided into training (70%), validation (15%) and prediction (15%) sets by applying the classic Kennard-Stone (KS) uniform sampling algorithm [6,12,29]. The training datasets were used in the modelling procedures, whereas the prediction dataset was only used in the final classification evaluation using LDA and QDA dis- crimination approaches. The Linear and Quadratic Discriminant models result in score plots that provide class separation and compactness; also, the PCA loadings and selected-variables by GA provide distinguishing wavenumbers, which may be translated into their corresponding biochemical information and can be identified as potential biomarkers for specific categories [4–27]. 2.3.3. LDA To obtain discriminant profile, the LDA classification score (L )ij is calculated for a given class k by the following equation, considering that the class covariance matrices are assumed to be equal: L πx x Σ x x=( − ) ( − )−2 logik i k pooled i k e kT −1 (1) where xi is an unknown measurement vector for a sample i; xk is the mean measurement vector of class k; Σpooled is the pooled covariance matrix; and πk is the prior probability of class k [5,25]. 2.3.4. QDA QDA classification score (Qij) is estimated using the variance- covariance matrix for each class k and an additional natural logarithm term, as follows: Q πx x Σ x x Σ=( − ) ( − )+log −2 logik i k k i k e k e kT −1 (2) where Σk is the variance-covariance matrix of class k; and Σloge k is the natural logarithm of determinant of variance-covariance matrix Σk . The prior probability (πk), pooled covariance matrix (Σpooled) and variance- covariance matrix (Σk) are calculated as follows: π N N =k k (3) ∑N NΣ Σ= 1 pooled k K k k =1 (4) ∑NΣ x x x x= 1 ( − )( − )k k i N i k i k =1 T k (5) where Nk is the number of objects of class k; N is the total number of objects in the training set; and K is the total number of classes [5,25]. PCA. PCA is a variable reduction method that has the ability to reduce large multivariate data matrices down to a few orthogonal Principal Components (PCs), which still contain the majority of the information held in the original raw data [5]. The optimum number of 9 PCs, having 98% of explained variance, was applied in the classifica- tion. Each PC has a corresponding eigenvalue which exactly matches the variance of its corresponding PCA factor, enabling these factors to be ranked according to the magnitude of variance captured by each one [12]. GA. GA is a variable selection tool which iteratively selects the spectral regions until the quality of the classifier reaches an optimum [17]. The GA routine was carried out utilizing 40 generations contain- ing 80 chromosomes each. This algorithm was repeated three times, starting from different random initial populations. The best solution resulting from the three realizations of GA was employed. The optimum number of variables for GA–LDA and GA-QDA was performed with an average risk G of LDA and QDA misclassification [18]. This kind of cost function is calculated in the validation set as: ∑G N g= 1 , v n N n =1 v (6) where gn is defined as: g r x m min r x m = ( , ) ( , )n n I I I n I 2 ≠ 2 n m n m ( ) ( ) ( ) ( ) (7) where I n( ) is the index of the true class for the nth validation object xn; r x m( , )n I n2 ( ) is the squared Mahalanobis distance between object xn (of class index I n( )) and the sample mean mI n( ) of its true class; and r x m( , )n I m2 ( ) is the squared Mahalanobis distance between object xn and the center of the closest wrong class. 2.4. Quality performance In order to compare and evaluate the classification power and quality between the LDA and QDA models, quality metrics from multivariate classification quality features were analyzed, such as: Sensitivity, Specificity, Positive (or Precision) and Negative Predictive Values, Youden's index, and Positive and Negative Likelihood Ratios. Sensitivity is the confidence that a positive result for a sample of the label class is obtained, meaning that it is positive for disease. Specificity is the confidence that a negative result for a sample of non-label class is obtained. Positive Predictive Value (PPV) shows how many of the test positives are true positives. Negative Predictive Value (NPV) shows how many of the test negatives are true negatives. Youden's index (YOU) evaluates the classifier's ability to avoid failure. Likelihood Ratios (LR +) represents the ratio between the probability of predicting a sample as positive when it truly is positive, and the probability of predicting a sample as positive when it is actually not positive. The LR- represents the ratio between the probability of predicting a sample as negative when it is actually positive, and the probability of predicting a sample as negative when it truly is negative [8,30,31]. Table 1 summarizes these equations. In addition, two measures of classification were calculated, namely the training and test (or prediction) set classification rates. The classification rate of the training set involves applying the models to the same set of samples used to build and optimize these models. The classification rate of the test set is used to test the classification ability of the models, and it gives a more realistic representation of their classification ability. L.F.S. Siqueira et al. Chemometrics and Intelligent Laboratory Systems 162 (2017) 123–129 125 88 3. Results and discussion This study aimed to compare the performance of the two discrimi- nant functions, namely Linear Discriminant Analysis (LDA) and Quadratic Discriminant Analysis (QDA), as well as to identify spectral differences between prostate cancer tissues according to the Gleason system and to determine the biochemical markers responsible for any such classification. The assumption is believed that (1) there is an apparent separation between clusters of different categories which becomes more pronounced when the robustness of the algorithm becomes larger; and (2) spectral differences could be the first evidence of phenotypic alteration [16]. For this intent, variable/dimensionality reduction methods such as PCA, variable selection methods such as GA and followed by LDA and QDA, allowed for the discrimination of prostate tissue according to Low grade (Gleason grade 2) and High grade (Gleason grade 3 and 4). For application of all methods,70% of the data were used to train the system, 15% for internal validation and 15% for external validation. The Gleason grade 2 training specimen originated from a clinically low- stage tumor, whereas, the Gleason grade 3 training specimen was derived from a patient whose tumor exhibited bone metastases at the time of biopsy, and the Gleason grade 4 training specimens were derived from tumors that exhibited more aggressive bone metastases at the time of biopsy [32,33]. The prostate tissue samples(n=45) were taken from formalin-fixed, dehydrated and paraffin-embedded (FFPE) pathology blocks. Forty to one hundred FT-MIR spectra per tissue were investigated in the spectral range of 900 to 1800 cm−1, since most bio-molecular spectral signa- tures reside within this area [34–38]. The samples were previously classified by pathologists according to the Gleason system. Fortunately, the samples did not show any complicating diathermy effect and no contributions of paraffin vibrational modes were apparent in the low- wavenumber region of FT-MIR spectra used in this project. Fig. 1 shows the non-pre-processed (Fig. 1D) and the pre-processed (Fig. 1E) mean FT-MIR spectra derived from each spectral dataset used to train the Gleason clusters, generating two categories: (1) Low grade(Gleason grade 2),blue line; and (2)High grade (Gleason grade 3 and 4), red line. The comparison of Low grade (Gleason grade 2) vs. High grade (Gleason grade 3 and 4) occurred after assessment between the Gleason grades 2 vs. 3, 2 vs.4, 3 vs. 4 and 2 vs. 3 plus 4. Significant differentiation between all grades (p < 0.01) was identified, but Gleason grades 2 vs. 3, 2 vs. 4 and 2 vs. 3 plus 4 demonstrated more significate differentiation than grade 3 vs. 4. This indicates that because both tumors can exhibit more similar and more aggressive behaviors than Gleason grade 2, they may consist of cells with overlapping phenotypes, which results in overlapping spectral features in their associated IR spectra [16,32,39]. Figs. 2B and 2D show score plots (DF1×DF2) derived from PCA- LDA and PCA-QDA of the two categories with significant segregation between them (P<0.00001). Both models were carried out using the first nine PCs, which explains about 98% of the variance within the sample population. Score plots identify the similarities and dissimila- rities between different categories and present them as clusters of points. Fig. 2A shows the loading plots derived from PCA-LDA, identifying the important wavenumbers for separation of the different categories. These include: 960; 1155; 1225; 1280; 1360; 1380; 1460; 1540; 1560; 1575; and 1630 cm−1. The associated loading plots of PCA-QDA (Fig. 2C) identify the principal segregating wavenumbers, which were: 1150; 1227;1250; 1280; 1360;1410; 1455; 1545; and 1574 cm−1.Loading plots identify the distinguishing wavenumbers. The main function of PCA is to reduce large multivariate data to a few orthogonal Principal Components (PCs), which still contain the majority information of the original raw data. In addition, the PC-plots allow for investigating the class covariance structure of the data. It is immediately clear that the class covariance of data sets are different, which explains why QDA is preferred to LDA. Fig. 3B displays the score plots for classification derived from GA- LDA. The model selected15 wavenumbers (Fig. 3A), which include: 1083; 1122; 1142; 1194; 1229; 1272; 1297; 1333; 1345; 1431; 1504; 1526; 1531; 1546; and 1680 cm−1. Fig. 3D shows the score plots for classification derived from GA-QDA. The model selected 20 wavenum- Table 1 Quality performance tools. Validation and quality tools Equations Validation and quality tools Equations Sensibility ×100TPTP + FN ⎛ ⎝⎜ ⎞ ⎠⎟ Youden's index (YOU) SENS − (1 − SPEC) Specificity ×100TNTN + FP ⎛ ⎝⎜ ⎞ ⎠⎟ Likelihood ratio positive (LR(+)) SENS 1 − SPEC ⎛ ⎝⎜ ⎞ ⎠⎟ Positive Predictive Value (PPV) ×100 TP TP + FP ⎛ ⎝⎜ ⎞ ⎠⎟ Likelihood ratio negative (LR(-)) SPEC 1 − SENS ⎛ ⎝⎜ ⎞ ⎠⎟ Negative Predictive Value (NPV) ×100 TN TN + FN ⎛ ⎝⎜ ⎞ ⎠⎟ Fig. 2. (A) Loading plot and (B) Score (DF1×DF2) plot derived from PCA-LDA.(C) Loading plot and (D) Score (DF1×DF2) plot derived from PCA-QDA. L.F.S. Siqueira et al. Chemometrics and Intelligent Laboratory Systems 162 (2017) 123–129 126 89 bers (Fig. 3C):970; 1041; 1047; 1057; 1083; 1103; 1160; 1230; 1255; 1276; 1297; 1308; 1322; 1404; 1409; 1411; 1455; 1502; 1561;and 1575 cm−1. These separations were significant (P<0.00005). GA is a feature selection method, it selects some wavelengths with high efficiency in the classification algorithm and it extracts the best set [4,16]. It is perceptible that there is some separation between the pre- assigned categories by these models. However, PCA approaches had minor quality performance regarding the other approaches. PCA-LDA was the weakest approach with 60% of the test data correctly classified, and 87.9% of the training data correctly classified. PCA-QDA had a subtle increase in quality metrics compared to PCA-LDA performance. GA-LDA and GA-QDA approaches revealed better segregation between the categories. In particular, GA-QDA presented a classification rate of about 100% and 96.7% for test and training data sets, better in comparison to GA-LDA with 83.3% and 96.7%, respectively (Table 2). Moreover, the similar training and test set classification rates of GA-QDA model indicate a well-balanced model. Furthermore, the specificity values suggest that GA-QDA improved accuracy in comparison to other models. Positive and Negative Predictive Values, Youden's index, and Positive and Negative Likelihood Ratios corroborate this. The PPV and NPV values obtained were higher (close to 100),which suggests that the method was correct. YOU index values were close to 100, indicating that the method effectiveness was relatively large; thus, the biomarker's differentiating ability is optimized when equal weight is given to sensitivity and specificity. The LR+ and LR- values provide an intuitive feeling that the result of GA-QDA rules the classification [15,30,31]. The same logic can be applied to the other models. Table 3 lists the molecular entities associated with wavenumbers of all methods. Marked variation in the spectral regions was more high- lighted and intense in the High grade category in comparison to Low grade. These spectral regions contains, in ascending order of intensity variation: (1) Secondary protein structures and protein region involving amino acid conformational changes in C=O, C-O, C-H and N- H(≈1591–1483 cm−1). This spectral site had the most increased variation. In fact, amide II (≈1550 cm−1) and amide I (≈1630 cm−1) are more sensitive to the conformational substructures changes in tissues; this tends to become more clear when entire population of cellular proteins in the High grade and the Low grade of prostate cancerous tissues are compared. Plus, this intensity increase also can imply in a reduction in the intermolecular aggregation of the tissue proteins promoted by cancer transformation [12,36,39,40].; (2) DNA and RNA bands (≈1000–1490 cm−1). The νsPO2− and νsPO2– frequency was more higher in the High grade than Low grade category of cancerous tissue, this can indicates that the intermolecular interac- tions between nucleic acids in the High grade category are stronger as a result of intermolecular differentiations [6,10,34,41,42]; and (3) Protein phosphorylation (≈970 cm-1) and other biomarkers had minor intensity variation, are included in Table 3. Variation in the spectral regions containing DNA and RNA bands involves nucleic acids, phosphate and deoxyribose modifications, and alteration in phosphate stretching vibrations including both νsPO2− and νsPO2–(DNA/RNA) which indicates changes in DNA conformation. Also, the high occurrence of protein structure variation could be a result Fig. 3. (A) Fifteen wavenumbers selected and (B) Score (DF1×DF2) plot derived from GA-LDA. (C)Twenty wavenumbers selected and (D) Score (DF1×DF2) plot derived from GA-QDA. Table 2 Values of quality performance features from LDA and QDA models by FT-MIR spectro- scopy for prostate cancer classification. Marked the better models. Quality performance features PCA-LDA PCA-QDA GA-LDA GA-QDA Classification rate (%) Test set 60 66.7 83.3 100 Training set 87.8 93.9 96.9 96.9 Sensibility (%) 60 66.7 71.4 75 Specificity (%) 60 66.7 80 100 Positive Predictive Value (PPV) 60 66.7 83.4 100 Negative Predictive Value (NPV) 60 66.7 66.7 66.7 Youden's index (YOU) 20 33.3 51.4 75 Likelihood ratio positive (LR(+)) 1.5 2 3.6 4.2 Likelihood ratio negative (LR(-)) 0.7 0.5 0.3 0.25 Table 3 Discriminating wavenumbers identified by LDA and QDA models by FT-MIR spectroscopy for prostate cancer classification. Marked the most intensity variation between High grade vs. Low grade spectra. Tentative wavenumber (cm−1) assignments ≈1680 LHS Amide I (C=O stretch; C–N stretch) ≈1650 Amide I (C=O stretch; C–N stretch) ≈1620 RHS Amide I (C=O stretch; C–N stretch) ≈1585 Amide I/II trough ≈1570 LHS Amide II (N–H bend and C–N stretch) ≈1550 Amide II (N–H bend and C–N stretch) ≈1520 RHS Amide II (N–H bend and C–N stretch) ≈1455 Protein (C–H and N–H deformation modes) ≈1400 Fatty acids and amino acids (C=O stretching of COO-groups) ≈1250 ≈1360 Amide III (C–N stretching) ≈1230 DNA (O–P–O asymmetric stretch) ≈1120 ≈ 1180 RNA Ribose and DNA (C–O stretching) ≈1080 DNA/RNA (O–P–O symmetric stretch) ≈1030 Glycogen (C–O–H bend) ≈970 Protein phosphorylation L.F.S. Siqueira et al. Chemometrics and Intelligent Laboratory Systems 162 (2017) 123–129 127 90 of alterations in phase I/II metabolizing enzyme expression and of the post-translational modifications related to changes evident within the DNA/RNA spectral regions.DNA vibrational modes have been identified as the key features in the discrimination of prostate cancer grades [10,12,17,18,30,31,33,35,38–40,42–47]. 4. Conclusion Two boundary classification methods were compared to prostate cancer datasets which were created with specific class distributions to highlight the different characteristics of these methods. Spectral differences between prostate cancer tissues according to the Gleason system and biochemical markers responsible for any such classification were also highlighted. According to the results, it is possible to see that Low grade and High grade of prostate cancer can be well-classified by the methods, especially by QDA-based models and even more so by GA-QDA.QDA- based models (which use separate covariance matrices for each class) obtained a higher classification rates than LDA-based models (which use a pooled covariance matrix). This can be explained due to the variance–covariance having been quite different for the two classes. This information is used by QDA when forming the boundary, and may account for the differences seen in the classification for each method. The use of covariance matrices means that the model is able to more closely fit the data. The studied models identify that secondary protein structure variations and DNA/RNA alterations are the main biomolecular ‘differ- ence markers’ for the prostate cancer grades. All results are in agreement with the known characteristics of the discriminant methods and of the biochemical fingerprint. There is no general recommenda- tion of these methods, but a simple graphical exploration of the data to see what the structure is and simple graphs in addition to calculations of model quality performance can provide guidance as to the optimal approach for any specific dataset. All models propose here showed robustness to classify prostate cancer with effectiveness and accuracy. These methods not suffer with the ‘dependence-observer’ and inter- and intra-observer variability, neither with lengthy procedures of the techniques standards for classification. To medical community, our models offered an increase the quality of the diagnosis and classification of prostate cancer and time reduction of the procedures, which can speed treatment and improve the prognostics and patients survival. Acknowledgements Laurinda F.S. Siqueira and Camilo L.M. Morais would like to acknowledge the financial support from the PPGQ/UFRN/CAPES. K.M.G. Lima acknowledges the CNPq (Grant 305962/2014-4) for financial support. References [1] N. Chen, Q. Zhou, The evolving Gleason grading system, Chin. J. Cancer Res. 28 (2016) 58–64. http://dx.doi.org/10.3978/j.issn.1000-9604.2016.02.04. [2] P.M. Pierorazio, P.C. Walsh, A.W. Partin, J.I. Epstein, Prognostic Gleason grade grouping: data based on the modified Gleason scoring system, BJU Int. 111 (2013) 753–760. http://dx.doi.org/10.1111/j.1464-410X.2012.11611.x. [3] J.I. Epstein, W.C.J. Allsbrook, M.B. Amin, L.L. Egevad, Proceedings of the 2005 International Society of Urological Pathology (ISUP) Consensus Conference on Gleason Grading of Prostatic Carcinoma., Am. J. Surg. Pathol. 291228–1242. http://dx.doi.org/10.1097/01.pas.0000173646.99337.b1, 2005. [4] M. Khanmohammadi, K. Ghasemi, A.B. Garmarudi, Genetic algorithm spectral feature selection coupled with quadratic discriminant analysis for ATR-FTIR spectrometric diagnosis of basal cell carcinoma via blood sample analysis, RSC Adv. 4 (2014) 41484–41490. http://dx.doi.org/10.1039/c4ra04965a. [5] S.J. Dixon, R.G. Brereton, Comparison of performance of five common classifiers represented as boundary methods: euclidean distance to centroids, linear discri- minant analysis, quadratic discriminant analysis, learning vector quantization and support vector machines, as dependent on data structure, Chemom. Intell. Lab. Syst. 95 (2009) 1–17. http://dx.doi.org/10.1016/j.chemolab.2008.07.010. [6] J. Trevisan, P.P. Angelov, P.L. Carmichael, A.D. Scott, F.L. Martin, Extracting biological information with computational analysis of Fourier-transform infrared (FTIR) biospectroscopy datasets: current practices to future perspectives, Analyst 137 (2012) 3202–3215. http://dx.doi.org/10.1039/c2an16300d. [7] P. Bickel, P. Diggle, S. Fienberg, K. Krickeberg, I. Olkin, N. Wermuth, S. Zeger, Principal component analysis, Springer Verlang (2002) 2812–2831. http:// dx.doi.org/10.1016/0169-7439(87)80084-9. [8] H. Abdi, L.J. Williams, Principal component analysis, Wiley Interdiscip. Rev. Comput. Stat. 2 (2010) 433–459. http://dx.doi.org/10.1002/wics.101. [9] C. Hughes, L. Gaunt, M. Brown, N.W. Clarke, P. Gardner, Assessment of paraffin removal from prostate FFPE sections using transmission mode FTIR-FPA imaging, Anal. Methods 6 (2014) 1028–1035. http://dx.doi.org/10.1039/c3ay41308j. [10] C. Pezzei, J.D. Pallua, G. Schaefer, C. Seifarth, V. Huck-Pezzei, L.K. Bittner, H. Klocker, G. Bartsch, G.K. Bonn, C.W. Huck, Characterization of normal and malignant prostate tissue by Fourier transform infrared microspectroscopy., Mol. Biosyst. 6 (2010) 2287–2295. http://dx.doi.org/10.1039/c0mb00041h. [11] T.J. Harvey, A. Henderson, E. Gazi, N.W. Clarke, M. Brown, E.C. Faria, R.D. Snook, P. Gardner, Discrimination of prostate cancer cells by reflection mode FTIR photoacoustic spectroscopy, Analyst 132 (2007) 292–295. http://dx.doi.org/ 10.1039/b618618a. [12] J.G. Kelly, J. Trevisan, A.D. Scott, P.L. Carmichael, H.M. Pollock, P.L. Martin- Hirsch, F.L. Martin, Biospectroscopy to metabolically profile biomolecular struc- ture: a multistage approach linking computational analysis with biomarkers, J. Proteome Res. 10 (2011) 1437–1448. http://dx.doi.org/10.1021/pr101067u. [13] M.J. Baker, J. Trevisan, P. Bassan, R. Bhargava, H.J. Butler, K.M. Dorling, P.R. Fielden, S.W. Fogarty, N.J. Fullwood, K. a Heys, C. Hughes, P. Lasch, P.L. Martin-Hirsch, B. Obinaju, G.D. Sockalingum, J. Sulé-Suso, R.J. Strong, M.J. Walsh, B.R. Wood, P. Gardner, F.L. Martin, Using Fourier transform IR spectroscopy to analyze biological materials, Nat. Protoc. 9 (2014) 1771–1791. http://dx.doi.org/10.1038/nprot.2014.110. [14] L.F.S. Siqueira, K.M.G. Lima, Trends in analytical chemistry a decade ( 2004–2014) of FTIR prostate cancer spectroscopy studies: an overview of recent advancements, Trends Anal. Chem. 82 (2016) 208–221. http://dx.doi.org/10.1016/ j.trac.2016.05.028. [15] L.F.S. Siqueira, K.M.G. Lima, MIR-biospectroscopy coupled with chemometrics in cancer studies, Analyst (2016) 4833–4847. http://dx.doi.org/10.1039/ C6AN01247G. [16] G. Theophilou, K.M.G. Lima, M. Briggs, P.L. Martin-hirsch, A biospectroscopic analysis of human prostate tissue obtained from different time periods points to a trans-generational alteration in spectral phenotype, Nat. Publ. Gr. (2015) 1–13. http://dx.doi.org/10.1038/srep13465. [17] C. Beleites, G. Steiner, M.G. Sowa, R. Baumgartner, S. Sobottka, G. Schackert, R. Salzer, Classification of human gliomas by infrared imaging spectroscopy and chemometric image processing, Vib. Spectrosc. 38 (2005) 143–149. http:// dx.doi.org/10.1016/j.vibspec.2005.02.020. [18] N.C. Purandare, I.I. Patel, K.M.G. Lima, J. Trevisan, M. Ma’Ayeh, A. McHugh, G. Von Bünau, P.L. Martin Hirsch, W.J. Prendiville, F.L. Martin, Infrared spectro- scopy with multivariate analysis segregates low-grade cervical cytology based on likelihood to regress, remain static or progress, Anal. Methods 6 (2014) 4576–4584. http://dx.doi.org/10.1039/c3ay42224k. [19] R.C. Conceicao, M. O’Halloran, M. Glavin, E. Jones, Evaluation of features and classifiers for classification of early-stage breast cancer, J. Electromagn. Waves Appl. 25 (2011) 1–14. http://dx.doi.org/10.1163/156939311793898350. [20] R.C. Conceicao, M. O’Halloran, M. Glavin, E. Jones, Investigation of classifiers for early-stage breast cancer based on radar target signatures, Prog. Electromagn. Res. B 105 (2010) 295–311. http://dx.doi.org/10.2528/PIERB10062407. [21] P. Lasch, Spectral pre-processing for biomedical vibrational spectroscopy and microspectroscopic imaging, Chemom. Intell. Lab. Syst. 117 (2012) 100–114. http://dx.doi.org/10.1016/j.chemolab.2012.03.011. [22] J. Trevisan, J. Park, P.P. Angelov, A.A. Ahmadzai, K. Gajjar, A.D. Scott, P.L. Carmichael, F.L. Martin, Measuring similarity and improving stability in biomarker identification methods applied to Fourier-transform infrared (FTIR) spectroscopy, J. Biophotonics 7 (2014) 254–265. http://dx.doi.org/10.1002/ jbio.201300190. [23] S. Ali, R. Veltri, J.I. Epstein, C. Christudass, A. Madabhushi, Selective invocation of shape priors for deformable segmentation and morphologic classification of prostate cancer tissue microarrays, Comput. Med. Imaging Graph. 41 (2015) 3–13. http://dx.doi.org/10.1016/j.compmedimag.2014.11.001. [24] S.E. Viswanath, N.B. Bloch, J.C. Chappelow, R. Toth, N.M. Rofsky, E.M. Genega, R.E. Lenkinski, A. Madabhushi, Central gland and peripheral zone prostate tumors have significantly different quantitative imaging signatures on 3 T endorectal, in vivo T2-weighted MR imagery, J. Magn. Reson. Imaging. 36 (2012) 213–224. http://dx.doi.org/10.1002/jmri.23618. [25] W. Wu, Y. Mallet, B. Walczak, W. Penninckx, D.L. Massart, S. Heuerding, F. Erni, Comparison of regularized discriminant analysis, linear discriminant analysis and quadratic discriminant analysis, applied to NIR data, Anal. Chim. Acta. 329 (1996) 257–265. http://dx.doi.org/10.1016/0003-2670(96)00142-0. [26] E. Pamukçu, H. Bozdogan, S.Ç. J, A Novel Hybrid Dimension Reduction Technique for Undersized High Dimensional Gene Expression Data Sets Using Information Complexity Criterion for Cancer Classification, 2015http://dx.doi.org/10.1155/ 2015/370640, 2015. [27] H. Zhang, H. Wang, Z. Dai, M. Chen, Z. Yuan, Improving accuracy for cancer classification with a new algorithm for genes selection, BMC Bioinforma. 13 (2012) 298. http://dx.doi.org/10.1186/1471-2105-13-298. [28] T.J. Harvey, E. Gazi, A. Henderson, R.D. Snook, N.W. Clarke, M. Brown, P. Gardner, Factors influencing the discrimination and classification of prostate cancer cell lines L.F.S. Siqueira et al. Chemometrics and Intelligent Laboratory Systems 162 (2017) 123–129 128 91 by FTIR microspectroscopy, Analyst 134 (2009) 1083–1091. http://dx.doi.org/ 10.1039/b903249e. [29] R.D. Fricker, R.D. Fricker, L.E. Associates, American Society for Quality Review, 47, 2015, pp. 131–132. doi:10.2307/1266291 [30] SHEILA E. FISHER, CHAPTER 1 Vibrational Spectroscopy: What Does the Clinician Need? SHEILA, Biomed. Appl. Synchrotron Infrared Microspectrosc. Ed., 2011, pp. 1–28. [31] T.C. Baia, R.A. Gama, L.A. Silva, de Lima, K.M.G. Lima, FTIR microspectroscopy coupled with variable selection methods for the identification of flunitrazepam in necrophagous flies, Anal. Methods 8 (2016) 968–972. http://dx.doi.org/10.1039/ C5AY02342D. [32] E. Gazi, M. Baker, J. Dwyer, N.P. Lockyer, P. Gardner, J.H. Shanks, R.S. Reeve, C.A. Hart, N.W. Clarke, M.D. Brown, A correlation of FTIR spectra derived from prostate cancer biopsies with gleason grade and tumour stage, Eur. Urol. 50 (2006) 750–761. http://dx.doi.org/10.1016/j.eururo.2006.03.031. [33] J. Lapointe, C. Li, J.P. Higgins, M. van de Rijn, E. Bair, K. Montgomery, M. Ferrari, L. Egevad, W. Rayford, U. Bergerheim, P. Ekman, A.M. DeMarzo, R. Tibshirani, D. Botstein, P.O. Brown, J.D. Brooks, J.R. Pollack, Gene expression profiling identifies clinically relevant subtypes of prostate cancer, Proc. Natl. Acad. Sci. USA 101 (2004) 811–816. http://dx.doi.org/10.1073/pnas.0304146101. [34] D.I. Ellis, W.B. Dunn, J.L. Griffin, J.W. Allwood, R. Goodacre, Metabolic finger- printing as a diagnostic tool, Pharmacogenomics 8 (2007) 1243–1266. http:// dx.doi.org/10.2217/14622416.8.9.1243. [35] D.Y. Duygu, T. Baykal, İ. Açikgöz, K. Yildiz, Review fourier transform infrared (FT- IR) spectroscopy for biological studies, Gazi Univ. J. Sci. 22 (2009) 117–121 〈http://www.google.de/url?sa=t & source=web & cd=1& ved=0CB0QFjAA & url=http://www.fbe.gazi.edu.tr/dergi/index.php/GUJS/article/download/113/ 53 & ei=vUyDTceTGsyQswb21ICeAw & usg=AFQjCNHZ0a5ouS5mMgdssCI874i be3HpAg〉. [36] D.I. Ellis, R. Goodacre, Metabolic fingerprinting in disease diagnosis: biomedical applications of infrared and Raman spectroscopy, Analyst 131 (2006) 875–885. http://dx.doi.org/10.1039/b602376m. [37] W.B. Dunn, D.I. Ellis, Metabolomics: current analytical platforms and methodolo- gies, TrAC - Trends Anal. Chem. 24 (2005) 285–294. http://dx.doi.org/10.1016/ j.trac.2004.11.021. [38] J. Cuzick, G.P. Swanson, G. Fisher, A.R. Brothman, D.M. Berney, J.E. Reid, D. Mesher, V.O. Speights, E. Stankiewicz, C.S. Foster, H. Møller, P. Scardino, J.D. Warren, J. Park, A. Younus, D.D. Flake, S. Wagner, A. Gutin, J.S. Lanchbury, S. Stone, Prognostic value of an RNA expression signature derived from cell cycle proliferation genes in patients with prostate cancer: a retrospective study, Lancet Oncol. 12 (2011) 245–255. http://dx.doi.org/10.1016/S1470-2045(10)70295-3. [39] I.I. Patel, J. Trevisan, P.B. Singh, C.M. Nicholson, R.K.G. Krishnan, S.S. Matanhelia, F.L. Martin, Segregation of human prostate tissues classified high-risk (UK) versus low-risk (India) for adenocarcinoma using Fourier-transform infrared or Raman microspectroscopy coupled with discriminant analysis, Anal. Bioanal. Chem. 401 (2011) 969–982. http://dx.doi.org/10.1007/s00216-011-5123-z. [40] M.J. Baker, E. Gazi, M.D. Brown, J.H. Shanks, P. Gardner, N.W. Clarke, FTIR-based spectroscopic analysis in the identification of clinically aggressive prostate cancer, Br. J. Cancer. 99 (2008) 1859–1866. http://dx.doi.org/10.1038/sj.bjc.6604753. [41] O. Bratt, J.-E. Damber, M. Emanuelsson, H. Grönberg, Hereditary prostate cancer: clinical characteristics and survival, J. Urol. 167 (2002) 2423–2426. http:// dx.doi.org/10.1097/00005392-200206000-00018. [42] I.I. Patel, F.L. Martin, Discrimination of zone-specific spectral signatures in normal human prostate using Raman spectroscopy, Analyst 135 (2010) 3060–3069. http:// dx.doi.org/10.1039/c0an00518e. [43] M.J. German, A. Hammiche, N. Ragavan, M.J. Tobin, L.J. Cooper, S.S. Matanhelia, A.C. Hindley, C.M. Nicholson, N.J. Fullwood, H.M. Pollock, F.L. Martin, Infrared spectroscopy with multivariate analysis potentially facilitates the segregation of different types of prostate cell, Biophys. J. 90 (2006) 3783–3795. http://dx.doi.org/ 10.1529/biophysj.105.077255. [44] M.J. Baker, C. Clarke, D. Démoulin, J.M. Nicholson, F.M. Lyng, H.J. Byrne, C. a Hart, M.D. Brown, N.W. Clarke, P. Gardner, An investigation of the RWPE prostate derived family of cell lines using FTIR spectroscopy, Analyst 135 (2010) 887–894. http://dx.doi.org/10.1039/b920385k. [45] K.M.G. Lima, K.B. Gajjar, P.L. Martin-Hirsch, F.L. Martin, Segregation of ovarian cancer stage exploiting spectral biomarkers derived from blood plasma or serum analysis: atr-ftir spectroscopy coupled with variable selection methods, Biotechnol. Prog. 31 (2015) 832–839. http://dx.doi.org/10.1002/btpr.2084. [46] D.C. Malins, P.M. Johnson, E.A. Barker, N.L. Polissar, T.M. Wheeler, K.M. Anderson, Cancer-related changes in prostate DNA as men age and early identification of metastasis in primary prostate tumors, Proc. Natl. Acad. Sci. USA 100 (2003) 5401. http://dx.doi.org/10.1073/pnas.0931396100. [47] E. Gazi, J. Dwyer, P. Gardner, A. Ghanbari-Siahkali, A.P. Wade, J. Miyan, N.P. Lockyer, J.C. Vickerman, N.W. Clarke, J.H. Shanks, L.J. Scott, C.A. Hart, M. Brown, Applications of Fourier transform infrared microspectroscopy in studies of benign prostate and prostate cancer. A pilot study, J. Pathol. 201 (2003) 99–108. http://dx.doi.org/10.1002/path.1421. L.F.S. Siqueira et al. Chemometrics and Intelligent Laboratory Systems 162 (2017) 123–129 129 92 93 CHAPTER 6 SVM FOR FT-MIR PROSTATE CANCER CLASSIFICATION: an alternative to the traditional methods. Laurinda F. S. Siqueira Camilo L.M. Morais, Kássio M. G. Lima. Article submitted to Scientific Report - Nature Contributions: • I did sample treatment. • I did spectral acquisition. • I did data pre-processing • I built the multivariate classification models. • I wrote the manuscript. Laurinda F. S. Siqueira Kássio M. G. Lima. Contents ABSTRACT....................................................................................................... 94 1 INTRODUCTION.............................................................................................. 95 2 EXPERIMENTAL............................................................................................. 98 3 RESULTS.......................................................................................................... 101 4 DISCUSSION.................................................................................................... 107 5 CONCLUSIONS................................................................................................ 118 ACKNOWLEDGEMENTS............................................................................... 119 ABBREVIATIONS............................................................................................ 119 REFERENCES................................................................................................... 120 94 GRAPHICAL ABSTRACT ABSTRACT PCA-SVM, SPA-SVM and GA-SVM combined with FT-MIR were presented as complementary and/or alternatives to the traditional methods of prostate cancer screening and classification. These approaches were compared within independent SVM models and with traditional methods of diagnosis, according to class separation interpretability, time consumption and figures of merit. The results showed that variables reduction and selection methods followed by SVM can reduce drawbacks of independent SVM analysis. The potential biomarkers indicated by PCA-SVM, SPA-SVM and GA-SVM were amide I, II and III and protein regions (≈ 1,400-1,585 cm-1) followed by DNA/RNA (O–P–O symmetric stretch) (≈1,080 cm-1) and DNA (O–P–O asymmetric stretch) (≈1,230 cm-1) regions. GA- SVM was the better approach, with higher sensitivity (100%) and specificity (80%), particularly in early stages, and better when compared to traditional methods of diagnosis. Thus, the potential diagnostic tools proposed in this paper describe a less time consuming methodology, not being observer-dependent, which not only may to diagnose prostate cancer in tissues samples with high accuracy and based on spectral differences which not suffer with intra and inter-observer variability, but mainly it can apparently detect early stages better than traditional methods. Keywords: SVM, FT-MIR, tissue, prostate cancer 95 1. INTRODUCTION Currently, the prostate cancer recognition follows some phases. First of all, the proctologist evaluate the prostate by Digital Rectal Examination (DRE) which allows palpation of only 40-50% prostate and it is affected, mainly, by intra and inter-observer variability, resulting in DRE sensitivity and specificity about of 21-37% and 71-91%, respectively 1–5 . In sequence or concomitantly, it is made measurement of serum Prostatic Specific Antigen (PSA) levels that have sensitivity and specificity about of 21% and 64% respectively, in earlies stages and sensitivity and specificity of 32% and 93% in high grade prostate cancers, respectively. When combined to DRE, sensitivity and specificity increase to 51-68% and 92-94%, respectively 1–8 ; however, this measurement can be affected by others factors beyond prostate cancer, such as ejaculation, bacterial prostatitis, biopsy and acute urinary retention which may elevate PSA levels 3,5–8 . According to results of these exams, biopsy coupled to Anatomopathological Examination can be indicated to cancer stage identification. The Trans-rectal Ultrasound (TRUS)-guided Biopsy is currently the gold standard, characterized by sensitivity and specificity of 39-52% and 81-82%, respectively 9 . The Anatomopathological Examination by Gleason grading system (GS) classifies biopsy samples according to aggressiveness of tumor and provides a prognostic idea; it defines Gleason 1 (best differentiation and most favorable prognosis) to Gleason 5 (least differentiation and poor prognosis) 10–12 . This system presents sensitivity and specificity about of 22-29% and 81%, in earlies stages, respectively, and sensitivity and specificity of 30% and 88-97% in high grade prostate cancers, respectively 12,13 . However, detection based on biopsy haves several disadvantages resulting in delays of the diagnosis results, such as samples heterogeneity, difficult and time consuming procedure, damaging organs, with high probability spreading cancer; in addition, visual criteria pattern recognition suffer with operator-dependence, subjectivity and high intra- and inter-observer variability 14,15 . FT-MIR is an important implement to identify structural alterations of cellular molecules based on chemical bonds, as well as to identify spectral biomarkers 16,17 . The spectral region from 900 to 1,800cm -1 – namely ‘fingerprint region’, the major biochemical information area – presents the potential biomarkers for biochemical alterations promoted by cancer when compared samples with the reference class. Potential biomarkers are considered as protein region with bands of amide I (1,650 cm-1), amide II (1,550 cm-1), methyl groups of lipids and proteins (1,400 cm-1), amide III (1,260 cm-1); as DNA/RNA region with 96 asymmetric phosphate stretching vibrations (VasPO2 - ; 1,225 cm-1), symmetric phosphate stretching vibrations (VsPO2 - ; 1,080 cm-1); C–O groups of carbohydrates (1,155 cm-1); glycogen (1,030 cm-1) and protein phosphorylation ( 970 cm-1) 18–21. Furthermore, FT-MIR is a non-invasive and non-destructive technique, which presents objectivity, low operational cost and versatility; it allows for qualitative and quantitative applications, and it involves quick and easy procedures. In cancer studies, it gives an important applications supply in screening field, which mostly realized in transmission and ATR modes 21,22 . Multivariate classification is used as tool in several researches involving discrimination of spectral data from biological samples and also exploratory analyzes, particularly in cancer studies. Notably, it was highlighted the application , in the classification field of the Discrimination Analysis, as Linear and Quadratic boundaries (namely, LDA and QDA), and also algorithms on a minor scale which consider data multidimensionality and non-linear boundaries, as SVM 21,22 . Multivariate classification involves a few steps, as pre- processing, selection samples, application of variables reduction and/or selection methods coupled to classification techniques and validation by figures of merit and/or comparison to reference samples. The application of these chemometric tools aim to reduce, select and classify information useful, given the large and complex spectral information of biological samples (such as tissues, cells and biofluids) and their normal and abnormal biochemical processes. Together, multivariate classification and FT-MIR tend to be a potential instrument for investigating, diagnosing, categorizing and monitoring prostate cancer and other diseases, as opposed to standard methods of detection and classification. In addition, these methodologies do not suffer with operator-dependence and intra- and inter-observer variability, nether with difficult preparation and time consuming procedures. SVM belongs to a generation of learning methodology used for infrared classification. SVM algorithm 23,24 aims the separation of data classes by an optimal hyperplane, operating in a kernel-induced space for linear and/or non-linear modelling 25 . The SVM classification central concepts are: construction of two parallel support hyperplanes, separating positive and negative data; and the maximization of the margins between the hyperplanes with the separation final decision positioned in the middle of them 25–34 The SVM advantages may be related to the maximization the inter-class geometric margin while keep the classification error low; the possibility of non-linear modelling by kernel function; the optimal solution and good generalization routine even when used in small datasets; the robustness of classification models obtained, which are less subject to dimensionality and over-fitting problems 14,25,26,29– 97 31 . As the most significant disadvantages are the time consuming on data pre-processing and on model selection, and less intuitive and interpretable character of the SVM classifier 14,25,27,28,32,35 . We also proposed the use of PCA as variables reduction method and SPA and GA as variables selection methods followed by SVM to perform the multivariate classification and to improve interpretability for class separation. PCA is a very popular variables reduction method. A sequence of independent linear combinations between the variables (wavelengths number) which have greater covariance each other, are represented by a new matrix so called Principal Component (PC). Each PC is product of the scores (which contain the variance explained by each PC) and loadings (which define linear combination of variables that each PC represents). In the iterative process, the first PC will have greater explained variance than the second; the second will have a greater explanation than the third, and so on. Therefore, the choice of PCs number is an important step in the process to avoid either redundancy or insufficient information 36–41 . Some strong points of PCA are the reduction of data dimensionality and generation of data visualization; the capture of more variance as possible; and the reduction of the original data set redundancy. On the other hand, PCA modifies the original data and it impartially considers all inter and within-variability in its algorithm, so only separates classes if there is great variance between them 21,22 . To deal with this drawbacks it is suggested effectives steps of pre- processing and choice of PCs number 18,41 . SPA selects the best variables according to a sequence of projection operations involving the matrix of instrumental responses and bigger resulting vectors projections in the orthogonal space. Morover, SPA have the advantage of that the subsets of selected variables have a small degree of multi-collinearity in order to minimize redundancy and ill-conditioned problems. However, high computational requirements and more time-consuming can be a weakness 22,42,43 . GA algorithms are stochastic and heuristic variables selection method based on evolutionary theory, having selection, recombination and mutation as operators. The selection elects the variables that present the lowest prediction and validation errors and forms a new matrix of variables. Recombination randomly selects individuals (variables) for than be crossed and generate descendants (new variables) with useful information. The mutation is a random change of genes in a chosen chromosome; and tends to be low. Each new generation (i.e., cycle of iterations) is more adapted than the previous one. The variables selection ends when the number of generations previously defined is reached 15,25,43–46 . It can be mentioned as GA strengths the elimination of potential interferents and the lower signal/noise ratio generated by variables selection. In counterpoint, the stochastic nature can makes the results 98 realization-dependent and the variables selection not reproducible 21,22 . To solve this, usually the GA algorithms are performed few times and it is chosen the one of better performance 16,17,43 . This work aim to apply variables reduction and selection techniques combined to SVM, in FT-MIR data from prostate tissue in order to classify and detect spectral differences between early and advanced stages of prostate cancer. Principal Component Analysis (PCA) was used as the variables reduction method, while Genetic Algorithms (GA) and Successive Projection Algorithm (SPA) were used as variables selection methods followed by SVM to perform multivariate classification by PCA-SVM, SPA-SVM and GA-SVM models. 2. EXPERIMENTAL This study was developed by the partnership between Institute of Chemistry and Department of Pathology of the Federal University of Rio Grande do Norte, Natal, Brazil. All experiments were performed incompliance with the relevant laws and institutional guidelines, where the institutional committees (No. 030/0030/2006) of the Liga Norte-Riograndense Contra o Cancer, Brazil, approved this research. Tissue collection. Pathology Department of the Federal University of Rio Grande of Norte (UFRN/Brazil) provided the prostate tissue sections which were formalin-fixed, dehydrated and paraffin-embedded (FFPE) in pathology blocks (n = 45) and classified based on Gleason traditional grading by pathologists in three categories: Gleason 2 (n = 23), Gleason 3 (n = 15) and Gleason 4 (n = 7). Tissue sections (5-μm-thick) were floated onto ZnSe slides (Bruker Optics Ltd., Coventry, UK), de-waxed by serial immersion in fresh xylenes baths for 5 min, washed and cleared in an absolute ethanol bath for another 5 min 47 , and then allowed to air- dry and placed in a desiccator until analysis. In our study, we chose deal with Low grade (Gleason 2, n = 23) and High grade (Gleason 3 and Gleason 4, n = 22) categorization, in order to work with concepts of early and advanced stages from a screening perspective. FT-MIR spectroscopy. 40-100 FTIR spectra per tissue were collected by transmission mode in Mid-IR wavenumber range 600–4,000 cm-1 with a spectral resolution of 8 cm-1 and 32 scans, using a Bruker Lumos FTIR spectrometer-microscope (Bruker Optics Ltd., Coventry, UK) and converted into absorbance by Bruker OPUS software. A new background was taken for every new sample. 99 Computational analysis. Spectral data importing and pre-processing and multivariate classification models construction were executed using PLS Toolbox 7.8 (Eigenvector Research, Inc., Wenatchee, WA, USA), LibSVM algorithm and MATLAB R2012b (MathWorks Inc., Natick, MA, USA). Pre-processing. FT-MIR spectra were cut to include wavelengths between 800 and 1,800 cm -1 , the area associated with the biological spectral fingerprint. In the resulting dataset were performed Extended Multiplicative Scatter Correction (EMSC) to correct baseline 48,49 , 1 st order Savitzky-Golay smoothing (15 points) to emphasize relevant information and to wipe out background noise 50,51 , and normalization to amide I peak (1,650 cm -1 ) to eliminate distortions 41,52 . For application of each analytical model, spectral data were divided into training (60%), validation (20%) and prediction (20%) sets by applying the classic Kennard- Stone (KS) uniform sampling algorithm. The training and validation datasets were used in the modelling procedures, whereas the prediction dataset was only used for the final classification evaluation 53 . SVM models. Application of linear-, quadratic-, Radial Basis Function (RBF)- and 3 rd order polynomial-SVMs was evaluated according to different kernel parameters (bias, correct classification of training and test sets and C parameter). The same set of values for the C parameter was considered for each kernel, which controls the trade-off between training error and margin. C ranged from 0.01 to 50. In addition, all combinations between C and σ parameters were trained for RBF-SVM, where σ ranged from 0.01 to 50; totalizing about 60 RBF-SVM models for each data set. Variables reduction and selection methods coupled to SVM. In this case which deals with a high dimensional data, the use of a variables reduction and selection methods followed by SVM can maximize the predictive performance of the models and reduce over-fitting, mainly when the number of samples in the training set is small. In this paper, the input space is transformed into feature space by means of the Radial Basis Function (RBF) kernel in the PCA-SVM, SPA-SVM and GA-SVM models. These models were performed according to C- support vector classification (C-SVC) where C is the cost of misclassification and tends to infinity. As mentioned before, C regulates the balance between training errors and model complexity. In our models, both C and σ parameters were automatically optimized. Variables reduction by PCA-SVM was employed after PCs number optimization based on classification rates of training and test sets. It was applied to classify the samples according to linear combinations between them, bigger variance explained and in the capture 100 in order to add information to the classification 36,37 . Variables selection by SPA-SVM picks the best variables according to its bigger vector projection in the orthogonal space 42,43 . GA- SVM selects the best variables based on stochastic sampling method 15,25,54 . The variables selection routine by GA-SVM was carried out utilizing 40 generations containing 80 chromosomes each. Five independent GA-SVM runs with different random initial populations were performed and only the best individuals kept. Figures of merit. Models performance were evaluated based on Sensitivity, Specificity, Positive (or Precision) and Negative Predictive Values, Youden’ index, and Positive and Negative Likelihood Ratios. (1) Sensitivity is the confidence that a positive result for a sample is a true positive in disease; (2) Specificity is the confidence that a negative result for a sample is true negative; (3) Positive Predictive Value (PPV) is the proportion of test positives are true positives; (4) Negative Predictive Value (NPV) is the proportion of test negatives are true negatives; (5) Youden index (YOU) evaluates the classifier's ability to avoid failure an also biomarker identification capacity; (6) Likelihood Ratios (LR+) represents the ratio between the probability of predicting an example as true positive, and the probability of predicting an example as negative; (7) The LR- represents the ratio between the probability of predicting an example as true negative, and the probability of predicting an example as positive 22,55,56 . All this means that the better models must be high sensitivity and specificity to be accurate in class separation. YOU must be close to 100 to prove it capacities of classification and biomarker identification. NNP and NPV also must high for affirmation or negation of the group segregation. The LR+ must be high, while the LR- must be low, which provides an intuitive feeling that the models rules the classification. The figures of merit measurement can be performed by equations summarized as follow: Sensibility (SENS) ( TP TP+FN ) × 100 (Eq. 3) Specificity (SPEC) ( TN TN+FP ) × 100 (Eq.4) Positive Predictive Value (PPV) ( TP TP+FP ) × 100 (Eq.5) Negative Predictive Value (NPV) ( TN TN+FN ) × 100 (Eq.6) Youden’s index (YOU) SENS − (1 − SPEC) (Eq.7) Likelihood ratio positive (LR+) ( SENS 1−SPEC ) (Eq.8) 101 Likelihood ratio negative (LR-) ( SPEC 1−SENS ) (Eq.9) Where TP is true positive, TN is true negative, FP is false positive, FN is false negative, SENS is sensitivity and SPEC is specificity. In addition, it were calculated the correct classification (CC%) of the training and test sets. Training set CC% involves applying the model in the samples used to build and to optimize the model, while test set CC% is used to test the models classification ability. Generally, the first tends to be higher (but closer) than the second; this means that the models are well-balanced and without presence of overfitting 57 . Plus, comparison between SVM models and variables reduction and selection methods coupled to SVM was performed by assessment of these classification rates. 3. RESULTS The main results of this study are presented in the following section, focusing on FT- MIR classification of Low and High grades for prostate cancer by the SVM classifier. Pre-processing. FT-MIR spectral dataset derived from Low and High grade categorization for prostate cancer is shown in Figure 1. Spectral raw data (Fig.1A) was cut in 1,800 - 800 cm -1 to emphasize the fingerprint region. EMSC, 1 st order Savitzky-Golay smoothing (15 points) and Normalization to Amide I peak (1,650 cm -1 ) were performed in mean spectra (Fig. 1B). Figure 1 – FT-MIR spectral dataset derived from Low and High grade categorization for prostate cancer. (A) Non-pre-processed spectral dataset and (B) pre-processed spectral dataset by EMSC, 1 st order Savitzky-Golay smoothing (15 points) and Normalization to Amide I peak (1,650 cm -1 ) and cut in fingerprint region (800-1800 cm -1 ). 102 SVM models. RBF-SVM. In the Figure 2 were depicted bias and errors in training and test sets, varying the values of σ and C parameters in RBF-SVM models derived from Low and High grade categorization for prostate cancer. We note that for σ and C variation, the bias values can be described in three ways: (1) For σ ≤ 1, bias remains approximately constant and had low values, independently of the values of C; (2) For 1 ≤ σ ≤ 10 and for σ ≥ 15 and C≤ 15, bias grows; (3) For σ ≥ 10 and C ≥ 10, bias drops (Fig. 2A). Errors in training set were lower; however, it grows in σ ≥ 15 and drops to zero in σ ≤ 15, independently of the values of C. In σ ≥ 25, High grade had more correct classification rate of training set (≈ 86%) than Low grade (≈ 78%) (Fig.2B). Elevate test error can be observed in σ ≥ 15 for Low grade, while the same occur in σ ≤ 1 for the High grade, independently of the values of C (Fig. 2C-D). Maximum correct classification of test set was ≈ 60% for both Low (in C ≥ 5) and for High grade (in σ = 10, C ≥ 1). For most RBF-SVMs, the number of support vector used in the classification was high and around of the number of training samples, in minor scale for those with σ ≥ 15 and C ≥ 10 (Table 1). Linear, quadratic and polynomial -SVMs. In the Figure 3 were illustrated the bias and errors in training and test sets in linear-, quadratic- and 3 rd order polynomial-SVMs models derived from the Low and High grade categorization for prostate cancer. Linear-SVM had larger bias values than quadratic- and 3 rd order polynomial-SVMs, independently of the values of C (Fig. 3A). Errors in training set were small (≤ 0.1) for Low and High grades, where 3 rd order polynomial-SVM had the minor correct classification of training set (≈ 90%) for both grades, independently of the values of C (Fig. 3B). Low grade had bigger error in test set than High grade (Fig. 3C). Quadratic-SVM was the better model for Low grade classification, classifying correctly only ≈ 40% of test set. Linear- and 3rd order polynomial- SVMs correctly classified ≈ 80-100% of the High grade test set. Number of support vectors used in the classification was presented in Table 1. These SVMs approaches used in the classification less number of support vectors (15-20 support vectors used) than RBF-SVM models (22-25 support vectors used). 103 Figure 2 – Low and High grade FT-MIR classification for prostate cancer by RBF-SVM. (A) Bias; (B) error in training set; and (C, D) error in test set, varying σ and C parameters. Figure 3 – Low and High grade FT-MIR classification for prostate cancer by linear, quadratic and 3 rd order polynomial -SVMs. (A) Bias, (B) error in training set, and (C) error in test set, varying C parameter. 104 Table 1 – Low and High grade FT-MIR classification for prostate cancer by SVM models. Number of support vectors used by SVM models. C σ 0.1 1 5 10 15 25 50 RBF-SVM 0.01 25 25 25 25 25 25 25 0.5 25 25 25 25 25 25 25 1 25 25 25 25 25 25 25 5 25 25 25 25 25 25 25 10 25 25 25 25 25 25 25 15 25 25 25 25 24 24 24 25 25 25 25 24 24 23 23 50 25 25 25 24 23 23 22 Linear-SVM - 21 18 18 18 18 18 18 Quadratic-SVM - 20 20 20 20 20 20 20 Polynomial-SVM - 15 15 15 15 15 15 15 Variables reduction and selection methods coupled to SVM. PCA-SVM. In the Figure 4 was represented information about the influence of PC number in correct classification of training and test sets from Low and High grade categorization for prostate cancer. The best classification rates (≈ 80%) for both Low and High grades and both training and test sets were found using 10 PCs. Loadings and score plots derived from PCA-SVM were displayed in Figure 5A-B. Loadings plot identified the distinguishing wavenumbers: ≈ 975; 1,080; 1,155; 1,230; 1,270; 1,370; 1,415; 1,465; 1,555; and 1,575 cm -1 . By score plot, it is perceptible that there is some separation between the Low and High grade (P < 0.001). PCA-SVM classification used 10 support vectors and had bias around ≈ -0.156. Figure 4 – Low and High grade FT-MIR classification for prostate cancer by PCA-SVM. Influence of the number of orthogonal components in the correct classification rates of training and test sets. 105 SPA-SVM and GA-SVM. Loadings and score plots derived from SPA-SVM were presented in Figure 5C-D and from GA-SVM in Figure 5E-F. SPA-SVM approach used twenty-four wavenumbers ≈ 960; 1,000; 1,027; 1,081; 1,115; 1,134; 1,151; 1,169; 1,231; 1,296; 1,325; 1,347; 1,357; 1,376; 1,389; 1,402; 1,450; 1,468; 1,488; 1,506; 1,559; 1,595; 1,620; and 1,650 cm -1 (Fig. 5C). SPA-SVM classification used 21 support vectors and had bias about ≈ -0.098. GA-SVM model generated the best classification (Fig. 5E) using twenty variables selected. These were ≈ 950; 1,012; 1,086; 1,226; 1,232; 1,242; 1,249; 1,268; 1,276; 1,297; 1,306; 1,330; 1,370; 1,371; 1,376; 1,399; 1,519; 1,552; 1,630 and 1,680 cm -1 . GA- SVM classification used 15 support vectors and had bias around ≈ -0.74. Both these variables selection approaches showed significant separation in the classification (P < 0.005). Figures of merit. Figures of merit of RBF-SVMs, Linear-SVM, Quadratic-SVM, Polynomial-SVM, PCA-SVM, SPA-SVM and GA-SVM models for FT-MIR classification for prostate cancer were listed in Table 2 and Table 3. In Table 2, SVM models had sensitivity and specificity values varying around ≈20- 100% and ≈0-80% respectively for Low and High grade. Linear-SVM had opposite sensitivity (≈20% vs. ≈100%) and specificity (≈80% vs. ≈0%) values for Low and High grades, respectively; the same occur with sensitivity (≈40% vs. ≈80%) and specificity (≈60% vs. ≈20%) values for Low and High grades, respectively, by quadratic-SVM. 3rd polynomial- SVM had ≈50% of sensitivity and specificity for both grades. RBF-SVM (σ = 10, C ≥ 5) had ≈60% and ≈40% of sensitivity and specificity respectively for both grades. In Table 3, variables reduction and selection methods followed by SVM presented sensitivity of ≈ 80-100% and specificity of ≈ 75-80% for Low grade category, while for High grade showed sensitivity of ≈ 67-80% and specificity of ≈ 71-80%. GA-SVM correctly classified around 100% of training and test sets for both grades of prostate cancer, and also had largest figures of merit in comparison to PCA-SVM and SPA-SVM. In turn, PCA-SVM was the second better approach, with subtly bigger classification rates and figures of merit than SPA-SVM. 106 Figure 5 - Low and High grade FT-MIR classification for prostate cancer by variables reduction and selection methods coupled to SVM. (A) Loadings plot derived from PCA- SVM. (B) Scores plot calculated by PCA-SVM. (C) Twenty-four wavenumbers selected by the SPA-SVM. (D) Scores plot calculated by SPA-SVM. (E) Twenty wavenumbers selected by the GA-SVM. (F) Scores plot calculated by GA-SVM. (LowCal: Low grade calibration set; LowVal: Low grade validation set; LowPred: Low grade test set; HighCal: High grade calibration set; HighVal: High grade validation set; HighPred: High grade test set) 107 Table 2 – Low and High grade FT-MIR classification for prostate cancer by SVM models. Figures of merit of RBF-, linear-, quadratic- and 3 rd polynomial-SVM. (Where, SENS: Sensitivity. SPEC: Specificity. PPV: Positive Predictive Value. NPV: Negative Predictive Value. YOU: Youden Index. LR+: Positive Likelihood. LR-: Negative Likelihood). RBF-SVM (σ = 10, C ≥ 5) Linear-SVM Quadratic-SVM Polynomial-SVM Low Grade High Grade Low Grade High Grade Low Grade High Grade Low grade High grade SENS (%) 60 60 20 100 40 80 50 50 SPEC (%) 40 40 80 0 60 20 50 50 PPV (%) 50 50 50 50 50 50 50 50 NPV (%) 50 50 50 50 50 50 50 50 YOU 0 0 0 0 0 0 0 0 LR+ 1 1 1 1 1 1 1 1 LR- 1 1 1 1 1 1 1 1 Table 3 – Low and High grade FT-MIR classification for prostate cancer by variables reduction and selection methods coupled to SVM. Figures of merit of PCA-SVM, SPA-SVM and GA-SVM. (Where, SENS: Sensitivity. SPEC: Specificity. PPV: Positive Predictive Value. NPV: Negative Predictive Value. YOU: Youden Index. LR+: Positive Likelihood. LR-: Negative Likelihood). PCA-SVM SPA-SVM GA-SVM Low Grade High Grade Low Grade High Grade Low grade High grade Training set CC (%) 100 100 84.62 66.67 100 100 Test set CC (%) 80 80 80 60 100 90 SENS (%) 80 66.67 80 80 100 80 SPEC (%) 75 80 75 71.43 80 80 PPV (%) 80 80 80 60 100 75 NPV (%) 80 60 80 60 100 60 YOU 60 41.67 60 60 75 60 LR+ 4 2.67 4 3.5 4 1.5 LR- 0.25 0.45 0.25 0 0.64 0.40 4. DISCUSSION This work aimed to apply variables reduction and selection techniques followed by SVM, in FT-MIR data from human tissue, to classify and detect spectral differences between early and advanced stages of prostate cancer. It was used prostate tissues taken from formalin- fixed, dehydrated and paraffin-embedded (FFPE) pathology blocks, previously staged by pathologists in Gleason II, III and IV. No significant changes in fixation or paraffin 108 embedding occurred during analysis period and no degradation of tissue architecture was observed. No diathermy effect was presented by the samples. No contributions of paraffin vibrational modes were apparent in the low-wavenumber region of FT-MIR spectra. FT-MIR spectra were collected by transmission mode. The necessity of a non- destructive technique which allows mapping tissue area was the motivation for use this mode. Moreover, our search treats with tissues samples that are complex structures. Thus, distinct locations across the samples were considered during data collection. This fact was as purpose of taking more than one spectra per sample. Influences of scattering effect, overlapping bands, noise and some reflective loss at the substrate–sample interface appear in spectral raw data plot (Fig.1A). Thus, in the order: Extended Multiplicative Scatter Correction (EMSC) was achieved for baseline correction of scattering effects, 1 st order Savitzky-Golay smoothing (15 points) was performed to remove background noise and correct baseline, and normalization to Amide I peak (1,650 cm -1 ) was realized to correct distortions (Fig. 1B). Although there is a very subtle spectral differentiation between Low and High grades, it was necessary to perform multivariate classification models to identify the most significant spectral markers for differentiation. RBF-, linear-, quadratic- and polynomial-SVM models have been used in several biomedical works 34,58–70 about cancer diagnosis, prognosis and genetic profile. However, few works 14,71–76 has applied this SVM-models in IR spectroscopic data derived from cancer samples. The number of publications is even lower when considering application of variables reduction 71,77–80 and selection 81 approaches followed by SVM in IR spectroscopic data from cancer samples. SVM models. RBF-SVMs. It was noted that σ was the most important parameter in the RBF-SVM classification. The bias, the errors in training and test sets, and the number of support vectors depend mostly on σ parameter. Emphasizing that bias derives from differences between the model and the true behavior to be predicted and it is related to over-fitting problems 82,83 . As follow, it is discussed relationships between all them. It is observed that for σ and C variation, the bias values had three behaviors (Fig. 2A). First of all, a low bias region may be related to lower values of σ (σ ≤ 1), where it is observed small or no variations in bias independently of the values of C. Secondly, a high bias region may be related to intermediary values of σ (1 ≤ σ ≤ 10) when C ≤ 5, where bias start grows. A 109 third region (a lower bias region) can be related to high values of σ (σ ≥ 15) when C ≥ 10, where bias suddenly decreases. In these two cases, RBF-SVMs are sensitive to low values of C; in other words, if C is too low, then bias can grow quickly. In fact, bias reached the biggest values when σ ≥ 15 and C ≤ 5. The error in training set grows when bias increase. Error grew in σ ≥ 15 and reached the biggest values when C ≤ 5, which coincides with the largest value of bias; then, error dropped down to zero when σ ≤ 15 (which coincides with some low values of bias). This is more evident for Low grade than High grade (Fig.2B). Hence, correct classification of Low grade training set was 75% and for High grade was 85%, considering high bias region, while correct classification of training set was  100% for both Low and High grades, considering low bias region. Interestingly, errors in test set for Low and High grades were opposite: high error can be observed in σ ≥ 15 for the first, while the same occur in σ ≤ 1 for the second one, independently of the values of C (Fig. 2C-D). This may be means that RBF-SVM cannot learn in these values of σ. Only in 5 ≤ σ ≤ 10 that this opposite behavior not appears and both categories were classified. In this σ range, it was found maximum classification rate of ≈ 60% for both Low (when C ≥ 5) and for High grades (when σ = 10 and C ≥ 1). In addition, for most RBF-SVMs the number of support vectors was high and around the number of training samples. A decrement in the number of support vectors can be observed in σ ≥ 15 when C ≥ 10, which coincides with a lower bias region (Table 1). All this information can support the hypothesis of overfitting problems with high values of σ (σ ≥ 15) when values of C are small (C ≤ 5). This can be confirmed by the fact that this range of σ and C match with high bias region, with higher values of training error and with high number of support vectors which was around of number of training samples 71,82,83 . All these relationships between σ and C parameters, bias, errors in training and test sets, and number of support vectors were important to conceive that the best RBF-SVM model may be considered those with σ = 10 when C ≥ 5. Linear-, Quadratic- and Polynomial-SVMs. In sequence, it is discussed relationships between C parameter, bias, errors in training and test sets and number of support vectors in the Linear-, Quadratic- and Polynomial-SVM classifications. Linear-SVM had larger bias values than quadratic- and 3 rd order polynomial-SVMs, independently of the values of C (Fig. 3A). Indeed, 3 rd polynomial-SVMs had all bias values as negative. Errors in training set were smaller (≤ 0.1) for Low and High grades, where the second one had subtle larger error than first one. All models had higher correct classification for both grades. 3 rd order polynomial- 110 SVM had the correct classification of training set of ≈ 90% and the others had ≈ 100% for both grades, independently of the values of C (Fig. 3B). From the Figure 3C, it was clear that High grade was better classified than Low grade, considering test set, independently of C values. In fact, the same opposite behavior in the error of test set found in RBF-SVM classification occur in these approaches. For Low grade, which had bigger error in test set, the better classification rate was ≈ 40% by quadratic- SVM independently of C values and by 3 rd polynomial-SVM but only when C =1. Indeed, 3 rd polynomial-SVM did not rate Low grade test set in any values of C (except when C =1). Linear-SVM correctly classified only ≈ 20% of Low grade test set. For High grade, linear- SVM correctly classified ≈ 100% of test set, independently of C values. The same occur in 3rd polynomial-SVM, except when C = 1 which correctly classified ≈ 80% of test set. Quadratic- SVM correctly classified ≈ 80% of High grade test set, independently of C values. These SVMs approaches used in the classification less number of support vectors than RBF-SVM models. Quadratic-SVM classification used the larger number of support vector number (20), while 3 rd polynomial-SVM used 15 support vectors and linear-SVM had 18 support vectors used (Table 1). The fact of linear-SVM have had larger bias values, zero in training error, higher error in test set for Low grade and relative high number of support vectors may be indicate overfitting problems according to our previous hypothesis. Despite of 3 rd polynomial-SVM having presented smaller bias values and lower number of support vectors, it also presented relatively high training error (compared to the others) and higher error in test set of Low grade. On the other hand, although quadratic-SVM classification use larger number of support vectors, it showed low bias values, low error in training set and succeeded to classify both Low and High grades test sets. Kernels comparison. The best RBF-SVM model and the linear-, quadratic and polynomial-SVMs were compared based on figures of merit, such as sensitivity, specificity and others in Table 2. It was clear that SVM models had lower sensitivity and specificity, with values varying around ≈20-100% and ≈0-80% for Low and High grade, respectively. Disregarding Linear- and Quadratic-SVM which presented opposite values of sensitivity and specificity for Low and High grades, that is, while a value was higher, the other is lower, and vice-versa, for both grades. 3 rd polynomial-SVM and RBF-SVM presented sensitivity and specificity values close to each other for Low and High grades, nevertheless these values were lower considering a classification perspective. Additionally, the number of support vectors used in almost all classifications was very close of the total number of samples in the training 111 set, an indication of possible overfitting 71 . Based on all this facts, it may consider inefficiency of these SVM models for our data. Variables reduction and selection coupled to SVM. In this work, it was proposed the use of PCA as variable reduction method and SPA and GA as variables selection methods followed by SVM to perform the multivariate classification and to improve interpretability for class separation. PCA-SVM. The best classification rates (≈ 80%) for both Low and High grades and both training and test sets were found using 10 PCs (Figure 4), which provided ≈ 98% explained variance. The search for optimum number of PC aimed to avoid overfitting problems, arbitrary separation, too much noise and degradation of the loadings plots interpretation. Scores plot identified significant spectral similarity/dissimilarity (P < 0.001) between the Low and High grades and showed visual representation and interpretation of these (Figure 5B). Added to this, loadings identified the most important segregating variables (wavenumbers) responsible for Low and High grades classification (Figure 5A), based on sequence of linear combinations between the originals variables that had greater covariance. The distinguishing wavenumbers were ≈ 975; 1,080; 1,155; 1,230; 1,270; 1,370; 1,415; 1,465; 1,555; and 1,575 cm -1 . SPA-SVM. Scores plot identified significant spectral separation (P < 0.005) between the Low and High grades (Fig. 5D). This approach selected the most relevant variables (wavelengths number) responsible for Low and High classification based on several vector projections, until a previously chosen number of wavelengths were reached. Loadings plot (Fig. 5C) provided a visualization of the variable selected by SPA-SVM, which had a small degree of multi-collinearity in order to minimize redundancy and ill-conditioned problems. SPA-SVM selected twenty-four wavenumbers ≈ 960; 1,000; 1,027; 1,081; 1,115; 1,134; 1,151; 1,169; 1,231; 1,296; 1,325; 1,347; 1,357; 1,376; 1,389; 1,402; 1,450; 1,468; 1,488; 1,506; 1,559; 1,595; 1,620; 1,650 cm -1 . GA-SVM. Scores plot identified significant spectral segregation (P < 0.005) between the Low and High grades (Fig. 5F). GA-SVM selects variables from the stochastic and heuristic modelling based on evolutionary theory, which the better variables are chose according to the lowest prediction and validation errors. GA-SVM generated the best classification using twenty variables (Fig. 5E), which were ≈ 950; 1,012; 1,086; 1,226; 1,232; 112 1,242; 1,249; 1,268; 1,276; 1,297; 1,306; 1,330; 1,370; 1,371; 1,376; 1,399; 1,519; 1,552; 1,630 and 1,680 cm -1 . Comparing variables reduction and selection methods coupled to SVM. From the figures of merit of the variables reduction and selection methods followed by SVM shown in Table 3, it was noted that Low grade category had larger values for most of figures of merit in comparison to High grade for all approaches, highlighting sensitivity (≈ 80-100%) and specificity (≈ 75-90%). It was observed that variables selection coupled to SVM performed by GA-SVM had better performance, as discussed in sequence. All models had lower bias and used relatively low number of support vector in the classification. GA-SVM model generated the best classification in comparison to PCA-SVM and SPA-SVM. It correctly classified ≈ 100% of the training and test sets of the Low grade, while correctly classified ≈ 100% and ≈ 90% of the training and test sets of High grade respectively. Its sensitivity and specificity values were ≈ 100% and ≈ 90% respectively for Low grade and both was ≈ 80% for High grade. This trend of higher values for Low grade was extended to the others figures of merit. This high figure of merit values (more close to 1 or 100%, better) confirmed the effectiveness of GA-SVM, as a variables selection method followed by SVM. In addition, GA-SVM classification used small number of support vectors compared to the others methods. Regarding SPA-SVM, it correctly classified ≈ 84% and ≈ 80% of the training and test sets of Low grade respectively, while correctly classified ≈ 67% and ≈ 60% of training and test sets of High grade respectively. This approach was the worst in comparison to the others methods considering only the classification rates. SPA-SVM classification used high number of support vectors, which was close of the training samples number. However, SPA-SVM had high figure of merit values also, which was close to the others approaches. On other hand, PCA-SVM variables reduction method was slightly better than SPA- SVM. PCA-SVM correctly classified ≈ 100% of the training set and ≈ 80% of the test set of Low and High grades. It had sensitivity and specificity of ≈ 80% and ≈ 75% respectively for Low grade, and sensitivity and specificity of ≈ 67% and ≈ 80% respectively for High grades. PCA-SVM classification used smallest number of support vectors compared to the others methods. PCA-SVM was the second best approach considering classification rates and the others figures of merit, in comparison to the others methods. 113 Potential biomarkers and spectral differences. In the Table 4 were represented discriminating wavenumbers identified from Low and High grade categorization for prostate cancer, by variables reduction and selection methods coupled to SVM. The most intensity variation between High grade vs. Low grade spectra was marked highlighted. In addition, identified spectral differences based on absorbance ratio between High and Low grades spectra was displayed in the Figure 6. The variables reduction and selection methods coupled to SVM pointed wavenumbers which can be related to functional groups that compose structures of proteins and nucleic acids (Fig. 5D-5E, Table 4). The same results were found by many others studies 16,21,71,74,75,77–81,84–87 . Table 4 – Low and High grade FT-MIR classification for prostate cancer by variables reduction and selection coupled to SVM. Discriminating wavenumbers identified. Marked the most intensity variation between High grade vs. Low grade spectra (dark grey: high values of absorbance ratio; grey: intermediary values of absorbance ratio; white: lower values of absorbance ratio). Tentative wavenumber (cm −1 ) assignments Absorbance ratio (High grade/Low grade) (u.a.) ≈ 1,680 LHS Amide I (C=O stretch; C–N stretch) 1.05±0.001 ≈ 1,650 Amide I (C=O stretch; C–N stretch) 1.10±0.001 ≈ 1,620 RHS Amide I (C=O stretch; C–N stretch) 1.06±0.001 ≈ 1,585 Amide I/II trough 1.19±0.001 ≈ 1,570 LHS Amide II (N–H bend and C–N stretch) 1.17±0.001 ≈ 1,550 Amide II (N–H bend and C–N stretch) 1.19±0.001 ≈ 1,520 RHS Amide II (N–H bend and C–N stretch) 1.10±0.001 ≈ 1,455 Protein (C–H and N–H deformation modes) 1.08±0.001 ≈ 1,400 Fatty acids and amino acids (C=O stretching of COO-groups) 1.18±0.001 ≈1,250-1,360 Amide III ( C–N stretching) 1.06±0.001 ≈ 1,230 DNA (O–P–O asymmetric stretch) 1.10±0.001 ≈1,120-1,180 RNA Ribose and DNA (C–O stretching) 1.08±0.001 ≈ 1,080 DNA/RNA (O–P–O symmetric stretch) 1.10±0.001 ≈ 1,030 Glycogen (C–O–H bend) 1.08±0.001 ≈ 970 Protein phosphorylation 1.00±0.001 114 In Figure 1B, absorbance intensities were clearly larger in High grade spectra than Low grade spectra. And more, spectral differences were mostly apparent in bands attributed to amide I, II and III and protein regions (≈ 1,400-1,585 cm-1) followed by DNA/RNA (O–P–O symmetric stretch) (≈1,080 cm-1) and DNA (O–P–O asymmetric stretch) (≈1,230 cm-1) regions, RNA Ribose and DNA (C–O stretching) regions (≈1,120-1,180 cm-1), glycogen (C– O–H bend) (≈1,030 cm-1) region; and protein phosphorylation region (≈ 970 cm-1) (Table 4 and Fig. 6). In fact, phenotypic alterations can be firstly evidenced by spectral differences 16 . Figure 6 – Low and High grade FT-MIR classification for prostate cancer by variables reduction and selection coupled to SVM. Identified spectral differences based on absorbance ratio between High and Low grades spectra. The spectral bands localized in ≈1,250-1,680–cm-1 can be attributed to deformation, stretching and bend modes of C–N, C=O, C-O, C-H and N-H of fatty acids, amino acids, amides I, II, III and proteins. Changes in amino acid conformational and the reduction in the intermolecular aggregation of the tissue proteins promoted by cancer transformation tends to increase of this spectral regions of High grade. Moreover, post-translational modifications 115 related to DNA/RNA changes and alterations in phase I/II metabolizing enzymes expression also can explain high occurrence of protein spectral variation between High and Low grades 16,21,71,74,75,81,84,86 . In addition, the spectral band localized in ≈1,120-1,180 cm-1 can be attributed to RNA Ribose and to C–O stretching of DNA; the band in ≈1,030 cm-1 can be attributed to C–O–H bend of glycogen; the band in ≈1,080 cm-1 can be attributed to νsPO2 − of DNA/RNA; and the band in ≈1,230 cm-1 can be attributed to νasPO2 – of DNA. Glycogen, ribose, deoxyribose and phosphate groupings are widely associated to conformation and metabolism of nucleic acids. In fact, the stronger intermolecular interactions between nucleic acids result of intermolecular differentiations and of changes in RNA/DNA conformation promoted by cancer may be cause increase in High grade spectral regions related. Indeed, the key of the prostate cancer grades discrimination have been associated to DNA vibrational modes 16,21,71,74,75,81,84–87 . Band localized in ≈ 970 cm-1 can be attributed to symmetric stretching of chemical bonds of monoester di-anionic phosphate of the phosphorylated proteins and to vibrations of phosphate groupings of the nucleic acids. Cell protein phosphorylation is responsible by protein regulating metabolism, which include biochemistry processes of cell proliferation, differentiation and growth. Post-translational modification of proteins and increase in the proliferative and differentiation processes and in the cell cycle progression, possibly caused by advanced cancer, may be represent increase in High grade spectral region related to protein phosphorylation 16,43,71,75,81,86 . SVM models vs. Variables Reduction and Selection coupled to SVM. In this point, it is relevant define that variables extraction approaches followed by SVM undoubtedly had better performance in comparison to SVM models, in particuly variables selection coupled to SVM (GA-SVM). SVM models shown higher bias, and discrepancies and lower classification rates for Low and High grades of prostate cancer, and used higher number of support vectors in the classification. Plus, even considering RBF-SVM and polynomial-SVM as best models, it was observed lower values of all figures of merit. These results may implicate in inefficiency of these SVM models for classification of prostate cancer stages. RBF-SVMs advantage is the fact that practically any boundary profile can be obtained with relatively good performance. However, the necessary optimization of the combinations between values of regularization parameter C and RBF kernel width parameter σ, which determine the boundary complexity 116 and the classification rate, implied in higher time consumption. On other hand, linear-, quadratic- and polynomial-SVMs have disadvantage of deal with specific boundary profiles. Though, it was necessary to optimize only values of the regularization parameter C, resulting in less time consumption 57,71 . Classification rates and figures of merit based on variables reduction and selection followed by SVM were larger. Moreover, it was obtained a set of lower-bias models and consequently accurate prediction. Variables extraction coupled to SVM was constructed based on RBF kernel, which allows obtaining practically any boundary profile. The good performance related to this kernel was improved by the use of PCA, SPA and GA. These models automatically optimize the combinations between C and σ parameters and the number of support vectors. Furthermore, GA-SVM has the advantage of being an optimization method too, which means that it can eliminate potential interferents and generate a lower signal/noise ratio by selected variables. While SPA-SVM can solves co-linearity and redundancy problems and PCA-SVM aims to preserve the total variance, to capture as much useful information as possible and to remove redundancy. In general, GA-SVM and SPA- SVM had the disadvantages of high computational requirements and more time-consuming. In addition, the stochastic character of GA-SVM may represent realization-dependence and reproducibility problems, while SPA-SVM requires pre-selection steps related to number of analytes and variables. PCA-SVM had as drawback the necessity of PCs number optimization, resulting in time consumption. However, it is worth mentioning that all time consumption related to these approaches was much lower than time consumption of SVM models alone. Comparing Multivariate Classification and Traditional Methods applied to prostate cancer screening In the Figure 7 were represented sensitivity and specificity values of the multivariate classification and traditional methods applied to prostate cancer screening and categorization. Sensitivity (Fig.7A) and specificity (Fig.7B) values derived from GA-SVM, Digital Rectal Examination (DRE), measurement of serum Prostatic Specific Antigen (PSA) levels, Trans- rectal Ultrasound (TRUS)-guided Biopsy and Gleason grading system (GS) for Low and High grade classification are presented. 117 Figure 7 – Multivariate classification and traditional methods applied to prostate cancer categorization and screening. (A) Sensitivity (SENS) (%) and (B) specificity (SPEC) (%) values derived from Digital Rectal Examination (DRE), measurement of serum Prostatic Specific Antigen (PSA) levels, Trans-rectal Ultrasound (TRUS)-guided Biopsy, Gleason grading system (GS) and GA-SVM for Low and High grade. As shown in the Figure 7A, sensitivity values for Low grade by GA-SVM classification was higher in comparison to High grade, and compared to the traditional methods for both grades. While for the specificity values (Fig.7B) for Low and High grades by GA-SVM classification was the same. However, specificity values were higher only for the Low grade by GA-SVM classification compared to the traditional methods which had higher values of specificity for the High Grade. These results of GA-SVM classification can corroborate with the initial idea of this work to classify and detect spectral differences between early and advanced stages of prostate cancer, particularly in a screening perspective. Some researches 71,77–81 has also pointed to high performance of variables reduction and selection couplet to SVM in the classification cancer field. In fact, a study by Baker et al. 81 confirmed the success of cancer GA-SVM classification, suggesting to the academic community to include this algorithm in a standard list of options since this method provide an optimum classification. 118 5. CONCLUSIONS The results showed that the combination of multivariate classification performed by variables reduction and selection methods followed by support vector machine analysis and FT-MIR can be successfully used to detection and differentiation of Low and High grades of prostate cancer, with higher sensitivity and specificity values. The work displayed that the use of variables reduction and selection methods followed by support vector machine can reduce drawbacks of independent SVM analysis, such as high time consumption in pre-processing and parameter optimization. The use of variables reduction and selection methods improve more interpretability to SVM classification also. The models applied to the MIR spectral data derived from prostate cancer samples selected bands which are responsible for the separation between Low and High grade spectral datasets. High grade spectral dataset showed more intensity than Low grade dataset. The potential biomarkers were amide I, II and III and protein regions (≈ 1,400-1,585 cm-1) followed by DNA/RNA (O–P–O symmetric stretch) (≈1,080 cm-1) and DNA (O–P–O asymmetric stretch) (≈1,230 cm-1) regions, RNA Ribose and DNA (C–O stretching) regions (≈1,120-1,180 cm-1), glycogen (C–O–H bend) (≈1,030 cm-1) region; and protein phosphorylation region (≈ 970 cm-1). It was demonstrated that the combination of FT-MIR data and GA-SVM (the better approach), may work as complementary and or alternative tool for prostate cancer screening and classification, with higher sensitivity (≈100%) and specificity (≈80%), particularly in early stages compared to traditional methods of diagnosis. Thus, the potential diagnostic tools proposed in this paper describe a less time consumption methodology and not observer- dependent, which not only may to diagnose prostate cancer in tissues samples with high accuracy and based on spectral differences which not suffer with intra and inter-observer variability; but mainly, that it can detect early stages apparently better than traditional methods. This can imply in early detection, consequently less aggressive and cheaper treatments, better prognosis and decrease mortality rates. 119 ACKNOWLEDGEMENTS Laurinda F.S. Siqueira and Camilo L.M. Morais would like to acknowledge the financial support from the PPGQ/UFRN/CAPES and IFMA. K.M.G. Lima acknowledges the CNPq (Grant 305962/2014-4) for financial support. The authors would like to acknowledge the Dr. Raimundo F. Araújo Júnior from the Department of Morphology, Dra. Aurigena Antunes de Araújo from the Department of Biophysics and Pharmacology and Mr. Godoy from Bruker Inc. for collaborations. ABBREVIATIONS DRE: Digital Rectal Examination EMSC: Extended Multiplicative Scatter Correction FFPE: Formalin-Fixation and Paraffin- Embedding FN: False negative FP: False positive FTIR: Fourier-Transform Infrared Spectroscopy FT-MIR: Fourier-Transform Mid-Infrared Spectroscopy GA: Genetic Algorithm GS: Gleason System IR: Infrared KS: Kenard Stone algorithm LDA: Linear Discrimant Analysis LR-: Negative Likelihood Ratio LR+: Positive Likelihood Ratio MIR: Mid-Infrared NPV: Negative Predictive Value PCA: Principal Components Analysis PCs: Principal Components PPV: Positive Predictive Value PSA: Prostate Specific Antigen QDA: Quadratic Discrimant Analysis RBF: Radial Basis Function SENS: Sensibility SPA: Successive Projections Algorithm SPEC: Specificity SVM: Support Vector Machine TN: True negative TP: True positive TRUS: Transrectal Ultrasound YOU: Youden index νasPO2 – : Asymmetric phosphate stretching vibrations νsPO2 − : Symmetric phosphate stretching vibrations LowCal: Low grade calibration set LowVal: Low grade validation set LowPred: Low grade test set HighCal: High grade calibration set HighVal: High grade validation set HighPred: High grade test set 120 REFERENCES 1. Schroder, F. H. et al. Evaluation of the Digital Rectal Examination as a Screening Test for Prostate Cancer. J. Natl. Cancer Inst. 90, 1817–1823 (1998). 2. Wilbur, J. Prostate cancer screening: The continuing controversy. Am. Fam. Physician 78, 1377–1384 (2008). 3. Kaffenberger, S. D. & Penson, D. F. The politics of prostate cancer screening. Urol. Clin. North Am. 41, 249–255 (2014). 4. O.W., B. & D.P., A. Screening for prostate cancer. CA Cancer J. Clin. 59, 264–273 (2009). 5. Hoffman, R. M. Screening for Prostate Cancer. N. Engl. J. Med. 365, 2013–2019 (2011). 6. Wolf, A. M. et al. American Cancer Society Guideline for the Early Detection of Prostate Cancer Update 2010. Cancer Journal, The 60, 70–98 (2010). 7. Misra-Hebert, A. D. & Kattan, M. W. Prostate Cancer Screening: A Brief Tool to Incorporate Patient Preferences in a Clinical Encounter. Front. Oncol. 6, 4–7 (2016). 8. Alberts, A. R., Schoots, I. G. & Roobol, M. J. Prostate-specific antigen-based prostate cancer screening: Past and future. Int. J. Urol. 22, 524–532 (2015). 9. Sudarshan, V. K. et al. Application of wavelet techniques for cancer diagnosis using ultrasound images: A Review. Comput. Biol. Med. 69, 97–111 (2016). 10. Iczkowski, K. A. & Lucia, M. S. Current perspectives on Gleason grading of prostate cancer. Curr. Urol. Rep. 12, 216–222 (2011). 11. Humphrey, P. A. Gleason grading and prognostic factors in carcinoma of the prostate. Mod. Pathol. 17, 292–306 (2004). 12. Chen, N. & Zhou, Q. The evolving Gleason grading system. Chin. J. Cancer Res. 28, 58–64 (2016). 13. Lattouf, J. B. & Saad, F. Gleason score on biopsy: Is it reliable for predicting the final grade on pathology? BJU Int. 90, 694–698 (2002). 14. Sattlecker, M., Baker, R., Stone, N. & Bessant, C. Support vector machine ensembles for breast cancer type prediction from mid-FTIR micro-calcification spectra. Chemom. Intell. Lab. Syst. 107, 363–370 (2011). 15. Khanmohammadi, M., Ghasemi, K. & Garmarudi, A. B. Genetic algorithm spectral feature selection coupled with quadratic discriminant analysis for ATR-FTIR spectrometric diagnosis of basal cell carcinoma via blood sample analysis. Rsc Adv. 4, 121 41484–41490 (2014). 16. Theophilou, G., Lima, K. M. G., Briggs, M. & Martin-hirsch, P. L. A biospectroscopic analysis of human prostate tissue obtained from different time periods points to a trans- generational alteration in spectral phenotype. Nat. Publ. Gr. 1–13 (2015). doi:10.1038/srep13465 17. Theophilou, G., Lima, K. M. G., Martin-Hirsch, P. L., Stringfellow, H. F. & Martin, F. L. ATR-FTIR spectroscopy coupled with chemometric analysis discriminates normal, borderline and malignant ovarian tissue: classifying subtypes of human cancer. Analyst 585–594 (2015). doi:10.1039/c5an00939a 18. Kelly, J. G. et al. Biospectroscopy to metabolically profile biomolecular structure: A multistage approach linking computational analysis with biomarkers. J. Proteome Res. 10, 1437–1448 (2011). 19. Zaera, F. New advances in the use of infrared absorption spectroscopy for the characterization of heterogeneous catalytic reactions. Chem. Soc. Rev. 43, 7624–7663 (2014). 20. Ellis, D. I., Dunn, W. B., Griffin, J. L., Allwood, J. W. & Goodacre, R. Metabolic fingerprinting as a diagnostic tool. Pharmacogenomics 8, 1243–1266 (2007). 21. Siqueira, L. F. S. & Lima, K. M. G. Trends in Analytical Chemistry A decade ( 2004 – 2014 ) of FTIR prostate cancer spectroscopy studies : An overview of recent advancements. Trends Anal. Chem. 82, 208–221 (2016). 22. Siqueira, L. F. S. & Lima, K. M. G. MIR-biospectroscopy coupled with chemometrics in cancer studies. Analyst 4833–4847 (2016). doi:10.1039/C6AN01247G 23. Cortes, C. & Vapnik, V. Support-Vector Networks. Mach. Learn. 20, 273–297 (1995). 24. Dietrich, R., Opper, M. & Sompolinsky, H. Statistical Mechanics of Support Vector Networks. Phys. Rev. Lett. 82, 2975–2978 (1999). 25. Devos, O., Downey, G. & Duponchel, L. Simultaneous data pre-processing and SVM classification model selection based on a parallel genetic algorithm applied to spectroscopic data of olive oils. Food Chem. 148, 124–130 (2014). 26. Chen, D., Tian, Y. & Liu, X. Structural nonparallel support vector machine for pattern recognition. Pattern Recognit. 60, 296–305 (2016). 27. Carrizosa, E., Nogales-Gómez, A. & Romero Morales, D. Clustering categories in support vector machines. Omega (United Kingdom) 66, 28–37 (2014). 28. Carrizosa, E., Nogales-Gómez, A. & Romero Morales, D. Heuristic approaches for support vector machines with the ramp loss. Optim. Lett. 8, 1125–1135 (2014). 122 29. Carrizosa, E. & Romero Morales, D. Supervised classification and mathematical optimization. Comput. Oper. Res. 40, 150–165 (2013). 30. Carrizosa, E., Martin-Barragan, B. & Morales, D. R. Multi-group support vector machines with measurement costs: A biobjective approach. Discret. Appl. Math. 156, 950–966 (2008). 31. Carrizosa, E., Martin-Barragan, B. & Morales, D. R. Binarized support vector machines. INFORMS J. Comput. 22, 154–167 (2010). 32. Carrizosa, E., Martín-Barragán, B. & Morales, D. R. Detecting relevant variables and interactions in supervised classification. Eur. J. Oper. Res. 213, 260–269 (2011). 33. Wang, X., Huang, F. & Cheng, Y. Computational performance optimization of support vector machine based on support vectors. Neurocomputing 211, 1–6 (2016). 34. Chen, A. H. & Lin, C. H. A novel support vector sampling technique to improve classification accuracy and to identify key genes of leukaemia and prostate cancers. Expert Syst. Appl. 38, 3209–3219 (2011). 35. Zhang, H., Wang, H., Dai, Z., Chen, M. & Yuan, Z. Improving accuracy for cancer classification with a new algorithm for genes selection. BMC Bioinformatics 13, 298 (2012). 36. Abdi, H. & Williams, L. J. Principal component analysis. Wiley Interdiscip. Rev. Comput. Stat. 2, 433–459 (2010). 37. Bro, R. & Smilde, A. K. Principal component analysis. Anal. Methods 6, 2812 (2014). 38. Fernandez, D. C., Bhargava, R., Hewitt, S. M. & Levin, I. W. Infrared spectroscopic imaging for histopathologic recognition. Nat. Biotechnol. 23, 469–474 (2005). 39. Harvey, T. J. et al. Discrimination of prostate cancer cells by reflection mode FTIR photoacoustic spectroscopy. Analyst 132, 292–295 (2007). 40. Harvey, T. J. et al. Factors influencing the discrimination and classification of prostate cancer cell lines by FTIR microspectroscopy. Analyst 134, 1083–1091 (2009). 41. Baker, M. J. et al. Using Fourier transform IR spectroscopy to analyze biological materials. Nat. Protoc. 9, 1771–91 (2014). 42. Soares, S. F. C. et al. A modification of the successive projections algorithm for spectral variable selection in the presence of unknown interferents. Anal. Chim. Acta 689, 22–28 (2011). 43. Purandare, N. C. et al. Infrared spectroscopy with multivariate analysis segregates low- grade cervical cytology based on likelihood to regress, remain static or progress. Anal. Methods 6, 4576–4584 (2014). 123 44. Beleites, C. et al. Classification of human gliomas by infrared imaging spectroscopy and chemometric image processing. Vib. Spectrosc. 38, 143–149 (2005). 45. Niazi, A. & Leardi, R. Genetic algorithms in chemometrics. J. Chemom. 26, 345–351 (2012). 46. Latif, A. H. M. M. & Brunner, E. A genetic algorithm for designing microarray experiments. Comput. Stat. 31, 409–424 (2016). 47. Hughes, C., Gaunt, L., Brown, M., Clarke, N. W. & Gardner, P. Assessment of paraffin removal from prostate FFPE sections using transmission mode FTIR-FPA imaging. Anal. Methods 6, 1028–1035 (2014). 48. Afseth, N. K. & Kohler, A. Extended multiplicative signal correction in vibrational spectroscopy, a tutorial. Chemom. Intell. Lab. Syst. 117, 92–99 (2012). 49. Martens, H. & Stark, E. Extended multiplicative signal correction and spectral interference subtraction: New preprocessing methods for near infrared spectroscopy. J. Pharm. Biomed. Anal. 9, 625–635 (1991). 50. Zimmermann, B. & Kohler, A. Optimizing savitzky-golay parameters for improving spectral resolution and quantification in infrared spectroscopy. Appl. Spectrosc. 67, 892–902 (2013). 51. Savitzky, A. & Golay, M. J. E. Smoothing and Differentiation of Data by Simplified Least Squares Procedures. Anal. Chem. 36, 1627–1639 (1964). 52. Trevisan, J., Angelov, P. P., Carmichael, P. L., Scott, A. D. & Martin, F. L. Extracting biological information with computational analysis of Fourier-transform infrared (FTIR) biospectroscopy datasets: current practices to future perspectives. Analyst 137, 3202–15 (2012). 53. Kennard, R. & Stone, L. a. Computer aided design of experiments. Technometrics 11, 137–148 (1969). 54. Padilha, C. A. D. A., Barone, D. A. C. & Neto, A. D. D. A multi-level approach using genetic algorithms in an ensemble of Least Squares Support Vector Machines. Knowledge-Based Syst. 106, 85–95 (2016). 55. Baia, T. C., Gama, R. A., Silva de Lima, L. A. & Lima, K. M. G. FTIR microspectroscopy coupled with variable selection methods for the identification of flunitrazepam in necrophagous flies. Anal. Methods 8, 968–972 (2016). 56. FISHER, S. E. CHAPTER 1 Vibrational Spectroscopy: What Does the Clinician Need? SHEILA. Biomed. Appl. Synchrotron Infrared Microspectrosc. Ed. 1–28 (2011). 57. Dixon, S. J. & Brereton, R. G. Comparison of performance of five common classifiers 124 represented as boundary methods: Euclidean Distance to Centroids, Linear Discriminant Analysis, Quadratic Discriminant Analysis, Learning Vector Quantization and Support Vector Machines, as dependent on. Chemom. Intell. Lab. Syst. 95, 1–17 (2009). 58. Mohapatra, P., Chakravarty, S. & Dash, P. K. Microarray medical data classification using kernel ridge regression and modified cat swarm optimization based gene selection system. Swarm Evol. Comput. 28, 144–160 (2016). 59. Peng, S. et al. Molecular classification of cancer types from microarray data using the combination of genetic algorithms and support vector machines. FEBS Lett. 555, 358– 362 (2003). 60. Chen, L., Xuan, J., Riggins, R. B., Clarke, R. & Wang, Y. Identifying cancer biomarkers by network-constrained support vector machines. BMC Syst. Biol. 5, 1–20 (2011). 61. Nayyeri, M. & Sharifi Noghabi, H. Cancer classification by correntropy-based sparse compact incremental learning machine. Gene Reports 3, 31–38 (2016). 62. Devi Arockia Vanitha, C., Devaraj, D. & Venkatesulu, M. Gene expression data classification using Support Vector Machine and mutual information-based gene selection. Procedia Comput. Sci. 47, 13–21 (2014). 63. Ali, S., Veltri, R., Epstein, J. I., Christudass, C. & Madabhushi, A. Selective invocation of shape priors for deformable segmentation and morphologic classification of prostate cancer tissue microarrays. Comput. Med. Imaging Graph. 41, 3–13 (2015). 64. Sun, T. et al. Comparative evaluation of support vector machines for computer aided diagnosis of lung cancer in CT based on a multi-dimensional data set. Comput. Methods Programs Biomed. 111, 519–524 (2013). 65. Ford, W. & Land, W. A Latent Space Support Vector Machine (LSSVM) Model for Cancer Prognosis. Procedia Comput. Sci. 36, 470–475 (2014). 66. Cao, J., Zhang, L., Wang, B., Li, F. & Yang, J. A fast gene selection method for multi- cancer classification using multiple support vector data description. J. Biomed. Inform. 53, 381–389 (2015). 67. Çınar, M., Engin, M., Engin, E. Z. & Ziya Ateşçi, Y. Early prostate cancer diagnosis by using artificial neural networks and support vector machines. Expert Syst. Appl. 36, 6357–6361 (2009). 68. Zheng, B., Yoon, S. W. & Lam, S. S. Breast cancer diagnosis based on feature extraction using a hybrid of K-means and support vector machine algorithms. Expert 125 Syst. Appl. 41, 1476–1482 (2014). 69. Gertych, A. et al. Machine learning approaches to analyze histological images of tissues from radical prostatectomies. Comput. Med. Imaging Graph. 46, 197–208 (2015). 70. Wang, H. & Huang, G. Application of support vector machine in cancer diagnosis. Med. Oncol. 28, (2011). 71. Kelly, J. G. et al. Robust classification of low-grade cervical cytology following analysis with ATR-FTIR spectroscopy and subsequent application of self-learning classifier eClass. Anal. Bioanal. Chem. 398, 2191–2201 (2010). 72. Hughes, C. et al. FTIR microspectroscopy of selected rare diverse sub-variants of carcinoma of the urinary bladder. J. Biophotonics 6, 73–87 (2013). 73. Cheng, C. G., Tian, Y. M. & Jin, W. Y. A study on the early detection of colon cancer using the methods of wavelet feature extraction and SVM classifications of FTIR. Spectroscopy 22, 397–404 (2008). 74. Hands, J. R. et al. Attenuated Total Reflection Fourier Transform Infrared (ATR-FTIR) spectral discrimination of brain tumour severity from serum samples. J. Biophotonics 7, 189–199 (2014). 75. Tian, P. et al. Intraoperative diagnosis of benign and malignant breast tissues by fourier transform infrared spectroscopy and support vector machine classification. Int. J. Clin. Exp. Med. 8, 972–981 (2015). 76. Sattlecker, M., Stone, N. & Bessant, C. Current trends in machine-learning methods applied to spectroscopic cancer diagnosis. TrAC - Trends Anal. Chem. 59, 17–25 (2014). 77. Bergner, N. et al. Tumor margin identification and prediction of the primary tumor from brain metastases using FTIR imaging and support vector machines. Analyst 138, 3983–90 (2013). 78. Banerjee, S. et al. Fourier-transform-infrared-spectroscopy based spectral-biomarker selection towards optimum diagnostic differentiation of oral leukoplakia and cancer. Anal. Bioanal. Chem. 407, 7935–7943 (2015). 79. Lee, S. et al. Improving the classification accuracy for IR spectroscopic diagnosis of stomach and colon malignancy using non-linear spectral feature extraction methods. Analyst 138, 4076–82 (2013). 80. Zhang, X. et al. Profiling serologic biomarkers in cirrhotic patients via high-throughput Fourier transform infrared spectroscopy: Toward a new diagnostic tool of 126 hepatocellular carcinoma. Transl. Res. 162, 279–286 (2013). 81. Baker, M. J. et al. An investigation of the RWPE prostate derived family of cell lines using FTIR spectroscopy. Analyst 135, 887–894 (2010). 82. Bishop, C. M. Pattern Recognition and Machine Learning. Journal of Chemical Information and Modeling 53, (2013). 83. Valentini, G. & Dietterich, T. G. Bias—Variance Analysis and Ensembles of SVM. J. Mach. Learn. Res. 5, 222–231 (2002). 84. Patel, I. I. & Martin, F. L. Discrimination of zone-specific spectral signatures in normal human prostate using Raman spectroscopy. Analyst 135, 3060–3069 (2010). 85. Patel, I. I. et al. Segregation of human prostate tissues classified high-risk (UK) versus low-risk (India) for adenocarcinoma using Fourier-transform infrared or Raman microspectroscopy coupled with discriminant analysis. Anal. Bioanal. Chem. 401, 969– 982 (2011). 86. Baker, M. J. et al. FTIR-based spectroscopic analysis in the identification of clinically aggressive prostate cancer. Br. J. Cancer 99, 1859–1866 (2008). 87. Gazi, E. et al. A Correlation of FTIR Spectra Derived from Prostate Cancer Biopsies with Gleason Grade and Tumour Stage. Eur. Urol. 50, 750–761 (2006). 127 CHAPTER 7 – CONCLUSIONS AND PERSPECTIVES 1. MULTIVARIATE CLASSIFICATION AND FT-MIR FOR PROSTATE CANCER SCREENING …………………....……………………………………….. 127 2. THEORETICAL AND PRATICAL SUPPORT …………………....………………. 130 3. PERSPECTIVES AND RECOMMENDATIONS ……………………………....….. 131 1. MULTIVARIATE CLASSIFICATION AND FT-MIR FOR PROSTATE CANCER SCREENING According to studies results, it is possible to consider: 1. Prostate cancer is the second most frequent in world male population, with high mortality rates related to delay in diagnosis by traditional methods and belated treatments. These facts justified the proposal of the multivariate classification combined with Fourier- Transform Mid-Infrared Spectroscopy (FT-MIR) as accurate, speed and low cost methodology to prostate cancer classification, particularly to early detection. 2. Prostate tissue samples taken from formalin-fixed, dehydrated and paraffin- embedded (FFPE) pathology blocks, previously classified in Gleason II, III and IV by pathologists, did not show any diathermy effect and no degradation of tissue was observed, neither significant changes in fixation or paraffin embedding occur and no contributions of paraffin in the FT-MIR spectra were observed. However, thorough treatment and preparation of these samples were required, owing to its notoriously fragility and susceptibility to damage and degradation. 3. The proposed multivariate classification models were applied in MIR transmission spectra obtained from prostate cancer tissue samples. They successfully differentiated Gleason II, III and IV each either, as well as Low grade (Gleason II) and High grade (Gleason III+IV), with high classification rates and values of figure of merit, highlighting sensibility and specificity. A performance comparison of all models is summarized in the Table 7.1 below (these results are related to general classification of the both Low and High grades). 128 4. GA-QDA was the best multivariate model applied, QDA was the best classifier, and GA-LDA, GA-QDA and GA-SVM were the best classification approaches, according to classification rates and figure of merits. In general, the variables selection methods (GA and SPA, particularly the first one) had better performance results compared to variables reduction method (PCA), as well as nonlinear classifiers (QDA and SVM, specially the last one) were better than linear classifier (LDA). It was remarkable the performance improvement given by use of variables reduction and selection methods. Table 7.1 – Multivariate classification applied to Low and High grades prostate cancer categorization from FT-MIR spectral data. Merit figures derived from variables reduction and selection methods coupled to LDA, QDA and SVM classifiers (Where, SENS: Sensitivity. SPEC: Specificity. PPV: Positive Predictive Value. NPV: Negative Predictive Value. YOU: Youden Index. LR+: Positive Likelihood. LR-: Negative Likelihood). Marked the best models based on higher values of figure of merit (dark grey: best multivariate classification model; grey: best best classification approaches). PCA- LDA PCA- QDA PCA- SVM¹ SPA- LDA¹ SPA- QDA¹ SPA- SVM¹ GA- LDA GA- QDA GA- SVM¹ Training set CC (%) 87.8 93.9 100 93.8 83.5 75.65 96.9 100 100 Test set CC (%) 60 66.7 80 66.7 80 70 83.3 96.9 95 SENS (%) 60 66.7 72.74 66.7 83.3 80 71.4 75 90 SPEC (%) 60 66.7 77.78 66.7 50 71.43 80 100 80 PPV (%) 60 66.7 80 66.7 50 70 83.4 100 87.50 NPV (%) 60 66.7 70 66.7 83.3 70 66.7 66.7 66.67 YOU 20 33.3 50.51 33.3 20 60 51.4 75 67.5 LR+ 1.5 2 3.27 2 1 3.25 3.6 4.2 3.57 LR- 0.7 0.5 0.35 0.5 0.3 0.34 0.3 0.25 0.36 ¹Unpublished results 5. The differences between performances of multivariate classification models can be related to heterogeneity of the tumor cells present in the tissue samples, considering proliferative, invasive and metastatic features, as well as variety of differentiation stages. This heterogeneity may be not completely explained by models which consider the total variance present in the original variables dataset, such as PCA models, and by models which consider an equal variance-covariance matrix for all classes, such as LDA models. 6. On the other hand, the success of QDA and SVM as classifiers, mostly the first one, can be related to the fact of these methods consider separate variance-covariance 129 matrices for each dataset/class when establishing the decision boundary. These variance- covariance differences are taking into account in the classification for each method, which imply that the classification models may be more closely fit the data. In addition, the success of GA algorithms as variables selection method may be associated to the fact of selection and optimization steps occur concomitantly; in other words, while the variables are selected randomly the signal/noise ratio is adjusted into the best value and also best variables are obtained, resulting in a classification with low test and training errors. 7. Spectral datasets showed absorbance intensities clearly larger in High grade than Low grade, according to wavenumbers-variables pointed by the multivariate classification models. Spectral differences were mostly apparent in bands attributed to (1) amide I, II and III and protein regions (≈ 1,400-1,585 cm-1) which can be attributed to deformation, stretching and bend modes of C–N, C=O, C-O, C-H and N-H of amides I, II, III and proteins promoted by alterations in amino acid conformation and reduction in the intermolecular aggregation of the tissue proteins stimulated by cancer advance; (2) DNA/RNA (O–P–O symmetric stretch) (≈1,080 cm-1) and DNA (O–P–O asymmetric stretch) (≈1,230 cm-1) regions, RNA Ribose and DNA (C–O stretching) regions (≈1,120-1,180 cm-1), glycogen (C– O–H bend) (≈1,030 cm-1) region, which can be are widely associated to intermolecular differentiations and changes in RNA/DNA conformation and metabolism promoted by cancer; and in minor scale (3) protein phosphorylation region (≈ 970 cm-1) which can be imply in alterations in the processes of proliferation, differentiation and progression cellular caused by advanced cancer. Thus, all this spectral differences can corroborate with progressives metabolic and biochemical alterations promoted by cancer. 8. In comparison to traditional methods, multivariate classification combined with FT-MIR potentially discriminated prostate cancer stages in tissues samples with better performance based on spectral differences which not suffer with intra and inter-observer variability neither with observer-dependence, presenting less time consumption, easy proceeding and operational low cost. Moreover, multivariate models showed better performance in the Low grade classification than traditional methods, which gives a perspective of early detection and consequently may imply in less aggressive and cheaper treatments, better prognosis and decrease mortality rates. 130 2. THEORICAL AND PRATICAL SUPPORT The application of multivariate classification coupled to FT-MIR for differentiate prostate cancer stages took the development of theoretical and practical frameworks to cancer research:  The theoretical support providing chemometrics applications in cancer studies (Chapter 2). This review provides a general idea about applications of multivariate algorithms in spectral data derived from biological samples; and helps to consider the multiple options of preprocessing tests, multivariate algorithms and applications of multivariate analysis.  The theoretical support providing FT-MIR applications in prostate cancer studies (Chapter 3). This overview provides a summary of the tendencies in the MIR spectroscopic applications related to prostate cancer; and allows knowing the current trends in samples preparation and treatment, instrumental requirements and computational analysis, applied in prostate cancer studies.  The practical frameworks based on multivariate algorithms applied to FT-MIR prostate cancer classification (Chapter 4, 5 and 6). These frameworks provide an experimental basis for multivariate analysis applications in FT-MIR biomedical data, which potentially may be reproduced and extended to others types of cancer and also to distinct samples formats; allow applying and comparing multivariate classification models; allow knowing the strengths and drawbacks of the samples preparation and treatment, instrumental analysis and computational analysis; enable providing potential spectral biomarkers; allow thinking in multivariate classification coupled to FT-MIR as a potential complement and or alternative to traditional methods of cancer classification and screening. 131 3. PERSPECTIVES AND RECOMMENDATIONS Based on the multivariate classification applied to FT-MIR biomedical spectral data and on the potentialities of our studies, the use of more samples in addition to improvements in experimental design and in instrumental analysis (spectral acquisition from same samples datasets by different instruments) is extremely important, mainly to generate a spectral biomarkers database and analysis protocols. As well as, the use of other samples formats, highlighting biofluids, is really promising for noninvasive diagnosis and screening of the prostate cancer, and as shown, this is a field still little investigated. Relations between prostate cancer spectral data, socioeconomic data and comorbidities data also are little investigated and deserve attention. It is expected extend multivariate classification combined with FT-MIR to specialized hospitals and laboratories, to others localities, and mainly, to others types of cancer and diseases, considering the potentialities of early detection by this methodology and beneficial socioeconomic impacts promoted by low cost and less time-consuming of these methods and by less aggressive and short term cancer treatments related. However, clearly there is a long road and a solid resistance to acceptance these methods by medical community, which only will be surpassed with more and more hard work.