A κ-statistical analysis of the Y-chromosome

An analysis of the coding sequence for the Y-chromosome (Homo sapiens) has been performed, embedded in the formalism of κ-statistics, which naturally encompasses long-range correlations. In this formalism, the entropy has been written as a function of κ (called the deformation —or entropic— parameter). The κ-entropy has been linked directly to dimensional parameters defined for the DNA chain associated with the chromosome Y. Our analysis indicates that there are certain regions of chromosome Y which exhibit linearity between entropy and sample sizes for some particular values of κ, implying that, on these regions, the information contained on the DNA increases monotonically linearly with the sample size, and also depicts an internal order.

On the other hand, the Tsallis framework has been a useful tool in the study of systems which present longrange correlations between the constituent parts [13]. In particular, there are theoretical efforts to study the DNA molecule through the Tsallis approach, e.g. the size distributions of non-coding DNA (including introns and intergenic regions) in human chromosomes have been studied by using the q-exponential distribution that emerges from the Tsallis framework [14,15]. Another effort has also used the Tsallis statistics to show the behavior of the electronic specific heat at low temperature by considering a quasi-periodic model for the DNA molecules, as well as parts of the real genomic DNA sequence [16,17]. More recently, using entropy analysis and phase plane concepts, the DNA information has been successfully investigated in the non-extensive framework [18]. Here, it is worth asking: i) Is there another framework, beyond the Tsallis one, useful to estimate the role of long-range correlation in the DNA molecule? ii) What is the entropic effect on the geometric organisation of the chromosome Y?
To address such issues, we use the generalized statistics of Kaniadakis [19][20][21], which is characterized by an additive entropy 1 and a power-law behaviour, by featuring the stationary distribution of random variables of the system. We investigate the κ-entropic effect on the geometry of the chromosome Y, as well as the role of the long-range correlations in the DNA molecule, which naturally emerge in the context of generalized statistics [13,19,20]. Specifically, we will use Kaniadakis' definition of an adapted entropy together with the concept of block entropy to describe the relationships between a set of κ parameters and the linear dimensions of the chromosome Y. Here, we follow the approach proposed by the author in [22], who demonstrated analytically that the static structures of deterministic Cantor sets with fractal dimension d f , calculated in the framework of Tsallis' statistics, are characterized by a non-extensive q-exponent, i.e., q = 1/(d f − d). In contrast, we will approach the problem within the ambit of the generalized statistics of Kaniadakis [19][20][21].
This paper is organized as follows: a discussion about the Kaniadakis framework is made in the next section. Later on, we give a brief description of the Y-chromosome, and relate its properties that are important to the present study. Finally, in the conclusions section we summarize our main findings.
Kaniadakis framework. -Through a kinetic foundation, in a non-linear phase, a system of particles can be governed by a principle called the Kinetic Interaction Principle (KIP), as proposed by Kaniadakis [19]. According to the KIP, it is always possible to obtain a stationary statistical distribution consistent with the system constraints, which "imposes" a specific form for the (generalized) entropy, resulting in stability and entropic maximization. Mathematically speaking, the KIP is expressed as (1) f and f represent two-particle distribution functions, respectively before and after collisions (typically, binary collisions) between the system's constituents. The arbitrary function γ is a factor closely related to f and f . Also, κ(f ) is a positive, real-valued function. The analytical expression for the KIP, eq. (1), contains all the information one needs to describe the kinetics of a system, i.e., it is enough to provide the form of the function γ. Thus, for example, if we choose the function γ to be the product: we recover the expression that takes into account the Pauli exclusion principle [23]. Also, eq. (1) acts like a connection that can enable us to propose new functions of distributions. By following this direction, and admitting that the entropy is associated with a discrete set of microstates with probabilities {p i }, one can define the S κ entropy in the form [19] where κ is the so-called entropic parameter, contextualized below. Here and hereafter, the Boltzmann constant is set equal to unity for the sake of simplicity.
It is important to mention here two properties of S κ : when we make a composition of two independent subsystems, i.e., p ij = p i ⊗ κ p j with ⊗ κ being the so-called κ-product, the κ-entropy is in general additive and extensive (for details, see [21]).
From a mathematical point of view, the κ-framework is based on the κ-exponential and the κ-logarithm functions defined as with In the case κ → 0, these expressions reduce to the usual exponential and logarithmic functions. For a continuous distribution function f , the κ-entropy associated with the κ-framework is given by which fully recovers standard Boltzmann-Gibbs entropy, Here, p and f represent the momenta and its distribution function, respectively. Several physical features of the κ-distribution have also been theoretically investigated as, for instance, the selfconsistent relativistic statistical theory [20,21], non-linear kinetics [23], the H-theorem from a generalization of the chaos molecular hypothesis [24,25], the κ-Weibull distribution as one model for extreme-event return intervals in finite-size systems [26], and the reexamination of the blackbody radiation in the context of κ-framework [27]. Moreover, another investigation has shown that the κstatistics is able to describe the entropy of a Cantor set [28].
κ-Description of DNA: chromosome Y. -The Ychromosome is composed of approximately 50 million base pairs in size, containing more than 400 genes that represent no more than 2% of all pairs mapped by the human genome project [29]. This chromosome contains information that is associated with, among other features, male infertility due to spermatogenic failure, growth control and sex determination.
Although there is no rule of iteration that could lead us to that sequence construction, however, and in the context of entropy of block, we can apply the method of block scanning and associate a block entropy S κ (s, L) to chromosome Y, by describing the way each nucleotide base in the DNA sequence presents itself in relation to the rest of the chain.
The scanning of the coding sequence of chromosome Y is made by using the block-scanning method [28]. Here, we analyze the DNA sequence using blocks of size s, where s = 1, 2, 4. The choice of these values will become clear later. Then we use the functional S κ , as expressed in 38004-p2 eq. (3), and we calculate computationally the entropy for each coding segment of DNA, as a function of the size L of the nucleotide sequence present in chromosome Y. Therefore, the κ-entropy can be written as where s is the block size used to sweep the sequence of nucleotides, and L is the length of the chain, measured in units of number of nitrogenous bases (nB). The sum in (8) will depend on the value assumed for s. When s = 1, the DNA is scanned in order to consider a single base at a time. In this case, the index i will run from 1 to 4, which corresponds to the existing four nitrogenous bases on the chromosome under study: A, T, G and C. When s = 2, the bases are taken two by two (here, blocks of type XY and Y X are taken as the same). Thus, there are ten different possibilities to combine the bases, namely, AA, AT, AG, AC, TT, TG, TC, GG, GC, CC. The previous schema remains when we consider s = 4, however, we observe that the number of blocks to take into account inside the string will be 35: AAAT, AAAG, AAAC, AATT, AATG, AATC, AAGG, AAGC, AACC, ATTT, ATTG, ATTC, ATGG, ATGC, ATCC, AGGG, AGGC, AGCC, ACCC, TTTT, TTTG, TTTC, TTGG, TTGC, TTCC, TGGG, TGGC, CCGT, TCCC, GGGG, GGGC, GGCC, GCCC and CCCC. Because we are performing a box-counting analysis, we have considered that blocks as ATGC and GCAT are taken as equal within our scheme, and therefore are counted just once. Although from a functional point of view these arrangements could be considered distinct from each other (see paragraph below), we will consider them statistically identical. Energetically speaking, however, our assumption is physically reasonable, since the energy of the block remains the same in both orders [2]. In this work, we applied a filter to NCBI database for chromosome Y, and discarded each of the explicitly non-coding sequences. Also, inside the gene itself there are some subsequences of nucleotides which contain the instructions to produce proteins, while other sub-sequences do not follow this rule. The sub-sequences which have the instructions are also called coding sequences. However, even in the remaining data, some of the nucleotides do not play a role of heredity. A few words must be said regarding the so-called junk DNA [30]. This nomenclature derives from a paradox ("Cvalue paradox"), which states that more complex organisms should have longer encoding sequences, but actually the opposite occurs [31]. For example, lungfish DNA is around 30 times larger than the human DNA, and even some flowers have also a much larger genomic encoding than ours [32]. So, in this context, some author argue that a large part of a DNA sequence is composed by non-coding (junk) DNA. Although there exists some controversy about this nomenclature, and the meaning of a functional element (see, e.g., [33,34]), in our numerical analysis we explicitly discarded the (apparently) noncoding parts. Since the role of the so-called junk DNA is still unclear, we decided to restrict our scanning to the "coding" sets.
Results and discussion. -From now on we will use eq. (3), along with the concept of block entropy, to associate to the coding sequence of human DNA (more specifically to that which concerns the Y-chromosome) an entropy which describes statistically the DNA arrangement, as well as measures long-range statistical correlations between the n-tuples sets of genomic bases. In fig. 1 we depict our building blocks of information, namely the probabilities of finding the four given bases, when scanning the Y-chromosome with a block of size s = 1. A similar analysis is shown in fig. 2, however considering the coupling of the ten different base pairs. With this result, together with the definition of block of information, we are able to determine the S κ -entropy. In both figures, the probability saturates. It is noticeable that for s = 2, the probability of occurrence of pairs made of equal bases (TT, AA, CC, GG) is lower than those of different ones, with the exception of GC pairs, which is caused obviously by the lower occurrence of these base pairs, for the Ychromosome, that can be inferred from fig. 1. Now we analyze the κ-entropy associated with these probabilities. The behavior of S κ (s = 2, L) with L is marked, initially ( fig. 3(a), see also the insert) by strong oscillations. This is expected, because when we proceed to scan the coding chain, any significant changes in the counting of a block carries a drastic change in the probability of one be faced with it and, consequently, the entropy can oscillate strongly at this stage. Thereafter, the entropy presents a (mostly linear) increase, and then saturates. In fig. 3(b) we consider several values of the entropic parameter κ. In the region where S κ (s, L) increases more linearly, we have observed that the ratio of increasing is directly linked to the entropic parameter κ. Depending on the value of κ, the generalized entropy in this region may suffer a rapid increase, can grow slowly and even grow linearly with L.
We realized that all values of κ make S κ (s, L) linear with respect to L (extensive), at least on some of the coded region under consideration. When we increase the box size (s = 4), the behavior of the κ-entropy changes accordingly, as one can see in fig. 4. Here, we decided to make an analysis of the κ-entropy for different values of the entropic  Conclusions. -An analysis of the Y-chromosome has been performed by using the κ-statistical formalism adapted to the block concept. DNA molecules actually display a multi-fractal pattern [35], what could lead us to consider that DNA information is characterized by the statistical concept of disorder. Indeed, the κ-statistics is suitable to analyze the long-range correlations of the DNA molecules. However, in the ambit of κ-entropy, we can always find a suitable κ that makes the system linear with L and, consequently, extensive in this sense. This means that, at least for some portions, the Y-chromosome behaves non-fractally. By non-fractal we mean a structure which presents a linear behavior, and therefore is extensive, in contrast to fractals, which feature a powerlaw mathematical behavior in all scales. Indeed, in a similar analysis made in [28], applied to the Cantor set, it was determined that there is only one value of κ which makes the entropy linear. Here, on the opposite, there is a full range of linearity for all the values of κ. Additionally, we have found that the values of κ that make S κ linear with L are contained in a region where κ-entropy shows no maximum since they are outside the limit |κ| ≤ 1. Another aspect that we have noticed in this analysis, and which refers to description of the coding sequence in a physical framework concerning the thermodynamics, it is the emerging possibility of describing the DNA inside a κ-theory of ensembles. How to do this? We know that DNA is subject to various intra-and extra-structural forces which stabilize it. One of these interactions occurs within the intra-structural ambit: the base pairs that are stacked exert forces on each other. This type of interaction is known in the literature as stacking interaction. The energies associated with each pair of interaction are well known. Thereby, we could make the connection with the thermodynamics through a deformed partition function Z κ associated, in turn, to a statistical weight of the form exp κ (−βκ i) Zκ and hence, we could obtain all the κ-thermodynamics of the system, such as the specific heat c κ . * * * We acknowledge the financial support received from the Brazilian Founding Agency (CAPES). RS is very grateful to CNPq for the grants under which this work was carried out.