# Statistics

## Extreme value theory

Extreme value theory deals with characteristics related to the tail of a distribution function such as indices describing tail decay, extreme quantiles, small exceedance probabilities and, in time series and multivariate settings, measures of extremal dependence. As such, the focus in this area is quite different from the ‘classical’ statistical theory, which mainly concentrates on the analysis of means. Clearly, when interest is in estimating parameters related to the far tail of a distribution function, the analysis should be based on the largest observations in the sample, which in turn asks for a characterisation of the behavior of these extreme data points. Central in the extreme value theory are in this respect the limiting distributions of extremes, the so-called extreme-value distributions, first introduced by Fisher and Tippett in 1928. These distributions arise as the only possible limiting forms for the distribution of normalized maxima in samples of independent and identically distributed random variables. Applications of extreme value theory can be found in disciplines like hydrology (river discharges and floods), environmental research and meteorology (concentrations of pollutants, extreme precipitation, windspeeds), (re)insurance (premium calculations), finance (Value-at-Risk estimation), geology (earthquake modelling, value of diamonds), and computer science (network traffic data, server waiting times),  to mention but a few.

The statistics unit is active in the areas of tail index estimation and the testing of extreme-value conditions.

The estimation of the extreme value index, a parameter which gives an indication about the nature of tail decay, is a central theme in extreme value statistics, in fact, having an estimate for the tail index is a prerequisite for tackling many other estimation problems like for instance the estimation of extreme quantiles. It is well known that traditional estimators for this index may suffer seriously from bias, making a careful selection of the tail sample fraction to be used in estimation a necessity. Recent research has therefore focused on the development and asymptotic study of bias-corrected estimators, which typically behave more stable when considered as a function of the tail sample fraction that was used in the estimation. The bias-correction is obtained by taking a condition on the second order tail behavior explicitly into account in the estimation stage. The group is active in bias-corrected estimation in the classes of Pareto-type and Weibull-type models.

Despite the widespread interest in tail index estimation, not much thought has been given to testing the underlying extreme value assumptions. However, testing the validity of these assumptions is important as it clearly does not make sense to infer about the tail index or extreme quantiles on the basis of an extreme value model if that model does not provide an adequate fit to the data. The models used in extreme-value theory are semi-parametric, and as such the goodness-of-fit testing procedures necessarily have a complex nature, which partly explains the scarcity of testing procedures. Similarly to tail index estimation, goodness-of-fit tests typically suffer from bias in the extreme value framework. This bias is an undesirable property, because, when the assumption under test is rejected, then it may be hard to judge whether this is due to an actual violation of the assumption or due to an excessive bias. The statistics unit has been active in the construction of a general class of kernel goodness-of-fit tests for the assumption of Pareto-type or heavy-tailed behavior, and besides also introduced a general theoretic framework for bias-correction of such kernel statistics. The procedures are currently extended to the other max-domains of attraction of the extreme-value distributions.

## Exponential dispersion models

Exponential dispersion models are two-parameter families of probability distributions, where one parameter represents an exponential family (generated by exponential tilting), while the other represents a convolution semigroup. The class of exponential dispersion models contains a number of standard statistical families as special cases, such as the Poisson, binomial, gamma, normal and inverse Gaussian families. A particularly important type of exponential dispersion models is the class of Tweedie models, which are characterized as being closed under scale transformations, which implies that their variance functions are of power form. Tweedie models are limiting distributions in a special kind of central limit theorem, involving a combination of convolution, exponential tilting and scaling. The Tweedie power form for the variance function, also known as Taylor's power law, has been observed empirically in a wide variety of settings, such as for example in connection with the distribution of plants and animals in their habitat, in various fields of health science and genetics (HIV, child cancer), and Internet traffic data, to name just a few. Theoretical work is under way on determining the domain of attraction to Tweedie distributions by means of the theory of regularly varying functions, as well as to identify analogues to the Tweedie convergence results in other areas. Recently, we have shown that Tweedie convergence may be considered an analogue of the convergence of extremes, where the extreme value distributions turns out to be characterized by a power form for the so-called slope function. Tweedie models are particularly useful for building models for longitudinal data and so-called mixed models, and they have been used for analysis of, for example. insurance data, fisheries data and air pollution data.

## Missing values

In biological anthropology, especially in studies of paleontological or prehistoric skeletal material, the utilisation of multivariate methods is often limited or impossible due to missing values, in their turn due to imperfect preservation and/or distortion of the morphological structures in question. The research is dealing with metric as well as discrete morphological variables, including geometric morphometrics, i.e. landmark based methodology. Topics under scrutiny are : Proper methods for inference concerning Mahalanobis distances using Hotelling’s T-test ; Mahalanobis distances modified to be applicable to discrete traits (proper test statistics are wanting even for the complete data case !) ; multiple imputation methods for inference based on specimens virtually reconstructed by geometric morphometry.