Biomathematics & Statistics Scotland

RESEARCH: Statistical Methodology

Statistics is the science of the collection, analysis and use of data. To meet demands and opportunities of our collaborators, BioSS’s statistics research has focused on the analysis of digital images, spatial and temporal data. Here we illustrate some highlights:

Colour displays for categorical images

Connected pores in soil section
Connected pores in soil section displayed using optimal colour labels.
Electron scanning micrograph of crosssection
of soil aggregate
Electron scanning micrograph of crosssection of soil aggregate (black areas are pores and lighter areas are soil).

In a categorical image, each picture element (pixel) is assigned to a unique group. Such images arise in many contexts, including when a segmentation algorithm has been used to partition the pixels into groups according to some definition of connectivity. To produce a clear display of such an image we need as distinct a colour as possible to represent each group. We have developed a method for identifying such a set of colours. We use search algorithms and simulated annealing to maximise the minimum distance between any pair of colours in a set, as measured in a perceptual colour space.

Image analysis of plant varieties

Image of some pea stipules-photo
Image of some pea stipules.
Outline of the mean shape-graph
Outline of the mean shape.

The shape of leaves is an important characteristic used by botanists and plant breeders in describing and classifying different species, subspecies and cultivars. It is not an easy thing to quantify. Many characteristics are used to describe the overall type of shape, and then a system of scores is used to specify the extent to which each characteristic is expressed. BioSS has been involved in developing tools whereby this may be done automatically and objectively, using methods of measurement in image analysis. An additional benefit of this is that it is often also possible to define and estimate an average shape (see figure). This has the benefit of allowing cultivar differences to be presented in a clear graphical format. We plan to continue developing these ideas, and to make the mathematical tools accessible through easy-to-use software.

^ Top

Continuous monitoring of river water quality

monitoring of Scotland’s rivers’ water quality-photo
The Water Framework Directive requires increased monitoring of Scotland’s rivers’ water quality
Water quality, as indicated by alkalinity-flow chart
Water quality, as indicated by alkalinity, is strongly affected by rates of river flow

Legislation such as the European Water Framework Directive has given rise to a need for closer monitoring of water quality in river catchments. The information gathered can be used both to influence long-term policy and to suggest short-term remedial actions. Of considerable importance to hydrologists is the issue of determining the relative contributions of water in rivers which is either of short-term residence in the catchment (soil water) or long-term residence (ground water). BioSS has been applying Bayesian models to alkalinity data from a study of the Feugh catchment by the Universities of Aberdeen and Glasgow, recorded at 15-minute intervals over a period of one year. The high density of the data recorded means that successive observations are highly correlated, and our statistical models have been extended to deal with correlation at more than one time-scale. This enables us to obtain better estimates of the relationship between river flow rate on water quality.

^ Top

Discovering variation during pregnancy of the relationship between weather and birth weight

Outline of the mean shape

Suppose we want to know how, at different times during pregnancy, weather influences birthweight. Such questions are of increased importance in the context of climate change. However they are difficult to answer from observational data, because small populations of animals are exposed to the same weather each year, hence the effective sample size in analyses relating birth weights to weather variables is the number of years. Also, as weather variables in successive intervals are positively correlated, use of standard multiple regression techniques will lead to successive regression coefficients being negatively correlated and having large associated confidence intervals (Figure, left). More information can be extracted from the data by incorporating into the model a belief that adjacent time intervals should have similar regression coefficients. We have developed a technique for doing this using standard software for fitting linear mixed models, based on the assumption that differences between successive regression coefficients come from the same Gaussian distribution. The methods have been applied to demonstrate variation during pregnancy of the relationship between birth weights of red deer calves and temperature, this being generally positive during early to mid pregnancy and strongest in weeks 23 to 26 (Figure, right).

Regression coefficients describing the relationship between birth weight of red deer calves and mean temperature in successive fortnights during pregnancy, together with 95% confidence intervals. Deer data come from Rum during the period 1971 to 1998, and regression coefficients were estimated using standard multiple regression techniques (left) or by smoothing differences in successive regression coefficients (right).

^ Top

Latent Gaussian models for multivariate and compositional food intake

The latent Gaussian model Relationship between the consumption of
different food products-scatter plot

Data on the food eaten by consumers and on the nutritional contents of foods can help us to understand dietary risks to health. Such datasets are typically difficult to analyse because they contain many zeroes – one for each recorded food type not eaten during the observation period, or one for each component that is absent from a particular type of food. This means that, at face value, we cannot treat the data as coming from a standard distribution such as the Gaussian (or normal) distribution.

One solution is to assume that the data have been generated by a transformation of a standard underlying (latent Gaussian) distribution, with negative values of that distribution recorded as zeroes. We constructed a latent version of a factor analysis model to explore the multivariate relationships between the intake of different foods by British consumers. This model allows us to estimate the distribution of consumption of, for example, vitamins which are present in many food types. We have also developed a compositional model for describing the relative proportions of different nutrients within individual foods. Standard methods for estimating model parameters do not work in this context, so we have investigated the performance of inference using simulation-based approximations.

Relationship between the consumption of different food products by a sample of 2197 adults consumers during one week. Data: Dietary & Nutritional Survey of

The latent Gaussian model used to analyse the nutritional composition of a range of fish products. Data: USDA Agricultural Research Service http://www.ars.usda.gov/ba/bhnrc/ndl

^ Top