PARTIAL LEAST SQUARES REGRESSION
Partial Least Squares (PLS) regression is a multivariate data analysis technique which can be used to relate several response (Y) variables to several explanatory (X) variables.
The method aims to identify the underlying factors, or linear combination of the X variables, which best model the Y dependent variables.
PLS can deal efficiently with data sets where there are very many variables that are highly correlated and involving substantial random noise.
Partial least squares (PLS) is a method of modelling relationships between one or more Y variables and one-to-many explanatory variables.
The PLS approach was developed and adapted principally by Svante Wold and Harald Martens from an algorithm first proposed by H. Wold (1975).
An application to the study of potato variety yields data :
DATA :
Y VARIABLES
- variety yields from a number of centres (centres = variables)X VARIABLES
- variety characteristics, e.g. disease resistance.
WHAT INFLUENCES VARIETY X CENTRE VARIATION ?
- are there variety characteristics which control how variety yields change between environments ?HOW EFFECTIVE IS THE PLS MODEL ?
- what proportion of variation is explained by the fitted model ?
The example is taken from potato variety trials in the UK.
It is often difficult to interpret variety x environment interactions in the analysis of data from trials repeated over several locations. One approach is to calculate the principal components scores from the variety x trials data, after the main effects of variety and trials have been removed, and then try to relate these scores to other varietal characters.
A similar, but more direct, approach is to use PLS to provide a parsimonious linear model of the systematic relationship between the variety x trials data and the other variety characters.
POTATO VARIETY X ENVIRONMENT DATA
The data consists of potato variety tuber yields for 9 varieties, all sown at twelve centres
throughout the UK in the years 1983-87. So each item in the table represents the mean from
five years of trials with the variety at that centre.
In addition there was available botanical, agronomic, disease and pest susceptibility scores for each of the varieties, all of which were rated on a scale 1-9 where 9 means the variety possesses the characteristic to a high degree, e.g. good drought resistance, pale crisp colour, high dry matter content.
LOCATION OF POTATO VARIETY TRIALS
The 12 centres were distributed throughout England, Wales, Scotland and N. Ireland.
PLS METHOD AND V x E MODELS
PLS METHOD
Aims to identify weighted combinations of the X variables (loading vectors) which best model the Y variables.
Development of theory and expository texts.
PLS fits within a broad class of V x E modelling strategies.
There are a number of versions of the PLS method. The one described here can be used to
relate several Y variables to several X variables.
The analysis of V x E data by the fitting of linear and multiplicative models has a well established history. PLS can be fitted within this broad class of modelling strategies.
LOADING VECTOR FOR VARIETY CHARACTERS
Factor 1 ( ordered )
1983-87
Dry Matter -0.52
Crisp Colour -0.48
Tuber Blight -0.34
Maturity -0.21
Gangrene -0.08
External Damage -0.06
Virus Y -0.01
Foliage Blight 0.01
Foliage Cover 0.13
Dormancy 0.15
Drought Resistance 0.16
Internal Damage 0.24
Common Scab 0.31
Leaf Roll Virus 0.34
Application of the PLS to the potato data suggested that there was only one factor of
relevance for prediction. It explains some 40% of the variation in the Y matrix.
The Factor 1 loadings are presented here in rank order. This places dry matter and crisp colour at one end of the scale and leaf roll virus and common scab at the other end.
Crisp colour assessment is known to be positively correlated with dry matter content. Also, leaf roll virus susceptibility is known to be negatively correlated with DM content of tubers.
PARTIAL LEAST SQUARES MODEL
The PLS method aims to identify the underlying factors, or a linear combination of the X
variables, which best model the Y dependent variables.
STAGES IN THE PLS ALGORITHM
The algorithm involves an iterative procedure.
First estimate a linear combinations of the X variables to give a latent vector with the property that all of the Y variables can be predicted from this vector optimally by least squares.
A second latent vector is derived which represents a linear combination of the X residuals after projects on to the first latent variable. This vector has the property of optimally predicting the Y residuals from the first stage.
This procedure is continued until the contribution of a new latent vector is negligble.
Details of the PLS algorithm have been given by Aastveit and Martens (1986), amongst others.
IDENTIFYING OPTIMAL FACTORS
Cross-validation (Stone, 1974) can be used to access the appropriate
number of factors to fit.
The data are split into groups. One group of observations is ommitted and the PLS model is fitted for the remaining data.
Predictions are made for the ommitted data and the sums of squares of predicted minus observed is calculated :
The procedure is repeated until each observation has been deleted once.
The total sums of squares of predicted minus observed is used as a measure of the predictive usefulness of the factor.
MODELLING G x E INTERACTIONS
PLS can be used to explore some of the mechanisms which make up the interaction terms in
the variety x centres data table.
BREAKING DOWN THE INTERACTION
The PLS method as applied here has similarities to principal component analysis of
interactions. PLS has the advantage of directly involving the external information in the
fitting of the multi-linear model.
In effect PLS may be best understood as separate PCA of X and of Y at the same time with component estimates providing prediction of Y from X.
ADVANTAGES/ WEAKNESSES OF PLS
PLS is a relatively flexible algorithm which produces a minimal number of factor solutions
when attempting to relate one complex matrix to another.
Although it has a well established heuristic base the distributional properties of estimators from the PLS are not known.
It can be difficult, sometimes, to interpret loadings. This is particularly the case where judgement is needed in the scaling of variables before applying PLS, for example, with binary variables where there may be no natural scale.
ALTERNATIVES TO PLS
PLS has some similarities with PCA plus regression (PCR).
Both methods project the Y data into a multi-dimensional space determined
by vectors or factors and use coordinates in this space as regressors with
the X variables as regressands. But while PCR projects into a space
estimated by the Y variables alone, PLS projects into a space determined
by both the Y and X variables.
Canonical correlation tries to find linear combinations of the X variables which correlate maximally with the linear combination of the Y variables. However, it is sensitive to strong correlations in either the X or Y variables and the number of variables must be less that the number of observations.
Ridge regression (RR) is a method for stabilizing regression estimates when there is extreme collinearity amongst the variables. Multiresponse RR applies separate RRs to each principal component's linear combination of the responses using separate ridge parameters for each one.
The connection between PLS, PCR and RR is well described in a paper in by Ildiko Frank and Jerome Friedman, "A statistical view of some chemometrics regression tools" published in Technometrics, 1993, 35(2), 109-135.
GENSTAT PLS PROCEDURE
A GENSTAT 5 procedure, PLS, written by Ian Wakeling
and Nick Bratchell, fits a partial least squares
model.
The following is an extract from their description of the procedure :
The potato data and the GENSTAT instructions for fitting the PLS model to these data are available.
THE PLS MODEL AND ITS FITTING
Procedure description by Ian Wakeling and
Nick Bratchell
If Y and X denote matrices of dependent and independent variablesrespectively, then the aim of PLS is to fit a bilinear model having the form T=XW, X=TP'+E and Y=TQ'+F, where W is a matrix of coefficients whose columns define the PLS factors as linear combinations of the independent variables. Successive PLS factors contained in the columns of T are selected both to minimise the residuals in E and simultaneously to have high squared covariance with a single Y variate (PLS1) or a linear combination of multiple Y variates (PLS2). The columns T are constrained to be mutually orthogonal. See Helland (1988) or Hoskuldsson (1988) for a more comprehensive description of the PLS method.
The procedure allows the calculation of PLS1 and PLS2 models with cross- validation to assist in the determination of the correct number of dimensions to include in the model. By setting the NGROUPS option the data are randomly divided into a number of groups; samples in each group are then modelled from the remaining samples only. The sum of squares of differences between these "leave out predictions" and the observed values of Y are called PRESS. Many tests of significance for determining the correct number of dimensions are based on comparing values of PRESS for PLS models of varying rank. Values of PRESS are used in the procedure to perform Osten's (1988) test of significance and may also be plotted out in a scree diagram. In addition to the factor scores, factor loadings and residuals, the procedure also calculates a leverage measure (Martens & Naes 1989 page 276) and a single linear combination of the X variables (ESTIMATES) which summarises the entire PLS model.
To use a PLS model to make predictions from new observations on the X variables, two methods are available. Either the user may do this manually by using the model as specified in the estimates matrix, or the new X data may be specified beforehand as the pointer to variates XPREDICT and the corresponding predicions obtained as YPREDICT.
Although the PLS method is often presented in terms of an iterative algorithm (Manne 1987), the X block loadings vector for the first PLS dimension (w1) is simply the eigenvector of X'YY'X corresponding to its largest eigenvalue. To find the second and subsequent dimensions, X and Y are deflated by orthogonalising with respect to the current PLS factor (t=Xw) and the eigenanalysis repeated. The above approach was adopted by Rogers (1987) in an implementation of a Genstat 4 macro. Here we adopt a very similar approach by performing a singular value decomposition on the matrix X'Y which simultaneously obtains loading vectors for both data blocks (Hoskuldsson 1988).
OUTPUT FROM THE PLS MODEL AND ITS FITTING
Output from the PLS procedure can be selected using the following settings of
the PRINT option.
data the unscaled data values (with labels).
xloadings X-component loadings (columns of the matrix W - see above).
yloadings variable loadings for the bilinear model of the matrix of
dependent variables. Note that these are standardized to unit
length and are not the same as the columns of the matrix Q
above. To obtain Q form the matrix C, whose columns are the
standardized loadings and post-multiply by the diagonal
matrix supplied as the output parameter B.
ploadings variable loadings for the bilinear model of the matrix of
independent variables (columns of the matrix P - see above).
scores X and Y component scores. The X component scores are the
columns of the matrix T and are mutually orthogonal. The Y
component scores, usually given the symbol u, are not in
fact needed in the calculation of the PLS model unless an
iterative algorithm is used (see method section). They are
provided here for completeness, as sometimes it is useful to
plot the Y component scores against the X component scores
to give a visual indication of the degree of fit for each
PLS dimension.
leverage measure of leverage.
xerrors residual sum of squares and residual standard deviations for
all the independent variables. When NGROUPS>1 additional
statistics are calculated from the cross-validated residuals,
derived when each object is left out. The PRESS value
is equal to the sum of squares of cross-validated standard
deviations for each X variable multipled by N-1, where N is
the total number of observations. The cross-validated
standard deviations may therefore be used to measure the
predictive ability of the model for each of the variables.
yerrors residual sum of squares and residual standard deviations for
all the dependent variables (see xerrors above).
scree scree diagram of PRESS.
xpercent percentage variance explained for the X variables.
ypercent percentage variance explained for the Y variables.
predictions predicted values for any observations that were not included
in the PLS model but were supplied using the XPREDICT
parameter.
groups details of groupings used for cross-validation.
estimates estimated PLS regression coefficients.
fittedvalues fitted values from the PLS regressions.
The default settings are estimates, xpercent, ypercent, scores, xloadings,
yloadings, ploadings.
PLS PROCEDURE OPTIONS
The data for PLS are supplied using the X and Y parameters, as pointers to
variates containing the columns of the X and Y matrices. Other parameters
allow output to be saved in appropriate data structures.
'options'
PRINT = strings Printed output required (data, xloadings,
yloadings, ploadings, scores, leverage,
xerrors, yerrors, scree, xpercent, ypercent,
predictions, groups, estimates, fittedvalues);
default esti,xper,yper,scor,xloa,yloa,ploa
NROOTS = scalar Number of PLS dimensions to be extracted
YSCALING = string Whether to scale the Y variates to unit variance;
(yes, no); default no
XSCALING = string Whether to scale the X variates to unit variance;
(yes, no); default no
NGROUPS = scalar Number of cross-validation groups into which to divide
the data; default 1 (i.e. no cross-validation performed)
SEED = scalar A scalar indicating the seed value to use when
or dividing the data randomly into NGROUPS groups for the
factor cross-validation or a factor to indicate a specific set
of groupings to use for the cross-validation; default
takes the (scalar) value of NGROUPS
LABELS = text Sample labels for X and Y that are to be used in the
printed output; defaults to the integers 1...n where
n is the length of the variates in X and Y
PLABELS = text Sample labels for XPREDICT that are to be used in
the printed output; default uses the integers 1, 2 ...
'parameters'
Y = pointers Pointer to variates containing the dependent variables
X = pointers Pointer to variates containing the independent variables
YLOADINGS = pointers Pointer to variates used to store the Y component
loadings for each dimension extracted
XLOADINGS = pointers Pointer to variates used to store the X component
loadings for each dimension extracted
PLOADINGS = pointers Pointer to variates used to store the loadings for
the bilinear model for the X block
YSCORE = pointers Pointer to variates used to store the Y component
scores for each dimension extracted
XSCORE = pointers Pointer to variates used to store the X component
scores for each dimension extracted
B = matrices A diagonal matrix containing the regression
coefficients of YSCORE on XSCORE for each dimension
YPREDICT = pointers A pointer to variates used to store predicted Y values
for samples in the prediction set
XPREDICT = pointers A pointer to variates containing data for the
independent variables in the prediction set
ESTIMATES = matrices An NX+1 by NY matrix (where NX and NY are the numbers
of variates contained in X and Y respectively) used to
store the PLS regression coefficients for a PLS model
with NROOTS dimensions
FITTED = pointers Pointer to variates used to store the fitted values for
each Y variate
LEVERAGE = variates Variate used to store the leverage that each sample has
on the PLS model
PRESS = variates Variate used to contain the Predictive Residual Error
Sum of Squares for each dimension in the PLS model,
available only if cross-validation has been selected
RSS = variates Variate used to store the Residual Sum of Squares for
each dimension extracted
YRESIDUAL = pointers Pointer to variates used to store the residuals from the
Y block after NROOTS dimensions have been extracted,
uncorrected for any scaling applied using YSCALING
XRESIDUAL = pointers Pointer to variates used to store the residuals from the
X block after NROOTS dimensions have been extracted,
uncorrected for any scaling applied using XSCALING
XPRESIDUAL = pointers Pointer to variates used to store the residuals from the
XPREDICT block after NROOTS dimensions have been
extracted
It is usual to centre all variables prior to a PLS analysis, the procedure
will automatically do so even if the XSCALING/YSCALING options are not set.
On exit from the procedure the variates pointed to by X and Y are unchanged.