Revised version published in J.Chemom., 2003, 17, 16-33
Centering and scaling in component analysis
Rasmus Bro
Chemometrics Group, Food Technology, Dept. Dairy and Food Science
The Royal Veterinary and Agricultural University
Denmark
rb@kvl.dk
&
Age K. Smilde
Process Analysis and Chemometrics
Department of Chemical Engineering
The
Abstract................................................................................................................................................................................. 4
1. Introduction................................................................................................................................................................ 5
1.1 Definitions........................................................................................................................................................ 5
1.2 Leading principles........................................................................................................................................... 8
1.3 Outline of paper............................................................................................................................................... 8
1.4 Notation............................................................................................................................................................ 8
2. Two-way centering.................................................................................................................................................... 9
2.1 Reasons for centering..................................................................................................................................... 9
2.2 How centering works.................................................................................................................................... 10
2.2.1 Centering can remove offsets because it is a projection step...................................................................... 10
2.2.2 Centering across several modes................................................................................................................... 11
2.2.3 Centering is a two-stage procedure for a least squares fitting problem...................................................... 11
2.2.4 Rank reduction and centering...................................................................................................................... 13
2.3 When centering does not work................................................................................................................... 14
2.3.1 Handling missing data................................................................................................................................. 14
2.3.2 Subtracting the grand mean......................................................................................................................... 14
2.4 Alternatives to centering............................................................................................................................. 17
2.5 Summary................................................................................................................................................................... 18
3. Two-way scaling...................................................................................................................................................... 19
3.1 Reasons for scaling....................................................................................................................................... 19
3.2 How scaling works........................................................................................................................................ 20
3.2.1 Different types of scaling........................................................................................................................... 20
3.2.2 Scaling and weighted least squares fitting................................................................................................... 21
3.3 When scaling does not work....................................................................................................................... 23
3.4 Alternatives to scaling................................................................................................................................. 23
3.5 Summary......................................................................................................................................................... 23
4. Simultaneous two-way centering and scaling................................................................................................... 23
4.1 Scaling within a mode disturbs centering across the same mode, but not across other modes....... 24
4.2 Centering across one mode disturbs scaling within all modes.............................................................. 24
4.3 Centering across and scaling within the same mode is problematic..................................................... 25
4.4 Summary......................................................................................................................................................... 25
5. Three-way preprocessing...................................................................................................................................... 26
5.1 Centering........................................................................................................................................................ 26
5.1.1 Possible three-way offsets and their proper removal................................................................................. 26
5.2 Scaling............................................................................................................................................................. 28
5.2.1 Incorrect scaling of three-way arrays.......................................................................................................... 29
5.3 Simultaneous centering and scaling........................................................................................................... 30
5.4 Summary......................................................................................................................................................... 31
6. Conclusion............................................................................................................................................................... 31
Acknowledgments............................................................................................................................................................. 33
Appendix 1: Projections................................................................................................................................................... 33
Appendix 2: Fitting bilinear model plus offsets across one mode equals fitting a bilinear model to centered data 35
In this paper, the purpose and use of centering and scaling is discussed in depth. The main focus is on two-way bilinear data analysis, but the results can easily be generalized to multiway data analysis. In fact, one of the scopes of this paper is to show that if two-way centering and scaling is understood, then multiway centering and scaling is quite straightforward. In the literature, it is often stated that preprocessing of multiway arrays is difficult, but here it is shown that most of the difficulties do not pertain to three- and higher-way modeling in particular.
It is shown that centering is most conveniently seen as a projection step, where the data are projected onto certain well-defined spaces within a given mode. This view of centering helps explaining why, for example, centering data with missing elements is likely to be suboptimal if there are many missing elements.
Building a model for data consists of two parts: postulating a structural model and using a method to estimate the parameters. Centering has to do with the first part: when centering, a model including offsets is postulated. Scaling has to do with the second part: when scaling another way of fitting the model is employed. It is shown that centering is simply a convenient technique to estimate model parameters for models with certain offsets, but this does not work for all types of offsets. It is also shown that scaling is a way to fit models with a weighted least squares loss function and that sometimes this change in objective function cannot be performed by a simple scaling step.
Further practical aspects of and alternatives to centering and scaling are discussed, and examples are used throughout to show that the conclusions in the paper are not only of theoretical interest, but can have impact on practical data analysis.
Keywords: Offset, weighted least squares, preprocessing, two-way, three-way, multiway, missing data, PCA, PARAFAC
It is important to have a concise terminology for scaling and centering. The following convention is based on suggestions from the literature [Harshman & Lundy 1984, Kiers 2000, Kruskal 1989, ten Berge 1989]. The term 'an offset' - also sometimes called an intercept - is used for a part of the model that is constant across one or several modes. An R-component bilinear model of a data matrix, X (I × J) with elements xij, may be written in terms of scalars or in matrix notation as
|
X = ΦΘT + E |
(1) |
where Φ (I × R) and Θ (J × R) hold the parameters, ir and
jr, where Greek symbols are
used to indicate population parameters. The matrix E holds the unknown
errors. Offsets may be constant across the
first mode (rows). The model associated with such offsets is:
|
X = ΦΘT + 1 |
where (J × 1) holds
the constant terms
(j=1,..,J) and where 1 is a one-vector of suitable size (I × 1 in this case). Again, the Greek symbol
indicates a population value. Offsets may also
be constant across the second mode (columns). The underlying bilinear model
with offsets across the second mode reads
|
X = ΦΘT + |
(3) |
where the vector (I × 1) is
now holding the offsets
(i=1,..I). Offsets may also be constant across
columns and across rows yielding
|
X = ΦΘT + |
|
in which the single constant is the same for all elements of X. Such a situation may arise for
example in chromatography or capillary electrophoresis where a constant offset
in the detector may appear, due to the way the
detector zero-level is determined.
Thus, for bilinear models there are two basic types of offsets: constants across one mode (columns or rows) or constants across both modes. Combinations of such offsets may also appear as is for example seen in analysis of variance settings.
As will be shown below, offsets are often handled by first centering the data and subsequently fitting the bilinear model. If the data are centered by subtracting the column-average from every element in the column, this is referred to as centering across the first mode. Mathematically it can be expressed
|
yij = xij
- |
(5) |
where yij is an element of the centered data matrix. If m (J×1) is a vector holding the jth column-average in its jth element, then centering across the first mode can also be expressed
|
Y = X |
(6) |
where 1 is an I-vector of ones and Y the matrix holding the centered data. Subtracting the row-average from each element in a row is referred to as centering across the second mode and can be expressed
|
yij = xij
- |
(7) |
or using m (I×1) as a vector holding the ith row-average in its ith element
|
Y = X |
(8) |
In general, centering across one mode is also called single-centering and performing for example a centering across the first mode and then a subsequent centering of the outcome across the second mode is called double-centering. The term slab-centering, which is sometimes seen in the literature, refers to centering by subtracting, from each slab in a three-way array, the overall average of that slab. For two-way data, this simply corresponds to subtracting the average of all elements.
For scaling, another terminology is used. When a matrix is scaled such that each row is multiplied by a specific scalar, it is called scaling within the first mode (yij = xijwi). If each column is multiplied by a certain scalar as in traditional autoscaling, it is referred to as scaling within the second mode (yij = xijwj). In matrix notation scaling within the first mode can be written
|
Y = WX |
(9) |
where W is an I × I diagonal matrix holding the scalar wi in its ith diagonal element. Scaling within the second mode can be written
|
Y = XW |
(10) |
where W is now a J × J diagonal matrix holding the scalar wj in its jth diagonal element.

Figure 1. Example of centering and scaling using sugar process data (legend: see
text).
An example of centering and scaling is shown in Figure 1. The data is from a sugar factory (see Nørgaard [1995] for more information). The variables shown are Ash (1), Color (2), Color type (3), Turbidity (4), Grain size1 (5), Grain size2 (6), SO2 (7), Invert (8), Floc (9), Insoluble residue (10), Amino-N (11) which are all measured in different units of different magnitude. Each line in a plot is the status ('spectrum') of these 11 variables at a certain time. Ninety-seven times are shown. The raw data are shown on the left side A). The different ranges of these variables will manifest themselves in a subsequent modeling of the data, where the variables with little variation will not be modeled to any significant degree. Centering (across the first mode) will not remove these scale differences, but will move the variation of each variable to the zero-level. As the differences in scales between variables are arbitrary, it is useful to scale the data so that each variable has the same initial standard deviation (and remove different measurement units). This can be achieved by scaling the centered data within the second (variable) mode. The corresponding autoscaled data are shown on the right side of the figure. It is seen that the variation of each processed variable is comparable for the autoscaled data. The outlying sample in the lower part of the plot, however, leads to a too dramatic down-weighting of, for example, variable one and should hence be excluded before preprocessing is carried out.
It may seem strange that different words are used for scaling (within) and centering (across). The explanation for this is as follows. Centering is performed across a mode in the sense that one offset is subtracted from every element in a certain vector, i.e., the data are centered across the elements of one mode. The same holds for three-way data; the average value is subtracted from each element of a vector. Scaling is performed by multiplying all elements in the array containing a certain variable (or object) by the same scalar. For two-way data, scaling therefore also pertains to vectors, but e.g. for three-way data, this means that a whole slab (corresponding, for example, to the I × K matrix of the jth wavelength in a spectral three-way array) has to be multiplied by the same scalar. Thus scaling is performed within the elements of one mode.
In the following, much use will be made of the notion of 'fit' or 'model fit'. In general terms this means what portion of the data is fitted by the model. This can be expressed by
|
|
(11) |
where X
contains the data, the fitted values of the data using the model
and E the residuals. The symbol
denotes a norm of A, here taken to be the Frobenius or Euclidian norm (square root of
the sum of squared elements of A).
The R2 statistic can take
on values between zero and one, where one means perfect fit and zero means no
fit. The fit may also be expressed in percentages between 0 and 100.
There are two leading principles in this paper. The first principle is parsimony. It is preferred that a model is as simple as possible. This means that if two models give the same fit, the model using the fewer number of parameters is preferred. This idea goes back to Willem of Ockham who lived in the Middle Ages and stated that a minimum number of assumptions should be adopted to explain a phenomenon [Weinberg 1964]. This principle is known as 'Ockham's Razor'. In statistics, the notion of parsimony is formulated in Statistical Decision Theory [Judge et al. 1985] and in chemometrics it was introduced as the 'parsimony principle' [Seasholtz & Kowalski 1993].
The second principle is that centering in this paper is not considered to estimate offsets, but to remove offsets. Estimating offsets is a different issue than removing them and estimating offsets has its own properties and problems.
In the main part of the paper only two-way (bilinear) models are considered because most results generalize straightforwardly to multiway models. Section 2 is concerned with two-way centering, Section 3 with two-way scaling, and Section 4 with the combined use of two-way centering and scaling. Section 5 explains the discussed results in terms of multiway models and finally in Section 6 some conclusions are drawn.
The following notation is used in this paper.
Two-way data are held in an I × J matrix X with typical elements xij.
Three-way data are held in an I × J × K matrix X with typical elements xijk.
Such a three-way array is often rearranged to an I × JK matrix X(I×JK)
by concatenating the K third-mode
frontal slabs after each other; i.e., X(I×JK) = [X1 X2 .. XK] where Xk
is the I × J matrix obtained by fixing the third
mode of X at k. This operation has been referred to
in a number of ways (e.g. unfolding), but it has been suggested to use the term
matricizing to avoid confusion with other techniques [Kiers 2000]. The matricized array it is often
just denoted X Instead of X(I×JK) if no confusion is possible. The
letter Y is used for preprocessed
data and is used for a model of X (be it two- or three-way). The number of components in a
model is called R.
The letter m is used for calculated offsets (e.g. averages) and μ for true offsets (population values). The character w is used for a scaling parameter. The character P is used for a projection matrix related to centering. Usually its dimension will not be specified, because it follows from the context. Similar rules apply for the diagonal matrix W holding the weights associated with scaling the array.
The
Kronecker product is denoted ,
the Hadamard (element-wise) product *, and the Khatri-Rao [Styan 1973] product which
is the column-wise Kronecker product of two matrices is denoted
. The use of
these special products makes it possible to express most three-way models with
two-way (matricized) arrays [Bro 1998]. For example, a PARAFAC
model of an I×J×K
array X can be expressed as
|
X(I×JK) = A(C |
where A (I×R), B (J×R), and C (K×R) are component matrices and E holds the residual unexplained variation. This notation is equivalent to the model
|
|
(13) |
In order to understand when and how centering works, it is important first to consider the goals of centering and to realize how these goals are achieved in practice. These aspects are described in this section.
As stated by Harshman and Lundy [1984], quite subjective and qualitative reasons are often given for performing centering. It is possible to formulate rational reasons for centering on scientific grounds. Basically, centering should be performed only if there are common offsets in the data or if modeling such offsets provides an approximately reasonable model. Thus, centering is performed to make interval-scale data behave as ratio-scale data, which is the type of data assumed in most multivariate models. Said more simply, centering should make a difference. This difference can manifest itself as
i) Reduced rank of the model
ii) Increased fit to the data
iii) Specific removal of offsets
iv) Avoidance of numerical problems
Re i). If a model of the raw data requires, say, R + 1 components to describe the data well, whereas a model of the centered data requires only R components, then centering is sensible because the model of the centered data only has R(I+J) + J parameters. The J parameters pertain to the calculated averages, assuming that centering is performed across the first mode. The alternative of fitting the (R+1)-component model to the raw data would lead to a model with (R+1)(I+J) parameters and thus would violate the parsimony principle. Re ii). In some situations, the rank of the appropriate model is not reduced upon centering, but if the fit of the model of the centered data is significantly improved, then naturally introducing extra parameters is useful. It is possible to heuristically consider the offsets introduced by centering as one extra 'half' component of which either the scores (centering across first mode) or the loadings (centering across second mode) are known a priori to be equal to one. This holds in the sense that the fit of a model with R components is poorer than the fit of a model with R components and offsets, which again is poorer than a model with R+1 components. Re iii). Centering can remove certain offsets. In some situations the offsets as such are of interest, in which case it is interesting to estimate these. This can usually not be achieved by centering. Re iv). In certain algorithms, it may be useful to center the data in order to minimize algorithmic problems. For fitting a bilinear model using principal component analysis, it is known that the ratio of the two largest eigenvalues is related to the convergence rate of the power method (and related techniques such as NIPALS). For PARAFAC it is also known that if some components are strongly correlated as evidenced through Tucker's congruence coefficient [Tucker 1951] then the fitting procedure may be complicated by so-called swamps. For both situations, it holds that centering across certain modes can be helpful in minimizing the cause of the problem because the resulting optimization problem is related to a different model with different (and hopefully better) properties with respect to numerical problems.
The following discussion pertains to centering across the first mode (ordinary column-centering) but is readily applicable to centering across the second mode as well. Understanding that centering is a special projection step within one specific mode, explains why it eliminates constant terms in the data (see Appendix 1 for details on projections).
If the vector m holds in its jth element the average of the jth column of X, then m can be expressed as
|
m = (1/I)XT1 |
(14) |
where 1 is an I-vector of ones and X (the data) has size I × J. Then centering X across the first mode (column-centering) amounts to
|
Y = X - 1mT |
where Y (IxJ) contains the centered data. As
|
1mT = (11T/I)X, |
(16) |
the centered data can also be expressed
|
Y = X - 1mT =
X - (11T/I)X = (I - (11T/I))X = (I-P)X= P |
(17) |
where PX = P[x1, ... , xJ] = [Px1, ..., PxJ]
and xj is the jth column of X. The matrix (11T/I) = P is a symmetric and idempotent (I × I)
matrix and is thus an (orthogonal) projection matrix (See Appendix 1). This shows that the column-averages are the
orthogonal projections of the columns of X
onto the direction of ones, i.e., the direction given by the vector 1. The centering matrix (I - (11T/I)) = P
is the projection matrix onto the nullspace of 1T which equals range(1)
(where range(·) is the range of a
matrix). Stated otherwise, centering may also be interpreted as providing the
residuals after regressing the columns of X
onto 1. It follows, that
centering may be viewed as the projection of the data onto a space with the
common offset (given by the I-vector 1) removed.
Mathematically, centering is a projection onto the nullspace of 1T and it is worthwhile to keep this in mind. Suppose that the true model of the data contains offsets across the first mode. Then the model can be written as
|
X = ΦΘ’ + 1μT + E |
(18) |
Projecting these data onto the nullspace of 1T leads to
|
P |
(19) |
|
P |
(20) |
where PX
= Y
is the matrix holding the centered data and where the matrix P
1μT vanishes, as 1 has no residuals when projected onto
itself. The part P
ΦΘT is a bilinear model with scores P
Φ and loadings Θ. Thus instead of fitting the
bilinear model and the offsets to the
original data, it is only necessary to fit the bilinear model to the centered
data, Y
with true structure, P
ΦΘT+ P
E. Centering also leads to models with
residuals with zero column averages (centered across the first mode), because
these are also projected onto the nullspace
of 1T, as is clear from Eq. (20).
If Φ and Θ are nonnegative, then nonnegativity constraints can be imposed on the model for X. When X is centered, e.g., across the first mode, such a constraint is not meaningful for Φ anymore, because centering destroys nonnegativity of Φ (but not of Θ).
As mentioned earlier,
centering across a given mode is called single-centering. Single-centering an
array across one mode that has previously been single-centered across another
mode is called double-centering. Performing several single-centerings (for multiway arrays as many can be
performed as the number of ways) is unproblematic in the sense that centering
across one mode, leaves the ‘centeredness’ intact in other modes [Harshman & Lundy 1984]. Further, the order of centering is
immaterial. This means that if the data are first centered across the first
mode, and subsequently the centered data are centered across the second mode,
then the average of every column and every row will be zero. This follows,
because centering across the first mode can be written as X where
is the
centering operator for the first mode. Centering across the second mode can be
written X
. Thus double-centering can be written
X
, and hence: i) the order of which centering is
performed is immaterial and ii) the double-centered array will have both
column- and row-average zero, because
X
can be
viewed as centering the matrix X
across the
first mode or centering the matrix
X
across the second mode.
Consider a two-way data set, which is generated as
|
X = ΦΘT + 1μT + E |
where Φ is (I × R), Θ (J × R) and μ (J × 1). It follows, that the data can be modeled by a bilinear model plus a common offset for each variable/column plus additional unmodeled variation held in the residual matrix E. This is the model assumed to be a valid approximation in most bilinear methods in chemometrics. The least squares loss function for the above model is
|
||X |
and this function is to be minimized directly over A, B, and n, with A, B and n being of the same dimensions as Φ, Θ and μ respectively. The matrices A, B, and n contain estimates of Φ, Θ, and μ respectively, but it is intrinsic to the problem, that these estimates do not uniquely recover the underlying parameters. For example, Φ and Θ can, at most, be found up to a rotation. As shown above, however, the parameters can be estimated in two steps. Centering the data across the first mode will remove the offsets μ, and the bilinear model is subsequently fitted to the centered data Y, thus minimizing the loss function
|
|| Y - CDT||2 |
only over C and D that are of the same dimensions as A and B above. It holds that
|
min||X |
(24) |
i.e. the fit of the model fitted directly and the bilinear model fitted to the centered data will be exactly the same (see proof in Appendix 2) even though the actual parameters will usually differ. This is an important result because it guarantees the optimality of the model even if the offsets are calculated separately from the bilinear parameters.
The
solution for minimizing Eq. (22) directly is not unique for n. That is, the two-stage solution of
centering first and then fitting the bilinear part gives one solution of many
to the problem in Eq. (22). Therefore, centering removes μ, but m is not necessarily an estimate of μ.
This non-uniqueness is explained in short. Centering involves subtracting from each column its column-average. The matrix
|
PX = PΦΘT + P1μT+ PE |
(25) |
holds in each row the vector mT containing the average value of each column of X (P is 11T/I, as above). If the part PΦΘT is a matrix of zeros, then m will be an estimate of the true offsets μ (with error PE), because
|
P1μT = 1μT |
(26) |
by definition. However, PΦΘT will only be a zero-matrix if Φ is orthogonal to P and/or Θ = 0 (assuming that Φ, Θ have full column rank). That is, PΦ=0 or ΦTP=0. Thus, the offsets will only equal the true offsets if the column-space of Φ is orthogonal to 1 (meaning that Φ is centered already) or if Θ = 0.
In some cases centering reduces rank and in
some cases not. Column centering of X
(I×J)
reduces the rank of X if and only if
1range(X), where range(X) is the range of X (see Amrhein [1998]
p. 156, and Amrhein [1996]). Intuitively this is
understandable. Centering is a projection. If the axis on which the projection
takes place is a part of the range of X,
then the residuals of this projection do not have this direction available
anymore. Hence, the rank of the matrix of residuals lowers by one. This simple
fact has several repercussions for centering across the first mode
(column-centering).
The following can be said about the noiseless case. Suppose that X (I×J) is noiseless and can be modeled as
|
|
|
where Φ is (I×R) of
full rank; Θ (J×R).
Assume that has full rank R+1 which will be fulfilled for real data. For Y, the column-centered X,
two cases can be distinguished:
1. 1 range(Φ)
rank(X) = R
rank(Y) = R -1
2. 1 range(Φ)
rank(X) = R +1
rank(Y) = R
Hence, in both cases the rank of X is reduced by one. The reverse also holds: if for the noiseless case no rank reduction of X is obtained upon column-centering, then model (27) is not valid. To summarize, in the noiseless case there is a simple relation between the validity of model (27) and rank reduction upon column-centering.
In the case with noise added things are less simple. Suppose that X (I×J) also contains noise and the model for X is
|
|
|
If I>J, then in general 1 range(X). Although 1
range(‘noiseless’
X), this property is destroyed upon
adding noise to X. In the case I ≤ J and if X has full rank I, then 1 is by definition in the range of X, whether or not there exists an
offset term 1μT. Hence, in the case with noise
there is no simple relationship anymore between the validity of (28) and mathematical rank reduction upon
column-centering.
Viewing centering as a projection step rather than as a simple subtraction of averages has more than theoretical importance. In practice, situations often occur where subtraction of averages does not work and may, in fact, lead to models that fit the original data more poorly than if the data had not been preprocessed. This will be shown in the following for two different problems: handling missing data and modeling a single common offset.
When data are missing, centering by subtracting averages from columns or rows does not lead to elimination of offsets. Rather, the offsets have to be eliminated simultaneously with the fitting of the bilinear part [Grung & Manne 1998]. This is so because the equivalence between subtracting average values and projecting onto the nullspace of vectors of ones no longer holds as the projection cannot be calculated with missing elements. As an example, consider the matrix X(1) shown below.
|
|
|
This is a rank one matrix and will remain so, even after centering across the first mode. The averages of the two columns are 4.1 and 8.2 respectively and the centered matrix reads as Y(1) in Eq. (29), which is also a rank one matrix.
Consider now a situation in which the first two elements in the first column are missing. The data then reads as X(2). This data set is naturally still perfectly modeled by a rank one bilinear model, as no new information has been added. The average of each column is now 1 and 8.2 and subtracting these values leads to centered matrix Y(2). This matrix cannot be described by a rank one model. This is easily realized by only looking at the last three rows. This is a rank two submatrix, and the addition of the first two rows cannot change this. Thus, by subtracting averages from the data with missing elements, the structure of the data has been destroyed and meaningful results cannot be expected. Prior centering no longer leads to elimination of the true offsets as centering ordinarily does.
Centering is really an extension of the bi- (or multi-) linear model where offsets are assumed to be present in the model of the data. Data with missing elements constitute one situation in which such a model cannot be fitted in a least squares sense using centering. An alternative to eliminating offsets by preprocessing is given in Section 2.4.
The traditional centering across the first mode easily leads to the belief that subtracting averages with the same structure as the offsets will generally eliminate these offsets. This holds for offsets constant across one mode but it does not hold in general.
Consider a data set structured as a bilinear part plus a constant identical for all elements; that is, all elements have the same common offset, as also shown in Eq. (4). It might seem natural to remove this offset by initially subtracting the grand mean m from the data. However, this will not simplify the subsequent modeling of the data, and in fact, it obscures interpretation of the model, because such a centering leads to artificial mathematical components that also need to be modeled.
To explain this, assume that X is perfectly described by a bilinear part plus a common offset
|
X = ΦΘT + 1I1JTμ. |
Centering by removing the overall mean of all elements of X can be written
|
Y = X - PIXPJ |
where PJ is the projection matrix of 1J (=1J1JT/J) and PI of 1I (=1I1IT/I). Then PIXPJ is a matrix of the same size as X holding the overall average of X in all its elements. Inserting the true model of Eq. (30) in Eq. (31) leads to
|
Y = ΦΘT + 1I1JTμ - PI(ΦΘT + 1I1JTμ)PJ =
ΦΘT + 1I1JTμ - PIΦΘTPJ - PI1I1JTμPJ =
ΦΘT + 1I1JTμ - PIΦΘTPJ - 1I1JTμ =
ΦΘT - PIΦΘTPJ =
ΦΘT - 1I1JTs. |
(32) |
The scalar s is the overall average of the true bilinear part ΦΘT. Even though the overall mean, λ, has been removed, a new common offset, s, has been introduced into the preprocessed data, and hence the same number of components is still necessary for modeling the data. Depending on the true parameters in the underlying model (ΦΘT), the model fitted to the preprocessed data may therefore explain less or more of the original data than if the data had not been preprocessed! Clearly, preprocessing the data by subtracting the overall mean is generally not useful.
As subtracting the overall level does not remove the offset, another approach must apparently be adopted for handling situations with one common offset. There are basically two different ways of treating the problem. The best way is to optimize the loss function of the problem directly in a one-step procedure, rather than trying to use a two-step procedure where the offsets are first removed (See Section 2.4).
Another simpler way of dealing with a constant offset follows from the observation that the model
|
X = ΦΘT + 1I1JTλ |
may equivalently be written
|
X = ΦΘT + 1IμT |
where
|
μ = 1Jμ. |
Posed in this way it is evident that a model with one global offset is a special case of the situation treated earlier where each variable (or sample) has a specific offset. Therefore, the constant offset may be eliminated by using ordinary centering across the first mode. As the offset is constant across rows, this centering removes the constant. An alternative procedure is to center across columns instead of rows, because the offset is also constant across columns.
An example is given for illustration. Consider [RB1]a spectral data set. The data have been synthesized according to a bilinear three-component part plus a scalar offset of one. Thus, the model is
|
X = ΦΘT + 11T. |
(36) |
No noise is added. Using principal component analysis to model the raw data, four components are needed for describing the variation as expected (Table 1 left). Modeling the data centered by subtracting the overall mean leads to a situation where four components still have to be used for describing all systematic variation (Table 1 right). In fact, the three-component model explains less of the original data after preprocessing in this case.
Table 1. Percentage of variation explained for different models of the raw
(left) and corrected data (right).
|
# |
Raw data |
Overall average subtracted |
|
1 |
99.19% |
98.05% |
|
2 |
99.58% |
99.08% |
|
3 |
99.95% |
99.67% |
|
4 |
100.00% |
100.00% |
|
5 |
100.00% |
100.00% |
Even though only three systematic components should be present, the loading plot (Figure 2 left) clearly shows that the first four components are 'spectral'-like. With proper preprocessing only three loading vectors will be systematic, as shown in Figure 2 (right), using centering across the first mode.

Figure 2. Left: Loading plot from principal component model of the data matrix
where the overall average is subtracted. Right: The first five loadings are
shown when the data have been correctly centered across the first mode.
Single-centering involves fitting several parameters (I or J respectively). When there is only one constant parameter in the true model, as is the case here, a risk of overfitting is introduced with this approach. It is advisable, therefore, to center across the mode of largest dimension such that as few offsets as possible need to be estimated.
To capitalize, the following rule establishes the 'correct' procedure for removing offsets before fitting the model: centering across one mode removes offsets constant across that mode as well as offsets constant across both modes [Harshman & Lundy 1984]. This important rule also extends to multiway data of arbitrary order. Thus, centering across one mode removes offsets constant across that mode as well as offsets constant across several modes involving that mode. This generalization follows from realizing that multiway models can always be considered as a constrained version of a bilinear model. Hence, offsets constant across an arbitrary number of modes can always be considered a constrained version of a model with offsets constant across one mode. Centering across one of these modes will therefore eliminate the offsets because of the projection properties of centering.
Instead of modeling the data in two steps - removing offsets by centering and fitting a model to the residuals - it is possible to fit the model in one step, alleviating the need for projecting the offsets away. Two examples are given.
The example of missing data (Section 2.3.1) can be fitted directly in the following way. Assume that the offsets are, for instance, constant across the first mode and that a principal component analysis model is sought including offsets across the first mode. Such a PCA model of the data held in the matrix X including offsets reads
|
X = ΦΘT + 1μT + E |
(37) |
where μ is a J-vector. A very simple way to fit this model to a data matrix with missing elements in a least squares sense, is by the use of an alternating least squares approach where the missing elements are continuously exchanged with their model estimates. Such an algorithm may proceed as follows.
1. Initialize missing values with reasonable values. Then the data set is complete and can be modeled by ordinary techniques.
2. Fit the model including offsets to the (now complete) data set. For PCA, this amounts to centering the data and fitting the PCA model.
3. Exchange missing values in the data matrix for model estimates. These estimates will improve the current estimates and thus provide a data set where the estimated missing elements are closer to the correct ones according to the (yet unknown) true least squares model.
4. Proceed from step 2 until convergence
This approach can be shown to converge, because it is an alternating least squares algorithm and hence has a non-increasing loss function. Upon convergence to the global minimum, the imputed missing data will have no residuals and hence no influence on the model. The model parameters computed from the complete data are exactly those that would have been obtained, had the model only been fitted to the nonmissing data directly*. This approach can be viewed as a simple special case of expectation maximization [Little & Rubin 1987]. For specific models or specific offsets, other approaches can also be feasible, but the above approach is general and easily implemented.
The problem with one common offset (Section 2.3.2) can be dealt with in the following way. The loss function for a bilinear model with constant overall offset is expressed as
|
||X - ΦΘT |
(38) |
Instead of fitting the over-parameterized model
|
X = ABT + 1mT + E |
(39) |
in a two-step procedure (see Eq.(34)), it is possible to fit a 'correct' structural model
|
X = ABT + m11T + E |
directly. Instead of J parameters in m, only one parameter, m, has to be estimated. The PCA model is used here as an example, but may be exchanged with any other structural model including a three-way model. Also, other types of offsets may be used. The loss function may be optimized in several ways leading to a least squares model [Cornelius et al. 1992, van Eeuwijk 1996]. A simple algorithm is based on alternating least squares where first the offset is set to an initial value. Then the structural model is fitted to the corrected data X - m11 T using an ordinary PCA algorithm in this case. This provides an update of the bilinear parameters. Subtracting the new interim PCA model from the data leads to
|
X - ABT = m11T + E. |
(41) |
Therefore, m may be determined conditional on A and B as the overall average of X - ABT. By alternating between updating the loading parameters and the offset, the model parameters will be estimated upon convergence.
In a three-way context using a PARAFAC model, an auxiliary benefit is that the offsets may often be uniquely determined due to the uniqueness of the PARAFAC model.
Offsets are part of the model hypothesized for
the data. Some offsets can be removed by centering before fitting the remaining
part of the model. This removal of offsets involves fitting additional
parameters. Proper centering is
defined as a centering operation that removes the offsets postulated and does
not change the structural model of the data. Proper centering is always
performed across a single mode. Sequentially centering across several modes is
allowed. Hence, proper centering can always be written as or
or
depending on whether centering is performed
across the first, second or both modes. Here
is an I
× I matrix
defined as (I - (11T/I)), where I
is an I × I identity matrix and 1 an I-vector of ones. The matrix
is defined similarly. If elements are missing
or other offsets are to be modeled, this has to be done using a one-step
modeling approach where offsets and other parameters are considered
simultaneously (usually using iterative algorithms).
Unlike centering, scaling does not change the structure of the model. Scaling is used to change the weights given to different parts of the data in fitting the model. Though scaling is important, it usually has a much less dramatic influence on the fitted model than centering, as long as the model and scaling are reasonable [Westerhuis et al. 1999]. Some issues related to scaling are discussed by Paatero & Tapper [1994].
Scaling is used for several reasons. Some important ones are
i) To adjust scale differences
ii) To accommodate for heteroscedasticity
iii) To allow for different sizes of subsets of data (block-scaling)
Re i). It is quite common to use, for instance, autoscaling (centering across the first mode and scaling to unit standard deviation within the second mode) to let the variance of each variable be identical initially. Thereby, all variables have the same variance, and as the subsequent fitting of a model is performed so as to describe as much systematic variation as possible, every variable has the same initial opportunity of entering the model. This type of scaling is especially useful when the variables are measured in different measurement units (e.g. Pascal, Centigrade,...). Re ii). The ordinary least squares fitting of a model is statistically optimal in a maximum likelihood sense, if the errors are homoscedastic, independent and Gaussian. If the variances of the distributions are not the same, though the same within, e.g., a specific variable, it is possible to accommodate the fitting procedure accordingly by initially scaling the data within the variable mode. By scaling each variable with the inverse of the standard deviation of the residual variance, the fitted model will be optimal in a maximum likelihood sense. Re iii). When the data is made up of several subsets of very different sizes, it may sometimes be advantageous to scale each block separately in order to ensure that all the different blocks are allowed to influence the model. Consider, for example, a situation in which 5000 wavelengths are measured in an infrared spectrum (absorbance between 0 and 1) and one variable is given for the temperature. Due to the huge difference in number of variables (5000 and one respectively), the total variance of the infrared spectra will be tremendous compared to the temperature. If no scaling is applied to adjust for this difference, then the model is implicitly forced to focus on the infrared data. Explaining the temperature variable will not lead to a well-fitting model, unless the model is so complex that it can fit both subsets simultaneously or because the temperature data is in accordance with the infrared data. If the infrared data and the temperature data are initially believed to be equally important, then scaling both subsets to the same total variance will provide a model that reflects this assumption. Thus, in this case scaling is used from an information point of view to ensure that all important information can enter the model, irrespective of the variance of the different sources of information.
It is important to note that even if no scaling is applied, the data are still scaled by the weight one. Thus, scaling (or the loss function in general, as will be shown in the next section) always has to be considered before fitting the model.
All the above-mentioned reasons can be put under the same heading by the term weighted least squares fitting, which is a general and broader approach to fitting models than merely the use of scaling. This will be elaborated on in the next section.
Scaling is a subject often treated in conjunction with centering. However, the purpose of scaling is very different from the purpose of centering. Scaling is a way of introducing another loss function than the least squares loss function normally used. Therefore, scaling does not change the interpretation of the model and its parameters. As for centering, scaling has to be performed in a specific way in order not to introduce artificial structure that needs to be modeled. This becomes even more apparent when going to three-way models.
Scaling is usually performed by multiplying each column or each row in the data matrix by a scalar. There are two types of scaling that are relevant for two-way matrices. One is scaling within the first mode where every row is multiplied by a specific number
|
Y = WX |
(42) |
where W is an I × I diagonal matrix with the scaling parameter for the ith row on its ith diagonal element. This is the type of scaling used, for example, in standard normal variate correction [Barnes et al. 1989, Guo et al. 1999, Helland et al. 1995] where the norm or area of each row of X is scaled to the same scalar value by using an appropriate W. It extends easily to multiway arrays, as discussed in Section 5.2.
The other main type of scaling is within the second mode where every column is multiplied by a specific number
|
Y = XW |
where W is here a J × J diagonal matrix with the scaling parameter for the jth column on its jth diagonal element. This is the scaling ordinarily used in, e.g., PCA where the weight of a specific column (variable) is often chosen to be the inverse of the standard deviation of the variable. In combination with centering across the first mode, such scaling within the second mode is often referred to as autoscaling in the two-way case.
An example of a scaling according to Eq. (43) is shown in Figure 3. One-component bilinear data with huge random residual variation in the last half of the variables (upper left) is generated. The resulting X has the spectra (as plotted in Figure 3 upper left) in its rows. The first principal component is seen to correlate well with the reference score generating the data (lower left). However, when the data are initially weighted as in Eq. (43) by the inverse of the residual standard deviation (upper right), the right part of the data is downweighted substantially. The correlation between the true known score (‘concentration’) and the estimated score of the scaled data is even higher (lower right) than for the raw data (left). Even in this extreme case, the influence of the noise is not at all as drastic as might be expected. This illustrates that scaling is not as critical as centering as long as the scaling is reasonable and the variables relevant. Note that the scaling in this example is not the usual autoscaling using the inverse standard deviation of the data. Rather, the inverse standard deviation of the residual variation, for instance, assessed by replicates, is used.
|
|
|
Figure 3. Influence of scaling on fitted model |
Given a data matrix X (I
× J)
and a model (I × J) which may, for example, be a bilinear
model (
= ABT), a standard approach for
determining the model and its parameters is to fit the model in a least squares
sense by minimizing the loss function
|
||X
- |
(44) |
which can also be expressed
|
|
(45) |
When the different elements of the data have different uncertainties or relevances, it is possible to fit the model using a weighted least squares loss function. In one of its simplest form this can be expressed
|
||(X
- |
(46) |
where (I × J) holds in its ijth element the weight of the ijth
element of X, and '*' is the
Hadamard (elementwise) product. Often the weight of an element is set equal to
the inverse of the standard deviation of the residual variation. It is also
possible to use more elaborate weights, if there are certain correlations
between the residuals [Bro et al. 2001, Wentzell
& Lohnes 1999]. The above weighted loss
function can also be expressed in scalar notation as
|
|
|
In a maximum-likelihood sense, this loss function is optimal if the weights reflect the uncertainty of the individual elements, if there is no correlation between the residuals of different elements, and if the residuals are normally distributed. If the uncertainty of a given variable is the same over all objects, the above model turns into
|
||(X
- |
where the weight matrix, W , is now a diagonal matrix which holds the column-specific weights in the diagonal. The loss function may be transformed as follows
|
||(X
-
||XW
- |
(49) |
In case of a bilinear model the above reads
|
||(X - ABT)W||2 =
||XW - ABTW||2 =
||XW - AHT||2. |
(50) |
Thus, by fitting the
bilinear model AHT to the data scaled within the second mode, XW, in
an ordinary least squares sense, the weighted loss function of Eq. (48) is automatically optimized. This is the basic
mathematical rationale behind scaling. If
the sought model, ,
has the structure ABT then fitting a bilinear model to the scaled data
provides the model in the form AHT. Thus, the score matrix (or a rotated version of
it) is directly provided in A
whereas the loadings of the problem are found by premultiplying the found
loadings, H, with W-1, as
|
BTW=HT |
(51) |
Sometimes, only the scaled data are considered and model parameters are not transformed back to the original domain. The appropriateness of this approach is still, however, governed by the fact that scaling as outlined above maintains the bilinear structure assumed reasonable for the raw data.
There is a direct connection between ||(X - ABT)W||2 and ||XW - AHT||2. However, there is no direct connection between the solution to ||(X - ABT)W||2 and ||(X - TPT)||2 unless the model has perfect fit. That is, fitting the bilinear model to scaled and unscaled data are two different problems with no direct relation.
When scaling within several modes is desired, the situation is complicated, because scaling one mode affects the scale of the other mode. For example, scaling to a standard deviation of one within both the first and second mode will generally not be possible, not even using iterative scaling [Harshman & Lundy 1984]. If scaling to a mean square of one is desired within both modes, this has to be done iteratively, until convergence [Harshman & Lundy 1984, ten Berge 1989]. Using mean-squares rather than, for instance, standard deviations for scaling has the attractive property that iterative scaling is guaranteed to converge, in case no centering is included in the iterative scheme [Harshman & Lundy 1984].
Iterative preprocessing may seem unsatisfactory, because it tends to complicate the subsequent evaluation and validation of the preprocessing, since more than one set of scaling parameters for each mode has to be used. These several matrices holding the scaling parameters from each iteration may be combined, though, into one single matrix [Harshman & Lundy 1984]. An ad hoc alternative is to skip the iterative preprocessing and perform only one scaling of each mode. The purpose of scaling is mainly to bring the level of variation of different variables to some sort of equivalent levels. Therefore, one iteration of scaling can suffice to scale the data, so that no part of the data will have unreasonably large influence on the subsequent fitting.
As shown in the preceding paragraphs, scaling can be considered to be a special case of using a weighted least squares loss function. When more complicated weights are needed, it is not always possible to fit the model indirectly by fitting the least squares model to the scaled data. in such situations an alternative to scaling is to use algorithms that directly handle a weighted least squares optimization criterion [Bro 1998, Kiers 1997, Paatero 1997]. This can be relevant, for example, when the residual variation is correlated both across rows and columns [Andrews & Wentzell 1997, Bro et al. 2001, Wentzell et al. 1997].
Scaling does not affect the structural model of the data, but the loss function used to estimate the model parameters. Proper scaling is defined as the scaling that can be expressed in terms of pre- or post-multiplied weights in the loss function of the model. Hence, proper scaling of X can always be expressed as WIX or XWJ or WIXWJ; where WI and WJ are diagonal matrices holding, for example, the inverse standard deviation of the corresponding row or column. If a certain scale is needed for both modes, then the corresponding weights have to be found iteratively. Scaling, for example, to unit standard deviation in two modes is not possible in general, whereas scaling to unit mean square variation is possible.
A complicating issue in preprocessing is the interdependence of centering and scaling [ten Berge & Kiers 1989]. Because preprocessing is mostly performed in one or a few standardized ways in two-way analysis, the problems are seldom appreciated. It is important, however, to be aware that not all combinations of centering and scaling will work as anticipated (see e.g. Harshman and Lundy [1984]). Generally, only centering across both modes is straightforward or scaling within one mode combined with centering across the other mode [Harshman & Lundy 1984, Kruskal 1983], which is exactly what e.g. autoscaling amounts to.
Scaling within one mode disturbs prior centering across the same mode, but not across other modes [ten Berge 1989]. This holds for two-way arrays as well as higher-order arrays. The reason for this is illustrated in Figure 4. The solid line shows a typical column vector and the dashed line a typical row-vector. When scaling within the first mode, the elements of any column are multiplied by different numbers and hence prior centering across the first mode is destroyed.
Consider a two-way array X (I×J). If the array is scaled within the first mode, this can be expressed
|
Y = WX |
(52) |
where Y
is the scaled array, W is an I × I diagonal matrix holding the scaling
parameters and X is the original
array. As can be seen, scaling within the first mode amounts to a
multiplication of every row by a scalar. This does not affect any centering of
the vectors across the second mode, because every element in a row-vector is
multiplied by the same number. The average of any row will be the original
average of the row scaled down accordingly, and therefore, if the average is
zero, it will stay zero. In the first mode, however, each element in a column
is multiplied by a different scalar. If centering is performed across the first
mode, these column-vectors will not necessarily preserve their zero average
after subsequent scaling within the first mode. Mathematically, the centered matrix PX becomes WP
X upon scaling. As 1TWP
X?0, the preprocessed matrix is no
longer guaranteed to be centered. Offsets constant across the first mode,
however, will still be removed,
because
|
P WP |
(53) |
Note also the interesting fact that if scaling is performed before centering, the result will be different. In that case, the original offsets will not be removed, but the data will be centered (yielding centered scores and residuals), because
|
P |
(54) |
This holds, because unlike for P1, it does
not hold that P
W1
is a zero vector in general.

Figure 4. Two-way array showing the dependency between centering and scaling
Centering across one mode disturbs scaling within all modes. This holds for two- as well as multiway arrays,
but there are certain cases for which it does not hold. One of these special
cases is the situation in which two-way data are scaled within the second mode
to a standard deviation of one and subsequently centered across the first mode
(autoscaling)*. This
subsequent centering will not disturb the scaling within the second mode
(though it would disturb scaling within the first mode, had that been
performed). The reason is that the scaling is specifically performed relative
to the center of the data (standard deviations are based on centered data).
Hence, any change in offset is immaterial for the standard deviation. Scaling
by other means than standard deviations will not have this property. In
multiway analysis it is common to use mean squares for scaling instead of
standard deviations because such scaling more often converges when implemented
in an iterative scheme and because scaling by standard deviations implicitly
assumes an offset, which may or may not be present depending on the structural
part of the model.
Centering across a mode within which scaling is
also applied or vice versa is generally not going to
retain all the properties of the two individual operations, as discussed
earlier. For example, if centering across the first mode and scaling within the
first mode is desired, then setting
|
Y = WP |
(55) |
with W
being a diagonal scaling matrix and P
a centering operator will not lead to a preprocessed matrix in which the first
mode vectors are centered, although possible offsets will be eliminated.
Conversely, setting
|
Y = P |
(56) |
will not eliminate the original offsets, even
though the preprocessed array will have centered first mode vectors. Hence, for
a specific application, a choice has to be made between these two approaches,
depending on why the centering is applied.
Proper centering is a centering operation
that correctly removes the presupposed offsets and which does not introduce
other offsets into the data. Likewise, proper
scaling introduces the correct weights into the loss function. Stated
otherwise, proper centering and scaling do what they are supposed to do. Proper
centering and proper scaling can sometimes be combined. Unproblematic
combinations can always be expressed as
|
(P |
where P
is the projection matrix that centers the data and W is a weighting matrix, both of appropriate sizes. The brackets in
Eq. (57) indicate the proper order in which the different
preprocessing steps have to be performed. If performed oppositely, offsets will
not be removed, although the data will be centered. Problematic combinations are
|
WP |
(58) |
These combinations do not retain all the
properties desired of the preprocessing steps.
The preprocessing of multiway arrays will now
be discussed using three-way arrays as an example. The basic properties
discussed thus far are unchanged. Centering has to be performed across a
specific mode and scaling has to be performed by a transformation within a
specific mode. Most difficulties in preprocessing three-way arrays arise
because of the problems outlined so far, which all generalize to multiway
arrays. The problems are sometimes enhanced, because three-way data are often
rearranged (matricized) to two-way arrays before preprocessing. This is
unfortunate, because it introduces a column-mode that is a combination of two
of the original modes. Transformation within or across this combined mode
should be avoided if multiway models are to be fitted, because the mode is not
a 'real' mode, but merely a computational construct.
The observations on centering of two-way data
are helpful in discussing centering of three- and higher-way arrays. If the basis of two-way centering is understood, then
three-way centering is quite simple. In the following, it will be assumed that
the true model of the data is a PARAFAC model plus possible offsets, but the
conclusions hold for any multilinear model.
|
|
|
Figure 5. Structure of offsets in three-way (trilinear) data when all elements
have the same offset held in the scalar λ. Two alternative but equivalent ways of
writing the PARAFAC model are shown: the slab notation using submatrices Xk and the notation
described in Eq. (12) |
Consider a
three-way array. Conceptually, offsets may occur in three different ways.
Constant across all modes (Figure
5), constant across two modes (Figure 6) or constant across one mode only, as shown in Figure 7. In
the figures, the first-mode loadings are held in the I×R matrix ,
the second-mode loadings are held in the J×R matrix
and the third-mode loadings are held in the K×R matrix
.
The matrix Dk is an R×R diagonal matrix holding the kth
row of
in its diagonal.
|
|
|
Figure 6. Structure of offsets in three-way (trilinear) data when all elements
in each vertical slab have the same offset. The offsets are held in the
J-vector λ |
|
|
|
Figure 7. Structure of offsets in
three-way (trilinear) data when all elements in each vector have the same
offset (case three). The offsets are held in a matrix Λ (J×K) whose
jk'th element holds the offset of the vertical column with second and third
mode index j and k. The vector λk
is the k'th row of the matrix Λ |
Regardless of
the structure of the offsets, the basic principle of centering is that the data
must be preprocessed, so that they are projected onto the nullspace of vectors
of ones in a particular mode. For the first mode, projecting a data array X onto the nullspace of 1T, i.e., centering across
the first mode amounts to
|
Y = P |
(59) |
where P
= I - (11T/I). For
the second and third modes, the centering can be performed similarly. As
mentioned earlier, such centering is referred to as single-centering. Centering, for example, across the first mode of
an array can thus be done by matricizing the array to an I × JK
matrix, and then centering this matrix across the first mode as in ordinary
two-way analysis:
|
|
(60) |
The column-mean is subtracted from every element, as depicted
graphically in Figure
8.
As can be seen, single-centering is similar in structure to the type of offsets
shown in Figure 7.

Figure 8. Centering across the first mode. For each column of the raw data, a
mean value is calculated and subtracted from each element in the column. Thus,
a two-way matrix of mean values is obtained. Note that this centering is identical
to matricizing the data to a two-way structure and centering these two-way data
across the first mode
Such
single-centerings performed successively across two modes, are referred to as
double-centering. That is, double-centering is performed by first centering
across one mode, and then centering the outcome across another mode. The order
of centerings is immaterial, but it is essential that they are performed
sequentially. For all three situations depicted in Figure 5-Figure 7 the above centering across the first mode will remove
the shown offsets, because in all three situations the offsets are constant
across the first mode.
As for the two-way case, only single-centering leads to the properties sought in centering (removal of offsets). Other types of centering, such as subtracting the overall mean, will introduce artifacts that have to be modeled additionally to the inherent systematic variation. This was shown for the two-way case in Section 2.3.2. Such types of incorrect centering are often used in three-way analysis. For example, matricized data are centered across a combined mode. E.g., if an array of structure (variables × time × samples) is centered by subtracting the average calculated across samples and time, then in line with the above example, artificial offsets are introduced and the subsequent model will have to fit this additional variation as well. This can obscure validation and exploration of the model and will lead to models that do not provide overall least squares solutions.
As explained for the two-way case, scaling is a
transformation of a particular variable (or object) space. Instead of fitting
the model to the original data, the model is fitted to the data transformed by
a (usually) diagonal scaling matrix in
the mode whose variables are to be scaled. This means that whole matrices
instead of columns have to be scaled by the same value in three-way analysis.
For a four-way array, three-way slabs would have to be scaled by the same
scalar. Mathematically, scaling within
the first mode can be described as
|
yijk = wixijk, i=1,..I;
j=1,..J;k=1,..K |
where Y
with elements yijk is the
scaled array and, for instance, setting
|
|
(62) |
will scale to a unit mean
square within the sample mode. Using matricized arrays, scaling may be
expressed as
|
Y(I×JK) = WX(I×JK) |
(63) |
where W
is an I × I diagonal matrix holding the scaling
values in its diagonal. The assumed structural model of the data is a
multilinear model - e.g. a PARAFAC model = A(C
B)T. With the above scaling, a similar structural
model -
= P(R
Q)T - will hold for the transformed data., because
|
||Y
- P(R ||WX
- P(R ||W(X - W-1P(R |
(64) |
The loss function minimized when
fitting a model with a weighted criterion is
|
||W(X - A(C |
(65) |
and hence it holds that the parameters of this
model can be found by fitting the scaled data and setting
|
A
= W-1P, B = Q, and C = R. |
(66) |
Thus, fitting the scaled data provides a
solution not only to the problem posed as fitting a model to Y, but to the problem of fitting a
model to X and where the first mode
loadings obtained are transformed by the scaling matrix W. As for two-way scaling, it is emphasized that the found model
parameters have no direct relations to the parameters found when fitting the
model to the raw data in a least squares sense.
Scaling has to be applied by transforming the
data within a given mode. It is not appropriate to scale an array within two
combined modes which can happen, for example, when autoscaling a matricized
array. Such an inappropriate scaling will lead to the inclusion of artificial
components in the data.
|
|
|
Figure 9. Incorrect scaling of three-way array. Spectra from one sample
measured at two conditions (slabs one and two). The top plot shows the data
before scaling and the lower one after a hypothetical scaling. |
Consider an I × 30 ×2 three-way array with I samples, thirty variables (say spectral), measured at two
conditions (e.g., two different pH values) as shown in Figure 9. In the two top plots, the profiles of the
first hypothetical sample are shown (30 variables). To the left the measured
spectrum is shown at the first condition and to the right at the second
condition. As can be seen, the shape is identical in the two plots, only a
multiplicative factor distinguishes the profiles. In this case, only one
component is necessary for describing this variation. If the array is scaled
such that different occasions of the same variable are scaled differently, then
the phenomena can no longer be described by the variation of one basic profile.
This is shown in the lower plots, where the first three wavelength variables
have been scaled differently in slab one and two. In slab one they have been
scaled by one and in slab two by 0.01. It is clear that to preserve the
multilinearity of the data, any occurrence of a given variable must be scaled
by the same amount.
To
show the influence of incorrect scaling, consider a synthetic data set with PARAFAC
structure
|
X(I×JK) = A(C |
(67) |
where A
is a 4 × 2 matrix of random numbers and B and C are defined likewise. It is not important how these matrices are
generated, as long as they have full column rank. Consider the following
alternative two-component PARAFAC models:
1.
Using
X
2.
Using
X centered across the first
mode and scaled within the combined second and third mode (autoscaled as a
two-way matrix, X(I×JK))
3.
Using
X scaled within, e.g., mode
two.
The fit values of these three models
always using two components are given below in percentages of the sum-of-squares
of the preprocessed data.
1.
100.00%
2.
98.80%
3.
100.00%
As can be readily seen, a two-component model
is appropriate and should be so, even after scaling as in case (3). However,
using ordinary two-way scaling methods as in case (2) destroys the multilinear
structure of the data and deteriorates the model.
The exact same rules for interdependence of
preprocessing steps apply for multiway data as for two-way data (Section 4), also with respect to treating missing data etc. Any preprocessed array may be written in matrix notation
using the matricized I × JK data array X and the preprocessed array Y
|
Y = MIX(MK |
(68) |
where e.g. MI is an I × I array holding either the centering or
scaling transformation matrix for the first mode or even a combination of such.
The exact content of these transformation matrices depends on the type of
preprocessing chosen and in case of iterative preprocessing M may be a
product of several matrices [Harshman & Lundy 1984].
Combined centering and
scaling in one operator M is generally not going to retain all the
properties of the two individual operations (see Section 4.3). If centering across the first mode and scaling
within the first mode is desired, centering first and scaling afterwards will not lead to an array in which the
first-mode vectors are centered, although possible offsets will be eliminated.
Scaling first and centering afterwards will not eliminate the original offsets,
even though the preprocessed array will have centered first-mode vectors.
Another example based on the
guidelines in Section 4 is a situation in which e.g. scaling within the
second and third mode is desired together with centering across the first mode.
If the preprocessing is performed so that the data are centered after the
weights are determined (iteratively) and applied
|
Y = P |
|
Y = (P |
this is not the case. In this case, the data
are first centered and then the weights are determined from the centered array
rather than from the raw data. Hence, the weights in Eq. (70) are preferred.
Proper single-centering of a multiway array can
always be expressed as
|
P |
(71) |
where X
(I × JKL...) is a multiway array rearranged
to a two-way matrix such that the mode to be centered across is the row-mode.
Hence, P
works on the non-matricized mode (I
in this case). All combinations of centering of this form are proper and will
maintain a preprocessed array with offsets removed.
Proper scaling of a multiway array
can be expressed as
|
WX |
(72) |
where X
is (I × JKL...) is a multiway array rearranged
to a two-way matrix such that the mode to be scaled within is the row-mode.
Scaling, e.g., to unit mean square variation within several modes sequentially
will not yield modes within which the variables or samples have unit mean
square variation unless iterative determination of the weights are used.
Combinations
of centering and scaling are unproblematic only when centering across several
modes is desired or scaling within one mode combined with centering across
other modes. Centering must be performed before scaling. All other combinations
will only partly fulfill the requirements. E.g. centering across a mode
followed by scaling within the same mode will not lead to zero-average vectors
in the mode, but will remove any offsets across the mode.
A number of important features of
the common preprocessing steps centering and scaling have been discussed.
·
Centering deals with the structural model; scaling with the way this
model is fitted.
·
Centering is a part of a two-stage procedure in which offsets are
removed first and the multilinear terms are estimated in the second stage. This
is only equivalent to the one-stage procedure of estimating all parameters
simultaneously, if proper centering, as defined in this paper, is used. Proper
centering is shown in Figure 10 (always across one mode at a time).
·
For offsets that cannot be removed by using proper centering, a
one-stage procedure has to be used. This holds generally for data with missing
elements.
·
Scaling provides a way to change the objective function by assuming
certain weights. Some weight arrangements can be dealt with by scaling followed
by ordinary least squares fitting. Only proper scaling is allowed. Proper
scaling is shown in Figure 10. For weighting schemes that cannot be dealt with by
scaling, weighted least squares algorithms have to be used.
·
Incorrect centering or scaling introduces artificial variation. The
amount of artificial variation introduced depends on the data and leads to
models that are suboptimal to their 'correct' (least squares) counterparts.
This is so, because the artificial variation has to be modeled additionally.
Two-way
results
·
Proper centering can always be written as PX
·
Several centerings can be performed sequentially
·
Proper scaling can always be expressed as WX
·
Several scalings can be performed sequentially, but will generally need
iterations to establish the scaling constants and this may not converge
·
Unproblematic combinations of centering and scaling can be expressed as (PX)W.
Similar results hold for transposed matrices.
Three-way
results
·
Proper centering can always be written PX where X is the three-way array matricized, so that the mode to be
centered across is the first mode.
·
Several such single-centerings may be performed sequentially across
several modes
·
Proper scaling can always be expressed WX for a matricized array as above
·
Several scalings can be performed sequentially, but will generally need
iterations and may not converge
·
Proper combinations of centering and scaling can be expressed similarly
to the two-way case. That is, scaling does not affect centering across other
modes, but centering affects scaling within all modes.
The appropriate centering
and scaling procedures can most easily be summarized as in Figure 10. Centering must be done by subtracting scalars from
individual vectors of the array, while scaling must be performed by multiplying
individual slabs.

Figure 10. Three-way array. Proper centering must be done across a mode,
exemplified here by proper centering across the second mode. From all elements
of a specific row (fixed i and k) the same scalar mik is subtracted.
Proper scaling, e.g., within the first mode is performed such that all elements
of a specific horizontal slab are multiplied by the same scalar wi.
Most algorithms used in this paper are
available from the home page of Food Technology at http://www.models.life.ku.dk.
The first author is grateful for financial support through LMC (Center for
Advanced Food Studies), EU project NwayQual GRD1-1999-10377 and AQM (Advanced Quality Monitoring) supported by the Danish
Ministries of Research and Industry. Anonymous referees are thanked for helpful comments.
In this appendix, the projection of vectors on
other vectors is explained. In Figure
11 the orthogonal projection of an I-vector b on an I-vector a is considered. The resulting projection can be expressed in the original coordinate
system or in terms of the new basis vector a.

Figure 11. Projection of b on a.
Expression of in terms of a
The score of on the new basis vector a is k. The following
equations hold
|
|
(73) |
where the second equation is called the cosine
rule for vectors. Hence,
|
|
(74) |
This expression becomes particularly simple
when a is normalized to length one
(||a|=1)
|
k = (a,b) = aTb |
Note that expressions like Eq. (75) are used in principal component analysis - PCA, where
usually the loading vectors p are
chosen to be of length one and then
|
|
|
where xiT
is a row of X (I×J). Hence, the scores of PCA are just
orthogonal projections of the rows of X
on p in coordinates of this p.
Expression of in terms of the original coordinate system
It is also possible to express in the original coordinate system.
Referring to Figure
11, it holds that
=ka. Hence,
|
|
(77) |
and the matrix P (I×I) is
special because
|
|
|
which means that P is symmetric and idempotent. Then this matrix is an orthogonal
projection matrix [Schott 1997]. It projects orthogonally on the
vector a. In the case of centering,
(see Eq. (15)), the vector 1
takes the role of a.
Orthogonal projections
in general including residuals
A matrix P=AA+ (where ‘+’ indicates the
Moore-Penrose inverse) projects orthogonally on the range (column-space) of A [Schott 1997]. It can be checked that a+=aT/(aTa), hence, Eq. (78) is a special case of the general orthogonal
projection theorem.
It
is also interesting to consider the residuals from the orthogonal projection of
b on A, that is, the vector b*.
As ,
it holds that
|
|
(79) |
and is again a symmetric and idempotent matrix,
i.e., an orthogonal projection operator. This matrix projects onto the
orthogonal complement of A, that is,
onto range(
), where range(.)
is used to indicate the range (column-space) of a matrix or vector. It holds
that range(
)=nullspace(AT), where nullspace(.) is used to indicate the
nullspace of a matrix or vector. For every x
range(
) it holds that ATx=0, hence,
x
nullspace(AT) and vice versa.
Theorem
Given
X of size I × J
and the column-dimension R of a
sought bilinear model. Then
|
min||X |
(80) |
where Y is the original data, X, with the column-averages subtracted.
Proof
The
proof has been given by Kruskal [1977] and Gabriel [1978].
Understanding that centering is a projection, it is simple to prove the above
theorem. Let the loss function be
|
||X - ABT
- 1mT||2 |
(81) |
and partition
it into two orthogonal parts
|
||P |
(82) |
using the Pythagorean fact that the
squares of two orthogonal parts equal the square of the total. This equation
can be further developed to
|
min ||P min ||P |
(83) |
because ||PX - P(ABT + 1mT)||2
will be zero by setting
|
m’ = (1T/I)(X - ABT)
since 1(1T/I) = P. |
(84) |
Setting C = PA and D = B will therefore
provide a solution with exactly the same fit as would be obtained by minimizing
the original loss function. The solution may be computed using any bilinear
algorithm for fitting a principal component analysis model of Y. The scores will automatically be
centered, because they are linear combinations of the columns of X. If the columns are centered, so are
their linear combinations.
1. Amrhein M, Reaction and flow variants/invariants for the analysis of chemical reaction data. Ph.D. thesis, École Polytechnique Fédérale de Lausanne, 1998,
2. Amrhein M, Srinivasan B, Bonvin D, Schumacher MM, On the rank deficiency and rank augmentation of the spectral measurement matrix, Chemom Intell Lab Syst, 1996, 33, 17-33.
3. Andrews DT, Wentzell PD, Applications of maximum likelihood principal component analysis: incomplete data sets and calibration transfer, Analytica Chimica Acta, 1997, 350, 341-352.
4. Barnes RJ, Dhanoa MS, Lister SJ, Standard normal variate transformation and de-trending of near-infrared diffuse reflectance spectra, Appl Spectrosc, 1989, 43, 772-777.
5. Bro R, Sidiropoulos ND, Smilde AK, Maximum likelihood fitting using simple least squares algorithms, Journal of Chemometrics, 2001, Submitted.
6. Bro R, Multi-way Analysis in the Food Industry. Models, Algorithms, and Applications. Ph.D. thesis, University of Amsterdam (NL), http://www.mli.kvl.dk/staff/foodtech/brothesis.pdf, 1998,
7. Cornelius PL, Seyedsadr M, Crossa J, Using the shifted multiplicative model to search for 'separability' in crop cultivar trials, Theoretical and Applied Genetics, 1992, 84, 161-172.
8. Gabriel KR, Least squares approximation of matrices by additive and multiplicative models, Journal Of The Royal Statistical Society Series B Statistical Methodology, 1978, 40, 186-196.
9. Grung B, Manne R, Missing values in principal component analysis, Chemom Intell Lab Syst, 1998, 42, 125-139.
10. Guo Q, Wu W, Massart DL, The robust normal variate transform for pattern recognition with near-infrared data, Analytica Chimica Acta, 1999, 382, 87-103.
11. Harshman RA, Lundy ME, Data preprocessing and the extended PARAFAC model, Research Methods for Multimode Data Analysis, (Eds. Law,HG, Snyder,CW, Jr., Hattie,J, and McDonald,RP), Praeger, New York, 1984, 216-284.
12. Helland IS, Næs T, Isaksson T, Related versions of the multiplicative scatter correction method for preprocessing spectroscopic data, Chemom Intell Lab Syst, 1995, 29, 233-241.
13. Judge GG, Griffifths WE, Carter Hill R, Lütkepohl H, Lee TC, The
theory and practice of econometrics. John Wiley & Sons,
14. Kiers HAL, Weighted least squares fitting using ordinary least squares algorithms, Psychometrika, 1997, 62, 251-266.
15. Kiers HAL, Towards a standardized notation and terminology in multiway analysis, Journal of Chemometrics, 2000, 14, 105-122.
16. Kruskal JB, Some least squares theorems for matrices and N-way
arrays,1977, Manuscript, Bell Laboratories,
17. Kruskal JB, Multilinear methods, Proceedings of Symposia in Applied Mathematics, 1983, 28, 75-104.
18. Kruskal JB, Rank, decomposition, and uniqueness for 3-way and
N-way arrays, Multiway Data Analysis, (Eds. Coppi,R and Bolasco,S),
Elsevier,
19. Little RJA, Rubin DB, Statistical analysis with missing data.
John Wiley & Sons,
20. Nørgaard L, Classification and prediction of quality and process parameters of beet sugar and thick juice by fluorescence spectroscopy and chemometrics, Zuckerindustrie, 1995, 120, 970-981.
21. Paatero P, A weighted non-negative least squares algorithm for three-way 'PARAFAC' factor analysis, Chemom Intell Lab Syst, 1997, 38, 223-242.
22. Paatero P, Tapper U, Positive matrix factorization: A non-negative factor model with optimal utilization of error estimates of data values, Environmetrics, 1994, 5, 111-126.
23. Schott JR, Matrix analysis for statistics. Wiley and Sons,
24. Seasholtz MB, Kowalski BR, The parsimony principle applied to multivariate calibration, Analytica Chimica Acta, 1993, 277, 165-177.
25. Styan PH, Hadamard products and multivariate statistical analysis, Linear Algebra and its Applications, 1973, 6, 217-240.
26. ten Berge JMF, Convergence of PARAFAC preprocessing procedures and the Deming-Stephan method of iterative proportional fitting, Multiway Data Analysis, (Eds. Coppi,R and Bolasco,S), Elsevier, Amsterdam, 1989, 53-63.
27. ten Berge JMF, Kiers HAL, Convergence properties of an iterative procedure of ipsatizing and standardizing a data matrix, with application to PARAFAC/CANDECOMP preprocessing, Psychometrika, 1989, 54, 231-235.
28. Tucker LR, A method for synthesis of factor analysis studies, Personnel Research Section, Report 984, Dept. of the Army, 1951,
29. van Eeuwijk FA, Between and beyond additivity and non-additivity;
the statistical modelling of genotype by environment interaction in plant
breeding. Ph.D. thesis,
30. Weinberg JR, A Short History of Medieval Philosophy.
31. Wentzell PD, Andrews DT, Hamilton DC, Faber NM, Kowalski BR, Maximum likelihood principal component analysis, Journal of Chemometrics, 1997, 11, 339-366.
32. Wentzell PD,
33. Westerhuis JA, Kourti T, MacGregor JF, Comparing alternative approaches for multivariate statistical analysis of batch process data, Journal of Chemometrics, 1999, 13, 397-413.
* If the algorithm for fitting the bilinear part of model is iterative, it is useful not to iterate until convergence in each step two. Instead only a few iterations are performed before one proceeds to step three where the complete model (AB’ + 1m’ in case of PCA) is calculated and used for obtaining better estimates of the missing values.
* Normally, centering would be performed before scaling for computational reasons, as the averages are needed for scaling by the inverse standard deviation.
[RB1]% Shows that centering with the grand mean sucks
%load claus,f=parafac(X,DimX,3,[],[2 2 2]);
%[a,b,c]=fac2let(f,DimX,3);save spec
load spec
clear
load spec
X = rand(10,3)*b([1:3:150],[1 2 3])';
X = X + .1;
[t,s,p]=svds(X,5);t=t*s;
[t1,s1,p1]=svds(X-mean(X(:)),5);t1=t1*s1;
[t2,s2,p2]=svds(mncn(X),5);t2=t2*s2;
ss=[];
for i=1:5
ss=[ss;100*(1-ssq(X-t(:,1:i)*p(:,1:i)')/ssq(X))];
end
ss1=[];
for i=1:5
ss1=[ss1;100*(1-ssq(X-t1(:,1:i)*p1(:,1:i)'-mean(X(:)))/ssq(X))];
end
subplot(1,2,1)
plot(p1(:,1:4),'b','LineWidth',2)
hold on
plot(p1(:,5),'b','LineWidth',1)
set(gca,'XTick',[0 25 50])
set(gca,'YTick',[-.5 0 0.5])
title('Loadings grand-centered data','FontWeight','Bold')
xlabel('Spectral variable')
ylabel('Loading')
hold off
subplot(1,2,2)
plot(p2(:,1:3),'b','LineWidth',2)
hold on
plot(p2(:,4:5),'b','LineWidth',1)
set(gca,'XTick',[0 25 50])
set(gca,'YTick',[-.5 0 0.5])
title('Loadings single-centered data','FontWeight','Bold')
xlabel('Spectral variable')
ylabel('Loading')
hold off
%For finding numbers
figure,plot(p1),legend('1','2','3','4','5')
figure,plot(p2),legend('1','2','3','4','5')