*Mathematical space is a collection of objects with a structure defined over them.* This definition as a whole, as well as the words used in it, and their relations may seem too vague, broad, and ambiguous. What is “object”? What is “collection” and “structure”? Why the structure is “defined”, and not intrinsically belongs to these objects or their collection? And, if it’s defined, who and why defines it?

Instead of intuitively guessing answers to these questions, we might have wanted to build our understanding of the vague and complex concepts from a limited basis of the simple, narrow, well defined, clear and distinct concepts, as Rene Descartes, one of the framers of the modern philosophy and algebra, have wanted. However, the same time he noted that human mind is a very poor inventor. It can’t create anything genuinely new. It’s very good, though, in mixing and combining blocks already known to it. If we go to the path of excessive reduction of the number, richness, and generality of these blocks, we may end up unable to span complex (if at all) concepts from the simple, distinct, and clear blocks.

Therefore, we may be better off defining simple, well defined, and clear rules of combination of the potentially opaque, complex, and only intuitively comprehensible building blocks. To add insult to the injury, postmodernism pointed out that intuition depends on the personal experience, therefore each person will have different from others innate concepts. The question ‘is your “red” the same as my “red”?’ does not necessarily expects an affirmative answer. Still, we may define labels or tokens (word “red”) to the real-world phenomena, hoping that everyone would sort out relations of their inner mental images with those labels by themselves, likely on sub/unconscious level.

In this context, Chomsky’s LAD (Language Acquisition Device) could be also decoded as the “Label” Acquisition Device, because those innate concepts of the real-world understanding are imbedded into natural human languages. For example, the very concept of number 1 (and other numbers than 1) presents in all(?) grammars of human languages in some form, or even does the dynamic of going from 1 to many via 1+1=2 (through dropped in the explicit way by many modern languages).

Therefore, we may not need to go into a rabbit hole of a purely analytical ontologies, risking to end up with nothing: this is/consists of that and that, which, in their turn, are/consist of even smaller and simpler things, until they disappear. Instead, on some level we may start thinking in terms of a more synthetic ontologies of the type: this could be thought of as that (and that, and that) in their combinations and relations (as well as equally true other combinations of other “that”s).

Based on those deep, even language embedded concepts of natural numbers, formal mathematical axioms, like Peano’s axiom, come quite naturally: for each natural number x there is only one successor y = s(x) = x + 1; if successors are equal then do equal their predecessors; there is only one number – 1 – that is not a successor of any number; 1 with its successors spans the whole collection of natural numbers N.

Mathematics has been always an object of amusement, that such a seemingly artificial construct out of the fantasy world could be so useful being applied to problems of the real world. Meanwhile, mathematical concepts are always based, on the one hand – one the real world phenomena, and on the other hand – on humans’ patterns of the world comprehension. One can reasonably expect that wherever those concepts lead us, their range will be the same – humans’ understanding of the world, i.e. humans may always find areas of their application.

The above implies that extraterrestrial, or artificial intelligence, with potentially different ways of observing and comprehending real world, may develop quite different mathematics. Even humans, using different then the “usual” set axioms have developed quite different mathematical theories, but we may expect ET’s and AI’s difference on a much larger scale.

However, let’s come back in the next chapter to the ways of thinking about “collections of objects” and “mathematical structure defined over them” from the original definition…

]]>

#horizontal matrix PCA – ECG chanells

pca_model <- ks_eigen_rotate_cov(ekg2)

ds <- pca_model$ds# 3-dimension, k-means clustering

c_model <- ks_kmeans_nd_means(ds, c(“V5″,”V6″,”V7”), 6)

ds <- ks_kmeans_nd_clusters(ds, c(“V5″,”V6″,”V7”), c_model$Mu)cloud (V7 ~ V5 * V6, ds, groups=Cls, pretty=TRUE, zoom=0.9,

screen = list(x = 90, y = -30, z = 0))ggplot(ds, aes(x=V1))+

geom_point(aes(y=V6, color=Cls))+

scale_colour_gradientn(colours=rainbow(4))

Though, V6 principal variable view with clustering (Fig. 18.2) shows much better separation of datapoints than in 1D case (Fig .16.1):

Still, we may want to add more continuity to the data, clustering datapoints not only by channels’ amplitudes, but also temporally. Of course, linear time won’t be much help there, but we may use “cyclic” time, modulating it, for example, trigonometrically according to the child’s heartbeat frequencies (or it could be any cyclic function), and scaling it up to the principal component amplitudes. Idea here is to make datapoints neighbours not only by the neighbouring amplitude parameters, but also by the close occurrence time criterium. Including the “cyclic” time variable into clustering as a 4th dimension gives us even better clustering separation (Fig. 18.3,4):

#including cyclic time

ds[,”P”] <- -sin(ds[,”V1″]/(10/22.4)*2*pi)*20

c_model <- ks_kmeans_nd_means(ds, c(“V5″,”V6″,”V7”, “P”), 6)

ds <- ks_kmeans_nd_clusters(ds, c(“V5″,”V6″,”V7”, “P”), c_model$Mu)

Now, we may transform the clustered dataset of perspiration and child’s heartbeat from the principal eigenbasis back into original ECG basis, and see how these signals would look like on the original ECG channels had we have filters to separate them from the primary mather’s heartbeat signal (Fig. 18.5):

ds$V2 <- 0

ds$V3 <- 0

ds$V4 <- 0

ds$V8 <- 0

ds$V9 <- 0

#inverse subset transform

#ts <- as.data.frame(ds[ds$Cls<2,”V1″])

ts <- as.data.frame(ds[ds$Cls>1,”V1″])

names(ts) <- c(“V1”)

#ds_p <- ds[ds$Cls<2,dim]

ds_p <- ds[ds$Cls>1,dim]

ds <- as.data.frame(as.matrix(ds_p) %*% pca_model$An1)#put original var names and time back

names(ds) <- c(“V2″,”V3″,”V4″,”V5″,”V6″,”V7″,”V8″,”V9″)

ds[,”V1″] <- ts[,”V1”]

…

**Appendix 18**

As usual, working files are in: https://github.com/NSelitskaya/kitchen-style-r

When trying to come up with association of our 1D “disjoint delta” measure to entities of n-dimensional space, we would notice that for 2-means case it naturally associates to the dot/inner product of delta vectors **x**i – **mu’** and **x**i** – mu” **(where **x**i is an i-th datapoint, and **Mu** is the set of cluster means). Or, by other words, “disjoint delta” would be a mapping to a real field that makes a vector space an inner product space, which gives us hope that nD, k-means extension would also have some meaning, too, when we come to it.

This realization of the inner product product association give us immediate understanding that in nD case “disjoint delta” will approach null not only when a data point is approaching one of the cluster means, but also when its deltas are orthogonal. That maybe considered a drawback or not, but definitely is something to keep in mind.

dd(**x**i, **Mu**) = (**x**i – **mu’**) . (**x**i – **mu”**) = delta(**x**i, **mu’**) . delta(**x**i, **mu”**)

with delta(**x**, **mu’**) = (x1 – mu’1)**i** + (x2 – mu’2)**j **+ … + (xn – mu’n)**k**

a nice, easy to compute “disjoint delta” would be if basis of our space is orthogonal (i.e **i**.**i**=1, **i**.**j**=0), which in case of the PCA eigenbasis (as we made sure before) is the case, and we shouldn’t worry about that restriction:

dd(**xi**, **Mu**) = (x1 – mu1′)(x1 – mu1″) + (x2 – mu2′)(x2 – mu2″) + … + (xn – mun’)(xn – mun”) = x1^2 + x1(-mu1′ – mu1″) + mu1’*mu1″ + x2^2 + x2(-mu2′ – mu2″) + mu2’*mu2″ + … + xn^2 + xn(-mun’ – mun”) + mun’*mun”

which gives the same matrix form of system of linear equations and its solution as before:

**X2** + **X*****c** = **dd**

**c** = – (**X**T***X**)-1***X**T**X2**

where **X2** = ( x1^2 + x2^2 +…+xn^2; … ), **X** = ( x1 x2 … xn 1 1 … 1; …), **c** = (c11 c21 … cn1 c10 c20 … cn0)T

Again, having found vector **c** of polynomial coefficients we may calculate its (or actually their) roots **Mu**. Of course, matrix **X** above is not a full rank , which means we won’t get a unique analytical solution for the matrix equation above. We always can do a numeric one, optimizing SSDD or other parameter, but if we want a nice clean analytical solution, we should restrict generality once again. What we can do is to set one of the cluster means at the space identity/coordinate origins, then all ci0 = mui’*mui” coefficients will become nulls, and with **X** = ( x1 x2 … xn; …), **c** = (c11 c21 … cn1)T

**c** = – (**X**T***X**)-1***X**T**X2**

would be easily uniquely solvable.

**nD k-means disjunctive pointwise clustering**

If we wanted to generalize “disjoint distance” to nD, k-means case in the form similar to previous lower dimensional cases:

d(**x**, **Mu**)^2 = ((x1 – mu’1)*(x1 – mu”1)*…*(x1 – muk’1))^2 + ((x2 – mu’2)*(x2 – mu”2)*…*(x2 – muk’2))^2** **+ … + ((xn – mu’n)*(xn – mu”n)*…*(x1 – muk’n))^2

we should have come up with some meaning and purpose of that definition. We can try to find it in analogies to more complex spaces than a vector space – in the already mentioned inner product space, or in metric or topological spaces. If in topological space we define sets of open (or closed) sets that basically define whether points of the space are neighbours or not, we can define a set **Mu** = {(**mu’**, **mu”**, … **muk’**) : **mui’** b.t. R^n} that contains cluster mean vectors of the original nD space. That space with the set Mu and defined “disjoined distance” mapping function d(**x**, **Mu**) from it to R we can name a “cluster space” Cn = (Rn, **Mu**, d), in which set Mu and distance function d define neighbourhoods or clusters of the data points.

Let’s see if such a “cluster space” definition produce any useful and meaningful results. Again, a generalized matrix equation for finding set **Mu** using “disjoint distance” defined above for the given dataset A b.t. Rn, and restrictions of the orthogonal basis of Rn and one centrod set being identity of Rn, will be: **Xk** + **X*****c** = **dd**

where:

**Xk** = ( x1^k + x2^k2 +…+xn^k; … )

**X** = (x1^(k-1) x2^(k-1) … xn^(k-1) … x1^(k-2) x2^(k-2) … xn^(k-2) … x1 x2 … xn; …)

**c** = (c1(k-1) c2(k-1) … cn(k-1) …c1(k-2) c2(k-2) … cn(k-2) … c11 c21 … cn1)T

**dd** = (d(**x**1, **Mu**)^2, …)T

and solution for minimizing sum of squares of **dd**: SSDD = **dd**T***dd** will be, again:

**c** = – (**X**T***X**)-1***X**T**Xk**

from which we can find members of the set **Mu** – roots of the polynomials:

xi^k + ci(k-1)*xi^(k-1) + ci(k-2)*xi^(k-2) + … + ci1*x = (xi – mu’i)*(xi – mu”i)*…*(xi – muk’i) = 0

where **mu’** = (mu’1 mu’2 … mu’i … mu’n)T

Let’s use our ECG dataset, take principal eigen-variables V5 and V6, and cluster them using “disjoint distance” and traditional “conjoint” k-means algorithms (Fig. 17.1)

setwd(“/Users/stanselitskiy/R”)

ekg <- read.table(“foetal_ecg.dat”)#remove time

dim <- c(“V2″,”V3″,”V4″,”V5″,”V6″,”V7″,”V8″,”V9”)

ekg2 <- ekg[, dim]#horizontal matrix PCA – ECG channels

pca_model <- ks_eigen_rotate_cov(ekg2)

ds <- pca_model$ds#put time and original labels back

names(ds) <- c(“V2″,”V3″,”V4″,”V5″,”V6″,”V7″,”V8″,”V9″)

ds[,”V1″] <- ekg[,”V1”]# n-dimension, k-means clustering

for(i in 3:7){

c_model <- ks_kmeans_nd_means(ds, c(“V5″,”V6”), i)

ds <- ks_kmeans_nd_clusters(ds, c(“V5″,”V6”), c_model$Mu)mn <- as.data.frame(c_model$Mu)

names(mn) <- c(“V5″,”V6″)

mn$Cls <- seq(1,nrow(mn))ggplot(ds, aes(x=V5, y=V6, colour=Cls))+

geom_point(alpha=0.4, show.legend=F)+

geom_point(data=mn, alpha=1, size=2, shape=1, stroke=4, show.legend=F)+

geom_point(data=mn, alpha=1, size=2, color=”white”)+

scale_colour_gradientn(colours=rainbow(4))}

**Appendix 17**

As usual, working files are in: https://github.com/NSelitskaya/kitchen-style-r

So far, when we were talking about linear methods, we were using terms linear/vector space and dataset in that space quite interchangeable, which is not very rigorous, but, excusable if it is obvious from the context. Anyway, we were talking about methods of working with the whole dataset – for example using covariance matrix from the whole dataset to find an eigenbasis that would nullify the covariance rotation moment and we used that basis as Principal coordinates.

However, that’s not necessarily the only way to do the analysis. We may be interested not (only) in the structure/homomorphic projections of the whole dataset, but in revealing structure of the particular subset or subsets in the most continuous way. Therefore, we may use multiple principle eigenbases, rotated around covariance matrices of particular subsets, in which we may transform either the whole dataset or just subsets of the interest. Here we will need to work with clustering our data around some structural features visible in either the very original space, or revealed in the transformed space with the principle eigenbasis. On these structural features we may want to take a closer look, rotating original basis just around clustered subset of our interest.

Clustering is a technique based on quite a vague idea, and is considered an unsupervised one, still we should carefully think through what we want to achieve to come up with particular algorithmic implementation. Even quite straightforward K-means clustering technique (which we will work with) may be implemented with various aims in minds and by various means. Usually, we define in how many cluster subsets we want to partition our dataset (or, as advanced variant, extract the number recommendation from the data), and then try to arrange means of cluster subsets according to some logic. We may use an AND (conjunction) logic – we want that distance from any data point to all means were minimal – i.e we use addition of distance a point to means. But we may also use an OR (disjunction) logic – we may want to minimize distance of any point to at least one any mean – i.e we use multiplication of distances of a point to means.

**1D 2-means disjunctive clustering**

Let’s explore the latter case in more detail, and for the simplest 1D case with 2 means for starters. For i-th datapoint let’s define a “disjoint delta ” measure to the means mu’ and mu”, or actually to a vector **mu** as follows:

ddi = (xi – mu’)*(xi – mu”) = xi^2 + (-mu’ -mu”)*xi + mu’*mu”

for all N data points we may write a system of equations (where c1 = – (mu’+mu”), and c0 = mu’*mu”) as:

( x1^2 x1 1 ) ( 1 ) (dd1)

( x2^2 x2 1 ) ( c1 ) = (dd2)

…

( xn^2 xn 1 ) ( c0 ) (ddn)

or even

( x1^2 ) ( x1 1 ) ( c1 ) (dd1)

( x2^2 ) + ( x2 1 ) ( c0 ) = (dd2)

…

( xn^2 ) ( xn 1 ) (ddn)

which gives us quite familiar, by our previous linear regression derivations, equation structure in compact matrix form:

**X2** + **X*****c** = **dd**

Then we may define a distance from an i-th point to a mean vector **mu** as:

d(xi, **mu**) = sqrt(ddi^2), or d^2(xi, **mu**) = ddi^2

then sum of squares of those “disjoint distances” to the mean vector from all points would be:

SSDD = **dd**T***dd** = (**X2** + **X*****c**)T(**X2** + **X*****c**)

and then we may want to minimize it by taking its derivative by **c,** and find out for which **c** it takes the null value:

@SSDD/@**c** = 2**X**T(**X2** + **X*****c**) = 0 (with Hessian matrix @^2SSDD/@**c**@**c**T = **X**T***X** being full rank positive define it’s indeed minimum)

**X**T***X2** = –**X**T***X*****c**

**c** = – (**X**T***X**)-1***X**T**X2**

and, remembering that c1 = -mu’ – c0/mu’

**mu** = (-c1 +- sqrt(c1^2 -4c0))/2

Then we can walk all data point and assign their cluster accordingly to the shortest distance between them and just found cluster means mu’ and mu”.

**1D k-means disjunctive clustering**

We can (almost) easily extend the logic to any k dimensions of the mean vector mu, but still 1D original data space.

With “disjoint delta” measure for i-th datapoint:

ddi = (xi – mu1)*(xi – mu1)*…*(xi – muk) = xi^k + ck-1*xi^k-1 + … + c1*xi + c0

System of equations for all data points:

( x1^k ) ( x1^k-1 … x1 1 ) ( ck-1 ) (dd1)

( x2^k ) + ( x2^k-1 … x2 1 ) ( ck-2 ) = (dd2)

…

( xn^k ) ( xn^k-1 … xn 1 ) ( c0 ) (ddn)

or

**Xk** + **X*****c** = **dd**

SSDD = **dd**T***dd** = (**Xk** + **X*****c**)T(**Xk** + **X*****c**)

@SSDD/@**c** = 2**X**T(**Xk** + **X*****c**) = 0

**X**T***Xk** = –**X**T***X*****c**

**c** = – (**X**T***X**)-1***X**T**Xk**

Here, having found k polynomial coefficients (k+1th is 1), we have to find roots of that polynomial (factor it), and those roots will be our means, but unlike the 2-mean case (or even 3 or 4), in general (k>=5) case we’ll have to cheat a bit our approach of working with simple analytical solutions, and find them numerically, though conveniently, R (or Python) have already such functions available, like *polyroot( c)*.

For our ECG example, let’s take V6 channel in the principal eigen basis and find, say 7, clusters:

ekg <- read.table(“foetal_ecg.dat”)

#remove time

dim <- c(“V2″,”V3″,”V4″,”V5″,”V6″,”V7″,”V8″,”V9”)

ekg2 <- ekg[, dim]pca_model <- ks_eigen_rotate_cov(ekg2)

ds <- pca_model$ds#put time and original lables back

names(ds) <- c(“V2″,”V3″,”V4″,”V5″,”V6″,”V7″,”V8″,”V9″)

ds[,”V1″] <- ekg[,”V1”]Mu <- ks_kmeans_1d_means(ds, “V6”, 7)

ds <- ks_kmeans_1d_clusters(ds, “V6”, Mu)

ggplot(ds, aes(x=V1))+

geom_point(aes(y=V6, color=Cls, alpha=0.5))+

scale_colour_gradientn(colours=rainbow(5))

Which is not so impressive, so we may want to look at multi-dimensional solutions…

**Appendix 16**

As usual, working files are in: https://github.com/NSelitskaya/kitchen-style-r

When rotating basis around covariance matrix of our new 8-point dataset in 2500 dimension space/basis, we would reasonably expect also to see some subspaces with minimal/miniscule “explained variance”, or homomorphisms with a lot of dropped data structure or, by other words, with preserved negligible aspects of the data structure.

After actually doing that rotation we’ll see that result greatly exceeds our expectations – it’s only 7D subspace in the eigenbasis that preserves most significant data structure. Or, in terms of “compression”, we “replaced” 8 x 2500 matrix of the previous “horizontal” data representation by 8 x 7 matrix with quite a low loss or “noise” (Fig.15.3). Of course, in reality we didn’t get those results from thin air, we just put most of the data structure information of the latter case in the 2500 x 2500 eigen-matrix, while for the former – it was 8 x 8 dimension rotation eigen-matrix. That may seem a useless exchange, but dealing with data with similar structures we may train their model in the “off-line” time (calculate eigen-matrix), and do “compression” or pattern recognition faster in “on-line” time.

“Noise” in this model spherical(Fig.15.3), but correlated, and, if we take a look at individual subspaces, they have quite regularized and varying structures (for example Fig.15.2.9).

If, similarly to the image recognition jargon, we could have called the data in eigenspace “eigen-waves” for horizontal matrix representation, for “vertical” representation they could be called “eigen-coefficients” (Fig.15.1). Though that terminology is quite relative – we just call long series of data points “waves”, and short – “coefficients”. Inverse transformations of the 1D subspaces of the eigenspace help to envision what PCA (or linear models in general do) (Fig. 15.2.1-5). Each 1D dimension or subspace of the original space of ECG data is combined from the same (well known to us in previous chapters) eigen-waves, but multiplied by different eigen-coefficients. Depending for which data matrix we calculate eigenspace – “original” or transposed, they will be, pairwise, either in eigen-matrix (matrix of the basis rotation, or eigenvectors) or in the matrix of our data in eigenspace, though they’ll be specific for each covariance matrix we calculate eigenbasis for.

Let’s call our data in original space **A**, eigen-matrix to it **T**, and data in eigenspace **B**, or for transposed case – **A**T, **M** and **C** respectively. Then **T*****A** = **B**, **M*****A**T = **C**, and **M**-1=**B**T***T**-T***C**-1, **T**-1= **C**T***M**-T***B**-1.

#vertical matrix PCA – 2500 time chanells

ekg3 <- as.data.frame(t(ekg2))

pca_model3 <- ks_eigen_rotate_cov(ekg3)

dst <- pca_model3$ds#invariants

ds <- dst[,1:8]

names(ds) <- c(“V2″,”V3″,”V4″,”V5″,”V6″,”V7″,”V8″,”V9″)

ds[,”V1″] <- c(1,2,3,4,5,6,7,8) #ekg[1:8,”V1″]#”remove” extra subspaces

dst_1 <- dst

dst_1[,2:ncol(dst)] <- 0

dst_1o <- as.data.frame(as.matrix(dst_1) %*% pca_model3$An1)

ds <- as.data.frame(t(dst_1o))

ds[,”V1″] <- ekg[,”V1″]dst_2 <- dst

dst_2[,1] <- 0

dst_2[,3:ncol(dst)] <- 0

dst_2o <- as.data.frame(as.matrix(dst_2) %*% pca_model3$An1)

ds <- as.data.frame(t(dst_2o))

ds[,”V1″] <- ekg[,”V1″]etc…

**Appendix 15**

As usual, working files are in: https://github.com/NSelitskaya/kitchen-style-r

We can naturally isomorphically (without any lose of structure) map “things” that have the real number arithmetic properties, or set (your dataset column or row) in the field of real numbers itself onto 1-dimensional vector space. However, mapping your multiple column/row dataset that may contain data of the different nature and measurement units is not so natural. We may construct a group/vector space from the different groups having nothing in common in their underlying sets and defined operations over them via external direct product, however, if we want it to be isomorphic to a group constructed by the inner product (product of subspaces of a bigger space, i.e. having the same nature and structure), we want those subspaces be commutative/Abelian (and disjoint).

(Gallian 353, 185, 196, 212)

So, indeed, for vector spaces the inner and outer direct products are the same (to the isomorphism) and we may use just a “direct product” term, and use the inner and outer direct product notation interchangeably (we’ll use x “cross” notation for simplicity). By definition, the higher dimension linear spaces are the direct products of the lower dimension ones. For example, the real 3D space R3 = R1 x R1 x R1, however, is it the same/isomorphic space to the “really real” spatial space, say S3, we use to envision abstract nD spaces akin to? The space in which any “thing” could be represented by at least 3 other “telescopic things” we may combine together, even we don’t know nothing about (real) numbers and (Cartesian) coordinates?

Continuity of the direct and inverse mappings of f:R3->S3 and f-1:S3->R3 can be verified naturally (we want to check that f(m*a+n*b) = m*f(a) + n*f(b) and vice versa (i.e. neighbours by relations “*” and “+” will remain neighbours after transformation f and f-1)), however bijectivity of f, and in particular surjectivity (onto) does not follow naturally. Basically what we want to know whether for each set of coordinates R3 does exist a point in our physical space S3, and whether for each point in S3 do exist coordinates for it. Which is literally a choice you make to accept the Axiom of Choice, which states they do, and we don’t “cavitation-like” holes in R3 and S3 spaces and therefore we may treat S3 as R3.

By stating isomorphism of the inner and outer direct products for vector spaces and axiom of choice, we are basically saying that we may decompose (via inverse of the inner direct product) a multi-dimensional vector space, in multiple ways, into disjoint (normal) subspaces (each of which created by projecting some of the original dimensions) sum of dimensions of which would be the same as in the original space, and then compose (via outer direct product) them back as a product space of the resulted subspaces, to a degree of isomorphism. That may may be demonstrated directly, of course, via preservation of continuity based on definitions of addition and multiplication operations of vector and product vector spaces (except multiplicity of the ways, which again requires axiom of choice), but the former statement sounds more consciousness and meaningful.

Also, the same could be said in a way that each projection of the original n-dimensional space into m-dimensional (m<n) would be a homomorphism, i.e. preservation of some aspects of the original space structure, while the direct product of disjoint homomorphisms will produce the whole original space back, to a degree of isomorphism.

One more, pretty obvious observation from all above is that transformation of the basis rotation (square matrix) of the direct sum of disjoint subspaces in a general case would result into sum of images of those subspaces which a members of the whole original space. I.e. for **a** b.t., for example, A6=A2 x A2′ x A2″, therefore **a** = **a1** + **a2** + **a3 **(direct product and direct sum of subspaces are also isomorphisms to a degree of “relabeling”), where **a1** b.t A2, **a2** b.t. A2′, **a3** b.t. A2″, then T(**a**) = T(**a1**) + T(**a2**) + T(**a3**) = **b1** + **b2** + **b3** = **b**, where **b**, **b1**, **b2**, **b3** b.t. B6, T:A6->B6, where T is of course an isomorphism.

As we can see in our particular case of the 8-dimensional space of ECG readings of a pregnant women, we transformed the space around its covariance matrix to the eigenbasis, and then decomposed it (for programmatic simplicity we just set kernel variables to null) into three disjoint/normal subspaces by homomorphisms with the mother’s heartbeat aspect of the data structure, child heartbeat + mother’s perspiration, and error/nonlinearity/noise. Rotating those subspaces back to the original 8-channel space via the inverse transformation, we’ve got 3 separate images of those subspaces on those 8 ECG channels, and by adding them together we’ve got, of course, the original dataset back.

pca_model <- ks_eigen_rotate_cov(ekg2)

ds <- pca_model$ds#error/nonlinearity/noise

ds_n <- ds

ds_n$V2 <- 0

ds_n$V3 <- 0

ds_n$V4 <- 0

ds_n$V5 <- 0

ds_n$V6 <- 0

ds_n$V7 <- 0

ds_n <- as.data.frame(as.matrix(ds_n) %*% pca_model$An1)#mother

ds_m <- ds

ds_m$V5 <- 0

ds_m$V6 <- 0

ds_m$V7 <- 0

ds_m$V8 <- 0

ds_m$V9 <- 0

ds_m <- as.data.frame(as.matrix(ds_m) %*% pca_model$An1)#child

ds_c <- ds

ds_c$V2 <- 0

ds_c$V3 <- 0

ds_c$V4 <- 0

ds_c$V8 <- 0

ds_c$V9 <- 0

ds_c <- as.data.frame(as.matrix(ds_c) %*% pca_model$An1)

#all

ds <- ds_c + ds_m + ds_n

…

**Appendix 14.1**

As usual, working files are in: https://github.com/NSelitskaya/kitchen-style-r

]]>As we mentioned in previous chapters, by performing PCA, we implicitly find the least lossy linear models of the hidden variables that describe our data, and use them as an eigenbasis for a “linear models subspace”, while the rest of the eigenvectors compose an orthogonal “error subspace”.

However, have we known nothing about hidden variables, we would not know which eigenbasis vectors are part of the “linear models subspace”, and which – of “error subspace”. Although, having some guesses about distributions of the linear model’s hidden experimental parameters, also about whether and how non-linear processes are, or about the noise, coming from numerous minor sources we don’t know and care, and therefore (according to central limit theorem) should be Gaussian, we may try to identify which eigenbasis vectors has projection of either noise, or/and non-linearity of the linear models, as well as how many distinct models describe the process. Then we may want either drop some dimensions, or set data in them to 0 or other baseline, and sometimes rotate back into original basis, muting either noise, or, in opposite, dominant models, allowing to stand up the weaker, less obvious models.

For example in the previous chapter we took a dataset of 8-channel ECG of a pregnant women, where on all the channels mother’s heartbeat was dominating over noises of some degree (Fig 13.1), and rotead basis of the data over the covariance matrix, which resulted in three dimensions clearly (by the shape and Gamma-type pdf ) associated with mother’s heartbeat (V2 (dark blue)) is the strongest and clearest variable), two dimensions combining data of child’s heartbeat with some noise and perspiration (V6 (dark green)), and three more dimensions of noises and non-linearities of various sorts (Fig 13.2).

If we set the dominant mother’s heartbeat in the first three variable to the base level and rotate basis back to the original 8 channel ECG dimensions, we’ll be able to see clearly child’s heartbeat in one dimension (V4 (dark blue)), and mother’s perspiration in another (V6 (dark green)) on noise backgrounds (Fig 13.3).

ekg <- read.table(“foetal_ecg.dat”)

dim <- c(“V2”, “V3”, “V4”, “V5”, “V6”, “V7”, “V8”, “V9”)

ekg2 <- ekg[, dim]pca_model <- ks_eigen_rotate_cov(ekg2)

ds <- pci_model$ds#remove mother’s heartbeat and one noise channel

names(ds) <- c(“V2”, “V3”, “V4”, “V5”, “V6”, “V7”, “V8”, “V9”)

ds$V2 <- mean(ds$V2)

ds$V3 <- mean(ds$V3)

ds$V4 <- mean(ds$V4)

ds$V7 <- mean(ds$V7)#rotate basis back

ds <- as.data.frame(as.matrix(ds) %*% pca_model$An1)#put time back

names(ds) <- c(“V2″,”V3″,”V4″,”V5″,”V6″,”V7″,”V8″,”V9″)

ds[,”V1″] <- ekg[,”V1”]

Of course we can always go with the standard maximal (minimal) “explained” variance criterion to choose the eigenbasis vectors we are interested in (either to preserve or eliminate), but that is rather an “accidental” criterion. A more fundamental one is that covariance could be seen as a rotating moment applied to a pair of axis of our candidate basis, magnitude of which is defined by the residues (along these two axis) of the dataset, relative to means of the dataset, where residues along one axis are interpreted as “forces”, and along another – as “lever lengths” (or vice versa).

If we rotate the axis around the mean in the direction of the generated moment (covariance), there will be an equilibrium position in which the moment (covariance) will become null, and “coincidentally” that position of axis will be parallel, one – to the regression line, and another – to the regression residue projection line, which will be orthogonal. When we we have dimensions with the same measurement scale (as in this example) it will be obvious which one is which, otherwise multidimensional scaling may help with that decision.

When we make all pairwise covariances in the covariance matrix equal nulls, and leave only variances on the diagonal, that would mean we found eigenbasis for the covariance transformation matrix, where variances will be eigenvalues for the found eigenbasis vectors. That would mean that all possible regression lines/planes/hyperplanes would preserve maximal possible structure of the original dataset (because projections vectors will be orthogonal to the regression hyperplanes), which means that regression transformations will be “conditionally” continuous (neighbours in domain and range spaces will remain neighbours after forward or inverse transformations) with the smallest possible (out of various axis orientations) granularity of epsilon that still preserves continuity. That is where the maximal variance criterion comes from – regression/projection transformations with the smallest continuity loss will give the largest spreads/variances in the range space.

Though, again, if there is some structure in the original dataset, it’s not necessarily those components of the structure that are most hom(e)omorphically transformed onto eigen-subspaces (and therefore have the largest variances) that we may be interested in, like in the example above (i.e. mother’s heartbeat). In a way, looking at dimensions with maximal “explained” variances is like looking for the keys under streetlight (from that anecdote). Yes, it’s dark outside of the light circle, but it’s dark also because the street light is blinding you. If you manage to polarize its light and use orthogonally polarized glasses you may see your keys in the “dark”, or, rather, illuminated by the Moon’s on stars’ light. You may not see all the small details on these keys as you could have seen them under the streetlight illumination had they been there, but they are not – they are under the dim starlight

…

**Appendix 13.1**

As usual, working files are in: https://github.com/NSelitskaya/kitchen-style-r

]]>Before going into more details of more sophisticated multidimensional scaling techniques, and criteria for selecting principal components/factors, let’s take a look at data that has the same measurement units in all their dimensions, and relatively easy to guess what criteria we may use to distinguish the real-life principal components. The dataset reference (https://homes.esat.kuleuven.be/~smc/daisy/daisydata.html, De Moor B.L.R. (ed.), DaISy: Database for the Identification of Systems, Department of Electrical Engineering, ESAT/STADIUS, KU Leuven, Belgium), we’ll be using, is borrowed from (A.J.Izenman 2013). These are the 8 channel ECG readings of a pregnant woman. What we may want to do with these data is to separate those mixed “crazy statistical” variables, we were talking before, into proper analytical or real-life tangible variables, like reading of mother’s and child’s heartbeats, perspiration rhythms, etc.

This is a typical “cocktail party problem” in which we want to separate voices of the party participants out of recordings made from differently placed microphones in the party-room while all invited people babble simultaneously. Of course, more serious applications of the problem may include submarine or drone detection out of sonar or radar readings, electric grid invasions, sources of rocket engine explosions out of sensor data, etc…, you name it.

ekg <- read.table(“foetal_ecg.dat”)

dim <- c(“V2″,”V3″,”V4”, “V5″,”V6”, “V7”, “V8”, “V9”)

ekg2 <- ekg[, dim]

ds <- ks_eigen_rotate_cov(ekg2)

names(ds) <- c(“V2″,”V3″,”V4”, “V5″,”V6”, “V7”, “V8”, “V9″)

ds[,”V1″] <- ekg[,”V1”]

describe(ds)

vars n mean sd median trimmed mad min max range skew kurtosis

V2 1 2500 1.57 **215.17** 17.69 19.92 53.02 -1366.55 474.04 1840.59 **-3.69 18.36**

V3 2 2500 0.03 44.47 -3.04 -2.88 24.02 -172.36 286.30 458.66 2.51 13.30

V4 3 2500 0.07 19.66 -0.85 -0.60 14.62 -104.64 147.41 252.05 0.74 6.58

V5 4 2500 0.18 6.13 0.46 0.48 4.60 -34.43 26.13 60.57 -0.65 2.75

V6 5 2500 0.23 5.36 0.86 0.28 5.04 -25.50 21.63 47.13 -0.04 1.04

V7 6 2500 0.10 3.32 0.01 -0.02 3.03 -11.75 16.15 27.90 0.53 1.44

V8 7 2500 0.04 2.23 0.06 0.06 2.11 -8.35 7.98 16.33 -0.12 0.39

V9 8 2500 0.02 2.01 0.07 0.04 2.03 -6.83 7.17 14.01 -0.11 -0.03

As one can easily see, that even simple eigenbasis rotation around the covariance matrix transformation allows us to see clearly mother’s heartbeat (14 of them) (dark blue V2, also V3 and V4 with additional signals), child’s heartbeat (22 of them) and perspiration rhythm (dark green V6), and noise V9, with V5, V7, V8 as intermediate mixes of various proportion of these three signals.

While the formal parametric descriptive statistics may be still hard to use as a guiding indicators (except obvious V2 with really high variance, skewness and fat tails), non-gaussian shape of the distributions hints more decisively which variables may be not noise and we may be interesting in.

…

**Appendix 12.1**

As usual, working files are in: https://github.com/NSelitskaya/kitchen-style-r

]]>The Covariance Family metrics (including coefficient of correlation and r squared) may be interpreted in various ways, which indicates that may be poorly understood, hence, and inappropriately used. For example, the usual interpretation of covariance/correlation is a measure of the variables’ dependency may be quite confusing – on the one hand we may want to reduce it, to deal with independent variables, on the other – we do want to find that dependency. Of course we may say we want independent explanatory variables, and dependent response, but that division is quite subjective, therefore that dependency interpretation of covariance/correlation is, at least, subjective, too.

Other interpretations, such as relation to regression model or paying attention to their moment functions nature, may be more useful. We may envision the covariance family (as well as, for comparison, rss/sse or variance/std. deviation) mechanistically. For example, rss (residual sum of squares) could be envisioned as little springs/ rubber bands attached on one side to the free floating regression bar, and on another – to data points. Equilibrium of such a configuration will give as regression bar position we seek – the one with minimal “internal tension” of the model. Using the same pattern, we may envision the same little springs/rubber bands attached to data points on one their end, but, now, the other end would be attached to a coordinate line with a rotation point fixed in the mean point, such way creating a rotation moment.

It’s quite interesting what will be the equilibrium point for the system. If we let go our coordinate system it will rotate to the eigenbasis coordinate system (relative to our covariance matrix of course) – such a system that sum of all rotation moments from the individual data points will be *0*, i.e. intuitively we can see (we’ll demonstrate that more rigorously later) that all the springs/bends will be in a minimal stressed state, which means that one of our new axes will be parallel to the regression line (just with the different baseline of the “minimal tension”, which is defined by the mean rotation point constraint), while another (2D case would be easier to intuitively imagine) will be orthogonal (for orthogonal basis, which is usually the case with the bases of abstract statistical spaces).

**When Covariance Analysis may be useful**

When we model real-life processes we may encounter situations, especially in “soft” sciences like psychology or medicine, or pseudosciences like sociology or economy, when we are able to measure some parameters, but suspect that there exist more primal “hidden” or “latent” parameters which we can’t measure directly, such as Intelligence Level, or Health Rank, or Social Stability or Economic “Health”. Of course in hard sciences we measure many parameters indirectly, for example already mentioned forces, accelerations, or masses are usually measured via deformations (of a spring) or stresses (of a piezo crystal) with the following electromagnetic field generation, which, in its turn, can be measured via charged particles displacement/rotation (of inductor), etc… But in hard sciences we usually have analytical models developed which could be rigorously verified and reproduced, and we can get reliable indirect inferences of one parameter via direct measurement of others, so such inference can be seen as a mere inconvenience (or even convenience when (more) direct measurement is possible, but more complicated/costly).

If we don’t have a proven analytical model, which is usual from “soft” sciences, or in the beginning of research in hard science, or our culture/civilization (of statisticians and data scientists :)) not yet developed tradition of advanced analytical models, even choice of the observable parameters may be incomplete or redundant. For instance, in our previous example of the “crazy statistical” 2nd Newton’s law of motion linear modelling (applied to throwing books out of a window) **b** + **c*** *= **d **+** e**, we created the model based on *1..i-th..n* observations of **b***i*, **c***i*, **d***i*, **e***i* parameters made in respect to an abstract statistical (orthogonal) basis **i**,**j**,**k**,**l **in a statistical space A, where **b***i* = *bi****i**, **c***i* = *ci****j**, etc… Let’s imagine these parameters were measured by weird instruments (for example by cameras with some video image processing) made by those who had no physics insights on what’s going on.

If we take a 4-dimensional space, linear mode of the type **b** + **c*** *= **d **+** e**, or** ***b****i ***+ c****j*** – d****k*** – e****l **= **0** (or a typical generic form **y** = *b1****x1** + *b2****x2** + … + *b0****y**), would mean that out of all possible members of the space we select only those that satisfy the above equations, which, actually mean that for the selected data points the 4-dimensional basis is excessive, because linear model equations are also conditions for the linear dependence of the (basis) vectors. Which means linear models describe subspaces of, at least, 1 less dimensionality than the original space. Which subspaces we select to project our original data to is a separate question. We may use methods of the least deltas (residues) between our data and their projections into a candidate subspace, or least angles between original data hull and subspace hyperplane, or various weighted methods, but for understanding PCA analysis we want to work with the least RSS algorithm in which residues are measured in orthogonal to the subspace hyperplane direction (say, **b** + **c*** *= **d **+** e **model/subspace/hyperplane was chosen using that method).

Had we have some insights on which hidden physics parameters are involved into 2nd Newton’s law of motion (and what it is), we could come up with mappings of our “crazy statistical” variables to the following hidden physics parameters: **b **= **F** – *m/2****w **= *F****v **– *m/2****w**; **c** = *2****a** = *2*a****u**;** d **= *m****a** = *m*a****u**;** e = ***– m/2****w **+ *2****a **= *– m/2****w** + *2*a****u**, (1)** **(relative to some spacial basis, say **u**,**v**,**w** of a physical space B, where **u** is a basis vector of Lagrangian type attached to a book and collinear to the acceleration of a book, **v** – to the force applied to it (say, we don’t know that **u** and **v** are linearly dependent), and **w** is, say, an orientation of our mass-measuring instrument (yeah, imagine we measure a mass vector, but we are “crazy statisticians”, aren’t we?).

Knowing these hidden physics variables are related via some law of motion (which we pretend we don’t know yet), we would perhaps wanted to shrink our initial abstract statistical basis **i**,**j**,**k**,**l **of A-space**, **and rotate it to a smaller latent statistical basis **f**,**a** of abstract statistical space C which is more conveniently mapped to the basis of hidden physics variables **u**,**v**,**w** of B-space. Here, let’s call the unknown physics variables “hidden”, and corresponding unknown abstract statistic variables “latent” (of course these are interchangeable terms in literature). Say, we want mapping g:B->C by a simple diagonal unit matrix, such that **f**=*1****v**, **a**=*1****u** and *0****w.** That may seem as a redundant transformation, but we still want to have B and C spaces distinct, for example because abstract statistical basis is orthogonal, but spacial **u,v,w** is not necessarily, and maybe not really a basis (linearly independent) after all.

Knowing that g*f = h, where f:A->B and h:A->C, i.e. f(**b+c-e**) = **F **= F***v**, g(**F**) = F***f**; f(**d**) =** ***m*a****u**, g(*m*a****u**) = *m*a****a**; and mappings are linear (f(**a**)+f(**b**)=f(**a+b**)), we would expect that h(**b+c-d-e**) = H*(**b+c-d-e**) = *F****f*** – m*a****a**=**0**.

If we transform our dataset of **b***i*,..**e***i* tuples into C space where they become **f***i*,**a***i* pairs, first, and then model them by a linear model that minimizes squared sums of **epsilon** orthogonal to the regression line, we’ll get the same model (as soon all transformations are homomorphic, or in vector spaces linear, which they are): **F **= C***a**. Although we may think that C=m stands for the mass of the book in our predictive model, however that won’t be a parameter of the linear model, but rather its coefficient for the parameter a. If we use standard RSS minimization regression, it will be an averaged mass ma that allows our model to go through the mean values of F and a of our dataset: ma=muF/mua (see below).

Regression line we just found would be parallel to one of the eigenbasis vectors (while epsilon-line would be parallel to another eigenvector), therefore rotating our bais **f,a** to the one spawning regression end epsilon lines, via, say, transformation e (represented by matrix E), would be transformation to the eigen basis **v1,v2** if we also shift origin of our coordinates. Therefore, again, because of the linearity of transformations, our regression would be like e(**F**-m***a**) = E*(**F**-m***a**) =E*(**F**)** **– E(m***a**), **reg** = (e11*F – e12*m*a)***v1**,** eps** = (e21*F – e22*m*a)***v2 **=** 0 **(therefore e21=e22=0); and, because regression in eigen coordinates would be var1 = b1*var2 + b0, where b1=0, then e11*F – e12*m*a – C’1 = 0.

On the other hand, we could have rotated basis of A space, in which our data represented by **b**n..**e**n tuples, into eigenbasis of D space, in which n-th tuple would be **var1**n, **var2**n,…**var4**n, where **var1**n = (a11*bn + a12*cn + a13*dn + a14*en)***v1**, etc…, or

(a11 a12 a13 a14) (b) **i** var1 **v1**

(a21 a22 a23 a24) (c) **j** = var2 **v2**

(a31 a32 a33 a34) (d) **k** var3 **v3**** **

(a31 a32 a33 a34) (e) **l **var4 ** v4**

Now, again we could have tried to find a linear model of our data in the eigenspace D with eigenbasis **v1,v2,v3,v4**, but because we know that all regression lines/planes/hyperplanes in eigenspace will be parallel to some eigenvectors (and orthogonal to another), one of the regressions vari = b1*varj + b0, where b1=0, therefore vari – C = 0 will be the same regression we found above in space C with f,a basis, and converted also into subspace of eigenspace D, we can write ai1*b + ai2*c + ai3*d + ai4*e = reg = e11*F – e12*m*a – C”1.

So, by performing PCA, i.e. finding eigenbasis relative to covariance matrix, we implicitly find the least lossy linear models of the hidden variables that describe our data, and use them as an eigenbasis for a “linear model subspace”, while the rest of the eigenvectors compose an orthogonal “error subspace”, which direct sum produces whole space our data reside in. In the case above we have only one set of hidden parameters that are part of one linear model equation, therefore we have only one-dimensional eigenbasis of the “linear model subspace”.

Haven’t we known nothing about hidden variables, and would rotate basis **i**,**j**,**k**,**l **around covariance matrix of our dataset into the same eigenbasis **e1** of the “linear model subspace”, and would get additional **e3**,**e4 **eigenvectors to already known **e2** eigenbasis vector of the error subspace. However, we would not know which eigenbasis vector is part of the “linear model subspace”, and which – “error subspace”. Although, having some guesses about distributions of the linear experimental parameters (for example that we through our books out of window with steadily increased force, and expecting uniform distribution of the data in linear model dimension), also of the non-linear parameters (for example we know that there are standard book sizes and weights, around which modes of book weight distribution would congregate), or the noise, coming from numerous minor sources we don’t know and care, and therefore (according to central limit theorem) Gaussian. The latter dimensions we, perhaps, want either drop, or set data in them to 0 or other baseline, and data in dimensions with mixed Gaussian and non-linear error – “de-Gaussianize”, and sometimes rotate back into original basis.

**More rigorous on Regression and Covariance connections**

A matrix can always be seen as a particular transformation of an object from one space (with one particular basis) into another (with other basis). Particular, covariance matrix S, can be seen as a transformation, defined by our dataset, of, for example the unit vector **x** of the basis our dataset expressed in **ijk..**, into object **y** of another space – space where basis vectors **uvw**… are defined via sums of **ij**, **jk**, **ki** etc,… covariances of our data set:

**y** = **Sx**, s.t.:

(*sii^2 sij sik …*) (*x1*) **i** (*vi * x1 + sij * x2 + sik * x3 + …*…*) **u**

(*sji sjj^2 sjk …*) (*x2*) **j** = (*sji * x1 + vj * x2 + sjk * x3 + …*…*) **v**

(*ski skj skk^2 …*) (*x3*) **k** (*ski * x1 + skj * x2 + vk * x3 + …*…*) **w**

However, for a given matrix-transformation **S** we may find such a basis **i’j’k’**, s.t. S-transformation of **x** will be equal to its scalar (lambda) multiplication, and, hence, asymmetric covariances in such a basis will be nulls:

**y** = *lambda* * **x**, s.t.:

(*lambda 0 0 …*) (*x1′*) **i’** (*lambda * x1′*) **u’**

(*0 lambda 0 …*) (*x2′*) **j’** = (*lambda* x2′*) **v’**

(*0 0 lambda …*) (*x3′*) **k’** (*lambda * x3′*) **w’**

Actually, it’s not the space with basis **u’v’w’** we are interested in, but the basis **i’j’k’**, in which asymmetric pairwise covariances are equal 0’s, or the basis in which (synthetic (or latent)) variables are independent, or, as we noted above, regression line(s) are parallel to basis vectors. Though, it’s not every regression, but quite particular (least RSS) ones. Let’s see for which regression *bij = sij / sii^2*, or in a more usual for 2D regression notation: *b1 = cov(x, y) / var(x)*. Let’s express residual square sum as:

*RSS = Sum i=1..n (e^2) = Sum (y – b1*x – b0)^2*

*@RSS/@b1 = -2*Sum(x*y -b1*x^2 – b0*x)*, which we want to set to 0 to find out RSS’s minimum:

*Sum(x*y)/n – b1*Sum(x^2)/n – b0*Sum(x)/n= 0*

if *cov(x, y) = E((x – E(x))(y – E(y))) = E(x*y) – E(x)*E(y) – E(x)*E(y) + E(x)*E(y) = E(x*y) – E(x)*E(y) = Sum (x*y)/n – mux*muy*

and *var(x) = E(x*x) – E(x)*E(x) = Sum (x*x)/n – mux^2*

then:

*c0v(x, y) + mux*muy – b1*var(x) – b1*mux^2 – bo*mux= 0*

i.e. *b1 = cov(x, y)/var(x) (*hence* b1=0 *when* cov(x,y)=0)* if *muy = b1*mux – b0 *or* mux=0*,

which, with *bo* inference:

*@RSS/@b0 = -2*Sum(y – b1*x – b0)/n = 0 = muy – b1*mux – bo**, *

is exactly the case.

If we generalize derivations above onto multivariable (but still univariate) regression *yr = b1*x1 – b2*x2 – … b0*, and express dataset’s y as xn with bn coefficient (which we always set to -1 to end up with the “traditional” regression notation), then:

*RSS = Sum i=1..n (exn^2) = Sum (bn*xn + b1*x1 + b2*x2 + … + b0)^2*

*cov(xi, xj) = Sum(xi*xj)/n – muxi*muxj*

*var(xi) = Sum(xi^2)/n – muxi^2*

partial derivatives of RSS would be:

*@RSS/@b1 = -2*Sum(x1*(bn*xn + b1*x1 + b2*x2 + … + b0)) = 0 = bn*Sum(x1*xn)/n + b1*Sum(x1^2)/n + b2*Sum(x1*x2)/n + … + b0*Sum(x1)/n *

*@RSS/@b2 = -2*Sum(x2*(bn*xn + b1*x1 + b2*x2 + … + b0)) = 0 = bn*Sum(x2*xn)/n + b1*Sum(x2*x1)/n + b2*Sum(x2^2)/n + … + b0*Sum(x2)/n*

…

*@RSS/@bn = -2*Sum(xn*(bn*xn + b1*x1 + b2*x2 + … + b0)) = 0 = bn*Sum(xn^2)/n + b1*Sum(xn*x1)/n + b2*Sum(xn*x2)/n + … + b0*Sum(xn)/n*

*@RSS/@b0 = -2*Sum(bn*xn + b1*x1 + b2*x2 + … + b0) = 0 = bn*Sum(xn)/n + b1*Sum(x1)/n + b2*Sum(x2)/n + … + b0*

or

*b1*var(x1) + b2*cov(x1,x2) + … + bn*cov(x1,xn) + mux1*(b1*mux1^2 + b2*mux2 + … + bn*muxn + bo) = 0*

* b1*cov(x2,x1) + b2*var(x1) + … + bn*cov(x2,xn) + mux2*(b1*mux2*mux1 + b2*mux1 + … + bn*muxn + bo) = 0*

* …*

* b1*cov(xn,x1) + b2*cov(xn,x1) + … + bn*var(xn) + muxn*(b1*mux1 + b2*mux2 + … bn*muxn + bo) = 0*

*b1*mux1 + b2*mux2 + … + bn*muxn + bo = 0*

then matrix form will be:

cov(**x**, **x**) * **b** = – **mux** * (**mux**T * **b** + **bo**) = **0**

where **x** = (*x1, x2, … xn*); **b** = (*b1, b2, … bn*); **mux** = (*mux1, mux2, … muxn*)

cov(**x**, **x**) * **b** = **0** means that for covariance matrix cov(**x**, **x**) = **S** of the eigenbasis **i’j’k’… **only non-zero members will be variances, and each *bi*var(xi) = 0 *means *bi* must be null, i.e. making least RSS regression (**S*****b**=**0**, or multivariate **S*****B**=**0**, where B is NxM matrix where M is number of dimensions for simultaneous regression, and N is the dataset size) lines/planes parallel to the remaining, after regression, eigenbasis vectors (as we can see on Fig 11.1). Which means that the least RSS regression in eigenbasis may be done just by dropping dimensions of our choice.

Orthogonality of the eigenbasis of the symmetric matrix, which covariance matrix is, comes from here:

**A**=**A**T – for symmetric matrix

(**A*****v**)T = **v**T***A**T, which is obvious from example:

( (a11 a12) (b1) )T = (b1*a11+b2*a12, b1*a21+b2*a22)

( (a21 a22) (b2) )

(b1, b2) (a11 a21) = (b1*a11+b2*a12, b1*a21+b2*a22)

. (a12 a22)

**A*****v1** = *lambda1****v1**, which is an eigenvector definition, we multiply by another distinct (with different *lambda2*) transposed eigenvector v2:

*lambda1****v2**T***v1** = **v2**T**lambda1****v1** <= **v2**T***A*****v1** => **v2**T***A**T***v1** = (**A*****v2**)T***v1** = (*lambda2****v2**)T***v1** = *lambda2****v2**T***v1**, or

*lambda1****v2**T***v1** – *lambda2****v2**T***v1** = 0, or (*lambda1-lambda2*)***v2**T***v1** = 0

**Criteria for dimension reduction (RSS/Variance)**

However, now, the question we want to answer would be what dimensions we may want to collapse, or, in the extreme projection just to one dimension, – which one it would be? As we discussed before, RSS is a measure of the original dataset Structure preservation (or, rather, not preservation, or level of granularity our transformation (regression) becomes non-homeomorphic). In our case of the least RSS regression in eigenbasis, variances and RSSs will be the same. Therefore, it would be reasonable for starters, to collapse those dimensions which variances are smallest – i.e., when doing regression, we lose minimal Structure of the original dataset. However, that approach depends on the variables’ measurement scale, use of normalization, and we may not be after that aspect of the dataset Structure which gives the highest variance. Because some aspects of the data Structure may be inherently random, we may be more interested in those synthetic variables which distribution is not gaussian (assuming that multisource randomness comes in a gaussian form), using either parametric metrics (like skewness or kurtosis), or nonparametric methods like Kolmogorov-Smirnov tests, etc…

If we run an eigenbasis analysis based on covariance matrix for our Diabetes dataset, using all but glucose related variables, and not normalizing data:

load(url(“http://biostat.mc.vanderbilt.edu/wiki/pub/Main/DataSets/diabetes.sav”))

dim3 <- c(“chol”, “hdl”, “ratio”, “bp.1s”, “bp.1d”, “age”, “height”, “weight”, “waist”, “hip”)

diabetes3 <- diabetes[complete.cases(diabetes[, dim]), dim3]

ds <- ks_eigen_rotate_cov(diabetes3)

We’ll able to see that distributions of the synthetic eigenbasis variables somewhat differ (in central tendencies and dispersion metrics) from the Glycosylated hemoglobin distribution (v1, v3, v4, v5 (especially v3 with obvious opposite skewness)) not only for variables with the high variance, and actually not necessarily do for the high variance variables (v2) (Fig 11.2). Because of the distributions’ differences, we can use those variables for clustering purposes of the high- and low-risk patients.

Those eigenbasis variables, again, show no surprises: high cholesterol and high blood pressure are very bad, age and extra weight are moderately bad, hdl ratio is rather good (maybe in some aspects). **Bold** indicates more significant, and *cursive* – “good” variables.

vars n mean **sd** median trimmed mad min max range skew kurtosis

v1 1 377 280.76 **45.82** 276.67 278.32 39.69 136.48 520.84 384.36 **0.81 2.38**

v2 2 377 118.88 41.31 115.47 116.17 36.65 22.77 265.49 242.72 0.69 0.75

v3 3 377 -108.09 **24.92** -105.24 -106.43 22.85 -216.86 -41.18 175.68 **-0.73 1.14**

v4 4 377 -60.25 **16.49** -58.30 -59.45 14.79 -132.42 -10.14 122.28 **-0.52 0.73**

v5 5 377 7.80 **14.43** 6.88 7.19 14.25 -32.48 56.33 88.80 **0.45 0.41**

v6 6 377 -11.56 8.72** **-12.07 -11.74 8.85 -35.94 19.96 55.90 0.27 0.11

v7 7 377 -32.98 4.55** **-32.83 -32.93 5.04 -43.54 -19.71 23.82 -0.06 -0.46

v8 8 377 47.39 2.50** **47.35 47.38 2.30 37.25 57.28 20.04 0.09 0.99

v9 9 377 35.88 2.03** **36.00 35.88 1.76 27.95 44.25 16.30 0.04 1.44

v10 10 377 -2.99 0.67 -2.86 -2.91 0.34 -9.71 -1.64 8.07 -4.17 32.36

where:

v1 = 0.9399***chol **+0.0475*hdl +0.01945*ratio +0.1650*bp.1s +0.0821*bp.1d +0.1083***age ***-0.0014*height *+0.2552***weight **+0.0417*waist +0.03362*hip

v3 = *0.2021**** chol **-0.0128*hdl +

v4 = 0.0565*chol *-0.8737**** hdl **+0.0732*ratio

v5 = *-0.0800*chol *+0.4424***hdl ***-0.0345*ratio* *-0.1739* bp.1s -0.4054**

or, for initially normalized data, high variance (0f v1 and v2 variables) may be a more obvious indicator of the dataset Structure we may interested in:

vars n mean sd median trimmed mad min max range skew kurtosis se

v1 1 377 2.31 **0.31** 2.30 2.29 0.32 1.67 3.21 1.54 **0.39 -0.16** 0.02

v2 2 377 -0.53 **0.25** -0.51 -0.52 0.27 -1.45 -0.06 1.39 **-0.42 -0.19** 0.01

v3 3 377 -1.11 0.19 -1.10 -1.11 0.19 -1.65 -0.47 1.18 -0.15 0.17 0.01

v4 4 377 -1.77 0.17 -1.75 -1.77 0.18 -2.25 -1.30 0.95 -0.07 -0.48 0.01

v5 5 377 1.71 0.16 1.70 1.70 0.15 1.24 2.24 0.99 0.28 0.28 0.01

v6 6 377 0.32 0.13 0.30 0.31 0.11 0.03 1.15 1.12 1.34 5.58 0.01

v7 7 377 0.26 0.08 0.25 0.25 0.08 0.04 0.62 0.59 0.57 1.07 0.00

v8 8 377 -0.21 0.07 -0.21 -0.21 0.07 -0.42 0.05 0.47 0.19 0.33 0.00

v9 9 377 -1.10 0.05 -1.10 -1.10 0.05 -1.30 -0.89 0.41 0.00 1.64 0.00

v10 10 377 -0.14 0.03 -0.14 -0.14 0.02 -0.42 -0.07 0.35 -3.43 23.53 0.00

v1 = 0.065*chol_n *-0.1978* hdl_n* +0.1322*ratio_n +0.1549*

v2 = -0.1774*chol_n -0.1444*hdl_n -0.0270*ratio_n -0.3777***bp.1s _n** -0.2264*

…

**Appendix 11.1**

As usual, working files are in: https://github.com/NSelitskaya/kitchen-style-r

]]>

If we stop our stepwise regression from the previous chapter on last two dimensions (don’t look at the *eigen* parameter yet :)), and draw a scatter plot (Fig.10.1):

load(url(“http://biostat.mc.vanderbilt.edu/wiki/pub/Main/DataSets/diabetes.sav”))

dim <- c(“glyhb”, “chol”, “hdl”, “ratio”, “stab.glu”, “bp.1s”, “bp.1d”, “age”, “height”, “weight”, “waist”, “hip”)

diabetes2 <- diabetes[complete.cases(diabetes[, dim]), dim]ds <- ks_lm_dim_red(diabetes2, dim, n_dim=2, eigen=FALSE)

names(ds)[1:2] <- c (“v1”, “v2”)

ds$glyhb <- diabetes2$glyhb

ds_nd <- ds[ds$glyhb<7,]

ds_pd <- ds[ds$glyhb>=7,]ggplot(ds, aes(x=v1, y=v2, colour=glyhb))+

geom_point(data=ds, alpha=0.5)+

geom_point(data=ds_nd, alpha=0.5)+

geom_point(data=ds_pd, alpha=0.5)+

scale_colour_gradientn(colours=rainbow(4))

We’ll easily see where (at about the bissectrice of the first quadrant) the regression line would be, and what kind of pdf’s of the population and “non-diabetes and “diabetes” groups would be after projection of the data to that line in normal direction (instead of collapsing dimension v2). In such a regression (the multivariate one, which we’ll talk about later) difference in samples’ distributions would be more noticeable than in the univariate regression case (Fig.10.2).

The reason we may want to prefer normal projection to the being projected regression line is because, although all projections are not a homeomorphic transformations and, therefore, do not preserve structure of the domain space, they may not preserve it on different degrees of granularity. For example (Fig.10.3), if we project point *a1* to the line *r*, while the projection *p* is a continuous transformation (neighbours in the domain space remain neighbours in the range space, i.e. if distance between two points in the range space is less than the chosen *epsilon* – radius of the open ball there, we always may chose such a radius of the open ball in the domain space – *delta* – that the inverse images of those range points will be inside that ball).

But the inverse image *p-1* is not continuous (if we take a point *a2* that lays on the projection line and the very line r, then its projection a2’=a2=a1′, and if we chose *epsilon’* < d(*a1,a2*), where d is a distance (usual Euclidean) function, then there is no way to find such a small *delta’* that would bring images *a1*=p-1(*a1′*) and *a2*=p-1(*a2′*) into neighbourhood < *epsilon’*, because delta’ is already *0*, and d(*a1,a2*) >*epsilon’*). The same applies to the projection p”(*a1*)=*a1″*=*a3″*=*a3* to the axis x, with *epsilon”* < d(*a1, a3*). However epsilon”= *epsilon’* / cos (*alpha*), *alpha* being an angle between *r* and *x*. Therefore, with e*psilon”* > *epsilon’*, we lose structural homeomorphity with the normal projection to the regression line *r* on a smaller granularity level.

It’s not only the multivariate regression that can help with that – we may also rotate the basis to make one of the basis vectors to spawn the regression line (Fig.10.4), and collapse (with our ready routines) another dimension:

ds <- ks_lm_dim_red(diabetes2, dim, n_dim=2, eigen=FALSE)

ds <- ks_eigen_rotate(ds, std=TRUE)

ds <- ks_lm_dim_red(ds, eigen=FALSE)

What we’ve just done (rotation to eigenbasis) is considered a part of Principal Component analysis. The need for it comes from the very nature of the statistical modelling. I love how Noam Chomsky put it in this lection at Google (33:50):

We can, indeed, throw a book out of a window, and, instead of using analytical mathematical models (based on the 2nd Newton’s law of motion **F**=*m***a**), make a bunch of video recordings and process them with statistical modeling methods, and even predictive models of the Department of Statistics and Data Analysis may be better than the ones of the Department of Physics. However, those statistical methods, because they don’t bother building causality models of the driving forces and reasons for development of the processes, are prone mistakenly take a deterministic process for a random one because of not including all necessary variables/dimensions in the model; or including the same, “real” analytical one, partially, into number of “phony” statistical variables; or/and multiplying number of such variable above necessary dimensionality.

For example, we may build our model by training it by throwing out books of the approximately same mass, and get **F**=*C***a** model formula, where *C* is some constant coefficient. Then, lacking dimension *m*, we’ll take a deterministic process for a random one, because, having the same **F,** we’ll get multiple **a** due to the mass difference of the books being thrown out.

Or, by some caprice of the statistical mind, we may come up with the following model formula: **F** – m/2***i** + 2**a** = m**a** – m/2***i** + 2**a**; where **b **= **F** – **m**/2, **c** = 2**a , e = – **m/2***i **+ 2**a**, m**a **= **d**, where **i** is some unit vector of our basis.

**b** + **c** = **d **+** e**

with the obviously excessive and dependent variables/dimensions. Principal Component analysis, or, actually, covariance analysis is meant to address exactly last case…

**Appendix 10.1**

*ks_eigen_rotate <- function(df, std=FALSE){*

* ei <- eigen(cov(df))*

* #print(ei$values)*

* ds <- as.data.frame(as.matrix(df) %*% ei$vectors)*

* colnames(ds) <- matrix_symvect_mult(t(ei$vectors), names(df))*

* if(std){*

* dim <- colnames(ds)*

* n_ds <-lapply(dim, std_norm_ds, ds)*

* names(n_ds) <- dim*

* ds <- as.data.frame(n_ds)*

* }*

* ds*

*}*