Kitchen Style Tutorials: Vector Spaces #1

Mathematical Space

Mathematical space is a collection of objects with a structure defined over them. This definition as a whole, as well as the words used in it, and their relations may seem too vague, broad, and ambiguous. What is “object”? What is “collection” and “structure”? Why the structure is “defined”, and not intrinsically belongs to these objects or their collection? And, if it’s defined, who and why defines it?

Instead of intuitively guessing answers to these questions, we might have wanted to build our understanding of the vague and complex concepts from a limited basis of the simple, narrow, well defined, clear and distinct concepts, as Rene Descartes, one of the framers of the modern philosophy and algebra, have wanted. However, the same time he noted that  human mind is a very poor inventor. It can’t create anything genuinely new. It’s very good, though, in mixing and combining blocks already known to it. If we go to the path of excessive reduction of the number, richness, and generality of these blocks, we may end up unable to span complex (if at all) concepts from the simple, distinct, and clear blocks.

Therefore, we may be better off defining simple, well defined, and clear rules of combination of the potentially opaque, complex, and only intuitively comprehensible building blocks. To add insult to the injury, postmodernism pointed out that intuition depends on the personal experience, therefore each person will have different from others innate concepts. The question ‘is your “red” the same as my “red”?’ does not necessarily expects an affirmative answer. Still, we may define labels or tokens (word “red”) to the real-world phenomena, hoping that everyone would sort out relations of their inner mental images with those labels by themselves, likely on sub/unconscious level.

In this context, Chomsky’s LAD (Language Acquisition Device) could be also decoded as the “Label” Acquisition Device, because those innate concepts of the real-world understanding are imbedded into natural human languages. For example, the very concept of number 1 (and other numbers than 1) presents in all(?) grammars of human languages in some form, or even does the dynamic of going from 1 to many via 1+1=2 (through dropped in the explicit way by many modern languages).

Therefore, we may not need to go into a rabbit hole of a purely analytical ontologies, risking to end up with nothing: this is/consists of that and that, which, in their turn, are/consist of even smaller and simpler things, until they disappear. Instead, on some level we may start thinking in terms of a more synthetic ontologies of the type: this could be thought of as that (and that, and that) in their combinations and relations (as well as equally true other combinations of other “that”s).

Based on those deep, even language embedded concepts of natural numbers, formal mathematical axioms, like Peano’s axiom, come quite naturally: for each natural number x there is only one successor y = s(x) = x + 1; if successors are equal then do equal their predecessors; there is only one number – 1 – that is not a successor of any number; 1 with its successors spans the whole collection of natural numbers N.

Mathematics has been always an object of amusement, that such a seemingly artificial construct out of the fantasy world could be so useful being applied to problems of the real world. Meanwhile, mathematical concepts are always based, on the one hand – one the real world phenomena, and on the other hand – on humans’ patterns of the world comprehension. One can reasonably expect that wherever those concepts lead us, their range will be the same – humans’ understanding of the world, i.e. humans may always find areas of their application.

The above implies that  extraterrestrial, or artificial intelligence, with potentially different ways of observing and comprehending real world, may develop quite different mathematics. Even humans, using different then the “usual” set axioms have developed quite different mathematical theories, but we may expect ET’s and AI’s difference on a much larger scale.

However, let’s come back in the next chapter to the ways of thinking about “collections of objects” and “mathematical structure defined over them” from the original definition…

Posted in Data Science Done Right | Tagged , , , , , | Leave a comment

Data Science Done Right (the Kitchen Style) #18

After rotating our ECG data of a pregnant women into a principal eigenbasis, and selecting 3 minor components of interest, showing perspiration rate and child’s heartbeat, we may try to separate child’s heartbeat and mother’s perspiration waves into different sub-datasets using clustering. A cloud view of the 3D and 6-means “disjoint distance” clustering of V5, V6, V7 principal components may be not very informative (Fig. 18.1):

Fig. 18.1. Clustering of the principal components V5, V6, V7, shown on 3D cloud plot













#horizontal matrix PCA – ECG chanells
pca_model <- ks_eigen_rotate_cov(ekg2)
ds <- pca_model$ds

# 3-dimension, k-means clustering
c_model <- ks_kmeans_nd_means(ds, c(“V5″,”V6″,”V7”), 6)
ds <- ks_kmeans_nd_clusters(ds, c(“V5″,”V6″,”V7”), c_model$Mu)

cloud (V7 ~ V5 * V6, ds, groups=Cls, pretty=TRUE, zoom=0.9,
screen = list(x = 90, y = -30, z = 0))

ggplot(ds, aes(x=V1))+
geom_point(aes(y=V6, color=Cls))+

Though, V6 principal variable view with clustering (Fig. 18.2) shows much better separation of datapoints than in 1D case (Fig .16.1):

Fig. 18.2. lustering of the principal components V5, V6, V7, shown on V6/time diagramm






Still, we may want to add more continuity to the data, clustering datapoints not only by channels’ amplitudes, but also temporally. Of course, linear time won’t be much help there, but we may use “cyclic” time, modulating it, for example, trigonometrically according to the child’s heartbeat frequencies (or it could be any cyclic function), and scaling it up to the principal component amplitudes. Idea here is to make datapoints neighbours not only by the neighbouring amplitude parameters, but also by the close occurrence time criterium. Including the “cyclic” time variable into clustering as a 4th dimension gives us even better clustering separation (Fig. 18.3,4):

Fig. 18.3. Clustering of the principal components V5, V6, V7 and Cyclic time P, shown on 3D cloud plot













#including cyclic time
ds[,”P”] <- -sin(ds[,”V1″]/(10/22.4)*2*pi)*20
c_model <- ks_kmeans_nd_means(ds, c(“V5″,”V6″,”V7”, “P”), 6)
ds <- ks_kmeans_nd_clusters(ds, c(“V5″,”V6″,”V7”, “P”), c_model$Mu)

Fig. 18.4. Clustering of the principal components V5, V6, V7 and Cyclic time P, shown on V6/time diagramm







Now, we may transform the clustered dataset of perspiration and child’s heartbeat from the principal eigenbasis back into original ECG basis, and see how these signals would look like on the original ECG channels had we have filters to separate them from the primary mather’s heartbeat signal (Fig. 18.5):

Fig. 18.5. Child’s heartbeat (V2) and mother’s perspiration (V5) clustered datasets transformed back into the original ECg channels basis

ds$V2 <- 0
ds$V3 <- 0
ds$V4 <- 0
ds$V8 <- 0
ds$V9 <- 0
#inverse subset transform
#ts <-[ds$Cls<2,”V1″])
ts <-[ds$Cls>1,”V1″])
names(ts) <- c(“V1”)
#ds_p <- ds[ds$Cls<2,dim]
ds_p <- ds[ds$Cls>1,dim]
ds <- %*% pca_model$An1)

#put original var names and time back
names(ds) <- c(“V2″,”V3″,”V4″,”V5″,”V6″,”V7″,”V8″,”V9″)
ds[,”V1″] <- ts[,”V1”]

Appendix 18
As usual, working files are in:

Posted in Data Science Done Right | Tagged , , , , | Leave a comment

Data Science Done Right (the Kitchen Style) #17

nD 2-means disjunctive pointwise clustering

When trying to come up with association of our 1D “disjoint delta” measure to entities of n-dimensional space, we would notice that for 2-means case it naturally associates to the dot/inner product of delta vectors xi – mu’ and xi – mu” (where xi is an i-th datapoint, and Mu is the set of cluster means). Or, by other words, “disjoint delta” would be a mapping to a real field that makes a vector space an inner product space, which gives us hope that nD, k-means extension would also have some meaning, too, when we come to it.

This realization of the inner product product association give us immediate understanding that in nD case “disjoint delta” will approach null not only when a data point is approaching one of the cluster means, but also when its deltas are orthogonal. That maybe considered a drawback or not, but definitely is something to keep in mind.

dd(xi, Mu) = (xi – mu’) . (xi – mu”) = delta(xi, mu’) . delta(xi, mu”)

with delta(x, mu’) = (x1 – mu’1)i + (x2 – mu’2)j + … + (xn – mu’n)k

a nice, easy to compute “disjoint delta” would be if basis of our space is orthogonal (i.e i.i=1, i.j=0), which in case of the PCA eigenbasis (as we made sure before) is the case, and we shouldn’t worry about that restriction:

dd(xi, Mu) = (x1 – mu1′)(x1 – mu1″) + (x2 – mu2′)(x2 – mu2″) + … + (xn – mun’)(xn – mun”) = x1^2 + x1(-mu1′ – mu1″) + mu1’*mu1″ + x2^2 + x2(-mu2′ – mu2″) + mu2’*mu2″ + … + xn^2 + xn(-mun’ – mun”) + mun’*mun”

which gives the same matrix form of system of linear equations and its solution as before:

X2 + X*c = dd
c = – (XT*X)-1*XTX2

where X2 = ( x1^2 + x2^2 +…+xn^2; … ), X = ( x1 x2 … xn 1 1 … 1; …), c = (c11 c21 … cn1 c10 c20 … cn0)T

Again, having found vector c of polynomial coefficients we may calculate its (or actually their) roots Mu. Of course, matrix X above is not a full rank , which means we won’t get a unique analytical solution for the matrix equation above. We always can do a numeric one, optimizing SSDD or other parameter, but if we want a nice clean analytical solution, we should restrict generality once again. What we can do is to set one of the cluster means at the space identity/coordinate origins, then all ci0 = mui’*mui” coefficients will become nulls, and with X = ( x1 x2 … xn; …), c = (c11 c21 … cn1)T

c = – (XT*X)-1*XTX2

would be easily uniquely solvable.

nD k-means disjunctive pointwise clustering

If we wanted to generalize “disjoint distance” to nD, k-means case in the form similar to previous lower dimensional cases:

d(x, Mu)^2 = ((x1 – mu’1)*(x1 – mu”1)*…*(x1 – muk’1))^2 + ((x2 – mu’2)*(x2 – mu”2)*…*(x2 – muk’2))^2 + … + ((xn – mu’n)*(xn – mu”n)*…*(x1 – muk’n))^2

we should have come up with some meaning and purpose of that definition. We can try to find it in analogies to more complex spaces than a vector space – in the already mentioned inner product space, or in metric or topological spaces. If in topological space we define sets of open (or closed) sets that basically define whether points of the space are neighbours or not, we can define a set Mu = {(mu’, mu”, … muk’) : mui’ b.t. R^n} that contains cluster mean vectors of the original nD space. That space with the set Mu and defined “disjoined distance” mapping function d(x, Mu) from it to R we can name a “cluster space” Cn = (Rn, Mu, d), in which set Mu and distance function d define neighbourhoods or clusters of the data points.

Let’s see if such a “cluster space” definition produce any useful and meaningful results. Again, a generalized matrix equation for finding set Mu using “disjoint distance” defined above for the given dataset A b.t. Rn, and restrictions of the orthogonal basis of Rn and one centrod set being identity of Rn, will be: Xk + X*c = dd

Xk = ( x1^k + x2^k2 +…+xn^k; … )
X = (x1^(k-1) x2^(k-1) … xn^(k-1) … x1^(k-2) x2^(k-2) … xn^(k-2) … x1 x2 … xn; …)
c = (c1(k-1) c2(k-1) … cn(k-1) …c1(k-2) c2(k-2) … cn(k-2) … c11 c21 … cn1)T
dd = (d(x1, Mu)^2, …)T

and solution for minimizing sum of squares of dd: SSDD = ddT*dd will be, again:
c = – (XT*X)-1*XTXk

from which we can find members of the set Mu – roots of the polynomials:
xi^k + ci(k-1)*xi^(k-1) + ci(k-2)*xi^(k-2) + … + ci1*x = (xi – mu’i)*(xi – mu”i)*…*(xi – muk’i) = 0

where mu’ = (mu’1 mu’2 … mu’i … mu’n)T

Let’s use our ECG dataset, take principal eigen-variables V5 and V6, and cluster them using “disjoint distance” and traditional “conjoint” k-means algorithms (Fig. 17.1)

Fig. 17.1 Top row – “disjoint distance” clustering, bottom row – usual k-means clustering for 3-7 centroids for V5 and V6 principal eigen-variables of the ECG of a pregnant woman dataset.

ekg <- read.table(“foetal_ecg.dat”)

#remove time
dim <- c(“V2″,”V3″,”V4″,”V5″,”V6″,”V7″,”V8″,”V9”)
ekg2 <- ekg[, dim]

#horizontal matrix PCA – ECG channels
pca_model <- ks_eigen_rotate_cov(ekg2)
ds <- pca_model$ds

#put time and original labels back
names(ds) <- c(“V2″,”V3″,”V4″,”V5″,”V6″,”V7″,”V8″,”V9″)
ds[,”V1″] <- ekg[,”V1”]

# n-dimension, k-means clustering
for(i in 3:7){
c_model <- ks_kmeans_nd_means(ds, c(“V5″,”V6”), i)
ds <- ks_kmeans_nd_clusters(ds, c(“V5″,”V6”), c_model$Mu)

mn <-$Mu)
names(mn) <- c(“V5″,”V6″)
mn$Cls <- seq(1,nrow(mn))

ggplot(ds, aes(x=V5, y=V6, colour=Cls))+
geom_point(alpha=0.4, show.legend=F)+
geom_point(data=mn, alpha=1, size=2, shape=1, stroke=4, show.legend=F)+
geom_point(data=mn, alpha=1, size=2, color=”white”)+


Appendix 17
As usual, working files are in:

Posted in Data Science Done Right | Tagged , , , , | Leave a comment

Data Science Done Right (the Kitchen Style) #16


So far, when we were talking about linear methods, we were using terms linear/vector space and dataset in that space quite interchangeable, which is not very rigorous, but, excusable if it is obvious from the context. Anyway, we were talking about methods of working with the whole dataset – for example using covariance matrix from the whole dataset to find an eigenbasis that would nullify the covariance rotation moment and we used that basis as Principal coordinates.

However, that’s not necessarily the only way to do the analysis. We may be interested not (only) in the structure/homomorphic projections of the whole dataset, but in revealing structure of the particular subset or subsets in the most continuous way. Therefore, we may use multiple principle eigenbases, rotated around covariance matrices of particular subsets, in which we may transform either the whole dataset or just subsets of the interest. Here we will need to work with clustering our data around some structural features visible in either the very original space, or revealed in the transformed space with the principle eigenbasis. On these structural features we may want to take a closer look, rotating original basis just around clustered subset of our interest.

Clustering is a technique based on quite a vague idea, and is considered an unsupervised one, still we should carefully think through what we want to achieve to come up with particular algorithmic implementation. Even quite straightforward K-means clustering technique (which we will work with) may be implemented with various aims in minds and by various means. Usually, we define in how many cluster subsets we want to partition our dataset (or, as advanced variant, extract the number recommendation from the data), and then try to arrange means of cluster subsets according to some logic. We may use an AND (conjunction) logic – we want that distance from any data point to all means were minimal – i.e we use addition of distance a point to means. But we may  also use an OR (disjunction) logic – we may want to minimize distance of any point to at least one any mean – i.e we use multiplication of distances of a point to means.

1D 2-means disjunctive clustering

Let’s explore the latter case in more detail, and for the simplest 1D case with 2 means for starters. For i-th datapoint let’s define a “disjoint delta ” measure to the means mu’ and mu”, or actually to a vector mu as follows:

ddi = (xi – mu’)*(xi – mu”) = xi^2 + (-mu’ -mu”)*xi + mu’*mu”

for all N data points we may write a system of equations (where c1 = – (mu’+mu”), and c0 = mu’*mu”) as:

( x1^2  x1  1 )    ( 1 )       (dd1)
( x2^2  x2  1 )   ( c1 ) = (dd2)

( xn^2  xn  1 )   ( c0 )    (ddn)

or even

( x1^2 )     ( x1  1 )    ( c1 )     (dd1)
( x2^2 ) + ( x2  1 )   ( c0 ) = (dd2)

( xn^2 )     ( xn  1 )                 (ddn)

which gives us quite familiar, by our previous linear regression derivations, equation structure in compact matrix form:

X2 + X*c = dd

Then we may define a distance from an i-th point to a mean vector mu as:

d(xi, mu) = sqrt(ddi^2), or d^2(xi, mu) = ddi^2

then sum of squares of those “disjoint distances” to the mean vector from all points would be:

SSDD = ddT*dd = (X2 + X*c)T(X2 + X*c)

and then we may want to minimize it by taking its derivative by c, and find out for which c it takes the null value:

@SSDD/@c = 2XT(X2 + X*c) = 0  (with Hessian matrix @^2SSDD/@c@cT = XT*X being full rank positive define it’s indeed minimum)

XT*X2 = –XT*X*c

c = – (XT*X)-1*XTX2

and, remembering that c1 = -mu’ – c0/mu’

mu = (-c1 +- sqrt(c1^2 -4c0))/2

Then we can walk all data point and assign their cluster accordingly to the shortest distance between them and just found cluster means mu’ and mu”.

1D k-means disjunctive clustering

We can (almost) easily extend the logic to any k dimensions of the mean vector mu, but still 1D original data space.

With “disjoint delta” measure for i-th datapoint:
ddi = (xi – mu1)*(xi – mu1)*…*(xi – muk) = xi^k + ck-1*xi^k-1 + … + c1*xi + c0

System of equations for all data points:
( x1^k )     ( x1^k-1 …  x1 1 )   ( ck-1 )     (dd1)
( x2^k ) + ( x2^k-1 … x2 1 )   ( ck-2 ) = (dd2)

( xn^k )     ( xn^k-1 … xn 1 )   (  c0  )     (ddn)


Xk + X*c = dd
SSDD = ddT*dd = (Xk + X*c)T(Xk + X*c)
@SSDD/@c = 2XT(Xk + X*c) = 0
XT*Xk = –XT*X*c
c = – (XT*X)-1*XTXk

Here, having found k polynomial coefficients (k+1th is 1), we have to find roots of that polynomial (factor it), and those roots will be our means, but unlike the 2-mean case (or even 3 or 4), in general (k>=5) case we’ll have to cheat a bit our approach of working with simple analytical solutions, and find them numerically, though conveniently, R (or Python) have already such functions available, like polyroot(c).

For our ECG example, let’s take V6 channel in the principal eigen basis and find, say 7, clusters:

ekg <- read.table(“foetal_ecg.dat”)

#remove time
dim <- c(“V2″,”V3″,”V4″,”V5″,”V6″,”V7″,”V8″,”V9”)
ekg2 <- ekg[, dim]

pca_model <- ks_eigen_rotate_cov(ekg2)
ds <- pca_model$ds

#put time and original lables back
names(ds) <- c(“V2″,”V3″,”V4″,”V5″,”V6″,”V7″,”V8″,”V9″)
ds[,”V1″] <- ekg[,”V1”]

Mu <- ks_kmeans_1d_means(ds, “V6”, 7)
ds <- ks_kmeans_1d_clusters(ds, “V6”, Mu)
ggplot(ds, aes(x=V1))+
geom_point(aes(y=V6, color=Cls, alpha=0.5))+

Fig. 16.1. 7 clusters of V6 principal ECG variable




Which is not so impressive, so we may want to look at multi-dimensional solutions…

Appendix 16
As usual, working files are in:

Posted in Data Science Done Right | Tagged , , , , | Leave a comment

Data Science Done Right (the Kitchen Style) #15

So far we’ve been working with the ECG dataset as a “horizontal” matrix, with ECG channels, being indeed information channels, or dimensions of our vector space, and readings at time slices – our data n-dimensional (actually 8-dimensional) points of count 2500. However, we may transpose the matrix and treat time moments as information channels or dimensions (2500 of them), while 8 ECG channels would be data points.

Fig.15.1. First 7 “eigen-coefficient” in subspaces with significant data structure homomorphisms, and V9 – first 1D “noise” subspace

When rotating basis around covariance matrix of our new 8-point dataset in 2500 dimension space/basis, we would reasonably expect also to see some subspaces with minimal/miniscule “explained variance”, or homomorphisms with a lot of dropped data structure or, by other words, with preserved negligible aspects of the data structure.

Fig.15.2.1. Inverse image of the first eigen-wave, modulated by the first eigen-coefficient V2

Fig.15.2.2. Inverse image of the second eigen-wave modulated by the second eigen-coefficient V3, etc…

Fig.15.2.3. Inverse image of the third eigen-wave modulated by the third eigen-coefficient V4, …

Fig.15.2.5. Inverse image of the fifth eigen-wave modulated by the fifth eigen-coefficient V6, etc…

Fig.15.2.9. Inverse image of the ninth, or first minor, “noise” eigen-wave

Fig.15.3. Inverse image of all minor subspaces 9-2500

After actually doing that rotation we’ll see that result greatly exceeds our expectations – it’s only 7D subspace in the eigenbasis that preserves most significant data structure. Or, in terms of “compression”, we “replaced” 8 x 2500 matrix of the previous “horizontal” data representation by 8 x 7 matrix with quite a low loss or “noise” (Fig.15.3). Of course, in reality we didn’t get those results from thin air, we just put most of the data structure information of the latter case in the 2500 x 2500 eigen-matrix, while for the former – it was 8 x 8 dimension rotation eigen-matrix. That may seem a useless exchange, but dealing with data with similar structures we may train their model in the “off-line” time (calculate eigen-matrix), and do “compression” or pattern recognition faster in “on-line” time.

“Noise” in this model spherical(Fig.15.3), but correlated, and, if we take a look at individual subspaces, they have quite regularized and varying structures (for example Fig.15.2.9).

If, similarly to the image recognition jargon, we could have called the data in eigenspace “eigen-waves” for horizontal matrix representation, for “vertical” representation they could be called “eigen-coefficients” (Fig.15.1). Though that terminology is quite relative – we just call long series of data points “waves”, and short – “coefficients”. Inverse transformations of the 1D subspaces of the eigenspace help to envision what PCA (or linear models in general do) (Fig. 15.2.1-5). Each 1D dimension or subspace of the original space of ECG data is combined from the same (well known to us in previous chapters) eigen-waves, but multiplied by different eigen-coefficients. Depending for which data matrix we calculate eigenspace – “original” or transposed, they will be, pairwise, either in eigen-matrix (matrix of the basis rotation, or eigenvectors) or in the matrix of our data in eigenspace, though they’ll be specific for each covariance matrix we calculate eigenbasis for.

Let’s call our data in original space A, eigen-matrix to it T, and data in eigenspace B, or for transposed case – AT, M and C respectively. Then T*A = B, M*AT = C, and M-1=BT*T-T*C-1, T-1= CT*M-T*B-1.

#vertical matrix PCA – 2500 time chanells
ekg3 <-
pca_model3 <- ks_eigen_rotate_cov(ekg3)
dst <- pca_model3$ds

ds <- dst[,1:8]
names(ds) <- c(“V2″,”V3″,”V4″,”V5″,”V6″,”V7″,”V8″,”V9″)
ds[,”V1″] <- c(1,2,3,4,5,6,7,8) #ekg[1:8,”V1″]

#”remove” extra subspaces
dst_1 <- dst
dst_1[,2:ncol(dst)] <- 0
dst_1o <- %*% pca_model3$An1)
ds <-
ds[,”V1″] <- ekg[,”V1″]

dst_2 <- dst
dst_2[,1] <- 0
dst_2[,3:ncol(dst)] <- 0
dst_2o <- %*% pca_model3$An1)
ds <-
ds[,”V1″] <- ekg[,”V1″]


Appendix 15
As usual, working files are in:

Posted in Data Science Done Right | Tagged , , , , | Leave a comment

Data Science Done Right (the Kitchen Style) #14

The ideas in PCA of “dropping” some dimensions of the resulting eigenbasis (after basis rotation with the target in mind of getting the covariance matrix in the lamda form), or decomposing our vector space into subspaces (“linear modelling” and “error”, for example) with smaller bases (but consisting from the same eigenbasis vectors), or the very use of the vector spaces for statistical modelling come from the simple notion that vector space is also a commutative (or Abelian) group.

We can naturally isomorphically (without any lose of structure) map “things” that have the real number arithmetic properties, or set (your dataset column or row) in the field of real numbers itself onto 1-dimensional vector space. However, mapping your multiple column/row dataset that may contain data of the different nature and measurement units is not so natural. We may construct a group/vector space from the different groups having nothing in common in their underlying sets and defined operations over them via external direct product, however, if we want it to be isomorphic to a group constructed by the inner product (product of subspaces of a bigger space, i.e. having the same nature and structure), we want those subspaces be commutative/Abelian (and disjoint).
(Gallian 353, 185, 196, 212)

So, indeed, for vector spaces the inner and outer direct products are the same (to the isomorphism) and we may use just a “direct product” term, and use the inner and outer direct product notation interchangeably (we’ll use x “cross” notation for simplicity). By definition, the higher dimension linear spaces are the direct products of the lower dimension ones. For example, the real 3D space R3 = R1 x R1 x R1, however, is it the same/isomorphic space to the “really real” spatial space, say S3, we use to envision abstract nD spaces akin to? The space in which any “thing” could be represented by at least 3 other “telescopic things” we may combine together, even we don’t know nothing about (real) numbers and (Cartesian) coordinates?

Continuity of the direct and inverse mappings of f:R3->S3 and f-1:S3->R3 can be verified naturally (we want to check that f(m*a+n*b) = m*f(a) + n*f(b) and vice versa (i.e. neighbours by relations “*” and “+” will remain neighbours after transformation f and f-1)), however bijectivity of f, and in particular surjectivity (onto) does not follow naturally. Basically what we want to know whether for each set of coordinates R3 does exist a point in our physical space S3, and whether for each point in S3 do exist coordinates for it. Which is literally a choice you make to accept the Axiom of Choice, which states they do, and we don’t “cavitation-like” holes in R3 and S3 spaces and therefore we may treat S3 as R3.

By stating isomorphism of the inner and outer direct products for vector spaces and axiom of choice, we are basically saying that we may decompose (via inverse of the inner direct product) a multi-dimensional vector space, in multiple ways, into disjoint (normal) subspaces (each of which created by projecting some of the original dimensions) sum of dimensions of which would be the same as in the original space, and then compose (via outer direct product) them back as a product space of the resulted subspaces, to a degree of isomorphism. That may may be demonstrated directly, of course, via preservation of continuity based on definitions of addition and multiplication operations of vector and product vector spaces (except multiplicity of the ways, which again requires axiom of choice), but the former statement sounds more consciousness and meaningful.

Also, the same could be said in a way that each projection of the original n-dimensional space into m-dimensional (m<n) would be a homomorphism, i.e. preservation of some aspects of the original space structure, while the direct product of disjoint homomorphisms will produce the whole original space back, to a degree of isomorphism.

One more, pretty obvious observation from all above is that transformation of the basis rotation (square matrix) of the direct sum of disjoint subspaces in a general case would result into sum of images of those subspaces which a members of the whole original space. I.e. for a b.t., for example, A6=A2 x A2′ x A2″, therefore a = a1 + a2 + a3 (direct product and direct sum of subspaces are also isomorphisms to a degree of “relabeling”), where a1 b.t A2, a2 b.t. A2′, a3 b.t. A2″, then T(a) = T(a1) + T(a2) + T(a3) = b1 + b2 + b3 = b, where b, b1, b2, b3 b.t. B6, T:A6->B6, where T is of course an isomorphism.

Fig. 14.1. Eigen-space as a direct product of mother heartbeat (blue), child heartbeat and perspiration (green), and noise (red) subspaces

As we can see in our particular case of the 8-dimensional space of ECG readings of a pregnant women, we transformed the space around its covariance matrix to the eigenbasis, and then decomposed it (for programmatic simplicity we just set kernel variables to null) into three disjoint/normal subspaces by homomorphisms with the mother’s heartbeat aspect of the data structure, child heartbeat + mother’s perspiration, and  error/nonlinearity/noise. Rotating those subspaces back to the original 8-channel space via the inverse transformation, we’ve got 3 separate images of those subspaces on those 8 ECG channels, and by adding them together we’ve got, of course, the original dataset back.

Fig. 14.2. Inverse image of the mother’s heartbeat eigen-subspace

Fig. 14.3. Inverse image of the childs’s heartbeat eigen-subspace

Fig 14.4. Inverse image of the noise/error/nonlinearity eigen-subspace (not exactly usual spherical white noise model)

Fig 14.5. Original dataset in the 8-chanell ECG space composed from inverse images of 3 eigen-subspaces above

pca_model <- ks_eigen_rotate_cov(ekg2)
ds <- pca_model$ds

ds_n <- ds
ds_n$V2 <- 0
ds_n$V3 <- 0
ds_n$V4 <- 0
ds_n$V5 <- 0
ds_n$V6 <- 0
ds_n$V7 <- 0
ds_n <- %*% pca_model$An1)

ds_m <- ds
ds_m$V5 <- 0
ds_m$V6 <- 0
ds_m$V7 <- 0
ds_m$V8 <- 0
ds_m$V9 <- 0
ds_m <- %*% pca_model$An1)

ds_c <- ds
ds_c$V2 <- 0
ds_c$V3 <- 0
ds_c$V4 <- 0
ds_c$V8 <- 0
ds_c$V9 <- 0
ds_c <- %*% pca_model$An1)


ds <- ds_c + ds_m + ds_n

Appendix 14.1

As usual, working files are in:

Posted in Data Science Done Right | Tagged , , , , | Leave a comment

Data Science Done Right (the Kitchen Style) #13

On possible criteria for partitioning PCA eigen space to the “error” and “linear modeling” subspaces

As we mentioned in previous chapters, by performing PCA, we implicitly find the least lossy linear models of the hidden variables that describe our data, and use them as an eigenbasis for a “linear models subspace”, while the rest of the eigenvectors compose an orthogonal “error subspace”.

However, have we known nothing about hidden variables, we would not know which eigenbasis vectors are part of the “linear models subspace”, and which – of “error subspace”. Although, having some guesses about distributions of the linear model’s hidden experimental parameters, also about whether and how non-linear processes are, or about the noise, coming from numerous minor sources we don’t know and care, and therefore (according to central limit theorem) should be Gaussian, we may try to identify which eigenbasis vectors has projection of either noise, or/and non-linearity of the linear models, as well as how many distinct models describe the process. Then we may want either drop some dimensions, or set data in them to 0 or other baseline, and sometimes rotate back into original basis, muting either noise, or, in opposite, dominant models, allowing to stand up the weaker, less obvious models.

Fig. 13.1. 8 channel ECG of a pregnant woman with mixed in child’s heartbeat.

For example in the previous chapter we took a dataset of 8-channel ECG of a pregnant women, where on all the channels mother’s heartbeat was dominating over noises of some degree (Fig 13.1), and rotead basis of the data over the covariance matrix, which resulted in three dimensions clearly (by the shape and Gamma-type pdf ) associated with mother’s heartbeat (V2 (dark blue)) is the strongest and clearest variable), two dimensions combining data of child’s heartbeat with some noise and perspiration (V6 (dark green)), and three more dimensions of noises and non-linearities of various sorts (Fig 13.2).

Fig. 13.2. Covariance matrix based eigenbasis, with V2 – mother’s heartbeat and V6 – child’s heartbeat and mother’s perspiration.

If we set the dominant mother’s heartbeat in the first three variable to the base level and rotate basis back to the original 8 channel ECG dimensions, we’ll be able to see clearly child’s heartbeat in one dimension (V4 (dark blue)), and  mother’s perspiration in another (V6 (dark green)) on noise backgrounds (Fig 13.3).

ekg <- read.table(“foetal_ecg.dat”)

dim <- c(“V2”, “V3”, “V4”, “V5”, “V6”, “V7”, “V8”, “V9”)
ekg2 <- ekg[, dim]

pca_model <- ks_eigen_rotate_cov(ekg2)
ds <- pci_model$ds

#remove mother’s heartbeat and one noise channel
names(ds) <- c(“V2”, “V3”, “V4”, “V5”, “V6”, “V7”, “V8”, “V9”)
ds$V2 <- mean(ds$V2)
ds$V3 <- mean(ds$V3)
ds$V4 <- mean(ds$V4)
ds$V7 <- mean(ds$V7)

#rotate basis back
ds <- %*% pca_model$An1)

#put time back
names(ds) <- c(“V2″,”V3″,”V4″,”V5″,”V6″,”V7″,”V8″,”V9″)
ds[,”V1″] <- ekg[,”V1”]

Fig. 13.3. 8 channel ECG of a pregnant woman with mixed in child’s heartbeat (v4 (dark blue)) and perspiration (v5 (dark green)), after setting data in the eigenbasis vectors associated to mother’s heartbeat to a flat mean level, and rotating basis back

Of course we can always go with the standard maximal (minimal) “explained” variance criterion to choose the eigenbasis vectors we are interested in (either to preserve or eliminate), but that is rather an “accidental” criterion. A more fundamental one is that covariance could be seen as a rotating moment applied to a pair of axis of our candidate basis, magnitude of which is defined by the residues (along these two axis) of the dataset, relative to means of the dataset, where residues along one axis are interpreted as “forces”, and along another – as “lever lengths” (or vice versa).

If we rotate the axis around the mean in the direction of the generated moment (covariance), there will be an equilibrium position in which the moment (covariance) will become null, and “coincidentally” that position of axis will be parallel, one – to the regression line, and another – to the regression residue projection line, which will be orthogonal. When we we have dimensions with the same measurement scale (as in this example) it will be obvious which one is which, otherwise multidimensional scaling may help with that decision.

When we make all pairwise covariances in the covariance matrix equal nulls, and leave only variances on the diagonal, that would mean we found eigenbasis for the covariance transformation matrix, where variances will be eigenvalues for the found eigenbasis vectors. That would mean that all possible regression lines/planes/hyperplanes would preserve maximal possible structure of the original dataset (because projections vectors will be orthogonal to the regression hyperplanes), which means that regression transformations will be “conditionally” continuous (neighbours in domain and range spaces will remain neighbours after forward or inverse transformations) with the smallest possible (out of various axis orientations) granularity of epsilon that still preserves continuity. That is where the maximal variance criterion comes from – regression/projection transformations with the smallest continuity loss will give the largest spreads/variances in the range space.

Though, again, if there is some structure in the original dataset, it’s not necessarily those components of the structure that are most hom(e)omorphically transformed onto eigen-subspaces (and therefore have the largest variances) that we may be interested in, like in the example above (i.e. mother’s heartbeat). In a way, looking at dimensions with maximal “explained” variances is like looking for the keys under streetlight (from that anecdote). Yes, it’s dark outside of the light circle, but it’s dark also because the street light is blinding you. If you manage to polarize its light and use orthogonally polarized glasses you may see your keys in the “dark”, or, rather, illuminated by the Moon’s on stars’ light. You may not see all the small details on these keys as you could have seen them under the streetlight illumination had they been there, but they are not – they are under the dim starlight 🙂

Appendix 13.1

As usual, working files are in:

Posted in Data Science Done Right | Tagged , , , , | Leave a comment