When previously we were talking about the transformation matrix N-1 – the matrix of changing basis from the original to a basis of the element of the quotient space we project our dataset to – we picked the simplest model just to keep computational ball rolling, planning to come back and address a more reasonable choice later. Before starting playing with word vector regressions let’s improve our transformation to a bit more useful form.

Our N-1 matrix was defined as:

(*a 0 .. b*)

(*0 c .. d*)

(.. .. .. ..)

(*0 0 .. 0*)

and we decided to go with coefficients* a=b=c=d=..=1* for simplicity.

Let’s see what that transformation that matrix would define. And let’s do it in 3D space for easier understanding, then, when we come up with a better matrix coefficients, generalizing it to a higher dimension case again).

Those matrices that convert bases define how unit vectors of these bases relate (expressed via linear combination) to each other. Therefore the following matrix:

N-1 **x** **y**

(*1 0 1*) (*1*) **i ** (*1*1 + 1*1*) **u**

(*0 1 1*) (*0*) **j** = ( *1*1*) **v**

(*0 0 0*) (*1*) **k**

basically says that if we take a unit vector’s **x**=*1****i**+*1****j**+*1****k** projection in a **ik** plane, then the length (in respect to base vector **u **(the same will be with **v**)) of the transformed (rotated to) vector **y** will be additively combined from the lengths of vector **x** in respect to base vectors **i** and **k** (**k** is the dimension we are collapsing). And the norm of the resulting vector in respect to **u** will be ||**y**||^2 = **u**T***u** = 2**2* = *4*, or ||**y**|| = *2*, therefore the unit vector in respect to basis **u** will be as twice as smaller than the transformation of the vector that has unit components in respect to the basis **ik**. Or, ||**x**||^2= **x**T***x** = *1*1 + 1*1* = *2*, or||**x**|| = sqrt(*2*), or ||**y**||/||**x**|| = sqrt(*2*), which would lead to a “dimension inflation”, or shrinking unit vector length’s of each synthetic dimension (or growing coefficients of the synthetic bases) after each regression step (see Fig. 9.1). Which is obviously may be inconvenient if we want to compare distributions between regression steps.

We could have set *a,b,c,d*,etc… coefficients in N-1 matrix to something else than *1*, for example keeping *a=b* and *c=d*, but setting *a!=c*, also counting in account that transformation R (regression’s projection) will shrink unit coefficient of **k** to *t31* (in this context we are looking at the transformation of the unit vector’s x projection in **ik** plane), and asking for ||**y**||=*1*:

N-1 **x** **y**

(*a 0 a*) (*1*) **i ** (*a*1 + a*t31*) **u**

(*0 c c*) (*0*) **j** = ( * c*t31*) **v**

(*0 0 0*) (*t31*) **k**

then: ||**y**||^2=a^*2*(1+t31)^2=1*; a=sqrt(*1/(1+t31)^2*)

However, we still assign the same weight a to components of **i** and **k** when composing length of **u** (even we take only projected part t31 of the original unit vector **k**), which, if we don’t have particular reason for doing that, will distort input of the original dimensions **i** and **k** into the synthetic **u**. Therefore it makes sense to use projections of the unit vector components in respect to **i** and **k** on our projected dimension **u** to compose length of the synthetic unit vector:

N-1 **x** **y**

(*d*a’ 0 d*b’*t31*) (*1*) **i ** (*d*a’*1 + d*b’*t31*) **u**

( ) (*0*) **j** =

( ) (*1*) **k**

Where: ||y||^2=*d^2*(a’+b’*t31)^2=1*;

*a’=1*cos(alpha)*; *b’*t31=sin(alpha)*t31*;

s*in(alpha)=t31/sqrt(1+t31^2)*; *cos(alpha)=1/sqrt(1+t31^2)*;

then: *d=1/sqrt(1+t31^2); d*a’=1/(1+t31^2); d*b’*t31=t31^2/(1+t31^2)*

Now, with new N-1 matrix, we’ll get consistent, normalized range distributions of the synthetic variables on each regression step, and multi-modality of the synthetic eigenbasis distribution is even more pronounced:

with the synthetic eigenbasis vector v1 = -4.46389046671904*bp.1s – 0.000335763837256707*hip + 0.0197350621146874*waist + 0.0266855254863463*bp.1d + 0.00308836195333824*height + 0.866182912400952*age + 0.00535941186779068*stab.glu – 0.00021277667540017*hdl + 0.000649570489160293*chol + 0.00146620009377012*glyhb

where coefficients less than 1e-04 were rounded down to 0 (for weight and ratio).

That is our current model (though without rotating synthetic basis to get rid of the variable dependency), where: glyhb – Glycosylated Hemoglobin, chol – Total Cholesterol, hdl – High Density Lipoprotein, ratio – Cholesterol/HDL ratio, stab.glu – Stabilized Glucose, bp.1s – Systolic Blood pressure, bp.1d – Diastolic Blood pressure, while other parameters’ names are self explanatory (height, hip and waist circumferences are measured in inches, weight in pounds). Standardized value of a variable is computed as sv = (v-min)/range, where:

**vars min max range**

glyhb 2.68 16.11 13.43

chol 78.00 443.00 365.00

hdl 12.00 120.00 108.00

ratio 1.50 19.30 17.80

stab.glu 48.00 385.00 337.00

bp.1s 90.00 250.00 160.00

bp.1d 48.00 124.00 76.00

age 19.00 92.00 73.00

height 52.00 76.00 24.00

weight 99.00 325.00 226.00

waist 26.00 56.00 30.00

hip 30.00 64.00 34.00

V1 value greater than 0.35 seams is a threshold between first three modes of the distribution that cluster data with low parameters in the cholesterol, glucose, blood pressure groups, and last two-three modes that cluster data with high such values, which we can see at Figs. 9.4-6 with overlaying histograms for groups with low/high glyhb (>7 Fig. 9.4), bp.1s (>150, Fig. 9.5), and chol (>250, Fig. 9.6).

**Appendix 9.1 **

*# Linear univariate regression that minimizes rss *

*# of variable ‘yname’ of ‘df’ dataset*

*ks_lm <- function (df, yname, xnames, full=1, error=0){*

* all.names = c(unlist(xnames), unlist(yname))*

* clean.df <- df[complete.cases(df[, all.names]), all.names]*

* nr <- nrow(clean.df)*

* nc <- ncol(clean.df)*

* model <- list()*

* model$X <- matrix(do.call(rbind,clean.df), nc)*

* rownames(model$X) <- all.names*

* x <- clean.df*

* x[, yname] <- NULL*

* x[, nc] <- rep_len(1, nr)*

*y <- clean.df[, yname]*

* X <- matrix( do.call(cbind,x), nrow(x) )*

*b1 <- solve(t(X) %*% X) %*% t(X) %*% y*

* if(full | error){*

* e <- X %*% b1 – y*

* model$e2 <- t(e) %*% e*

* model$e2n <- model$e2/nr*

* model$en <- sqrt(model$e2)/nr*

* }*

* if(full){*

* #Projection onto 0 member of quotient space X/Y*

* model$R <- diag(rep_len(1, nc))*

* model$R[nc,1:nc] <- b1*

* model$R[nc,nc] <- 0*

* da1 <- 1/(1+as.vector(b1)**2)*

* model$Nm1 <- diag(da1)*

* dbt1 <- 1/(1/as.vector(b1)**2+1)*

* model$Nm1[,nc] <- dbt1*

*model$Nm1 <- model$Nm1[-nc,]*

* model$h0 <- rep_len(0, nc)*

* model$h0[nc] <- b1[ncol(b1)]*

* model$X0 <- model$R %*% model$X*

* rownames(model$X0) <- all.names*

* model$Xl <- model$X0[, 1:nr] + model$h0*

* rownames(model$Xl) <- all.names*

* model$Nm1R <- model$Nm1 %*% model$R*

* model$DimNames <- matrix_symvect_mult(model$Nm1R, all.names)*

* model$Y <- model$Nm1 %*% model$X0*

* rownames(model$Y) <- model$DimNames*

* }*

* model*

*}*

sym_var_mult <- function(item, m=1){

item <- item_conv(item)

mult_res <- as.double(item[1])*m

if(abs(mult_res) < 1e-04)

mult_res <- 0

c(as.character(mult_res), item[2])

}

]]>

In our linear model, designed and developed (in R) in the previous chapters, we calculate not only new images of the data, being projected on the chosen element of the quotient space that satisfy particular requirements (we use minimal SSE), in the original space basis, but also in a basis of the quotient subspace. However, we will need one more function that would calculate in the symbolic form what would be new synthetic basis vectors composed by the linear combination out of the original space basis vectors. Basically we need to implement a symbolic multiplication of the matrix N-1R of our transformation into the quotient space element by symbolic vector of the variable names (see those functions in Appendix 8.1).

Having symbolic matrix by vector multiplication functions ready, we may start playing with our dataset, trying to reduce its higher dimensionality into a more intelligibly manageable and comprehensible lower dimension image. As we mentioned before, regression is a way to reduce dimensionality of a set, of course, by a price of losing complexity of the data Structure. We try to minimize that loss of the continuity of the transformation function (projection we use in regression) using some criteria (loss function) – in our case SSE. So far we have an univariate multivariable regression function ready, therefor we can collapse dimensions consecutively, one by one. With this approach we may need to decide how many dimensions we want to collapse and/or in what order.

There could be multiple approaches and criteria for that, and we may come back to this discussion later, but for now let’s start with simple ones. For obvious reasons of saving as much Structural complexity (i.e. continuity of transformation) we may want to collapse those dimensions which projections give the minimal loss function of our choice (of course, first, we normalize those dimensions to their ranges). On the other hand, it may be not so important if we go to the ultimate dimensionality reduction to the 1D space, basically making its basis vector an eigenvector.

Let’s take the Diabetes dataset we used for experimentation in the previous chapter, which has 12 dimensions (obviously not independent), and run consecutive univariate regressions until 1 dimension left:

load(url(“http://biostat.mc.vanderbilt.edu/wiki/pub/Main/DataSets/diabetes.sav”))

# Dimensionality reduction

diabetes2 <- diabetes

dim <- c(“glyhb”, “chol”, “hdl”, “ratio”, “stab.glu”,

“bp.1s”, “bp.1d”, “age”, “height”, “weight”, “waist”, “hip”)

ds <- ks_lm_dim_red(diabetes2, dim, n_dim=1, reorder=1)

Now we can see what will be our synthetic eigenvector variable:

resp_name <- names(ds)[1]

resp_name

[1] “573.865492043754*chol+25.8440157742691*weight-507.345171188756*hdl+455.427566923276*hip+251.259082865673*waist-12.0580460009216*bp.1s+17.2494522189963*glyhb+142.022683378986*stab.glu+233.150806921495*height+59.1169607418513*bp.1d+29.0382173062296*age”

On the first look, not much surprises – cholesterol is really bad, HDL (“good cholesterol”) is good, hip circumference is bad, as well, while waist circumference is as twice as less bad, surprisingly, height is noticeable worse than weight, and age and blood pressure have less importance than other factors, but high values are still bad. But, anyway, that’s only a formal statistical model.

Now, let’s see how the data distribution looks like in our new synthetic 1D eigenbasis v1:

names(ds)[1] <- “v1″

ggplot(ds, aes(x=v1))+

geom_histogram(data=ds,aes(y=..density..),fill=”blue”, bins=40, alpha=0.5)+

geom_density(data=ds,adjust=0.5)

We see it’s noticeably multi (quatro or quinto) modal. We don’t know, yet, what that clusterization mean, and correlates with, so let’s see how it relates to the usual criterion for diabetes glycosylated hemoglobin > 7. Let’s overlay histograms of two groups of observations divided by that criterion:

clean.diabetes <- diabetes2[complete.cases(diabetes2[, dim]), dim]

ds$glyhb <- clean.diabetes$glyhb

ds$glyhb_c <- 0

ds[ds$glyhb>=7, “glyhb_c”] <- 1ds_nd <- ds[ds$glyhb<7,]

ds_pd <- ds[ds$glyhb>=7,]ggplot(ds, aes(x=v1))+

geom_histogram(data=ds, fill=”blue”, bins=40, alpha=0.6)+

geom_histogram(data=ds_nd, fill=”green”, bins=40, alpha=0.6)+

geom_histogram(data=ds_pd, fill=”red”, bins=40, alpha=0.6)

Also we may look at these two subdatasets and the whole dataset on scatterplot:

ggplot(ds, aes(x=v1, colour=glyhb))+

geom_point(data=ds, y=.5, alpha=0.5)+

geom_point(data=ds_nd, y=.75, alpha=0.5)+

geom_point(data=ds_pd, y=.25, alpha=0.5)+

scale_colour_gradientn(colours=rainbow(4))

As well as descriptive statistics of these datasets:

> describe(ds[[1]])

vars nmean sdmedian trimmed mad min max range skew kurtosis se

X1 1 377508.01 192.12504.76 505.21 192.78 92.88 1215.27 1122.39 0.2 -0.14 9.89> describe(ds_nd[[1]])

vars nmean sdmedian trimmed mad min max range skew kurtosis se

X1 1 319478.84 177.23482.3 476.71 182.29 92.88 1041.15 948.27 0.12 -0.31 9.92> describe(ds_pd[[1]])

vars nmean sdmedian trimmed mad min max range skew kurtosis se

X1 1 58668.48 193.08694.31 671.4 169.63 149.86 1215.27 1065.41 -0.13 0.27 25.35

Difference in the means between two subsets is just around one sigma, which has not very impressive confidence level of statistical significance (like 60-70%), but that’s not exactly the point. Glycosylated hemoglobin and it’s level 7 itself is not the ultimate and definitive diabetes type II criterion, but just one useful estimation. Diabetes type II is not just about “blood sugar”, it’s a general metabolic condition, not necessarily disorder per se. Some researchers hypothesise it’s an evolutionary adaptation (to cold climate), and when it was developed the “inconveniences” it causes in the old age and “normal”, comfortable conditions of contemporary life was not an evolutionary priority.

Anyway, that clustering of observations with the positive glyhb criterion at last two modes (v1 > 650) may be used as an indicator of such adaptation, or likelihood of diabetes type II condition. Let’s see how this clustering look in the original (cholesterol-blood pressure-glycosylated hemoglobin) space:

clean.diabetes$syn <- ds[[1]]

clean.diabetes$syn_c <- 2

clean.diabetes[clean.diabetes$syn<320, “syn_c”] <- 1

clean.diabetes[clean.diabetes$syn>=650 , “syn_c”] <- 3

cloud(chol ~ glyhb * bp.1s, clean.diabetes, groups=syn_c)

What we can see on this scatterplot is that the last mode clusters around observations with largest variabilities of at least one of the displayed parameters, which is a well known risk factor for diabetes. Therefore, indeed, it may make sense to investigate more the above described criterion in association with diabetes type II.

It’s easy to notice that out of the original 12 variables many of them are dependent, and could be grouped into 5 more or less independent backets (glucose related, cholesterol related, blood pressure, physical characteristics and age). Collapsing one of the dimensions in the group may lead to collapsing others, therefore it may be enough to do much less projections than the number of dependent dimensions to convert synthetic dimensions into the eigenbasis. For our particular case 5 projections (and respective basis conversions) are enough to put the data in a relatively straight line which will be the eigenbasis vector we are looking for.

From the computational point of view we are looking for a state when median sigma of the coordinates on all the remaining dimensions are less than a particular value, say, 5 %:

ds <- ks_lm_dim_red(diabetes2, dim, sd_dim=0.05, reorder=1)

or setting that we want 7 dimensions left:

ds <- ks_lm_dim_red(diabetes2, dim, n_dim=7, reorder=1)

And plotting the transformed data in the first 3 of the synthetic dimensions (in the others it will be the same), we’ll get a pretty straight line:

*names(ds)[1:3] <- c (“v1”, “v2”, “v3”)*

*clean.diabetes <- diabetes2[complete.cases(diabetes2[, dim]), dim]*

*ds$glyhb <- clean.diabetes$glyhb*

*ds$glyhb_c <- 0*

*ds[ds$glyhb>=7, “glyhb_c”] <- 1*

*cloud(v1 ~ v2 * v3, ds, groups=glyhb_c, pretty=TRUE)*

To roughly converting our dataset into 1D eigenbasis, and preserving the basis vector length (which doesn’t really matter for the eigenbasis), there would be enough to multiply one of the coordinates by sqrt(n) (where n is number of remaining dimensions (assuming the vector length preservation in two bases: **A** = *b***u** = *a***i**+a**j**+a**k**+…+a**n**, **A**T***A** = *b**2 = *n*(*a**2)).

If we stop just one step before:

ds <- ks_lm_dim_red(diabetes2, dim, n_dim=8, reorder=1)

We’ll still have some variability of data points:

Whether we run all projections till the last dimension, or stop by the mean sigma criteria, we’ll have practically the same distribution along the eigen basis vector. In our case, if we stop earlier, numerical values of the synthetic variable will be different (we’ve been lazy in previous chapters defining the N-1 transform as a diagonal and last column unit matrix, therefore after each projection the length of the basis vectors shrinks), but for eigenbasis that’s not important. Eigenbasis coordinate will be just proportionally adjusted:

v1 = 23.70*chol+1.09*weight-18.89*hdl+18.57*hip+10.66*waist-0.51*bp.1s+0.73*glyhb+5.88*stab.glu+9.42*height+2.211*bp.1d+0.917*age

We’ll play with a more intelligent Principal Component Analysis of the synthetic basis on projection later, and for now take look at another applications of the linear regression for the eigenbasis transform, in particular such, it seems, unlikely area as Natural Language Understanding…

**Appendix 8.1 – Symbolic matrix by vector multiplication**

*#Examples:*

*str1 <- “2.*a-3.1*b-0.4*c-d”*

* str2 <- “.01*c+35*b+3.24*e-5.*d”*

* # Symbolic multiplication of string ‘str1’ by number 2.5*

* sym_lc_mult(str1, 2.5)*

* # Symbolic sum of 2 strings ‘str1’ and ‘st22’*

* sym_lc_add(str1, str2)*

*str_v <- c(“a+.5*b”,”b-2*c”,”3*c+d”)*

* str_u <- c(“”,””,””)*

* m <- matrix(c(1,2,3,4,5,6,7,8,9),3,byrow=TRUE)*

* # Symbolic multiplication of numeric matrix ‘m’ by string vector ‘str_v’,*

* # result is a string vector ‘str_u’*

* str_u <- matrix_symvect_mult(m, str_v)*

* str_u*

*# Symbolic multiplication of numeric matrix ‘m’ by string vector ‘str_v’,*

* # result is a string vector*

* matrix_symvect_mult <- function(m, str_v){*

*str_u*

* }*

*# Symbolic sum of 2 strings*

* sym_lc_add <- function(str1, str2){*

*for(item1 in l_tup1){*

* item1 <- item_conv(item1)*

* found = FALSE*

*for(item2 in l_tup2){*

* item2 <- item_conv(item2)*

*# item in both strings*

* if(identical(item1[2], item2[2])){*

* found = TRUE*

* mult = as.double(item1[1]) + as.double(item2[1])*

*if(mult != 0){*

* if(!is.null(str_out))*

* str_out = paste(str_out,”+”, sep=””)*

*str_out = paste(str_out,*

* as.character(mult),*

* “*”, item1[2], sep=””)*

* }*

* }*

*}*

*# item in first string but not in second*

* if(!found){*

* if(as.double(item1[1]) != 0){*

* if(!is.null(str_out))*

* str_out = paste(str_out,”+”, sep=””)*

*str_out = paste(str_out, item1[1],*

* “*”, item1[2], sep=””)*

* }*

* }*

* }*

*# item in second string but not in first*

* for(item2 in l_tup2){*

* item2 <- item_conv(item2)*

* found = FALSE*

*for(item1 in l_tup1){*

* item1 <- item_conv(item1)*

*# item in both strings – not processing*

* if(identical(item1[2], item2[2])){*

* found = TRUE*

* }*

* }*

*if(!found){*

* if(as.double(item2[1]) != 0){*

* if(!is.null(str_out))*

* str_out = paste(str_out,”+”, sep=””)*

*str_out = paste(str_out, item2[1],*

* “*”, item2[2], sep=””)*

* }*

* }*

* }*

*if(is.null(str_out))*

* str_out = “0”*

*str_out2 <- gsub(“[+]-“,”-“, str_out)*

* str_out2*

* }*

*# Symbolic multiplication of string ‘str’ by number m*

* sym_lc_mult <- function(str, m=1){*

*res <- lapply(l_tup, sym_var_mult, m)*

*str_out = NULL*

* for(item in res){*

* if(!identical(as.double(item[1]), 0)){*

* if(!is.null(str_out))*

* str_out = paste(str_out,”+”, sep=””)*

*str_out = paste(str_out, item[1], “*”, item[2], sep=””)*

* }*

* }*

*if(is.null(str_out))*

* str_out = “0”*

*str_out2 <- gsub(“[+]-“,”-“, str_out)*

* str_out2*

* }*

*item_conv <- function(item){*

* if(length(item)==1){*

* if(is.na(as.double(item[1]))){*

* if(identical(substr(item[1],1,1),”-“))*

* item = c(“-1”, sub(“-“,””,item[1]))*

* else*

* item = c(“1”, item[1])*

* }*

* else*

* item = c(“0”, “0”)*

* }*

* else if(length(item)==0){*

* item = c(“0”, “0”)*

* }*

*item*

* }*

*sym_var_mult <- function(item, m=1){*

* item <- item_conv(item)*

* c(as.character(as.double(item[1])*m), item[2])*

* }*

**Appendix 8.2 – Linear regression model**

*# Linear univariate regression that minimizes rss*

* # of variable ‘yname’ of ‘df’ dataset*

* ks_lm <- function (df, yname, xnames, full=1, error=0){*

*nr <- nrow(clean.df)*

* nc <- ncol(clean.df)*

*model <- list()*

* model$X <- matrix(do.call(rbind,clean.df), nc)*

* rownames(model$X) <- all.names*

*x <- clean.df*

* x[, yname] <- NULL*

* x[, nc] <- rep_len(1, nr)*

*y <- clean.df[, yname]*

*X <- matrix( do.call(cbind,x), nrow(x) )*

*b1 <- solve(t(X) %*% X) %*% t(X) %*% y*

*if(full | error){*

* e <- X %*% b1 – y*

* model$e2 <- t(e) %*% e*

* model$e2n <- model$e2/nr*

* model$en <- sqrt(model$e2)/nr*

* }*

*if(full){*

* #Projection onto 0 member of quotient space X/Y*

* model$R <- diag(rep_len(1, nc))*

* model$R[nc,1:nc] <- b1*

* model$R[nc,nc] <- 0*

*#X to Y Basis rotation*

* model$Nm1 <- diag(rep_len(1, nc))*

* model$Nm1[,nc] <- rep_len(1, nc)*

* model$Nm1 <- model$Nm1[-nc,]*

*model$h0 <- rep_len(0, nc)*

* model$h0[nc] <- b1[ncol(b1)]*

*model$X0 <- model$R %*% model$X*

* rownames(model$X0) <- all.names*

* model$Xl <- model$X0[, 1:nr] + model$h0*

* rownames(model$Xl) <- all.names*

*model$Nm1R <- model$Nm1 %*% model$R*

* model$DimNames <- matrix_symvect_mult(model$Nm1R, all.names)*

*model$Y <- model$Nm1 %*% model$X0*

* rownames(model$Y) <- model$DimNames*

* }*

*model*

* }*

**Appendix 8.3 – ****Dimensionality**** reduction via linear regression**

*# Reduce dimensionality of dataframe ‘df’ on variables listed in ‘dim’*

* # until ‘n_dim” dimansions left*

* ks_lm_dim_red <- function(df, dim, n_dim=1, reorder=1, sd_dim=0, hb=1){*

*# normalize variables by their range*

* n_ds <-lapply(dim, norm_ds, clean.df)*

* names(n_ds) <- dim*

* norm.df <- as.data.frame(n_ds)*

*#order variables by rss of their univariate regressions*

* v_dim <- sapply(dim, norm_rss_ds, norm.df)*

* i_dim <- order(v_dim[2,])*

* order.df <- norm.df[, i_dim]*

*# run univariate regressions for variables in rss order*

* # until n_dim dimensions left*

* ds <- order.df*

* n_max <- length(dim)-n_dim*

* i <- 0*

* while(i < n_max){*

*if(hb)*

* print(i)*

*resp_name <- colnames(ds)[1]*

* pred_names <- colnames(ds)[-1]*

*mod <- ks_lm(ds, resp_name, pred_names)*

* ds <- as.data.frame(t(mod$Y))*

*i <- i+1*

* if(i >= n_max)*

* break*

*# check if we already have effectively eigenvector (all coordinates are the same)*

* if(sd_dim > 0){*

* m_sd <- mean(sapply(as.data.frame(t(ds)), norm_sd))*

* print(m_sd)*

* if(m_sd < sd_dim){*

* # convert length to 1D eigenbasis*

* col_name <- names(ds)[1]*

* ds <- as.data.frame((ds[,1] * sqrt(length(dim)-i)))*

* names(ds)[1] <- sym_lc_mult(col_name, sqrt(length(dim)-i))*

* break*

* }*

* }*

*#reorder synthetic variables by rss on each iteration*

* if(reorder){*

* tmp_dim <- colnames(ds)*

* v_dim <- sapply(tmp_dim, norm_rss_ds, ds)*

* i_dim <- order(v_dim[2,])*

* ds <- ds[, i_dim]*

* }*

*}*

*ds*

* }*

*# Create list of rss for univariate regression on each normalized variable*

* norm_rss_ds <- function(item, norm.diabetes){*

* i_range <-range(norm.diabetes[,item])*

* norm.diabetes[,item] <- norm.diabetes[,item]/(i_range[2]-i_range[1])*

*predict_dim <- colnames(norm.diabetes)*

* predict_dim <- predict_dim[which(predict_dim!=item)]*

*mod <- ks_lm( norm.diabetes, item, predict_dim, full=0, error=1 )*

*dim_rss <- c(item, mod$en)*

* dim_rss*

* }*

*# Normalize a variable by its range*

* norm_ds <- function(item, norm.diabetes){*

* i_range <-range(norm.diabetes[,item])*

*norm_col <- (norm.diabetes[,item]-i_range[1])/(i_range[2]-i_range[1])*

* norm_col*

* }*

*# Normalize st.dev. by mean*

* norm_sd <- function(item){*

* sd(item)/mean(item)*

* }*

]]>

However, it would be interesting not only to find out what our quotient (or factor) space X/Y would be (which is defined by the *bi (i !=0 ) *regression* *coefficients), and what particular equivalence relation R would be of our dataset A with its image B on Y*l* we project it to, and what the projection function (**T** or **R**) we have to Y*l* and Y*0* spaces respectively, but also look at what our projected dataset B becomes in the new space of the reduced dimensionality Yl (or the same in Y0, because we found out before that in the linear space **d**=**f**). Actually that is the main goal of the whole exercise we just done – we projected our data from the space with the less “convenient” structure into one with a more “convenient” (with the reduced dimensionality) structure. BTW, that thing is a part of *Factor Analysis *(no surprise we are dealing with factor spaces).

What we want to find (see previous chapter) is the transform **M**. We just found (in ** ks_lm0**) the above mentioned transform

(1 0 … 1)

(0 1 … 1)

(. . … .)

(0 0 … 0)

So, let us stop for a moment at a more visually “comprehendable” 3D representation of our data, where we look only at glucosylated hemoglobin (glyhb), cholesterol/hdl (rati)o and systolic blood pressure (bp.1s) dimensions, and see how our data would look in 3D and 2D representation, after we have collapsed glyhb dimension, and created 2 synthetic variables of the linear combination of glyhb & ratio and glyhb & bp.1s. To do that we will enhance our new ** ks_lm** function to not only calculate slopes and interception as in

First, we use *lattice* package for 3D scatter plot of the original (blue) data X in coordinates V3=glyhb, V1=ration, V2=bp.1s, and projected (magenta) data Xl into quotient space member Yl (regression plane) and then converted back to the X space basis:

*load(url(“http://biostat.mc.vanderbilt.edu/wiki/pub/Main/DataSets/diabetes.sav”))*

*ks_model <- ks_lm( diabetes, c(“glyhb”), c(“ratio”,”bp.1s”) )*

*xdf <- as.data.frame(t(ks_model$X))*

*xdf[, “V4”] <- rep_len(1, nrow(ydf))*

*xldf <- as.data.frame(t(ks_model$Xl))*

*xldf[, “V4”] <- rep_len(2, nrow(ydf))*

*xxldf <- rbind(xdf, xldf)*

*cloud(V3 ~ V1*V2, xxldf, groups=V4)*

And then, we use *ggplot2* package to draw a 2D scatter plot of our new synthetic variables in the Yl space:

*ydf <- as.data.frame(t(ks_model$Y))*

*ydf[, “V3”] <- rep_len(2, nrow(ydf))*

*xdf <- as.data.frame(t(ks_model$X))*

*xdf[, “V3”] <- rep_len(1, nrow(ydf))*

*xydf <- rbind(ydf, xdf)*

*ggplot(xydf, aes(x=V1, y=V2, colour=V3))+*

* geom_point(alpha=0.5)*

Now, having visually seen how our linear regression function work on 3D data, and how it may be used for dimensionality/factor reduction, and where in it is a place for principle component manipulation, let us proceed playing with it on a larger dimension data…

**Appendix **

**ks_lm** <- function (df, yname, xnames){

* all.names = c( unlist(xnames), unlist(yname) )*

* clean.df <- df[complete.cases( df[, all.names] ), all.names]*

* nr <- nrow(clean.df)*

* nc <- ncol(clean.df)*

* model <- list()*

* model$X <- matrix(do.call(rbind,clean.df), nc)*

* x <- clean.df*

* x[, yname] <- NULL*

* x[, nc] <- rep_len(1, nr)*

*y <- clean.df[, yname]*

* X <- matrix( do.call(cbind,x), nrow(x) )*

*b1 <- solve(t(X) %*% X) %*% t(X) %*% y*

* #Projection onto 0 member of quotient space X/Y*

* model$R <- diag(rep_len(1, nc))*

* model$R[nc,1:nc] <- b1*

* model$R[nc,nc] <- 0*

* #X to Y Basis rotation*

* model$Nm1 <- diag(rep_len(1, nc))*

* model$Nm1[,nc] <- rep_len(1, nc)*

* model$Nm1 <- model$Nm1[-nc,]*

* model$h0 <- rep_len(0, nc)*

* model$h0[nc] <- b1[ncol(b1)]*

* model$X0 <- model$R %*% model$X*

* model$Xl <- model$X0[, 1:nr] + model$h0*

* model$Y <- model$Nm1 %*% model$X0*

* model*

*}*

]]>

Let’s run it against a dataset we used in one of the Statistical Methods classes (Statistical hypotheses testing for Diabetes Type II risk factors), also, for comparison, running these data through the standard *lm* function, making sure that results do match:

*> load(url(“http://biostat.mc.vanderbilt.edu/wiki/pub/Main/DataSets/diabetes.sav”))*

*> ks_lm0( diabetes, c(“chol”), c(“glyhb”,”ratio”,”bp.1s”,”bp.1d”) )*

* [,1]*

* [1,] 1.5868593*

* [2,] 11.2724797*

* [3,] 0.1593560*

* [4,] 0.3243256*

* [5,] 98.7703565*

*> cols <- c(“chol”,”glyhb”,”ratio”,”bp.1s”,”bp.1d”)*

* > clean2.diabetes <- diabetes[complete.cases(diabetes[,cols]),cols]*

* > View(clean2.diabetes)*

* > lmodel <- lm(“chol ~ glyhb + ratio + bp.1s + bp.1d”, clean2.diabetes)*

* > lmodel*

*Call:*

* lm(formula = “chol ~ glyhb + ratio + bp.1s + bp.1d”, data = clean2.diabetes)*

*Coefficients:*

* (Intercept) glyhb ratio bp.1s bp.1d*

* 98.7704 1.5869 11.2725 0.1594 0.3243*

Which, of course (no rocket science there, just simple matrix arithmetic), they do

However, let’s not stop here, and continue visualizing and playing with our data further…

**Appendix 1**

**ks_lm0** <- function (df, yname, xnames){

* all.names = c( unlist(yname), unlist(xnames) )*

* clean.df <- df[complete.cases( df[, all.names] ), all.names]*

*x <- clean.df*

* x[, yname] <- NULL*

* x[, ncol(clean.df)] <- rep_len(1, nrow(clean.df))*

*y <- clean.df[, yname]*

*#X <- matrix(unlist(x), nrow(x))*

X <- matrix(do.call(cbind,x), nrow(x))

*solve(t(X) %*% X) %*% t(X) %*% y*

* }*

]]>

Let’s formalize a bit our Kitchen Style analogy from the previous post, into a more suitable for coding notation, still being a bit too verbose, as it is typical for Kitchen talks. We’ll start with the same 3D case, and generalize it to bigger dimensions later. Let’s denote scalars as the lower case italic letters (s.a. *a*), vectors as the lower case bold letters (s.a. **x**), spaces and sets as the upper case non-bold letters (s.a. A), non-vector members of spaces or sets as lower case letter (s.a. f), and transformations (or in this case liner ones, or matrices) as the upper case bold letters (s.a. **M**), and, just in case, for the future use, tensors will be bold, italic, capital letters (s.a. * F*).

Let’s take a 3D vector space X on which we define basis **i**,**j**,**k**. Let’s take a 3D set of our Data and represent it as a set A in the space X. An arbitrary member **a** of the set A <b.t.> X could be represented as a linear combination *x1****i**+*x2****j**+*x3****k**. Let’s choose a dimension ( represented by, say, the base vector **k**) through the projection of which we are going to regress our 3D space X into a 2D space Y, or rather one, not yet determined, member of the quotient space X/Y. Torturing a bit notation, let’s call it (x/y)*l* (or Y*l*), where *l* <b.t.> I (-inf…0…+inf). Let’s define a basis in Y*l* as **u** and **v**.

Let’s denote **T** as a transformation of an arbitrary **a**<b.t.>A to **b**<b.t.>B, where B is a equivalence class of all projections of A to the element of the quotient space Y*l* (or the plane we are seeking in this regression), as **T**(**a**)=**b**=*x1****i**+*x2****j**+*h****k** (in respect to the basis** ijk**, where *h*=*x3-dx3*).

(figure 1)

The same transformation may be achieved through the chain of the following transformations of **a** to **f**: **P**ij(**a**)=**c** (projection of **k** to the plain **ij**), **M**(**c**)=**d **(projection from **ij **plane to Y*0 *(in respect to **uv**), with the origins of **uv** basis in the **ijk** origin), **R**(**c**)=**e** (the same projection as M, but in respect to ijk basis), **N**(**d**)=**e** (transformation to **ijk** basis back), **Q**(**e**)=**b** (also can be done via linear shift *h* from Y*0* to Y*l* along the **k **(or** **addition of** h**)** **– the quotient space thing). Let’s define **f** (ultimately that the goal of all our transformations) as a representation of the same element in respect to the basis **uv**: **f**=*y1****u**+*y2****v**, where **f**<b.t.>Y*l*, **b**<b.t.>X. **d**=**f **because where **b**=**h**+**N**(**f**)=**N**(**d**)+**h**, hence **f**=**d** (obviously, because of the commutativity) in the vector space (not in general case, though).

(schema 2)

We may be are too talkative, defining too many transformations, but let’s see, we may need them later to better understand behaviour of our data in the reduced-dimensionality spaces themselves, and not only in the original space after the reverse transformation, as we usually do it working with regressions.

Looking at the schema 2 we can see that **T**(**a**)=**QNMP**ij(**a**)=**NMP**ij(**a**)+**h**=**e**+**h**=**b**=**a**–*dx3****k,**** **or:

(t11 t12 t13) (x1) **i** + (0)** i** (x1) **i**

(t21 t22 t23) (x2) **j** + (0) **j** = (x2) **j**

(t31 t32 t33) (x3) **k** + (h) **k** (x3-dx3) **k**

And if **P**ij(**a**)=**c**, or:

(1 0 0) (x1) **i ** (x1) **i**

(0 1 0) (x2) **j** = (x2) **j**

(0 0 0) (x3) **k** (0) **k**

**M**(**c**)=**d**, or:

(a11 a12 0) (x1) **i** (y1) **u**

(a21 a22 0) (x2) **j** = (y2) **v**

(0 0 0) (0) **k**

As we can see **MP**ij = **M**, as well as **RP**ij=**R**.

Of course we could have done direct projection **P**uv=**MP**ij, without the intermediate step of **P**ij, though that matrix could be less intuitive to get. However if it is not, just forget the intermediate steps (whatever your *x3* is, it won’t affect your *y1* and *y2* coordinates):

**P**uv(**a**)=**d**:

(a11 a12 0) (x1) **i** (y1) **u**

(a21 a22 0) (x2) **j** = (y2) **v**

(0 0 0) (x3) **k**

**N**(**d**)=**e**, or:

(c11 c12 0) (y1) **u** (x1) **i**

(c21 c22 0) (y2) **v** = (x2) **j**

(c31 c32 0) (e3) **k**

**NM**(**a**)=**R**(**a**)=**e**:

(c11 c12 0) (a11 a12 0) (c11*a11+c12*a21 c11*a12+c12*a22 0) (t11 t12 t13)

(c21 c22 0) (a21 a22 0) = (c21*a11+c22*a21 c21*a12+c22*a22 0) = (t21 t22 t23)

(c31 c32 0) (0 0 0) (c31*a11+c32*a21 c31*a12+c32*a22 0) (t31 t32 t33)

therefore t13=0, t23=0, t33=0

then **NP**uv(**a**)+**h**=**e**+**h**=**b**:

(t11 t12 0) (x1) **i** (0) **i** (x1) **i**

(t21 t22 0) (x2) **j** + (0) **j** = (x2) **j**

(t31 t32 0) (x3) **k** (h) **k** (x3 – dx3) **k**

therefore:

(1 0 0) (x1) **i** (0) **i** (x1) **i**

(0 1 0) (x2) **j** + (0) **j** = (x2) **j**

(t31 t32 0) (x3) **k** (h) **k** (x3 – dx3) **k**

Now, how do we choose which indexed element* l* of the quotient space X/Y (or what value of h (or intercept)) is the best on for our purposes? On the most general level the projected Subspace and its Superspace are not hom(e)omorphic, i.e. do not possess the same *Structure *(because projection is a continuous function, but its inverse isn’t (neighbouring elements (belonging to the same Open Ball with radius delta) in the Projected Subspace, are not necessarily still are neighbours (belong to the Open Ball with radius epsilon, despite how small we may make radius delta) in the Superspace)). And that happens when dx3 > epsilon. Therefore, to reduce the loss of structure we may want to make all d*x3* (or *e* as we renamed it latter) as small as possible. Actually, there could be many of the reasonable criteria we can use (for example difference in the slopes/gradients of our dataset “surface” to the Projected “plane” that also is another reason to lose Homeomorphic Relation between the Spaces), but the usual, default one is the minimisation of the sum of the squares of deltas d*x3* (or residuals), i.e. *RSS* (residual square sum), which looks reasonable and is the nice looking one in the matrix form, and also gives nice analytical equations for the first and second derivatives needed for the minimum calculation.

Let’s forget for a moment bases **ijk** and **uv**, and let’s won’t depict them in our notations, and denote *i* <b.t.> I (*1…n*), where *n* is a size of our data set A. Then *i*-th element’s delta **x **(d**x**=**b**–**a**) can be written as (supposing that the T transform is Globally Linear, therefore is a multiplication of any i-th element to the same T matrix):

(*x1i – x1i) = (0)*

(*x2i – x2i) = (0)*

(*t31*x1i +t32*x2i + h – x3i) = (dx3i)*

Leaving only non-trivial dimension **k** we do the regression on, we can write system of equations for all *1..i..n* data elements in a matrix form:

(*x11 x21 1*) (t*31*) (*x31*) (*dx31*)

(… ) (*t32*) (… ) (… )

(*x1i x2i 1*) ( *h *) – (*x3i*) = (*dx3i*)

(… ) (… ) (… )

(*x1n x2n 1*) (*x3n*) (*dx3n*)

or, compactly:** Xt**–**x3**=**dx**, or, as usually it denoted in literature, **Xb**–**y**=**e**. Let’s partially borrow that notation for easiness of the mental mapping of this rubric to books, leaving, though, **t** in place, because we use **b** for other purposes. We’ll also drop index 3 in **t** and rename *h* to *t0*.

Having expressed *RSS= e1*e1 + … + ei*ei + … + en*en* = **e**T***e **(where** e**T is a transposed vector** e**), we can do the same with the left part of the equation as well: (**Xt**–**y**)T*(**Xt**–**y**)=**e**T***e**; and we want to minimize **e**T***e **(or (**Xt**–**y**)T*(**Xt**–**y**)). But take a note, – here, from one problem of mapping our data set from a space with the same dimensions as the original data elements have (**T**(**a**), **ijk**) into a subspace with the reduced dimensions (**b vu**), we moved to another problem, – using our data as a transformation matrix, we map coefficients of our transformation matrix (or we can say transformations T themselves (yes, transformation also may a space element)) of the original problem (**X**(T), **ijk**) (with the original number of dimensions (say, T-space)), into a space of deltas (or errors (say, E-space)) with number of dimensions equal to the size of our original data set (**e**,** 1..n**). And we want to find such an element (for the one return linear regression it will be a vector **t** (while for the multi-return regression we look at later, that will be a matrix **T**)) from the T-space, which our data would transform into the smallest element of E-space.

As usually we do that in Calculus, to find a minimum point (actually, a critical point, which includes maximum and saddle points) of a graph (curve, surface, or generic multi-dimensional data set, or product of 1-dimensional ones), we take Gradient **Grad y** = @

@Sum[*i=1..n*]((*t1*x1i + … + tj*xji + … + tm*xmi + t0 – yi)*(t1*x1i + … + tj*xji + … + tm*xmi + t0 – yi))/@t1 = @RSS/@t1 = 0*

…

@Sum[*i=1..n*]((Sum[*j=1..m*]*(tj*xji) + t0 – yi)*(*Sum[i=1..m]*(tj*xji) + t0 – yi))/@tj = @RSS/@tj = 0*

…

@Sum[*i=1..n*]((Sum[*j=1..m*]*(tj*xji) + t0 – yi)^2**)/@tm = @RSS/@tm = 0*

After differentiating:

*2**Sum[*i=1..n*]((*x1i )*(t1*x1i + … + tj*xji + … + tm*xmi + t0 – yi)) = 0*

…

2*Sum[i=1..n]((*xji* )*(*t1*x1i + … + tj*xji + … + tm*xmi + t0 – yi*)) = *0*

…

2*Sum[i=1..n]((*xmi* )*(*t1*x1i + … + tj*xji + … + tm*xmi + t0 – yi*)) = *0*

and converting equations above into matrix form we’ll get (actually, multiplication signs are not strictly necessary, they are just a visual convenience to navigate between the transpose and inversion notations, and parentheses):

*2****X**T*(**Xt**–**y**) = @*RSS*/@**t** = *0*

In this transformation we map **t** of *1..j..m* (in our original 3D case **ijk**) size basis to the same size basis of the partial derivative space, looking for the Null Transform of **t**, or Kernel **t**. Which will be:

**X**T***Xt** = **X**T**y**

and after multiplication of both sides by the inverse (**X**T***X**)-1:

(**X**T***X**)-1***X**T***Xt** =(**X**T***X**)-1***X**T**y **

where** **(**X**T***X**)-1***X**T***X **=** I – **identity matrix, i.e. diagonal with coefficients *1*, hence **It**=**t**, then:

**t** = (**X**T***X**)-1***X**T**y**

However, we yet to find out if our critical point is in fact a minimum, and not a maximum or a saddle point.

As usual, and obvious for the 1D mapping (function of one variable) *y*=f(*x*), we are looking that second derivative were positive. If the first derivative, usually expressed as slope line to the point where we take it, may be more intuitively envisioned as a “speed” of moving along the graph y in Y space, while we are moving along the x graph in X space, the second derivative is an “acceleration” of that movement. For example, a plane diving (for example for a zero-gravity simulation), and then climbing back up, changes its vertical speed from negative value to positive, transitioning through *0* at the critical point of minimum, its acceleration remains positive (which defines that the critical point indeed was minimum – i.e. whatever large negative speed the plain has, sooner or later (if it doesn’t hit the ground) it will be attenuated by the positive acceleration to *0* speed, and then the plane will start climbing up). If we get rid of the time variable, we can use derivatives at the critical point of the function graph along the domain x: d*y*/d*x* (“speed” of changing *y* with changing *x*, which is* 0 *there), and d^2*y*/d*x*^2 (“acceleration” of changing *y* with changing *x*, which is positive, and let it be constant *a0*) to estimate d*y* for a given d*x* using second member of the Taylor series: d*y* = 1/2**a0** d*x*^2 (higher members will be 0 because we set *a0* constant in our example). Which is positive, and that means that moving from the critical point x0 in any direction by d*x*, we’ll get d*y* positive, i.e. the critical point indeed is minimum.

The similar approach works in a multi-dimensional case too, but in that case the measure of “acceleration” of the f:Rn->R is called Hessian matrix **H**:

(@^2f/@x1^2 … @^2f/@x1@xi … @^2f/@x1@xn)

…

(@^2f/@xi@x1 … @^2f/@xi^2 … @^2f/@xi@xn)

…

(@^2f/@xn@x1 … @^2f/@xn@xi … @^2f/@xn^2)

We still want make sure that whatever our movement d**x** is in our X domain, it will be mapped in a positive movement d**y **in our range space Y, after applying the **H** transform (which is actually what brute force gradient descent methods would do). In the literature such a transform (matrix) is called the positive definite one, but for better intuitive understanding we rather start from the eigenvectors and eigenvalues idea (anyway these things are closely related). Eigenvalue is such a multiplier *lambda* (for a transform **M**) that gives to a particular vector(s) **v** the same transform as the regular **Mv** mapping (**Mv**=*lambda****v**). There could be multiple eigenvalues and vectors for particular transform. The nice thing about eigenvectors is that if we find enough (number of X space dimensions, so we can span it) linearly independent eigenvectors (so it will be basis of the space) we may express any vector **x=***x1****v1+***x2****v2+…+***xn****vn** in the new eigenbasis. If it happens that all eigenvalues *lambdai* (coefficients in the diagonal matrix **L**) for that eigenvector basis are positive, **M** transform will be the same sign as the vector of our domain space X. In case of the **H** transform that would mean that the point with *null* Gradient/Differential is indeed minimum.

**H*****x** = *x1* H**

or variation on the positive definite matrix definition:

**x**T***H*****x** = (x1, x2, …, xn) (*x1***lambada1, x2*lambada2, …, xn*lambadan) T *= x1**x1***lambada1**+x2**x2***lambada2*+…+xn**xn***lambadan = **lambda>0*

In a general case, we may want to find eigenvectors and eigenvalues (anyway they are very handy for the transform analysis (and, in our case, because our data set is used as a transform, for the data analysis, too), however, in special cases, for figuring out only what type of the critical point we are in, we can estimate whether our **H** matrix is really positive definite, or it can have negative lambdas.

Let’s calculate Hessian for our RSS case, which could be expressed in a compact matrix for as:

@^2*RSS*/@**t**@**t**T = *2****X**T***X**

Which is quadratic form and it guaranties that whatever our *xij* data are, *xij*xij* is positive, and for any d**t** of our domain space, “acceleration” value of the range space (RSS) will be positive. As in our 1D domain example with the diving plane, we can use second member (first one is *0* because it’s a critical point, and higher members will be *0*s, too, because the second derivative matrix **H** has all constant elements) of Taylor series to estimate what d*RSS* will be when we move from the critical point in any direction in the domain space d**t**. We can express it as:

d*RSS* = d**t**T***H***d**t**

Here, not only **H** transform will be positive (because of the quadratic form *xij***xij*), but d*tj**d*tj*, too. Therefore, whatever d**t** move will be, d*RSS* will be positive, so our critical point is indeed minimum, too.

The only thing we have to make sure that we have enough eigenvectors to form eigenbasis, i.e. there are no linear dependency in **H**, or in other words it’s not degenerate, or is a full rank matrix, otherwise we can’t “control” behaviour of the “acceleration” in some dimensions.

Of course we did not discover any Americas, and all these derivations may be found in many Statistics and Data Science books, and the Linear Regression functions are implemented in many libraries, but we want to experiment a bit further with the regression algorithms on real data, so let’s have our R or/and Python library for it, to play with them…

]]>

If we take a look at the general definition of the term *Regression* we will find something like: “transition to a simpler or less perfect state”. *Perfection* is quite a subjective category, and, depending on the context and point of view, the same phenomena, by the same person, may be viewed as more or less perfect for one or another purpose. For example a “more simple” state or model may be viewed as “less perfect” for purposes of the simulation accuracy, or “more perfect” for easiness and clarity of understanding. So, let’s rather stick to a more clear and distinct “simplicity” aspect of the definition.

In application to the Data Science’s meaning of “simplicity”, and especially in the context of the space mappings, it would, obviously, mean reduction of the dimensionality and/or number and complexity of *Relations* between the *Space* objects. Which, actually, means *Projection* of our Data from a Super- to a Subspace. As we previously noted, a *Space* is a *Set* with *Relations* defined over members of it. For a *Linear Space* that would be relations of additions and multiplications by scalars which comply the linear properties, again, mentioned before. For a particular instance of the *Linear Space* we chose an *Identity* element (“coordinate origin”) and the minimal number (i.e. linearly independent) of basis elements (which will create ordered pairs – basis vectors) which will spawn the *Space* (i.e. any its element may be expressed as a linear combination of them).

If we defined different Linear Relations on the same Set, i.e. different bases through which we can express literally how one element of the Set is related to other, we may map elements of one Linear Space, expressed in its Relations/basis to the expression of Relations/basis of another Linear space through matrix multiplication. Considering that those representations a (in the basis **ijk**) and a’ (in the basis **uvw**) of the same Set element are representations of the fundamentally the same thing, and we can express those basis vector “things” of one basis via another basis “things”, we can write (see notations in the next chapter):

**a**=*b***i**+*c***j**+*d***k**; **a’**=*x***u**+*y***v**+*z***w**; **a**=**a’**

**u**=*e11***i**+*e21***j**+*e31***k**;

**v**=*e12***i**+*e22***j**+*e32***k**;

**w**=*e13***i**+*e23***j**+*e33***k**;

**a’** = *x*(*e11***i**+*e21***j**+*e31***k**) + *y*(*e12***i**+*e22***j**+*e32***k**) + *z*(*e13***i**+*e23***j**+*e33***k**) = **i**(*x*e11+y*e12+z*e13*) + **j**(*x*e21+y*e22+z*e23*) + **k**(*x*e31+y*e32+z*e33*) = **a** = *b***i**+*c***j**+*d***k**

**E***(*x, y, z*)T (in respect to **uvw**) = (*b, c, d*)T (in respect to **ijk**); or **E*****a’** = **a**, or **a’** = **E**-1***a**

If we express coefficients of the matrix of an arbitrary transformation **F** as coefficients of the basis converting matrix **E** plus some deltas, then transform will give as the same vector in the respect to range basis plus a delta that may be spurned into any element of the range Space, thus giving us any desirable vector **b**:

**F*****a’** =**i**(*x*(e11+f11)+y*(e12+f12)+z*(e13+f13)*) + **j**(*x*(e21+f21)+y*(e22+f22)+z*(e23+f23)*) + **k**(*x*(e31+f31)+y*(e32+f32)+z*(e33+f33)*) = **i**(*x*e11+y*e12+z*e13*) + **j**(*x*e21+y*e22+z*e23*) + **k**(*x*e31+y*e32+z*e33*)+ **i**(*x*h11+y*h12+z*h13*) + **j**(*x*h21+y*h22+z*h23*) + **k**(*x*h31+y*h32+z*h33*) = **a + h **=** b**

A *Subspace* would be a *Subset* of the *Superset*‘s elements with the *Superspace*‘s *Relations* abridged by the *Subset*‘s nature. For example if the *Superset Relations* involve pairs of the elements which (or at least one of them) are no longer exist in the *Subset*, that *Relation* will be gone from the *Subspace*‘s *Relations*. For a (Linear) *Subspace* of the *Linear Superspace* some of those “things” we have chosen as basis vectors will be gone, and such a *Mapping* that involves eliminations of basis vectors (*Relations* in general case), with preserving the others, will be called *Projection*.

Tracking back our thoughts to a Regression, we won’t usually know beforehand which Subspace is more suitable for our Projecting purposes, but we may have an idea about possible variants from which we may chose, applying particular criteria, the best one. As it was already mentioned, any objects may be members of a Space, including other Spaces, or Subspaces. There, concept of Quotient (or Factor) Spaces may be useful. In such a Space its members are its disjoint (not having common elements) Subspaces.

Let’s imagine a Crape Cake, which, as a whole, is a 3D Space, but also it can be thought of as a 1D Quotient Space of the 2D Crapes. Also, let’s imagine we have Blueberries somehow stuffed in between our Crapes. And then, we somehow want to associate (via so called Equivalence Relation) all these Blueberries with only one Crape, for example by protruding (*Projecting*) those Blueberries through other Crapes by our fingers and smashing them into One Chosen Crape (or, making sure they somehow squeezed and moved through the holes made by toothpicks). All these Blue Spots on the One Chosen Crape (or, maybe on a sheet of Parchment Paper inserted in-between the Cake Layers) we may call Equivalence Class. And we may want to minimize the ruin we have just done to our Cake by choosing the One Crape that would ensure that, and that will be condition for our Equivalence Relation.

Of course, there may be other criteria chosen, for example a Crape with biggest holes in it, or something else. Also we may want to choose boring holes in the Cake not with the straight, but crooked fingers (that won’t be a Linear Transformation), or put the Cake on the edge of table, and let it bend like Clocks on Salvador Dali paintings and then bore it with straight fingers (that will be neither (Linear) Subspaces nor Linear Transformation). We may decide those non-linear Blueberry Transformations and Subspaces are even cooler than the linear ones (for example the crooked holes in the Cake would make it a Piece of Art), but for the Linear, One Output Parameter Regression from 3D Space into 2D Subspace we will stick to the algorithm (Linear Projections from Vector Superspace into Vector Subspaces) described in the former chapter.

Actually, it may take sense, for a better understanding what Linear Regression is, to stop for a moment and discuss what is not a Linear Regression. For the Cake Bending over a Table Edge case, though the Space it represents may look non-liner to as looking at it from the Linear Space with the fixed basis (so the Bent Layer Space won’t be Subspace of it), if the Bend Cake Space is defined by the vector basis that changes (relatively to our unchanging basis) with the movement along it, all the linear operations relative the Bent Cake basis may still hold. For example we may take a sheet of parchment paper, draw on it additions and multiplications of vectors and scalars representing all necessary properties of the Linear Space, and put that paper in the middle of the Cake and then bend it, our drawings will be still true. So it’s not necessarily the Space itself which make the Regression Nonlinear, but our view on it, or rather mappings from our Space to the Bent Cake Space (both basis conversions **M** and projections **P***ij* (see next chapter)) won’t be Globally Linear. At each given point we do can represent the mapping as a Matrix by Vector multiplication (so it will be a local Linear Transform), but Globally, moving point to point (and from one element of the quotient Space of Cake Layers to another) that Matrix will differ. Looking at it utilitarian way – in the next chapter we won’t be able to write that matrix equation where we use our dataset as a Linear Transform from the space of Linear Mapping to the Space of Residuals to find the transform with the minimal Residual Square Sum.

In the Crooked Fingers Boring case we have obvious Nonlinear Transformations f(**x**)=**y** that could not be represented by the matrix equations of the next chapter. Actually the second case is the same as the first one, but if we would look from inside of the range Benk Cake Space, imagining it’s “flat”, to the domain Kitchen Space which would now looked like a Warped one. That could be easily seen using an elastic eraser bar, which we can bend with our fingers, draw a straight on the side, and then let it go, – the previously straight line will become crooked.

In the last case we, again, have Locally Linear transforms, but they differ from point to point, like in the first case, and, again, are not described by the next chapter equations.Technically, we may use particular case of the Locally Linear, but Globally Nonlinear Transformations that vary from one Data element to another if the are Linearly “modulated” (for example if we get a Transform Matrix in a particular point(vector) as multiplication of a Tensor by that vector (so we’ll be adding, first, Linear Transformation of Vector-to-Linear Transformation (* D**

…

]]>

When we want to model an unstructured collection of the real world phenomena we use such mathematical abstraction as *Set*. It can contain not just simple elements (or objects, or members – these are interchangable terminologies) as numbers, or more complex mathematical abstractions (for example Sets themselves), – the members of a Set could be really any possible or imaginable objects. If we want to introduce (and we usually want to do that) a *Structure* over these objects we use such mathematical abstraction as *Space*. A *Space* is a *Set* with *Relations* (or, as a special case, *Mappings* (*Functions* or *Transformations* – these are, too, interchangeable terminologies), or even a more special case – *Operators*) defined over its members. *Functions* are *Relations* that define correspondence of a member of the *Domain* Space (Space we mapping from) to exactly one member of the *Range* Space (Space we mapping to). Of course there may be multiple Domain members that map in the same Range member, but we do not split them. *Operators* are mapping to the same Space.

For example, to define a Linear (Vector) Space we have to declare what element(s) will be Identity element(s), and we have to define Operators of the members’ Addition and scalar Multiplication in such a way that their result will be still an element of the Space (i.e. such Mappings are, indeed, Operators), and those Operators will be Associative, Commutative and Distributive, and we have to declare that every element will have its Inverse and their Sum will produce Identity (Additive Identity will be Null) element. Again, members of such a Space may be not only numbers or their lists, but also any phenomena, or even their relationships. We just have to define Operators on them as described above, and then we can apply all the Linear Vector Space analytical apparatus to our newly created Space.

For Metrics Spaces we have to define Distance Function (which is strictly speaking a Relation, therefore these are more generic Spaces) that will give us a distance between any two selected elements of the Space, and we are free to define whatever we want, and not necessarily being bounded by only Euclidean distance calculation. For the most generic Topological Spaces we define Topologies – those Sets that basically tell us whether elements of the Space are in a Relation of being neighbours or not, and where do boundaries lie between neighborhoods.

Representing our Data as Spaces with Structure definitions over them is, obviously, useful for finding Structural relationships between the Data elements, and sufficient enough for the Unsupervised Learning methods of Data Science. In addition, by defining Mappings or Relations between Spaces, we can ask (and answer) such questions as: “Can two Spaces be mapped to each other?”, “Is one of them a Subspace of the other?”, “Is that mapping continuous (isomorphic/homeomorphic)?” In terms of the Data Analysis those questions and answers will tell us whether our Data sets have the same or similar Structures, allowing us to recognize Patterns and mine Data.

Those *Relations*, *Functions*, or *Operators*, define *Structure* of the Space to which members of the *Space* can be subjected to, or which is “visible” in the *Space*. Our Real World Data could have a much more sophisticated “Real Structure”, but, when modelling the Real World Data in the particular modeling *Space*, we will be able to see no more *Structure* than we defined in the model. Or maybe even less *Structure* in the Data, if our expected, model *Structure* is not present in the Real World Data. For example Decision Tree (which is such a Relation) formulated to pinpoint fraudulent credit card use will not make visible authentic owner spending habits (for which we will need another kind of the Decision Tree). Or, Linear Vector Space will make visible to us only linear Structure of the relations between Data elements. Or, which is usually the case in Topological Data Analysis, if we Metricize generic Topological Space, we will lose non-metrizable relations.

Because using Statistical Modeling we can not (or do not bother to) get an insight on the causes and driving forces of our Data, and we do treat them like movements inside a “black box”, we are also in a darkness (of that box) about whether (as we may think they are) all the aspects (parameters or variables) of the objects we study are, indeed, their defining parameters, and not the incomplete or overlapping combinations of the “real” (independent) parameters. Because of that we are bound to see those parameters as random and dependent between each other (welcome to the real-world, or “nasty”, or “dirty” Data). Which is really not Data’s, but, instead, our problem of the failed assumptions, expectations, or, in a way, – ego.

If the aspects (variables) of the Data, and Structure Operators of the initial model do not give us much of the meaningful information, we may want to map the Data isomorphically, or at least partially, homomorphically, onto another Space with more relevant and interesting for us topologies, with different bases and different (reduced or introduced) dimensions. That may make visible those Structures we are interested, or maybe surprised to see, eliminate or reduce variable dependencies, or even reduce the very “randomness” of the variables.

But enough wordy theorizing, let us see how the data Statistics/Data Science workhorse of the Linear Regression is seen from the Space Mapping point of view…

]]>

So, let us start with the simple basic questions: *What are the Object and Method of the Data Science* *studies? *In my humble opinion (which, of course, may be naive, erroneous, or trivial, as any other statement in the following text, for everyone of which I am not going to repeat this caveat, but always imply it), what we are looking in the Data is their *Structure, *which word by itself, though, tells or explains nothing. Let us look at it in the context of the linguistic and cultural Structuralism of the XX century (OK, OK, it is not fashionable anymore, because we live in the age of Post-Structuralism, or even Trans-Structuralism, but that changes no basics).

Structuralists usually define *Structure* as a mesh of the *Opposition* relationships between objects of the domain of interest. This definition still leaves a lot of room for interpretation, and I prefer to look at the *Opposition* not as something adversarial, but, rather, as a state of two peers being in some kind of relationship, which may or may not be *Divisive*. For example, we may be interested in finding out are the given objects in a pair neighbors or not. Those relations (or we could say *Relations* in the Algebraic sense, i.e. if we have sets A and B, then subset of the product AxB is a *Relation* defined by some criteria), really, is the fundamental *Object* of the Data Science studies. That is pretty obvious for the unsupervised learning, clustering methods, but it also stands for the other, supervised ones.

Even if we take a look at Descriptive Statistics, we will see that those numbers, functions or diagrams let us peek at the various aspects of the Data *Structure* in a compact, integral form, without drowning in the excessive peculiars and mass of the Data.

Now, getting an idea what we want to study, we may start thinking about the *Method* which we may do it with. Definitely, it will be a branch of the *Mathematical Modelling*, but not the one we usually use in the “Hard”, Natural Science. In the Natural Sciences we also strive to uncover a *Structure*, but *Structure* of the causes and driving forces of the data being observed. In a general case we end up with a system of (partial) differential equations that we usually can not solve *Analytically*. Then we either linearize, or simplify, or modify our models to reduce them to a form that has known analytical solution that is (relatively) easy to comprehend, and works fine in a wide range of initial conditions. Or, if such approach is not possible or acceptable, we resort to the data crunching of the *Numerical Methods*, which are, basically, the same linearizations, simplifications and modifications, but applied on a small temporal or spatial scale repeatedly, which is easily machine-automated. However, obtaining a solution this way has a cost – limited convergence intervals, and, if the initial conditions are changed significantly, all the bets are off that such a solution will work not only with the required accuracy, but even that it will work on the level of general tendencies. Sounds familiar for the Data Scientists, huh?

“Soft” Science generally despises such an approach of mathematically modelling causes and driving forces of the data it deal with (supposedly, because it deals with much more complicated matters, and such deterministic analysis is practically useless – yeah, yeah, “an invisible hand of the Free Market” will sort everything out instead). What it usually looks for is the *Structure* of the Data itself, or its appearance. The useful branch of the *Mathematical Modelling* in such a case is the *Statistical Modelling*. Similarly to the world of Natural Science, we may be interested either in a more simplified, but universalistic and giving us insights *Statistical Inference*, or in a more result oriented, but convergence limited *Predictive Modelling*. *Machine Learning* is closely associated, and largely overlapping with the latter because its resulting models are hard to interpret in the analytical sense.

*Data Mining* and *Pattern Recognition* are also associated with each other, and do exactly what the latter says – search for a Pattern, or a *Structure* of the Data. However, the latter usually looks for Patterns by example, while the former looks for something new; and the more unexpected that *Structure* is, the better. They reside in the middle of the* Inference* and *Machine Learning* because, on the one hand we still want to have some analytical insight, on the other – we may greatly benefit from the power of “number crunching”. Of course, if the aim of the *Pattern Recognition* is purely utilitarian (to arrest particular government protester, or kill particular jihadist from a drone), then that bring it closer to the *Predictive Modelling*.

Again, with the “Soft” Sciences shying away from the mathematical methods, the niche of the theoretical branches of the “Hard” Sciences in the “Soft” Science realm was taken over by the (semi)autonomous Data Science. Of course, in the real world the described above partition of the *Mathematical Modelling* branches is not strict, and “Hard” Science uses a lot of Statistical methods, though they play more a servile role of the initial empirical data processing, before the real theorizing begins (or verification of theories against the empirical reality), while there is some place of the *Analytical Mathematical Models* even in the “Soft” Science.

Nevertheless, what is the fundamental *Method* that lies in the foundation of all the mentioned above (as well as not mentioned) methods of the Data Science? The *Method* that might not be frequently reflected upon in the real day-to-day practice?

…

]]>

Of course, the name of the blog is pretentious and plagiarized. It may look overconfident, nevertheless, I have quite humble reasons for naming it that way. Having taking Modern Algebra and Statistical Methods classes together in my graduation year, I found myself confused by the mix of two approaches: one, which is deep, fundamental, and universalist, and another – utilitarian, mechanistic, and, honestly, it appears with not much Science behind it.

However, the same difference I felt between the two ways the Linear Algebra classes were taught. Seemingly mechanistic, close to the ground Linear Algebra I in its Linear Algebra II incarnation turned into a much deeper and thoughtful discipline. It was a much tougher class, but, in a way, a more “mind-calming” one – turns out there was a meaning, a reason for all these matrix manipulations you merely memorized in the first class. The second course was taught by Sheldon Axler’s textbook *Linear Algebra Done Right*. Apparently, that title is an inspiration for the blog rubric.

Similarly, in the following posts I hope to find out and explain to myself the deep meaning of the confusing Data Science buzz-terminology: “Data Mining”, “Machine Learning”, “Artificial Intelligence”, “Deep Learning”, “Big Data”, etc… Unlike the more rigorously defined terminology such as “Statistical Inference”, “Predictive Modelling”, “Reinforced Learning”, “Pattern Recognition”, the former vocabulary is fuzzily defined, redundant, and confusing even for the seasoned Data Scientists.

Of course, I do not envision the blog being in any way comprehensive and exhaustive (I simply do not have qualifications for that), but rather spotty, fragmented, touching the most “unsettling” topics (for me), and, maybe, homing to some “calming” answers Because this is mainly a self-directed text of the consciousness dump, the writing is left in the scratch-book style, hardly proof-read, and, therefore, I beg pardon from the occasional visiting readers.

I am not going to properly format citations, and I will overuse, or even abuse capitalization and Italic fonts, which is, of course, not a proper *Scientific Writing* :), especially if the use of this emphasis formatic is not consistent, which it will be, – usage context will drive the choice of that abuse. Mathematical proofs, whenever used, are not rigorous, but, rather, illustrations to proofs, intended to make them more understandable, and intuitively clear. When some code is involved, I’m not going to torture readers by inlining it, but will make it freely available one way or another.

Having get some clarity about what this rubric is about and its form, let us head on the most fundamental question about Data Science: “What it really is?”

UPDT: Actually, in the course of writing I realized I can dilute my plain plagiarism with a bit of originality – make this talks Kitchen Talks. Like, talks of the people not rushing anywhere, relaxing at the kitchen with a cup of tea or coffee, with a piece of cake, or pastry, or other gourmet food, on which we are going to do our thought (or maybe quite physical) experiments.

…

]]>

Descartes thinks he accumulated a lot of questionable and dubious ideas through his life. He makes a conclusion that all those ideas come from senses which are not trustworthy. It is easy to doubt credibility of our senses when they work on the edge of their sensitivity. For example, recognition of small or far away objects. It is much harder to doubt bigger chunk of our senses especially if they work in their confidence interval because their correctness could be proved by experience. If we are going to insist that some of our senses are wrong, we risk being considered as mad. However, it is much easier to reject our senses as a whole. For example, in a dream we perceive all weird events of the dream as normal and real, but we can recognize the strangeness of the dream only outside of the dream “reality” when we wake up. That is why Descartes decides to question the reliability of the whole world of senses.

If we accept an idea that our real world, given to us through senses, is just another dream of a higher rank, we may want to find criteria which would allow us to find imperfections of the dreams (especially of the lower rank) comparing to the really real world. Importance of these criteria is stressed, for example, in the motion picture Inception. For a moment Descartes follows this path, suggesting that the human fantasy is impaired by its scantiness: fantastical creatures that the humans make up are just a combination of parts of real animals, or images in dreams are like bleak paintings of real things. Descartes implies that if we dream a dream impressed on us by some “Architect” (in terms of the Inception), we can use “simple and universal” invariants of the really real world (like mathematical concepts) as landmarks for detecting a dream. But Descartes quickly withdraws from this path, suggesting that the “Architect” may be an omnipotent God capable of creating a deceptive dream for us, which is as perfect as the really real world (i.e. “totem” from Inception would not work).

Descartes meets possible counter-arguments that God could be non-omnipotent or could not be possibly deceptive because deception is manifestation of imperfectness, by saying that he has no answers to these objections. Descartes started doubting senses for he wanted to leave only certain and distinct ideas in his understanding of the world, therefore he is willing to build it suitable for the worst case scenario that God’s task is to deceive Descartes with all his might. If Descartes is still able to infer anything about the world even in conditions of omnipotent deception, those inferences would be quite certain and unshakable.

Descartes is able to identify at least one thing which omnipotent evil deceiver cannot possibly trick him about. He cannot make Descartes believe that he does not exist, therefore the statement “I am, I exist” cannot be taken away from him. Descartes finds other certain qualities of his “I”, which is a “thinking thing”, and which exists only while he thinks. His “I” also has senses, which does not mean that these senses are somehow real, and, strictly speaking, sensing is a type of thinking. “I” can have mental images, which are modifications of thought as well.

Another classification of thought is aimed at identifying what type of thought is prone to errors. Descartes divides thoughts into three categories. Two of them, which are simple ideas (even unreal and imaginary) and emotions, cannot be judged as true or false. We can make an error only if we have thoughts of third category named judgments.

Descartes tries to analyze how the human mind works. He takes the example of the wax. Even when wax in a usual for human environment may appear to our senses in different states of matter, we perceive it as the same substance. He concludes that our mind does not comprehend things with senses, which give us information about appearance of substances, but our mind understands substances by essence. It is similar to Plato’s Forms. This conclusion is encouraging, because, to proceed further, Descartes has to take on the question of God’ existence. Because God could not be comprehended by senses, one must think about him in terms of the world of Forms.

Descartes attempts to find out which ideas could have originated from him and which could not. Based on the idea that the cause should be greater than the effect, he concludes that idea of the omnipotent, omniscient and perfect God may be originated only from the omnipotent, omniscient and perfect God himself. This proof effectively repeats after Aquinas’s proofs of God’s existence.

However, there are contradictions in Descartes syllogisms. He says that idea of physical objects, which are not intelligent and extended things, was created by him as opposition of the idea of a thinking and not-extended himself. Using the same logic, an idea of omnipotent and perfect God could have appeared in Descartes’ mind as an opposition to non-omnipotent and imperfect himself. On the other hand, using concept of the greater cause than its effect, Descartes could have said that it is not only impossible for an idea of God to originate from himself, but an idea of physical objects needs a greater cause in a form of the objective reality. In both cases: the personal dream world of Descartes whose ideas come only from himself, or an objective world created by perfect God, he could trust his senses, because they are either only his and not impressed by a foreign “Architect”, or are caused by the objective reality.

Perhaps feeling deficiency in his arguments Descartes makes a second attempt to justify the existence of God by asking the question whether he could be created by somebody else but God. His answer is “No”, based on the same medieval reasoning of necessity of the cause to be greater than the effect, but he introduces a new twist to this reasoning. Descartes says his continuous existence is of the same nature as his creation. Thus, his world, instead of being created once (long-long time ago and maybe left by God on its own), becomes a dream of God which requires God’s constant attention in actively redreaming it in the whole fullness every next moment.

Having proven the existence of the perfect God, Descartes proceeds to analyze a question where errors come from. He says that human’s free will is as great as God’s. Following Augustine’s thinking, he says his ability to understand is as good as God’s qualitatively, but quantitatively is less. When free will of judgment exceeds scope of understanding, an error appears. By his own standards Descartes’ will to question his senses and desire to prove God’s existence exceeded capacities of his contradictory understanding and made subsequent conclusions about credibility of senses shaky.

]]>