**Using linear regression for dimensionality reduction**

In our linear model, designed and developed (in R) in the previous chapters, we calculate not only new images of the data, being projected on the chosen element of the quotient space that satisfy particular requirements (we use minimal SSE), in the original space basis, but also in a basis of the quotient subspace. However, we will need one more function that would calculate in the symbolic form what would be new synthetic basis vectors composed by the linear combination out of the original space basis vectors. Basically we need to implement a symbolic multiplication of the matrix N-1R of our transformation into the quotient space element by symbolic vector of the variable names (see those functions in Appendix 8.1).

Having symbolic matrix by vector multiplication functions ready, we may start playing with our dataset, trying to reduce its higher dimensionality into a more intelligibly manageable and comprehensible lower dimension image. As we mentioned before, regression is a way to reduce dimensionality of a set, of course, by a price of losing complexity of the data Structure. We try to minimize that loss of the continuity of the transformation function (projection we use in regression) using some criteria (loss function) – in our case SSE. So far we have an univariate multivariable regression function ready, therefor we can collapse dimensions consecutively, one by one. With this approach we may need to decide how many dimensions we want to collapse and/or in what order.

There could be multiple approaches and criteria for that, and we may come back to this discussion later, but for now let’s start with simple ones. For obvious reasons of saving as much Structural complexity (i.e. continuity of transformation) we may want to collapse those dimensions which projections give the minimal loss function of our choice (of course, first, we normalize those dimensions to their ranges). On the other hand, it may be not so important if we go to the ultimate dimensionality reduction to the 1D space, basically making its basis vector an eigenvector.

Let’s take the Diabetes dataset we used for experimentation in the previous chapter, which has 12 dimensions (obviously not independent), and run consecutive univariate regressions until 1 dimension left:

load(url(“http://biostat.mc.vanderbilt.edu/wiki/pub/Main/DataSets/diabetes.sav”))

# Dimensionality reduction

diabetes2 <- diabetes

dim <- c(“glyhb”, “chol”, “hdl”, “ratio”, “stab.glu”,

“bp.1s”, “bp.1d”, “age”, “height”, “weight”, “waist”, “hip”)

ds <- ks_lm_dim_red(diabetes2, dim, n_dim=1, reorder=1)

Now we can see what will be our synthetic eigenvector variable:

resp_name <- names(ds)[1]

resp_name

[1] “573.865492043754*chol+25.8440157742691*weight-507.345171188756*hdl+455.427566923276*hip+251.259082865673*waist-12.0580460009216*bp.1s+17.2494522189963*glyhb+142.022683378986*stab.glu+233.150806921495*height+59.1169607418513*bp.1d+29.0382173062296*age”

On the first look, not much surprises – cholesterol is really bad, HDL (“good cholesterol”) is good, hip circumference is bad, as well, while waist circumference is as twice as less bad, surprisingly, height is noticeable worse than weight, and age and blood pressure have less importance than other factors, but high values are still bad. But, anyway, that’s only a formal statistical model.

Now, let’s see how the data distribution looks like in our new synthetic 1D eigenbasis v1:

names(ds)[1] <- “v1″

ggplot(ds, aes(x=v1))+

geom_histogram(data=ds,aes(y=..density..),fill=”blue”, bins=40, alpha=0.5)+

geom_density(data=ds,adjust=0.5)

We see it’s noticeably multi (quatro or quinto) modal. We don’t know, yet, what that clusterization mean, and correlates with, so let’s see how it relates to the usual criterion for diabetes glycosylated hemoglobin > 7. Let’s overlay histograms of two groups of observations divided by that criterion:

clean.diabetes <- diabetes2[complete.cases(diabetes2[, dim]), dim]

ds$glyhb <- clean.diabetes$glyhb

ds$glyhb_c <- 0

ds[ds$glyhb>=7, “glyhb_c”] <- 1ds_nd <- ds[ds$glyhb<7,]

ds_pd <- ds[ds$glyhb>=7,]ggplot(ds, aes(x=v1))+

geom_histogram(data=ds, fill=”blue”, bins=40, alpha=0.6)+

geom_histogram(data=ds_nd, fill=”green”, bins=40, alpha=0.6)+

geom_histogram(data=ds_pd, fill=”red”, bins=40, alpha=0.6)

Also we may look at these two subdatasets and the whole dataset on scatterplot:

ggplot(ds, aes(x=v1, colour=glyhb))+

geom_point(data=ds, y=.5, alpha=0.5)+

geom_point(data=ds_nd, y=.75, alpha=0.5)+

geom_point(data=ds_pd, y=.25, alpha=0.5)+

scale_colour_gradientn(colours=rainbow(4))

As well as descriptive statistics of these datasets:

> describe(ds[[1]])

vars nmean sdmedian trimmed mad min max range skew kurtosis se

X1 1 377508.01 192.12504.76 505.21 192.78 92.88 1215.27 1122.39 0.2 -0.14 9.89> describe(ds_nd[[1]])

vars nmean sdmedian trimmed mad min max range skew kurtosis se

X1 1 319478.84 177.23482.3 476.71 182.29 92.88 1041.15 948.27 0.12 -0.31 9.92> describe(ds_pd[[1]])

vars nmean sdmedian trimmed mad min max range skew kurtosis se

X1 1 58668.48 193.08694.31 671.4 169.63 149.86 1215.27 1065.41 -0.13 0.27 25.35

Difference in the means between two subsets is just around one sigma, which has not very impressive confidence level of statistical significance (like 60-70%), but that’s not exactly the point. Glycosylated hemoglobin and it’s level 7 itself is not the ultimate and definitive diabetes type II criterion, but just one useful estimation. Diabetes type II is not just about “blood sugar”, it’s a general metabolic condition, not necessarily disorder per se. Some researchers hypothesise it’s an evolutionary adaptation (to cold climate), and when it was developed the “inconveniences” it causes in the old age and “normal”, comfortable conditions of contemporary life was not an evolutionary priority.

Anyway, that clustering of observations with the positive glyhb criterion at last two modes (v1 > 650) may be used as an indicator of such adaptation, or likelihood of diabetes type II condition. Let’s see how this clustering look in the original (cholesterol-blood pressure-glycosylated hemoglobin) space:

clean.diabetes$syn <- ds[[1]]

clean.diabetes$syn_c <- 2

clean.diabetes[clean.diabetes$syn<320, “syn_c”] <- 1

clean.diabetes[clean.diabetes$syn>=650 , “syn_c”] <- 3

cloud(chol ~ glyhb * bp.1s, clean.diabetes, groups=syn_c)

What we can see on this scatterplot is that the last mode clusters around observations with largest variabilities of at least one of the displayed parameters, which is a well known risk factor for diabetes. Therefore, indeed, it may make sense to investigate more the above described criterion in association with diabetes type II.

It’s easy to notice that out of the original 12 variables many of them are dependent, and could be grouped into 5 more or less independent backets (glucose related, cholesterol related, blood pressure, physical characteristics and age). Collapsing one of the dimensions in the group may lead to collapsing others, therefore it may be enough to do much less projections than the number of dependent dimensions to convert synthetic dimensions into the eigenbasis. For our particular case 5 projections (and respective basis conversions) are enough to put the data in a relatively straight line which will be the eigenbasis vector we are looking for.

From the computational point of view we are looking for a state when median sigma of the coordinates on all the remaining dimensions are less than a particular value, say, 5 %:

ds <- ks_lm_dim_red(diabetes2, dim, sd_dim=0.05, reorder=1)

or setting that we want 7 dimensions left:

ds <- ks_lm_dim_red(diabetes2, dim, n_dim=7, reorder=1)

And plotting the transformed data in the first 3 of the synthetic dimensions (in the others it will be the same), we’ll get a pretty straight line:

*names(ds)[1:3] <- c (“v1”, “v2”, “v3”)*

*clean.diabetes <- diabetes2[complete.cases(diabetes2[, dim]), dim]*

*ds$glyhb <- clean.diabetes$glyhb*

*ds$glyhb_c <- 0*

*ds[ds$glyhb>=7, “glyhb_c”] <- 1*

*cloud(v1 ~ v2 * v3, ds, groups=glyhb_c, pretty=TRUE)*

To roughly converting our dataset into 1D eigenbasis, and preserving the basis vector length (which doesn’t really matter for the eigenbasis), there would be enough to multiply one of the coordinates by sqrt(n) (where n is number of remaining dimensions (assuming the vector length preservation in two bases: **A** = *b***u** = *a***i**+a**j**+a**k**+…+a**n**, **A**T***A** = *b**2 = *n*(*a**2)).

If we stop just one step before:

ds <- ks_lm_dim_red(diabetes2, dim, n_dim=8, reorder=1)

We’ll still have some variability of data points:

Whether we run all projections till the last dimension, or stop by the mean sigma criteria, we’ll have practically the same distribution along the eigen basis vector. In our case, if we stop earlier, numerical values of the synthetic variable will be different (we’ve been lazy in previous chapters defining the N-1 transform as a diagonal and last column unit matrix, therefore after each projection the length of the basis vectors shrinks), but for eigenbasis that’s not important. Eigenbasis coordinate will be just proportionally adjusted:

v1 = 23.70*chol+1.09*weight-18.89*hdl+18.57*hip+10.66*waist-0.51*bp.1s+0.73*glyhb+5.88*stab.glu+9.42*height+2.211*bp.1d+0.917*age

We’ll play with a more intelligent Principal Component Analysis of the synthetic basis on projection later, and for now take look at another applications of the linear regression for the eigenbasis transform, in particular such, it seems, unlikely area as Natural Language Understanding…

**Appendix 8.1 – Symbolic matrix by vector multiplication**

*#Examples:*

*str1 <- “2.*a-3.1*b-0.4*c-d”*

* str2 <- “.01*c+35*b+3.24*e-5.*d”*

* # Symbolic multiplication of string ‘str1’ by number 2.5*

* sym_lc_mult(str1, 2.5)*

* # Symbolic sum of 2 strings ‘str1’ and ‘st22’*

* sym_lc_add(str1, str2)*

*str_v <- c(“a+.5*b”,”b-2*c”,”3*c+d”)*

* str_u <- c(“”,””,””)*

* m <- matrix(c(1,2,3,4,5,6,7,8,9),3,byrow=TRUE)*

* # Symbolic multiplication of numeric matrix ‘m’ by string vector ‘str_v’,*

* # result is a string vector ‘str_u’*

* str_u <- matrix_symvect_mult(m, str_v)*

* str_u*

*# Symbolic multiplication of numeric matrix ‘m’ by string vector ‘str_v’,*

* # result is a string vector*

* matrix_symvect_mult <- function(m, str_v){*

*nr <- nrow(m)*

*nc <- ncol(m)*

*str_u <- rep_len(“”, nr)*

*for(i in 1:nr){*

*for(j in 1:nc){*

*str_u[i] <- sym_lc_add(str_u[i], sym_lc_mult(str_v[j],m[i,j]))*

*}*

*}*

*str_u*

* }*

*# Symbolic sum of 2 strings*

* sym_lc_add <- function(str1, str2){*

*str_out = NULL*

*str_t1 <- gsub(“-“,”+-“, str1)*

*l_tup1 <- strsplit(strsplit(str_t1, “[+]”)[[1]],”[*]”)*

*str_t2 <- gsub(“-“,”+-“, str2)*

*l_tup2 <- strsplit(strsplit(str_t2, “[+]”)[[1]],”[*]”)*

*for(item1 in l_tup1){*

* item1 <- item_conv(item1)*

* found = FALSE*

*for(item2 in l_tup2){*

* item2 <- item_conv(item2)*

*# item in both strings*

* if(identical(item1[2], item2[2])){*

* found = TRUE*

* mult = as.double(item1[1]) + as.double(item2[1])*

*if(mult != 0){*

* if(!is.null(str_out))*

* str_out = paste(str_out,”+”, sep=””)*

*str_out = paste(str_out,*

* as.character(mult),*

* “*”, item1[2], sep=””)*

* }*

* }*

*}*

*# item in first string but not in second*

* if(!found){*

* if(as.double(item1[1]) != 0){*

* if(!is.null(str_out))*

* str_out = paste(str_out,”+”, sep=””)*

*str_out = paste(str_out, item1[1],*

* “*”, item1[2], sep=””)*

* }*

* }*

* }*

*# item in second string but not in first*

* for(item2 in l_tup2){*

* item2 <- item_conv(item2)*

* found = FALSE*

*for(item1 in l_tup1){*

* item1 <- item_conv(item1)*

*# item in both strings – not processing*

* if(identical(item1[2], item2[2])){*

* found = TRUE*

* }*

* }*

*if(!found){*

* if(as.double(item2[1]) != 0){*

* if(!is.null(str_out))*

* str_out = paste(str_out,”+”, sep=””)*

*str_out = paste(str_out, item2[1],*

* “*”, item2[2], sep=””)*

* }*

* }*

* }*

*if(is.null(str_out))*

* str_out = “0”*

*str_out2 <- gsub(“[+]-“,”-“, str_out)*

* str_out2*

* }*

*# Symbolic multiplication of string ‘str’ by number m*

* sym_lc_mult <- function(str, m=1){*

*str_t <- gsub(“-“,”+-“, str)*

*l_tup <- strsplit(strsplit(str_t, “[+]”)[[1]],”[*]”)*

*res <- lapply(l_tup, sym_var_mult, m)*

*str_out = NULL*

* for(item in res){*

* if(!identical(as.double(item[1]), 0)){*

* if(!is.null(str_out))*

* str_out = paste(str_out,”+”, sep=””)*

*str_out = paste(str_out, item[1], “*”, item[2], sep=””)*

* }*

* }*

*if(is.null(str_out))*

* str_out = “0”*

*str_out2 <- gsub(“[+]-“,”-“, str_out)*

* str_out2*

* }*

*item_conv <- function(item){*

* if(length(item)==1){*

* if(is.na(as.double(item[1]))){*

* if(identical(substr(item[1],1,1),”-“))*

* item = c(“-1”, sub(“-“,””,item[1]))*

* else*

* item = c(“1”, item[1])*

* }*

* else*

* item = c(“0”, “0”)*

* }*

* else if(length(item)==0){*

* item = c(“0”, “0”)*

* }*

*item*

* }*

*sym_var_mult <- function(item, m=1){*

* item <- item_conv(item)*

* c(as.character(as.double(item[1])*m), item[2])*

* }*

**Appendix 8.2 – Linear regression model**

*# Linear univariate regression that minimizes rss*

* # of variable ‘yname’ of ‘df’ dataset*

* ks_lm <- function (df, yname, xnames, full=1, error=0){*

*all.names = c(unlist(xnames), unlist(yname))*

*clean.df <- df[complete.cases(df[, all.names]), all.names]*

*nr <- nrow(clean.df)*

* nc <- ncol(clean.df)*

*model <- list()*

* model$X <- matrix(do.call(rbind,clean.df), nc)*

* rownames(model$X) <- all.names*

*x <- clean.df*

* x[, yname] <- NULL*

* x[, nc] <- rep_len(1, nr)*

*y <- clean.df[, yname]*

*X <- matrix( do.call(cbind,x), nrow(x) )*

*b1 <- solve(t(X) %*% X) %*% t(X) %*% y*

*if(full | error){*

* e <- X %*% b1 – y*

* model$e2 <- t(e) %*% e*

* model$e2n <- model$e2/nr*

* model$en <- sqrt(model$e2)/nr*

* }*

*if(full){*

* #Projection onto 0 member of quotient space X/Y*

* model$R <- diag(rep_len(1, nc))*

* model$R[nc,1:nc] <- b1*

* model$R[nc,nc] <- 0*

*#X to Y Basis rotation*

* model$Nm1 <- diag(rep_len(1, nc))*

* model$Nm1[,nc] <- rep_len(1, nc)*

* model$Nm1 <- model$Nm1[-nc,]*

*model$h0 <- rep_len(0, nc)*

* model$h0[nc] <- b1[ncol(b1)]*

*model$X0 <- model$R %*% model$X*

* rownames(model$X0) <- all.names*

* model$Xl <- model$X0[, 1:nr] + model$h0*

* rownames(model$Xl) <- all.names*

*model$Nm1R <- model$Nm1 %*% model$R*

* model$DimNames <- matrix_symvect_mult(model$Nm1R, all.names)*

*model$Y <- model$Nm1 %*% model$X0*

* rownames(model$Y) <- model$DimNames*

* }*

*model*

* }*

**Appendix 8.3 – ****Dimensionality**** reduction via linear regression**

*# Reduce dimensionality of dataframe ‘df’ on variables listed in ‘dim’*

* # until ‘n_dim” dimansions left*

* ks_lm_dim_red <- function(df, dim, n_dim=1, reorder=1, sd_dim=0, hb=1){*

*clean.df <- df[complete.cases(df[, dim]), dim]*

*# normalize variables by their range*

* n_ds <-lapply(dim, norm_ds, clean.df)*

* names(n_ds) <- dim*

* norm.df <- as.data.frame(n_ds)*

*#order variables by rss of their univariate regressions*

* v_dim <- sapply(dim, norm_rss_ds, norm.df)*

* i_dim <- order(v_dim[2,])*

* order.df <- norm.df[, i_dim]*

*# run univariate regressions for variables in rss order*

* # until n_dim dimensions left*

* ds <- order.df*

* n_max <- length(dim)-n_dim*

* i <- 0*

* while(i < n_max){*

*if(hb)*

* print(i)*

*resp_name <- colnames(ds)[1]*

* pred_names <- colnames(ds)[-1]*

*mod <- ks_lm(ds, resp_name, pred_names)*

* ds <- as.data.frame(t(mod$Y))*

*i <- i+1*

* if(i >= n_max)*

* break*

*# check if we already have effectively eigenvector (all coordinates are the same)*

* if(sd_dim > 0){*

* m_sd <- mean(sapply(as.data.frame(t(ds)), norm_sd))*

* print(m_sd)*

* if(m_sd < sd_dim){*

* # convert length to 1D eigenbasis*

* col_name <- names(ds)[1]*

* ds <- as.data.frame((ds[,1] * sqrt(length(dim)-i)))*

* names(ds)[1] <- sym_lc_mult(col_name, sqrt(length(dim)-i))*

* break*

* }*

* }*

*#reorder synthetic variables by rss on each iteration*

* if(reorder){*

* tmp_dim <- colnames(ds)*

* v_dim <- sapply(tmp_dim, norm_rss_ds, ds)*

* i_dim <- order(v_dim[2,])*

* ds <- ds[, i_dim]*

* }*

*}*

*ds*

* }*

*# Create list of rss for univariate regression on each normalized variable*

* norm_rss_ds <- function(item, norm.diabetes){*

* i_range <-range(norm.diabetes[,item])*

* norm.diabetes[,item] <- norm.diabetes[,item]/(i_range[2]-i_range[1])*

*predict_dim <- colnames(norm.diabetes)*

* predict_dim <- predict_dim[which(predict_dim!=item)]*

*mod <- ks_lm( norm.diabetes, item, predict_dim, full=0, error=1 )*

*dim_rss <- c(item, mod$en)*

* dim_rss*

* }*

*# Normalize a variable by its range*

* norm_ds <- function(item, norm.diabetes){*

* i_range <-range(norm.diabetes[,item])*

*norm_col <- (norm.diabetes[,item]-i_range[1])/(i_range[2]-i_range[1])*

* norm_col*

* }*

*# Normalize st.dev. by mean*

* norm_sd <- function(item){*

* sd(item)/mean(item)*

* }*