Data Science Done Right (the Kitchen Style) #8

Using linear regression for dimensionality reduction

In our linear model, designed and developed (in R) in the previous chapters, we calculate not only new images of the data, being projected on the chosen element of the quotient space that satisfy particular requirements (we use minimal SSE), in the original space basis, but also in a basis of the quotient subspace. However, we will need one more function that would calculate in the symbolic form what would be new synthetic basis vectors composed by the linear combination out of the original space basis vectors. Basically we need to implement a symbolic multiplication of the matrix N-1R of our transformation into the quotient space element by symbolic vector of the variable names (see those functions in Appendix 8.1).

Having symbolic matrix by vector multiplication functions ready, we may start playing with our dataset, trying to reduce its higher dimensionality into a more intelligibly manageable and comprehensible lower dimension image. As we mentioned before, regression is a way to reduce dimensionality of a set, of course, by a price of losing complexity of the data Structure. We try to minimize that loss of the continuity of the transformation function (projection we use in regression) using some criteria (loss function) – in our case SSE. So far we have an univariate multivariable regression function ready, therefor we can collapse dimensions consecutively, one by one. With this approach we may need to decide how many dimensions we want to collapse and/or in what order.

There could be multiple approaches and criteria for that, and we may come back to this discussion later, but for now let’s start with simple ones. For obvious reasons of saving as much Structural complexity (i.e. continuity of transformation) we may want to collapse those dimensions which projections give the minimal loss function of our choice (of course, first, we normalize those dimensions to their ranges). On the other hand, it may be not so important if we go to the ultimate dimensionality reduction to the 1D space, basically making its basis vector an eigenvector.

Let’s take the Diabetes dataset we used for experimentation in the previous chapter, which has 12 dimensions (obviously not independent), and run consecutive univariate regressions until 1 dimension left:

load(url(“http://biostat.mc.vanderbilt.edu/wiki/pub/Main/DataSets/diabetes.sav”))
# Dimensionality reduction
diabetes2 <- diabetes
dim <- c(“glyhb”, “chol”, “hdl”, “ratio”, “stab.glu”,
“bp.1s”, “bp.1d”, “age”, “height”, “weight”, “waist”, “hip”)
ds <- ks_lm_dim_red(diabetes2, dim, n_dim=1, reorder=1)

Now we can see what will be our synthetic eigenvector variable:

resp_name <- names(ds)[1]
resp_name
[1] “573.865492043754*chol+25.8440157742691*weight-507.345171188756*hdl+455.427566923276*hip+251.259082865673*waist-12.0580460009216*bp.1s+17.2494522189963*glyhb+142.022683378986*stab.glu+233.150806921495*height+59.1169607418513*bp.1d+29.0382173062296*age”

On the first look, not much surprises – cholesterol is really bad, HDL (“good cholesterol”) is good, hip circumference is bad, as well, while waist circumference is as twice as less bad, surprisingly, height is noticeable worse than weight, and age and blood pressure have less importance than other factors, but high values are still bad. But, anyway, that’s only a formal statistical model.

Now, let’s see how the data distribution looks like in our new synthetic 1D eigenbasis v1:

names(ds)[1] <- “v1″
ggplot(ds, aes(x=v1))+
geom_histogram(data=ds,aes(y=..density..),fill=”blue”, bins=40, alpha=0.5)+
geom_density(data=ds,adjust=0.5)

Histogram/pdf of diabetes dataset observation in synthetic 1D eigenbasis

We see it’s noticeably multi (quatro or quinto) modal. We don’t know, yet, what that clusterization mean, and correlates with, so let’s see how it relates to the usual criterion for diabetes glycosylated hemoglobin > 7. Let’s overlay histograms of two groups of observations divided by that criterion:

 

 

 

clean.diabetes <- diabetes2[complete.cases(diabetes2[, dim]), dim]
ds$glyhb <- clean.diabetes$glyhb
ds$glyhb_c <- 0
ds[ds$glyhb>=7, “glyhb_c”] <- 1

ds_nd <- ds[ds$glyhb<7,]
ds_pd <- ds[ds$glyhb>=7,]

ggplot(ds, aes(x=v1))+
geom_histogram(data=ds, fill=”blue”, bins=40, alpha=0.6)+
geom_histogram(data=ds_nd, fill=”green”, bins=40, alpha=0.6)+
geom_histogram(data=ds_pd, fill=”red”, bins=40, alpha=0.6)

Histogram of the full dataset (blue), and datasets w/o (green), and w/ glyhb >= 7 in v1 eigenbasis

Also we may look at these two subdatasets and the whole dataset on scatterplot:

 

 

 

 

 

 

ggplot(ds, aes(x=v1, colour=glyhb))+
geom_point(data=ds, y=.5, alpha=0.5)+
geom_point(data=ds_nd, y=.75, alpha=0.5)+
geom_point(data=ds_pd, y=.25, alpha=0.5)+
scale_colour_gradientn(colours=rainbow(4))

Whole diabetes dataset (in the middle), and subsets w/o (above), and w/ glyhb >= 7 (below) in v1 eigenbasis

As well as descriptive statistics of these datasets:

 

 

 

 

 

 

> describe(ds[[1]])
vars n mean sd median trimmed mad min max range skew kurtosis se
X1 1 377 508.01 192.12 504.76 505.21 192.78 92.88 1215.27 1122.39 0.2 -0.14 9.89

> describe(ds_nd[[1]])
vars n mean sd median trimmed mad min max range skew kurtosis se
X1 1 319 478.84 177.23 482.3 476.71 182.29 92.88 1041.15 948.27 0.12 -0.31 9.92

> describe(ds_pd[[1]])
vars n mean sd median trimmed mad min max range skew kurtosis se
X1 1 58 668.48 193.08 694.31 671.4 169.63 149.86 1215.27 1065.41 -0.13 0.27 25.35

Difference in the means between two subsets is just around one sigma, which has not very impressive confidence level of statistical significance (like 60-70%), but that’s not exactly the point. Glycosylated hemoglobin and it’s level 7 itself is not the ultimate and definitive diabetes type II criterion, but just one useful estimation. Diabetes type II is not just about “blood sugar”, it’s a general metabolic condition, not necessarily disorder per se. Some researchers hypothesise it’s an evolutionary adaptation (to cold climate), and when it was developed the “inconveniences” it causes in the old age and “normal”, comfortable  conditions of contemporary life was not an evolutionary priority.

Anyway, that clustering of observations with the positive glyhb criterion at last two modes (v1 > 650) may be used as an indicator of such adaptation, or likelihood of diabetes type II condition. Let’s see how this clustering look in the original (cholesterol-blood pressure-glycosylated hemoglobin) space:

clean.diabetes$syn <- ds[[1]]
clean.diabetes$syn_c <- 2
clean.diabetes[clean.diabetes$syn<320, “syn_c”] <- 1
clean.diabetes[clean.diabetes$syn>=650 , “syn_c”] <- 3
cloud(chol ~ glyhb * bp.1s, clean.diabetes, groups=syn_c)

Observations grouped by modes on histograms above: blue – first mode, magenta – next two modes, blac – the last one(s)

What we can see on this scatterplot is that the last mode clusters around observations with largest variabilities of at least one of the displayed parameters, which is a well known risk factor for diabetes. Therefore, indeed, it may make sense to investigate more the above described criterion in association with diabetes type II.

It’s easy to notice that out of the original 12 variables many of them are dependent, and could be grouped into 5 more or less independent backets (glucose related, cholesterol related, blood pressure, physical characteristics and age). Collapsing one of the dimensions in the group may lead to collapsing others, therefore it may be enough to do much less projections than the number of dependent dimensions to convert synthetic dimensions into the eigenbasis. For our particular case 5 projections (and respective basis conversions) are enough to put the data in a relatively straight line which will be the eigenbasis vector we are looking for.

From the computational point of view we are looking for a state when median sigma of the coordinates on all the remaining dimensions are less than a particular value, say, 5 %:

ds <- ks_lm_dim_red(diabetes2, dim, sd_dim=0.05, reorder=1)

or setting that we want 7 dimensions left:

ds <- ks_lm_dim_red(diabetes2, dim, n_dim=7, reorder=1)

And plotting the transformed data in the first 3 of the synthetic dimensions (in the others it will be the same), we’ll get a pretty straight line:

Our Diabetes dataset in the synthetic basis one step away from the eigenbasis transform

names(ds)[1:3] <- c (“v1”, “v2”, “v3”)
clean.diabetes <- diabetes2[complete.cases(diabetes2[, dim]), dim]
ds$glyhb <- clean.diabetes$glyhb
ds$glyhb_c <- 0
ds[ds$glyhb>=7, “glyhb_c”] <- 1
cloud(v1 ~ v2 * v3, ds, groups=glyhb_c, pretty=TRUE)

To roughly converting our dataset into 1D eigenbasis, and preserving the basis vector length (which doesn’t really matter for the eigenbasis), there would be enough to multiply one of the coordinates by sqrt(n) (where n is number of remaining dimensions (assuming the vector length preservation in two bases: A = bu = ai+aj+ak+…+an, AT*A = b*2 = n(a*2)).

If we stop just one step before:

ds <- ks_lm_dim_red(diabetes2, dim, n_dim=8, reorder=1)

We’ll still have some variability of data points:

Our Diabetes dataset in the synthetic basis two steps away from the eigenbasis transform

Whether we run all projections till the last dimension, or stop by the mean sigma criteria, we’ll have practically the same distribution along the eigen basis vector. In our case, if we stop earlier, numerical values of the synthetic variable will be different (we’ve been lazy in previous chapters defining the N-1 transform as a diagonal and last column unit matrix, therefore after each projection the length of the basis vectors shrinks), but for eigenbasis that’s not important. Eigenbasis coordinate will be just proportionally adjusted:

v1 = 23.70*chol+1.09*weight-18.89*hdl+18.57*hip+10.66*waist-0.51*bp.1s+0.73*glyhb+5.88*stab.glu+9.42*height+2.211*bp.1d+0.917*age

Histogram/pdf of diabetes dataset observation in synthetic 1D eigenbasis with the mean sigma stop parameter

We’ll play with a more intelligent Principal Component Analysis of the synthetic basis on projection later, and for now take look at another applications of the linear regression for the eigenbasis transform, in particular such, it seems, unlikely area as Natural Language Understanding…

 

 

 

 

Appendix 8.1 – Symbolic matrix by vector multiplication

#Examples:

str1 <- “2.*a-3.1*b-0.4*c-d”
str2 <- “.01*c+35*b+3.24*e-5.*d”
# Symbolic multiplication of string ‘str1’ by number 2.5
sym_lc_mult(str1, 2.5)
# Symbolic sum of 2 strings ‘str1’ and ‘st22’
sym_lc_add(str1, str2)

str_v <- c(“a+.5*b”,”b-2*c”,”3*c+d”)
str_u <- c(“”,””,””)
m <- matrix(c(1,2,3,4,5,6,7,8,9),3,byrow=TRUE)
# Symbolic multiplication of numeric matrix ‘m’ by string vector ‘str_v’,
# result is a string vector ‘str_u’
str_u <- matrix_symvect_mult(m, str_v)
str_u

# Symbolic multiplication of numeric matrix ‘m’ by string vector ‘str_v’,
# result is a string vector
matrix_symvect_mult <- function(m, str_v){
nr <- nrow(m)
nc <- ncol(m)
str_u <- rep_len(“”, nr)
for(i in 1:nr){
for(j in 1:nc){
str_u[i] <- sym_lc_add(str_u[i], sym_lc_mult(str_v[j],m[i,j]))
}
}

str_u
}

# Symbolic sum of 2 strings
sym_lc_add <- function(str1, str2){
str_out = NULL
str_t1 <- gsub(“-“,”+-“, str1)
l_tup1 <- strsplit(strsplit(str_t1, “[+]”)[[1]],”[*]”)
str_t2 <- gsub(“-“,”+-“, str2)
l_tup2 <- strsplit(strsplit(str_t2, “[+]”)[[1]],”[*]”)

for(item1 in l_tup1){
item1 <- item_conv(item1)
found = FALSE

for(item2 in l_tup2){
item2 <- item_conv(item2)

# item in both strings
if(identical(item1[2], item2[2])){
found = TRUE
mult = as.double(item1[1]) + as.double(item2[1])

if(mult != 0){
if(!is.null(str_out))
str_out = paste(str_out,”+”, sep=””)

str_out = paste(str_out,
as.character(mult),
“*”, item1[2], sep=””)
}
}

}

# item in first string but not in second
if(!found){
if(as.double(item1[1]) != 0){
if(!is.null(str_out))
str_out = paste(str_out,”+”, sep=””)

str_out = paste(str_out, item1[1],
“*”, item1[2], sep=””)
}
}
}

# item in second string but not in first
for(item2 in l_tup2){
item2 <- item_conv(item2)
found = FALSE

for(item1 in l_tup1){
item1 <- item_conv(item1)

# item in both strings – not processing
if(identical(item1[2], item2[2])){
found = TRUE
}
}

if(!found){
if(as.double(item2[1]) != 0){
if(!is.null(str_out))
str_out = paste(str_out,”+”, sep=””)

str_out = paste(str_out, item2[1],
“*”, item2[2], sep=””)
}
}
}

if(is.null(str_out))
str_out = “0”

str_out2 <- gsub(“[+]-“,”-“, str_out)
str_out2
}

# Symbolic multiplication of string ‘str’ by number m
sym_lc_mult <- function(str, m=1){
str_t <- gsub(“-“,”+-“, str)
l_tup <- strsplit(strsplit(str_t, “[+]”)[[1]],”[*]”)

res <- lapply(l_tup, sym_var_mult, m)

str_out = NULL
for(item in res){
if(!identical(as.double(item[1]), 0)){
if(!is.null(str_out))
str_out = paste(str_out,”+”, sep=””)

str_out = paste(str_out, item[1], “*”, item[2], sep=””)
}
}

if(is.null(str_out))
str_out = “0”

str_out2 <- gsub(“[+]-“,”-“, str_out)
str_out2
}

item_conv <- function(item){
if(length(item)==1){
if(is.na(as.double(item[1]))){
if(identical(substr(item[1],1,1),”-“))
item = c(“-1”, sub(“-“,””,item[1]))
else
item = c(“1”, item[1])
}
else
item = c(“0”, “0”)
}
else if(length(item)==0){
item = c(“0”, “0”)
}

item
}

sym_var_mult <- function(item, m=1){
item <- item_conv(item)
c(as.character(as.double(item[1])*m), item[2])
}

Appendix 8.2 – Linear regression model

# Linear univariate regression that minimizes rss
# of variable ‘yname’ of ‘df’ dataset
ks_lm <- function (df, yname, xnames, full=1, error=0){
all.names = c(unlist(xnames), unlist(yname))
clean.df <- df[complete.cases(df[, all.names]), all.names]

nr <- nrow(clean.df)
nc <- ncol(clean.df)

model <- list()
model$X <- matrix(do.call(rbind,clean.df), nc)
rownames(model$X) <- all.names

x <- clean.df
x[, yname] <- NULL
x[, nc] <- rep_len(1, nr)

y <- clean.df[, yname]

X <- matrix( do.call(cbind,x), nrow(x) )

b1 <- solve(t(X) %*% X) %*% t(X) %*% y

if(full | error){
e <- X %*% b1 – y
model$e2 <- t(e) %*% e
model$e2n <- model$e2/nr
model$en <- sqrt(model$e2)/nr
}

if(full){
#Projection onto 0 member of quotient space X/Y
model$R <- diag(rep_len(1, nc))
model$R[nc,1:nc] <- b1
model$R[nc,nc] <- 0

#X to Y Basis rotation
model$Nm1 <- diag(rep_len(1, nc))
model$Nm1[,nc] <- rep_len(1, nc)
model$Nm1 <- model$Nm1[-nc,]

model$h0 <- rep_len(0, nc)
model$h0[nc] <- b1[ncol(b1)]

model$X0 <- model$R %*% model$X
rownames(model$X0) <- all.names
model$Xl <- model$X0[, 1:nr] + model$h0
rownames(model$Xl) <- all.names

model$Nm1R <- model$Nm1 %*% model$R
model$DimNames <- matrix_symvect_mult(model$Nm1R, all.names)

model$Y <- model$Nm1 %*% model$X0
rownames(model$Y) <- model$DimNames
}

model
}

Appendix 8.3 – Dimensionality reduction via linear regression

# Reduce dimensionality of dataframe ‘df’ on variables listed in ‘dim’
# until ‘n_dim” dimansions left
ks_lm_dim_red <- function(df, dim, n_dim=1, reorder=1, sd_dim=0, hb=1){
clean.df <- df[complete.cases(df[, dim]), dim]

# normalize variables by their range
n_ds <-lapply(dim, norm_ds, clean.df)
names(n_ds) <- dim
norm.df <- as.data.frame(n_ds)

#order variables by rss of their univariate regressions
v_dim <- sapply(dim, norm_rss_ds, norm.df)
i_dim <- order(v_dim[2,])
order.df <- norm.df[, i_dim]

# run univariate regressions for variables in rss order
# until n_dim dimensions left
ds <- order.df
n_max <- length(dim)-n_dim
i <- 0
while(i < n_max){

if(hb)
print(i)

resp_name <- colnames(ds)[1]
pred_names <- colnames(ds)[-1]

mod <- ks_lm(ds, resp_name, pred_names)
ds <- as.data.frame(t(mod$Y))

i <- i+1
if(i >= n_max)
break

# check if we already have effectively eigenvector (all coordinates are the same)
if(sd_dim > 0){
m_sd <- mean(sapply(as.data.frame(t(ds)), norm_sd))
print(m_sd)
if(m_sd < sd_dim){
# convert length to 1D eigenbasis
col_name <- names(ds)[1]
ds <- as.data.frame((ds[,1] * sqrt(length(dim)-i)))
names(ds)[1] <- sym_lc_mult(col_name, sqrt(length(dim)-i))
break
}
}

#reorder synthetic variables by rss on each iteration
if(reorder){
tmp_dim <- colnames(ds)
v_dim <- sapply(tmp_dim, norm_rss_ds, ds)
i_dim <- order(v_dim[2,])
ds <- ds[, i_dim]
}

}

ds
}

# Create list of rss for univariate regression on each normalized variable
norm_rss_ds <- function(item, norm.diabetes){
i_range <-range(norm.diabetes[,item])
norm.diabetes[,item] <- norm.diabetes[,item]/(i_range[2]-i_range[1])

predict_dim <- colnames(norm.diabetes)
predict_dim <- predict_dim[which(predict_dim!=item)]

mod <- ks_lm( norm.diabetes, item, predict_dim, full=0, error=1 )

dim_rss <- c(item, mod$en)
dim_rss
}

# Normalize a variable by its range
norm_ds <- function(item, norm.diabetes){
i_range <-range(norm.diabetes[,item])

norm_col <- (norm.diabetes[,item]-i_range[1])/(i_range[2]-i_range[1])
norm_col
}

# Normalize st.dev. by mean
norm_sd <- function(item){
sd(item)/mean(item)
}

Advertisements
This entry was posted in Data Science Done Right and tagged , , , . Bookmark the permalink.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s