Data Science Done Right (the Kitchen Style) #7

We just created, in the previous post, a simple univariate, multivariable regression function ks_lm0 that calculates regression slopes and intercept.

However, it would be interesting not only to find out what our quotient (or factor) space X/Y would be (which is defined by the bi (i !=0 ) regression coefficients), and what particular equivalence relation R would be of our dataset A with its image B on Yl we project it to, and what the projection function (T or R) we have to Yl and Y0 spaces respectively, but also look at what our projected dataset B becomes in the new space of the reduced dimensionality Yl (or the same in Y0, because we found out before that in the linear space d=f). Actually that is the main goal of the whole exercise we just done – we projected our data from the space with the less “convenient” structure into one with a more “convenient” (with the reduced dimensionality) structure.  BTW, that thing is a part of Factor Analysis (no surprise we are dealing with factor spaces).

What we want to find (see previous chapter) is the transform M. We just found (in ks_lm0) the above mentioned transform R (which is the diagonal matrix with one row made from bi coefficients (!= 0)), and we know that NM=R, i.e. M=N-1R. In general case, if we want to rotate uv basis relative to the ij one (BTW, which is a part of Principal Component Analysis) N-1 transform would be all rows, except one 0 row matrix, but we won’t do it now, and save it for later, and just declare that basis for our Space of new “synthetic variables” Yl would be u=ai+bk, v=cj+dk, and even a=b=c=d=1. I.e. we do not rotate new basis (yet), and do not significantly scale the unit vectors length (only do that trigonometrically). Therefore N-1:
(1 0 … 1)
(0 1 … 1)
(.  .  …  .)
(0 0 … 0)

So, let us stop for a moment at a more visually “comprehendable” 3D representation of our data, where we look only at glucosylated hemoglobin (glyhb), cholesterol/hdl (rati)o and systolic blood pressure (bp.1s) dimensions, and see how our data would look in 3D and 2D representation, after we have collapsed glyhb dimension, and created 2 synthetic variables of the linear combination of glyhb & ratio and glyhb & bp.1s. To do that we will enhance our new ks_lm function to not only calculate slopes and interception as in ks_lm0, but also transformation matrices R and N-1, and the very transformed datasets Xl and Yl.

First, we use lattice package for 3D scatter plot of the original (blue) data X in coordinates V3=glyhb, V1=ration, V2=bp.1s, and projected (magenta) data Xl  into quotient space member Yl (regression plane) and then converted back to the X space basis:

load(url(“http://biostat.mc.vanderbilt.edu/wiki/pub/Main/DataSets/diabetes.sav”))

Scatter plot of X (original) and Xl (regressed) data in X space

ks_model <- ks_lm( diabetes, c(“glyhb”), c(“ratio”,”bp.1s”) )

xdf <- as.data.frame(t(ks_model$X))
xdf[, “V4”] <- rep_len(1, nrow(ydf))

xldf <- as.data.frame(t(ks_model$Xl))
xldf[, “V4”] <- rep_len(2, nrow(ydf))

xxldf <- rbind(xdf, xldf)

cloud(V3 ~ V1*V2, xxldf, groups=V4)

Scatter plot of X (original) and Xl (regressed) data in X space (view from another direction)

Scatter plot of X (original) and Xl (regressed) data in X space (and yet another direction)

And then, we use ggplot2 package to draw a 2D scatter plot of our new synthetic variables in the Yl space:

Projection of our data to Glyhb-bp.1s (black), and to Yl synthetic basis

ydf <- as.data.frame(t(ks_model$Y))
ydf[, “V3”] <- rep_len(2, nrow(ydf))

xdf <- as.data.frame(t(ks_model$X))
xdf[, “V3”] <- rep_len(1, nrow(ydf))

xydf <- rbind(ydf, xdf)

ggplot(xydf, aes(x=V1, y=V2, colour=V3))+
geom_point(alpha=0.5)

Now, having visually seen how our linear regression function work on 3D data, and how it may be used for dimensionality/factor reduction, and where in it is a place for principle component manipulation, let us proceed playing with it on a larger dimension data…

Appendix 

ks_lm <- function (df, yname, xnames){
all.names = c( unlist(xnames), unlist(yname) )
clean.df <- df[complete.cases( df[, all.names] ), all.names]

nr <- nrow(clean.df)
nc <- ncol(clean.df)

model <- list()
model$X <- matrix(do.call(rbind,clean.df), nc)

x <- clean.df
x[, yname] <- NULL
x[, nc] <- rep_len(1, nr)

y <- clean.df[, yname]

X <- matrix( do.call(cbind,x), nrow(x) )

b1 <- solve(t(X) %*% X) %*% t(X) %*% y

#Projection onto 0 member of quotient space X/Y
model$R <- diag(rep_len(1, nc))
model$R[nc,1:nc] <- b1
model$R[nc,nc] <- 0

#X to Y Basis rotation
model$Nm1 <- diag(rep_len(1, nc))
model$Nm1[,nc] <- rep_len(1, nc)
model$Nm1 <- model$Nm1[-nc,]

model$h0 <- rep_len(0, nc)
model$h0[nc] <- b1[ncol(b1)]

model$X0 <- model$R %*% model$X
model$Xl <- model$X0[, 1:nr] + model$h0

model$Y <- model$Nm1 %*% model$X0

model
}

Advertisements
This entry was posted in Data Science Done Right and tagged , , , . Bookmark the permalink.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

w

Connecting to %s