Data Science Done Right (the Kitchen Style) #9

Picking up loose ends…

When previously we were talking about the transformation matrix N-1 – the matrix of changing basis from the original to a basis of the element of the quotient space we project our dataset to – we picked the simplest model just to keep computational ball rolling, planning to come back and address a more reasonable choice later. Before starting playing with word vector regressions let’s improve our transformation to a bit more useful form.

Our N-1 matrix was defined as:

(a 0 .. b)
(0 c .. d)
(.. .. .. ..)
(0 0 .. 0)

and we decided to go with coefficients a=b=c=d=..=1  for simplicity.

Let’s see what that transformation that matrix would define. And let’s do it in 3D space for easier understanding, then, when we come up with a better matrix coefficients, generalizing it to a higher dimension case again).

Those matrices that convert bases define how unit vectors of these bases relate (expressed via linear combination) to each other. Therefore the following matrix:
N-1         x                  y
(1 0 1)   (1)     (1*1 + 1*1u
(0 1 1)  (0) j =  (          1*1v
(0 0 0)  (1) k

basically says that if we take a unit vector’s x=1*i+1*j+1*k projection in a ik plane, then the length (in respect to base vector u (the same will be with v)) of the transformed (rotated to) vector y will be additively combined from the lengths of vector x in respect to base vectors i and k (k is the dimension we are collapsing). And the norm of the resulting vector in respect to u will be ||y||^2 = uT*u = 2*2 = 4, or ||y|| = 2, therefore the unit vector in respect to basis u will be as twice as smaller than the transformation of the vector that has unit components in respect to the basis ik. Or, ||x||^2= xT*x = 1*1 + 1*1 = 2, or||x|| = sqrt(2), or ||y||/||x|| = sqrt(2), which would lead to a “dimension inflation”, or shrinking unit vector length’s of each synthetic dimension (or growing coefficients of the synthetic bases) after each regression step (see Fig. 9.1). Which is obviously may be inconvenient if we want to compare distributions between regression steps.

We could have set a,b,c,d,etc… coefficients in N-1 matrix to something else than 1, for example keeping a=b and c=d, but setting a!=c, also counting in account that transformation R (regression’s projection) will shrink unit coefficient of k to t31 (in this context we are looking at the transformation of the unit vector’s x projection in ik plane), and asking for ||y||=1:

N-1            x                  y
(a 0 a)      (1)     (a*1 + a*t31u
(0 c c)      (0) j =   (          c*t31v
(0 0 0)  (t31) k

then: ||y||^2=a^2*(1+t31)^2=1; a=sqrt(1/(1+t31)^2)

However, we still assign the same weight a to components of i and k when composing length of u (even we take only projected part t31 of the original unit vector k), which, if we don’t have particular reason for doing that, will distort input of the original dimensions i and k into the synthetic u. Therefore it makes sense to use projections of the unit vector components in respect to i and k on our projected dimension u to compose length of the synthetic unit vector:

N-1                          x                  y
(d*a’ 0 d*b’*t31) (1)     (d*a’*1 + d*b’*t31u
(                          ) (0) j =
(                          ) (1) k

Where: ||y||^2=d^2*(a’+b’*t31)^2=1;
a’=1*cos(alpha); b’*t31=sin(alpha)*t31;
sin(alpha)=t31/sqrt(1+t31^2); cos(alpha)=1/sqrt(1+t31^2);
then: d=1/sqrt(1+t31^2); d*a’=1/(1+t31^2); d*b’*t31=t31^2/(1+t31^2)

Now, with new N-1 matrix, we’ll get consistent, normalized range distributions of the synthetic variables on each regression step, and multi-modality of the synthetic eigenbasis distribution is even more pronounced:

Fig. 9.3. Histogram/pdf of diabetes dataset in normalized synthetic 1D eigenbasis

with the synthetic eigenbasis vector v1 = -4.46389046671904*bp.1s – 0.000335763837256707*hip + 0.0197350621146874*waist + 0.0266855254863463*bp.1d + 0.00308836195333824*height + 0.866182912400952*age + 0.00535941186779068*stab.glu – 0.00021277667540017*hdl + 0.000649570489160293*chol + 0.00146620009377012*glyhb

where coefficients less than 1e-04 were rounded down to 0 (for weight and ratio).

Fig. 9.4. Histogram/pdf of diabetes dataset in normalized synthetic 1D eigenbasis for groups with glyhb < 7 (green), and glyhb > 7 (red)

That is our current model (though without rotating synthetic basis to get rid of the variable dependency), where: glyhb – Glycosylated Hemoglobin, chol – Total Cholesterol, hdl – High Density Lipoprotein, ratio – Cholesterol/HDL ratio, stab.glu – Stabilized Glucose, bp.1s – Systolic Blood pressure, bp.1d – Diastolic Blood pressure, while other parameters’ names are self explanatory (height, hip and waist circumferences are measured in inches, weight in pounds). Standardized value of a variable is computed as sv = (v-min)/range, where:

 

 

 

Fig. 9.3. Histogram/pdf of diabetes dataset in normalized synthetic 1D eigenbasis for groups with bp.1s < 150 (green), and bp.1s > 150 (red)

vars        min      max  range
glyhb       2.68     16.11    13.43
chol        78.00  443.00  365.00
hdl          12.00  120.00  108.00
ratio         1.50     19.30    17.80
stab.glu  48.00  385.00  337.00
bp.1s       90.00  250.00  160.00
bp.1d      48.00  124.00    76.00
age          19.00    92.00    73.00
height     52.00    76.00    24.00
weight    99.00  325.00  226.00
waist       26.00    56.00    30.00
hip           30.00    64.00    34.00

 

Fig. 9.3. Histogram/pdf of diabetes dataset in normalized synthetic 1D eigenbasis for groups with chol < 250 (green), and chol > 250 (red)

V1 value greater than 0.35 seams is a threshold between first three modes of the distribution that cluster data with low parameters in the cholesterol, glucose, blood pressure groups, and last two-three modes that cluster data with high such values, which we can see at Figs. 9.4-6 with overlaying histograms for groups with low/high glyhb (>7 Fig. 9.4), bp.1s (>150, Fig. 9.5), and chol (>250, Fig. 9.6).

 

 

 

Appendix 9.1 

# Linear univariate regression that minimizes rss
# of variable ‘yname’ of ‘df’ dataset
ks_lm <- function (df, yname, xnames, full=1, error=0){
all.names = c(unlist(xnames), unlist(yname))
clean.df <- df[complete.cases(df[, all.names]), all.names]

nr <- nrow(clean.df)
nc <- ncol(clean.df)

model <- list()
model$X <- matrix(do.call(rbind,clean.df), nc)
rownames(model$X) <- all.names

x <- clean.df
x[, yname] <- NULL
x[, nc] <- rep_len(1, nr)

y <- clean.df[, yname]

X <- matrix( do.call(cbind,x), nrow(x) )

b1 <- solve(t(X) %*% X) %*% t(X) %*% y

if(full | error){
e <- X %*% b1 – y
model$e2 <- t(e) %*% e
model$e2n <- model$e2/nr
model$en <- sqrt(model$e2)/nr
}

if(full){
#Projection onto 0 member of quotient space X/Y
model$R <- diag(rep_len(1, nc))
model$R[nc,1:nc] <- b1
model$R[nc,nc] <- 0

da1 <- 1/(1+as.vector(b1)**2)
model$Nm1 <- diag(da1)
dbt1 <- 1/(1/as.vector(b1)**2+1)
model$Nm1[,nc] <- dbt1

model$Nm1 <- model$Nm1[-nc,]

model$h0 <- rep_len(0, nc)
model$h0[nc] <- b1[ncol(b1)]

model$X0 <- model$R %*% model$X
rownames(model$X0) <- all.names
model$Xl <- model$X0[, 1:nr] + model$h0
rownames(model$Xl) <- all.names

model$Nm1R <- model$Nm1 %*% model$R
model$DimNames <- matrix_symvect_mult(model$Nm1R, all.names)

model$Y <- model$Nm1 %*% model$X0
rownames(model$Y) <- model$DimNames
}

model
}

sym_var_mult <- function(item, m=1){
item <- item_conv(item)

mult_res <- as.double(item[1])*m
if(abs(mult_res) < 1e-04)
mult_res <- 0

c(as.character(mult_res), item[2])
}

Advertisements
Posted in Data Science Done Right | Tagged , , , | Leave a comment

Data Science Done Right (the Kitchen Style) #8

Using linear regression for dimensionality reduction

In our linear model, designed and developed (in R) in the previous chapters, we calculate not only new images of the data, being projected on the chosen element of the quotient space that satisfy particular requirements (we use minimal SSE), in the original space basis, but also in a basis of the quotient subspace. However, we will need one more function that would calculate in the symbolic form what would be new synthetic basis vectors composed by the linear combination out of the original space basis vectors. Basically we need to implement a symbolic multiplication of the matrix N-1R of our transformation into the quotient space element by symbolic vector of the variable names (see those functions in Appendix 8.1).

Having symbolic matrix by vector multiplication functions ready, we may start playing with our dataset, trying to reduce its higher dimensionality into a more intelligibly manageable and comprehensible lower dimension image. As we mentioned before, regression is a way to reduce dimensionality of a set, of course, by a price of losing complexity of the data Structure. We try to minimize that loss of the continuity of the transformation function (projection we use in regression) using some criteria (loss function) – in our case SSE. So far we have an univariate multivariable regression function ready, therefor we can collapse dimensions consecutively, one by one. With this approach we may need to decide how many dimensions we want to collapse and/or in what order.

There could be multiple approaches and criteria for that, and we may come back to this discussion later, but for now let’s start with simple ones. For obvious reasons of saving as much Structural complexity (i.e. continuity of transformation) we may want to collapse those dimensions which projections give the minimal loss function of our choice (of course, first, we normalize those dimensions to their ranges). On the other hand, it may be not so important if we go to the ultimate dimensionality reduction to the 1D space, basically making its basis vector an eigenvector.

Let’s take the Diabetes dataset we used for experimentation in the previous chapter, which has 12 dimensions (obviously not independent), and run consecutive univariate regressions until 1 dimension left:

load(url(“http://biostat.mc.vanderbilt.edu/wiki/pub/Main/DataSets/diabetes.sav&#8221;))
# Dimensionality reduction
diabetes2 <- diabetes
dim <- c(“glyhb”, “chol”, “hdl”, “ratio”, “stab.glu”,
“bp.1s”, “bp.1d”, “age”, “height”, “weight”, “waist”, “hip”)
ds <- ks_lm_dim_red(diabetes2, dim, n_dim=1, reorder=1)

Now we can see what will be our synthetic eigenvector variable:

resp_name <- names(ds)[1]
resp_name
[1] “573.865492043754*chol+25.8440157742691*weight-507.345171188756*hdl+455.427566923276*hip+251.259082865673*waist-12.0580460009216*bp.1s+17.2494522189963*glyhb+142.022683378986*stab.glu+233.150806921495*height+59.1169607418513*bp.1d+29.0382173062296*age”

On the first look, not much surprises – cholesterol is really bad, HDL (“good cholesterol”) is good, hip circumference is bad, as well, while waist circumference is as twice as less bad, surprisingly, height is noticeable worse than weight, and age and blood pressure have less importance than other factors, but high values are still bad. But, anyway, that’s only a formal statistical model.

Now, let’s see how the data distribution looks like in our new synthetic 1D eigenbasis v1:

names(ds)[1] <- “v1″
ggplot(ds, aes(x=v1))+
geom_histogram(data=ds,aes(y=..density..),fill=”blue”, bins=40, alpha=0.5)+
geom_density(data=ds,adjust=0.5)

Histogram/pdf of diabetes dataset observation in synthetic 1D eigenbasis

We see it’s noticeably multi (quatro or quinto) modal. We don’t know, yet, what that clusterization mean, and correlates with, so let’s see how it relates to the usual criterion for diabetes glycosylated hemoglobin > 7. Let’s overlay histograms of two groups of observations divided by that criterion:

 

 

 

clean.diabetes <- diabetes2[complete.cases(diabetes2[, dim]), dim]
ds$glyhb <- clean.diabetes$glyhb
ds$glyhb_c <- 0
ds[ds$glyhb>=7, “glyhb_c”] <- 1

ds_nd <- ds[ds$glyhb<7,]
ds_pd <- ds[ds$glyhb>=7,]

ggplot(ds, aes(x=v1))+
geom_histogram(data=ds, fill=”blue”, bins=40, alpha=0.6)+
geom_histogram(data=ds_nd, fill=”green”, bins=40, alpha=0.6)+
geom_histogram(data=ds_pd, fill=”red”, bins=40, alpha=0.6)

Histogram of the full dataset (blue), and datasets w/o (green), and w/ glyhb >= 7 in v1 eigenbasis

Also we may look at these two subdatasets and the whole dataset on scatterplot:

 

 

 

 

 

 

ggplot(ds, aes(x=v1, colour=glyhb))+
geom_point(data=ds, y=.5, alpha=0.5)+
geom_point(data=ds_nd, y=.75, alpha=0.5)+
geom_point(data=ds_pd, y=.25, alpha=0.5)+
scale_colour_gradientn(colours=rainbow(4))

Whole diabetes dataset (in the middle), and subsets w/o (above), and w/ glyhb >= 7 (below) in v1 eigenbasis

As well as descriptive statistics of these datasets:

 

 

 

 

 

 

> describe(ds[[1]])
vars n mean sd median trimmed mad min max range skew kurtosis se
X1 1 377 508.01 192.12 504.76 505.21 192.78 92.88 1215.27 1122.39 0.2 -0.14 9.89

> describe(ds_nd[[1]])
vars n mean sd median trimmed mad min max range skew kurtosis se
X1 1 319 478.84 177.23 482.3 476.71 182.29 92.88 1041.15 948.27 0.12 -0.31 9.92

> describe(ds_pd[[1]])
vars n mean sd median trimmed mad min max range skew kurtosis se
X1 1 58 668.48 193.08 694.31 671.4 169.63 149.86 1215.27 1065.41 -0.13 0.27 25.35

Difference in the means between two subsets is just around one sigma, which has not very impressive confidence level of statistical significance (like 60-70%), but that’s not exactly the point. Glycosylated hemoglobin and it’s level 7 itself is not the ultimate and definitive diabetes type II criterion, but just one useful estimation. Diabetes type II is not just about “blood sugar”, it’s a general metabolic condition, not necessarily disorder per se. Some researchers hypothesise it’s an evolutionary adaptation (to cold climate), and when it was developed the “inconveniences” it causes in the old age and “normal”, comfortable  conditions of contemporary life was not an evolutionary priority.

Anyway, that clustering of observations with the positive glyhb criterion at last two modes (v1 > 650) may be used as an indicator of such adaptation, or likelihood of diabetes type II condition. Let’s see how this clustering look in the original (cholesterol-blood pressure-glycosylated hemoglobin) space:

clean.diabetes$syn <- ds[[1]]
clean.diabetes$syn_c <- 2
clean.diabetes[clean.diabetes$syn<320, “syn_c”] <- 1
clean.diabetes[clean.diabetes$syn>=650 , “syn_c”] <- 3
cloud(chol ~ glyhb * bp.1s, clean.diabetes, groups=syn_c)

Observations grouped by modes on histograms above: blue – first mode, magenta – next two modes, blac – the last one(s)

What we can see on this scatterplot is that the last mode clusters around observations with largest variabilities of at least one of the displayed parameters, which is a well known risk factor for diabetes. Therefore, indeed, it may make sense to investigate more the above described criterion in association with diabetes type II.

It’s easy to notice that out of the original 12 variables many of them are dependent, and could be grouped into 5 more or less independent backets (glucose related, cholesterol related, blood pressure, physical characteristics and age). Collapsing one of the dimensions in the group may lead to collapsing others, therefore it may be enough to do much less projections than the number of dependent dimensions to convert synthetic dimensions into the eigenbasis. For our particular case 5 projections (and respective basis conversions) are enough to put the data in a relatively straight line which will be the eigenbasis vector we are looking for.

From the computational point of view we are looking for a state when median sigma of the coordinates on all the remaining dimensions are less than a particular value, say, 5 %:

ds <- ks_lm_dim_red(diabetes2, dim, sd_dim=0.05, reorder=1)

or setting that we want 7 dimensions left:

ds <- ks_lm_dim_red(diabetes2, dim, n_dim=7, reorder=1)

And plotting the transformed data in the first 3 of the synthetic dimensions (in the others it will be the same), we’ll get a pretty straight line:

Our Diabetes dataset in the synthetic basis one step away from the eigenbasis transform

names(ds)[1:3] <- c (“v1”, “v2”, “v3”)
clean.diabetes <- diabetes2[complete.cases(diabetes2[, dim]), dim]
ds$glyhb <- clean.diabetes$glyhb
ds$glyhb_c <- 0
ds[ds$glyhb>=7, “glyhb_c”] <- 1
cloud(v1 ~ v2 * v3, ds, groups=glyhb_c, pretty=TRUE)

To roughly converting our dataset into 1D eigenbasis, and preserving the basis vector length (which doesn’t really matter for the eigenbasis), there would be enough to multiply one of the coordinates by sqrt(n) (where n is number of remaining dimensions (assuming the vector length preservation in two bases: A = bu = ai+aj+ak+…+an, AT*A = b*2 = n(a*2)).

If we stop just one step before:

ds <- ks_lm_dim_red(diabetes2, dim, n_dim=8, reorder=1)

We’ll still have some variability of data points:

Our Diabetes dataset in the synthetic basis two steps away from the eigenbasis transform

Whether we run all projections till the last dimension, or stop by the mean sigma criteria, we’ll have practically the same distribution along the eigen basis vector. In our case, if we stop earlier, numerical values of the synthetic variable will be different (we’ve been lazy in previous chapters defining the N-1 transform as a diagonal and last column unit matrix, therefore after each projection the length of the basis vectors shrinks), but for eigenbasis that’s not important. Eigenbasis coordinate will be just proportionally adjusted:

v1 = 23.70*chol+1.09*weight-18.89*hdl+18.57*hip+10.66*waist-0.51*bp.1s+0.73*glyhb+5.88*stab.glu+9.42*height+2.211*bp.1d+0.917*age

Histogram/pdf of diabetes dataset observation in synthetic 1D eigenbasis with the mean sigma stop parameter

We’ll play with a more intelligent Principal Component Analysis of the synthetic basis on projection later, and for now take look at another applications of the linear regression for the eigenbasis transform, in particular such, it seems, unlikely area as Natural Language Understanding…

 

 

 

 

Appendix 8.1 – Symbolic matrix by vector multiplication

#Examples:

str1 <- “2.*a-3.1*b-0.4*c-d”
str2 <- “.01*c+35*b+3.24*e-5.*d”
# Symbolic multiplication of string ‘str1’ by number 2.5
sym_lc_mult(str1, 2.5)
# Symbolic sum of 2 strings ‘str1’ and ‘st22’
sym_lc_add(str1, str2)

str_v <- c(“a+.5*b”,”b-2*c”,”3*c+d”)
str_u <- c(“”,””,””)
m <- matrix(c(1,2,3,4,5,6,7,8,9),3,byrow=TRUE)
# Symbolic multiplication of numeric matrix ‘m’ by string vector ‘str_v’,
# result is a string vector ‘str_u’
str_u <- matrix_symvect_mult(m, str_v)
str_u

# Symbolic multiplication of numeric matrix ‘m’ by string vector ‘str_v’,
# result is a string vector
matrix_symvect_mult <- function(m, str_v){
nr <- nrow(m)
nc <- ncol(m)
str_u <- rep_len(“”, nr)
for(i in 1:nr){
for(j in 1:nc){
str_u[i] <- sym_lc_add(str_u[i], sym_lc_mult(str_v[j],m[i,j]))
}
}

str_u
}

# Symbolic sum of 2 strings
sym_lc_add <- function(str1, str2){
str_out = NULL
str_t1 <- gsub(“-“,”+-“, str1)
l_tup1 <- strsplit(strsplit(str_t1, “[+]”)[[1]],”[*]”)
str_t2 <- gsub(“-“,”+-“, str2)
l_tup2 <- strsplit(strsplit(str_t2, “[+]”)[[1]],”[*]”)

for(item1 in l_tup1){
item1 <- item_conv(item1)
found = FALSE

for(item2 in l_tup2){
item2 <- item_conv(item2)

# item in both strings
if(identical(item1[2], item2[2])){
found = TRUE
mult = as.double(item1[1]) + as.double(item2[1])

if(mult != 0){
if(!is.null(str_out))
str_out = paste(str_out,”+”, sep=””)

str_out = paste(str_out,
as.character(mult),
“*”, item1[2], sep=””)
}
}

}

# item in first string but not in second
if(!found){
if(as.double(item1[1]) != 0){
if(!is.null(str_out))
str_out = paste(str_out,”+”, sep=””)

str_out = paste(str_out, item1[1],
“*”, item1[2], sep=””)
}
}
}

# item in second string but not in first
for(item2 in l_tup2){
item2 <- item_conv(item2)
found = FALSE

for(item1 in l_tup1){
item1 <- item_conv(item1)

# item in both strings – not processing
if(identical(item1[2], item2[2])){
found = TRUE
}
}

if(!found){
if(as.double(item2[1]) != 0){
if(!is.null(str_out))
str_out = paste(str_out,”+”, sep=””)

str_out = paste(str_out, item2[1],
“*”, item2[2], sep=””)
}
}
}

if(is.null(str_out))
str_out = “0”

str_out2 <- gsub(“[+]-“,”-“, str_out)
str_out2
}

# Symbolic multiplication of string ‘str’ by number m
sym_lc_mult <- function(str, m=1){
str_t <- gsub(“-“,”+-“, str)
l_tup <- strsplit(strsplit(str_t, “[+]”)[[1]],”[*]”)

res <- lapply(l_tup, sym_var_mult, m)

str_out = NULL
for(item in res){
if(!identical(as.double(item[1]), 0)){
if(!is.null(str_out))
str_out = paste(str_out,”+”, sep=””)

str_out = paste(str_out, item[1], “*”, item[2], sep=””)
}
}

if(is.null(str_out))
str_out = “0”

str_out2 <- gsub(“[+]-“,”-“, str_out)
str_out2
}

item_conv <- function(item){
if(length(item)==1){
if(is.na(as.double(item[1]))){
if(identical(substr(item[1],1,1),”-“))
item = c(“-1”, sub(“-“,””,item[1]))
else
item = c(“1”, item[1])
}
else
item = c(“0”, “0”)
}
else if(length(item)==0){
item = c(“0”, “0”)
}

item
}

sym_var_mult <- function(item, m=1){
item <- item_conv(item)
c(as.character(as.double(item[1])*m), item[2])
}

Appendix 8.2 – Linear regression model

# Linear univariate regression that minimizes rss
# of variable ‘yname’ of ‘df’ dataset
ks_lm <- function (df, yname, xnames, full=1, error=0){
all.names = c(unlist(xnames), unlist(yname))
clean.df <- df[complete.cases(df[, all.names]), all.names]

nr <- nrow(clean.df)
nc <- ncol(clean.df)

model <- list()
model$X <- matrix(do.call(rbind,clean.df), nc)
rownames(model$X) <- all.names

x <- clean.df
x[, yname] <- NULL
x[, nc] <- rep_len(1, nr)

y <- clean.df[, yname]

X <- matrix( do.call(cbind,x), nrow(x) )

b1 <- solve(t(X) %*% X) %*% t(X) %*% y

if(full | error){
e <- X %*% b1 – y
model$e2 <- t(e) %*% e
model$e2n <- model$e2/nr
model$en <- sqrt(model$e2)/nr
}

if(full){
#Projection onto 0 member of quotient space X/Y
model$R <- diag(rep_len(1, nc))
model$R[nc,1:nc] <- b1
model$R[nc,nc] <- 0

#X to Y Basis rotation
model$Nm1 <- diag(rep_len(1, nc))
model$Nm1[,nc] <- rep_len(1, nc)
model$Nm1 <- model$Nm1[-nc,]

model$h0 <- rep_len(0, nc)
model$h0[nc] <- b1[ncol(b1)]

model$X0 <- model$R %*% model$X
rownames(model$X0) <- all.names
model$Xl <- model$X0[, 1:nr] + model$h0
rownames(model$Xl) <- all.names

model$Nm1R <- model$Nm1 %*% model$R
model$DimNames <- matrix_symvect_mult(model$Nm1R, all.names)

model$Y <- model$Nm1 %*% model$X0
rownames(model$Y) <- model$DimNames
}

model
}

Appendix 8.3 – Dimensionality reduction via linear regression

# Reduce dimensionality of dataframe ‘df’ on variables listed in ‘dim’
# until ‘n_dim” dimansions left
ks_lm_dim_red <- function(df, dim, n_dim=1, reorder=1, sd_dim=0, hb=1){
clean.df <- df[complete.cases(df[, dim]), dim]

# normalize variables by their range
n_ds <-lapply(dim, norm_ds, clean.df)
names(n_ds) <- dim
norm.df <- as.data.frame(n_ds)

#order variables by rss of their univariate regressions
v_dim <- sapply(dim, norm_rss_ds, norm.df)
i_dim <- order(v_dim[2,])
order.df <- norm.df[, i_dim]

# run univariate regressions for variables in rss order
# until n_dim dimensions left
ds <- order.df
n_max <- length(dim)-n_dim
i <- 0
while(i < n_max){

if(hb)
print(i)

resp_name <- colnames(ds)[1]
pred_names <- colnames(ds)[-1]

mod <- ks_lm(ds, resp_name, pred_names)
ds <- as.data.frame(t(mod$Y))

i <- i+1
if(i >= n_max)
break

# check if we already have effectively eigenvector (all coordinates are the same)
if(sd_dim > 0){
m_sd <- mean(sapply(as.data.frame(t(ds)), norm_sd))
print(m_sd)
if(m_sd < sd_dim){
# convert length to 1D eigenbasis
col_name <- names(ds)[1]
ds <- as.data.frame((ds[,1] * sqrt(length(dim)-i)))
names(ds)[1] <- sym_lc_mult(col_name, sqrt(length(dim)-i))
break
}
}

#reorder synthetic variables by rss on each iteration
if(reorder){
tmp_dim <- colnames(ds)
v_dim <- sapply(tmp_dim, norm_rss_ds, ds)
i_dim <- order(v_dim[2,])
ds <- ds[, i_dim]
}

}

ds
}

# Create list of rss for univariate regression on each normalized variable
norm_rss_ds <- function(item, norm.diabetes){
i_range <-range(norm.diabetes[,item])
norm.diabetes[,item] <- norm.diabetes[,item]/(i_range[2]-i_range[1])

predict_dim <- colnames(norm.diabetes)
predict_dim <- predict_dim[which(predict_dim!=item)]

mod <- ks_lm( norm.diabetes, item, predict_dim, full=0, error=1 )

dim_rss <- c(item, mod$en)
dim_rss
}

# Normalize a variable by its range
norm_ds <- function(item, norm.diabetes){
i_range <-range(norm.diabetes[,item])

norm_col <- (norm.diabetes[,item]-i_range[1])/(i_range[2]-i_range[1])
norm_col
}

# Normalize st.dev. by mean
norm_sd <- function(item){
sd(item)/mean(item)
}

Posted in Data Science Done Right | Tagged , , , | Leave a comment

Data Science Done Right (the Kitchen Style) #7

We just created, in the previous post, a simple univariate, multivariable regression function ks_lm0 that calculates regression slopes and intercept.

However, it would be interesting not only to find out what our quotient (or factor) space X/Y would be (which is defined by the bi (i !=0 ) regression coefficients), and what particular equivalence relation R would be of our dataset A with its image B on Yl we project it to, and what the projection function (T or R) we have to Yl and Y0 spaces respectively, but also look at what our projected dataset B becomes in the new space of the reduced dimensionality Yl (or the same in Y0, because we found out before that in the linear space d=f). Actually that is the main goal of the whole exercise we just done – we projected our data from the space with the less “convenient” structure into one with a more “convenient” (with the reduced dimensionality) structure.  BTW, that thing is a part of Factor Analysis (no surprise we are dealing with factor spaces).

What we want to find (see previous chapter) is the transform M. We just found (in ks_lm0) the above mentioned transform R (which is the diagonal matrix with one row made from bi coefficients (!= 0)), and we know that NM=R, i.e. M=N-1R. In general case, if we want to rotate uv basis relative to the ij one (BTW, which is a part of Principal Component Analysis) N-1 transform would be all rows, except one 0 row matrix, but we won’t do it now, and save it for later, and just declare that basis for our Space of new “synthetic variables” Yl would be u=ai+bk, v=cj+dk, and even a=b=c=d=1. I.e. we do not rotate new basis (yet), and do not significantly scale the unit vectors length (only do that trigonometrically). Therefore N-1:
(1 0 … 1)
(0 1 … 1)
(.  .  …  .)
(0 0 … 0)

So, let us stop for a moment at a more visually “comprehendable” 3D representation of our data, where we look only at glucosylated hemoglobin (glyhb), cholesterol/hdl (rati)o and systolic blood pressure (bp.1s) dimensions, and see how our data would look in 3D and 2D representation, after we have collapsed glyhb dimension, and created 2 synthetic variables of the linear combination of glyhb & ratio and glyhb & bp.1s. To do that we will enhance our new ks_lm function to not only calculate slopes and interception as in ks_lm0, but also transformation matrices R and N-1, and the very transformed datasets Xl and Yl.

First, we use lattice package for 3D scatter plot of the original (blue) data X in coordinates V3=glyhb, V1=ration, V2=bp.1s, and projected (magenta) data Xl  into quotient space member Yl (regression plane) and then converted back to the X space basis:

load(url(“http://biostat.mc.vanderbilt.edu/wiki/pub/Main/DataSets/diabetes.sav&#8221;))

Scatter plot of X (original) and Xl (regressed) data in X space

ks_model <- ks_lm( diabetes, c(“glyhb”), c(“ratio”,”bp.1s”) )

xdf <- as.data.frame(t(ks_model$X))
xdf[, “V4”] <- rep_len(1, nrow(ydf))

xldf <- as.data.frame(t(ks_model$Xl))
xldf[, “V4”] <- rep_len(2, nrow(ydf))

xxldf <- rbind(xdf, xldf)

cloud(V3 ~ V1*V2, xxldf, groups=V4)

Scatter plot of X (original) and Xl (regressed) data in X space (view from another direction)

Scatter plot of X (original) and Xl (regressed) data in X space (and yet another direction)

And then, we use ggplot2 package to draw a 2D scatter plot of our new synthetic variables in the Yl space:

Projection of our data to Glyhb-bp.1s (black), and to Yl synthetic basis

ydf <- as.data.frame(t(ks_model$Y))
ydf[, “V3”] <- rep_len(2, nrow(ydf))

xdf <- as.data.frame(t(ks_model$X))
xdf[, “V3”] <- rep_len(1, nrow(ydf))

xydf <- rbind(ydf, xdf)

ggplot(xydf, aes(x=V1, y=V2, colour=V3))+
geom_point(alpha=0.5)

Now, having visually seen how our linear regression function work on 3D data, and how it may be used for dimensionality/factor reduction, and where in it is a place for principle component manipulation, let us proceed playing with it on a larger dimension data…

Appendix 

ks_lm <- function (df, yname, xnames){
all.names = c( unlist(xnames), unlist(yname) )
clean.df <- df[complete.cases( df[, all.names] ), all.names]

nr <- nrow(clean.df)
nc <- ncol(clean.df)

model <- list()
model$X <- matrix(do.call(rbind,clean.df), nc)

x <- clean.df
x[, yname] <- NULL
x[, nc] <- rep_len(1, nr)

y <- clean.df[, yname]

X <- matrix( do.call(cbind,x), nrow(x) )

b1 <- solve(t(X) %*% X) %*% t(X) %*% y

#Projection onto 0 member of quotient space X/Y
model$R <- diag(rep_len(1, nc))
model$R[nc,1:nc] <- b1
model$R[nc,nc] <- 0

#X to Y Basis rotation
model$Nm1 <- diag(rep_len(1, nc))
model$Nm1[,nc] <- rep_len(1, nc)
model$Nm1 <- model$Nm1[-nc,]

model$h0 <- rep_len(0, nc)
model$h0[nc] <- b1[ncol(b1)]

model$X0 <- model$R %*% model$X
model$Xl <- model$X0[, 1:nr] + model$h0

model$Y <- model$Nm1 %*% model$X0

model
}

Posted in Data Science Done Right | Tagged , , , | Leave a comment

Data Science Done Right (the Kitchen Style) #6

Let’s implement univariate multivariable linear regression (see derivations in previous chapter) in R the way it is usually used, i.e. mapping our dataset A not just to an element of the Quotient space Yl, but, after that, mapping it back into original space X. Our ks_lm0 R function is presented in Appendix 1.

Let’s run it against a dataset we used in one of the Statistical Methods classes (Statistical hypotheses testing for Diabetes Type II risk factors), also, for comparison, running these data through the standard lm function, making sure that results do match:

> load(url(“http://biostat.mc.vanderbilt.edu/wiki/pub/Main/DataSets/diabetes.sav”))

> ks_lm0( diabetes, c(“chol”), c(“glyhb”,”ratio”,”bp.1s”,”bp.1d”) )
[,1]
[1,] 1.5868593
[2,] 11.2724797
[3,] 0.1593560
[4,] 0.3243256
[5,] 98.7703565

> cols <- c(“chol”,”glyhb”,”ratio”,”bp.1s”,”bp.1d”)
> clean2.diabetes <- diabetes[complete.cases(diabetes[,cols]),cols]
> View(clean2.diabetes)
> lmodel <- lm(“chol ~ glyhb + ratio + bp.1s + bp.1d”, clean2.diabetes)
> lmodel

Call:
lm(formula = “chol ~ glyhb + ratio + bp.1s + bp.1d”, data = clean2.diabetes)

Coefficients:
(Intercept) glyhb ratio bp.1s bp.1d
98.7704 1.5869 11.2725 0.1594 0.3243

Which, of course (no rocket science there, just simple matrix arithmetic), they do 🙂

However, let’s not stop here, and continue visualizing and playing with our data further…

 

Appendix 1

ks_lm0 <- function (df, yname, xnames){
all.names = c( unlist(yname), unlist(xnames) )
clean.df <- df[complete.cases( df[, all.names] ), all.names]

x <- clean.df
x[, yname] <- NULL
x[, ncol(clean.df)] <- rep_len(1, nrow(clean.df))

y <- clean.df[, yname]

#X <- matrix(unlist(x), nrow(x))
X <- matrix(do.call(cbind,x), nrow(x))

solve(t(X) %*% X) %*% t(X) %*% y
}

 

Posted in Data Science Done Right | Tagged , , , | Leave a comment

Data Science Done Right (the Kitchen Style) #5

A bit more formal on Linear Regression

Let’s formalize a bit our Kitchen Style analogy from the previous post, into a more suitable for coding notation, still being a bit too verbose, as it is typical for Kitchen talks. We’ll start with the same 3D case, and generalize it to bigger dimensions later. Let’s denote scalars as the lower case italic letters (s.a. a), vectors as the lower case bold letters (s.a. x), spaces and sets as the upper case non-bold letters (s.a. A), non-vector members of spaces or sets as lower case letter (s.a. f), and transformations (or in this case liner ones, or matrices) as the upper case bold letters (s.a. M), and, just in case, for the future use, tensors will be bold, italic, capital letters (s.a. F).

Let’s take a 3D vector space X on which we define basis i,j,k. Let’s take a 3D set of our Data and represent it as a set A in the space X. An arbitrary member a of the set A <b.t.> X could be represented as a linear combination x1*i+x2*j+x3*k. Let’s choose a dimension ( represented by, say, the base vector k) through the projection of which we are going to regress our 3D space X into a 2D space Y, or rather one, not yet determined, member of the quotient space X/Y. Torturing a bit notation, let’s call it (x/y)l (or Yl), where l <b.t.> I (-inf…0…+inf). Let’s define a basis in Yl as u and v.

Let’s denote T as a transformation of an arbitrary a<b.t.>A to b<b.t.>B, where B is a equivalence class of all projections of A to the element of the quotient space Yl (or the plane we are seeking in this regression), as T(a)=b=x1*i+x2*j+h*k (in respect to the basis ijk, where h=x3-dx3).

(figure 1)

The same transformation may be achieved through the chain of the following transformations of a to f: Pij(a)=c (projection of k to the plain ij), M(c)=(projection from ij plane to Y0 (in respect to uv), with the origins of uv basis in the ijk origin), R(c)=e (the same projection as M, but in respect to ijk basis), N(d)=e (transformation to ijk basis back), Q(e)=b (also can be done via linear shift h from Y0 to Yl along the k (or addition of h) – the quotient space thing). Let’s define f (ultimately that the goal of all our transformations) as a representation of the same element in respect to the basis uvf=y1*u+y2*v, where f<b.t.>Yl, b<b.t.>X. d=f because where b=h+N(f)=N(d)+h, hence f=d (obviously, because of the commutativity) in the vector space (not in general case, though).

(schema 2)

We may be are too talkative, defining too many transformations, but let’s see, we may need them later to better understand behaviour of our data in the reduced-dimensionality spaces themselves, and not only in the original space after the reverse transformation, as we usually do it working with regressions.

Looking at the schema 2 we can see that T(a)=QNMPij(a)=NMPij(a)+h=e+h=b=adx3*k, or:
(t11 t12 t13)   (x1) i + (0) i   (x1) i
(t21 t22 t23) (x2) j + (0) j = (x2) j
(t31 t32 t33) (x3) k + (h) k   (x3-dx3) k

And if Pij(a)=c, or:
(1 0 0) (x1)     (x1) i
(0 1 0) (x2) j = (x2) j
(0 0 0) (x3) k   (0) k

M(c)=d, or:
(a11 a12 0)  (x1) i     (y1) u
(a21 a22 0) (x2) j = (y2) v
(0    0     0)  (0)  k

As we can see MPij = M, as well as RPij=R.

Of course we could have done direct projection Puv=MPij, without the intermediate step of Pij, though that matrix could be less intuitive to get. However if it is not, just forget the intermediate steps (whatever your x3 is, it won’t affect your y1 and y2 coordinates):

Puv(a)=d:
(a11 a12 0)  (x1) i     (y1) u
(a21 a22 0) (x2) j = (y2) v
(0    0     0)  (x3) k

N(d)=e, or:
(c11 c12 0)  (y1) u    (x1) i
(c21 c22 0) (y2) v = (x2) j
(c31 c32 0)                 (e3) k

NM(a)=R(a)=e:
(c11 c12 0) (a11 a12   0)     (c11*a11+c12*a21 c11*a12+c12*a22   0)     (t11  t12  t13)
(c21 c22 0) (a21 a22 0) = (c21*a11+c22*a21 c21*a12+c22*a22 0) = (t21 t22 t23)
(c31 c32 0) (0     0     0)     (c31*a11+c32*a21 c31*a12+c32*a22 0)    (t31 t32 t33)

therefore t13=0, t23=0, t33=0

then NPuv(a)+h=e+h=b:

(t11  t12   0)   (x1) i     (0) i    (x1)             i
(t21  t22   0)   (x2) j  + (0) j = (x2)            j
(t31 t32 0)  (x3) k    (h) k    (x3 – dx3) k

therefore:
(1     0    0)   (x1) i     (0) i    (x1)             i
(0    1     0)   (x2) j  + (0) j = (x2)            j
(t31 t32 0)  (x3) k    (h) k    (x3 – dx3) k

Now, how do we choose which indexed element l of the quotient space X/Y (or what value of h (or intercept)) is the best on for our purposes? On the most general level the projected Subspace and its Superspace are not hom(e)omorphic, i.e. do not possess the same Structure (because projection is a continuous function, but its inverse isn’t (neighbouring elements (belonging to the same Open Ball with radius delta) in the Projected Subspace, are not necessarily still are neighbours (belong to the Open Ball with radius epsilon, despite how small we may make radius delta) in the Superspace)). And that happens when dx3 > epsilon. Therefore, to reduce the loss of structure we may want to make all dx3 (or e as we renamed it latter) as small as possible. Actually, there could be many of the reasonable criteria we can use (for example difference in the slopes/gradients of our dataset “surface” to the Projected “plane” that also is another reason to lose Homeomorphic Relation between the Spaces), but the usual, default one is the minimisation of the sum of the squares of deltas dx3 (or residuals), i.e. RSS (residual square sum), which looks reasonable and is the nice looking one in the matrix form, and also gives nice analytical equations for the first and second derivatives needed for the minimum calculation.

Let’s forget for a moment bases ijk and uv, and let’s won’t depict them in our notations, and denote i <b.t.> I (1…n), where n is a size of our data set A. Then i-th element’s delta x (dx=ba) can be written as (supposing that the T transform is Globally Linear, therefore is a multiplication of any i-th element to the same T matrix):
(x1i – x1i)                                = (0)
(x2i – x2i)                               = (0)
(t31*x1i +t32*x2i + h – x3i) = (dx3i)

Leaving only non-trivial dimension k we do the regression on, we can write system of equations for all 1..i..n data elements in a matrix form:
(x11 x21 1) (t31)    (x31)   (dx31)
(…            ) (t32)    (… )      (…   )
(x1i x2i 1)  ( ) – (x3i) = (dx3i)
(…            )              (… )      (…    )
(x1n x2n 1)            (x3n)   (dx3n)

or, compactly: Xtx3=dx, or, as usually it denoted in literature, Xby=e. Let’s partially borrow that notation  for easiness of the mental mapping of this rubric to books, leaving, though, t in place, because we use b for other purposes. We’ll also drop index 3 in t and rename h to t0.

Having expressed RSS= e1*e1 + … + ei*ei + … + en*en = eT*e (where eT is a transposed vector e), we can do the same with the left part of the equation as well: (Xty)T*(Xty)=eT*e; and we want to minimize eT*e (or (Xty)T*(Xty)). But take a note, – here, from one problem of mapping our data set from a space with the same dimensions as the original data elements have (T(a), ijk) into a subspace with the reduced dimensions (b vu), we moved to another problem, – using our data as a transformation matrix, we map coefficients of our transformation matrix (or we can say transformations T themselves (yes, transformation also may a space element)) of the original problem (X(T), ijk) (with the original number of dimensions (say, T-space)), into a space of deltas (or errors (say, E-space)) with number of dimensions equal to the size of our original data set (e, 1..n). And we want to find such an element (for the one return linear regression it will be a vector t (while for the multi-return regression we look at later, that will be a matrix T)) from the T-space, which our data would transform into the smallest element of E-space.

As usually we do that in Calculus, to find a minimum point (actually, a critical point, which includes maximum and saddle points) of a graph (curve, surface, or generic multi-dimensional data set, or product of 1-dimensional ones), we take Gradient Grad y = @y/@x1*i + @y/@x2*j + … + @y/@xn*n in that point expecting it to be equal 0. Or, for one-return parameter functions (Rn->R), it may be intuitively easier to look for a null Differential dy = @y/@x1*dx1 + @y/@x2*dx2 + … + @y/@xn*dxn (which is, anyway, related to Gradient (Grad y)T*dx=dy), and in either case, we end up looking for null Partial Derivatives of RSS in respect to the basis of t, and generalizing it from the initial 3D case to 1..j..m size, we get:

@Sum[i=1..n]((t1*x1i + … + tj*xji + … + tm*xmi + t0 – yi)*(t1*x1i + … + tj*xji + … + tm*xmi + t0 – yi))/@t1 = @RSS/@t1 = 0

@Sum[i=1..n]((Sum[j=1..m](tj*xji) + t0 – yi)*(Sum[i=1..m](tj*xji) + t0 – yi))/@tj = @RSS/@tj = 0

@Sum[i=1..n]((Sum[j=1..m](tj*xji) + t0 – yi)^2)/@tm = @RSS/@tm = 0

After differentiating:

2*Sum[i=1..n]((x1i )*(t1*x1i + … + tj*xji + … + tm*xmi + t0 – yi)) = 0

2*Sum[i=1..n]((xji )*(t1*x1i + … + tj*xji + … + tm*xmi + t0 – yi)) = 0

2*Sum[i=1..n]((xmi )*(t1*x1i + … + tj*xji + … + tm*xmi + t0 – yi)) = 0

and converting equations above into matrix form we’ll get (actually, multiplication signs are not strictly necessary, they are just a visual convenience to navigate between the transpose and inversion notations, and parentheses):

2*XT*(Xty) = @RSS/@t = 0

In this transformation we map t of 1..j..m (in our original 3D case ijk) size basis to the same size basis of the partial derivative space, looking for the Null Transform of t, or Kernel t. Which will be:

XT*Xt = XTy

and after multiplication of both sides by the inverse (XT*X)-1:

(XT*X)-1*XT*Xt =(XT*X)-1*XTy

where (XT*X)-1*XT*X = I – identity matrix, i.e. diagonal with coefficients 1, hence It=t, then:

t = (XT*X)-1*XTy

However, we yet to find out if our critical point is in fact a minimum, and not a maximum or a saddle point.

As usual, and obvious for the 1D mapping (function of one variable) y=f(x), we are looking that second derivative were positive. If the first derivative, usually expressed as slope line to the point where we take it, may be more intuitively envisioned as a “speed” of moving along the graph y in Y space, while we are moving along the x graph in X space, the second derivative is an “acceleration” of that movement. For example, a plane diving (for example for a zero-gravity simulation), and then climbing back up, changes its vertical speed from negative value to positive, transitioning through 0 at the critical point of minimum, its acceleration remains positive (which defines that the critical point indeed was minimum – i.e. whatever large negative speed the plain has, sooner or later (if it doesn’t hit the ground) it will be attenuated by the positive acceleration to 0 speed, and then the plane will start climbing up). If we get rid of the time variable, we can use derivatives at the critical point of the function graph along the domain x: dy/dx (“speed” of changing y with changing x, which is 0 there), and d^2y/dx^2 (“acceleration” of changing y with changing x, which is positive, and let it be constant a0) to estimate dy for a given dx using second member of the Taylor series: dy = 1/2*a0* dx^2 (higher members will be 0 because we set a0 constant in our example). Which is positive, and that means that moving from the critical point x0 in any direction by dx, we’ll get dy positive, i.e. the critical point indeed is minimum.

The similar approach works in a multi-dimensional case too, but in that case the measure of “acceleration” of the f:Rn->R is called Hessian matrix H:

(@^2f/@x1^2 … @^2f/@x1@xi … @^2f/@x1@xn)

(@^2f/@xi@x1 … @^2f/@xi^2 … @^2f/@xi@xn)

(@^2f/@xn@x1 … @^2f/@xn@xi … @^2f/@xn^2)

We still want make sure that whatever our  movement dx is in our X domain, it will be mapped in a positive movement din our range space Y, after applying the H transform (which is actually what brute force gradient descent methods would do). In the literature such a transform (matrix) is called the positive definite one, but for better intuitive understanding we rather start from the eigenvectors and eigenvalues idea (anyway these things are closely related). Eigenvalue is such a multiplier lambda (for a transform M) that gives to a particular vector(s) v the same transform as the regular Mv mapping (Mv=lambda*v). There could be multiple eigenvalues and vectors for particular transform. The nice thing about eigenvectors is that if we find enough (number of X space dimensions, so we can span it) linearly independent eigenvectors (so it will be basis of the space) we may express any vector x=x1*v1+x2*v2+…+xn*vn in the new eigenbasis. If it happens that all eigenvalues lambdai (coefficients in the diagonal matrix L) for that eigenvector basis are positive, M transform will be the same sign as the vector of our domain space X. In case of the H transform that would mean that the point with null Gradient/Differential is indeed minimum.

H*xx1*H*v1+x2*H*v2+…+xn*H*vn x1*lambada1*v1+x2*lambada2*v2+…+xn*lambadan*vn =L*x

or variation on the positive definite matrix definition:

xT*H*x = (x1, x2, …, xn) (x1*lambada1, x2*lambada2, …, xn*lambadan) T = x1*x1*lambada1*+x2*x2*lambada2+…+xn*xn*lambadan = lambda>0

In a general case, we may want to find eigenvectors and eigenvalues (anyway they are very handy for the transform analysis (and, in our case, because our data set is used as a transform, for the data analysis, too), however, in special cases, for figuring out only what type of the critical point we are in, we can estimate whether our H matrix is really positive definite, or it can have negative lambdas.

Let’s calculate Hessian for our RSS case, which could be expressed in a compact matrix for as:

@^2RSS/@t@tT = 2*XT*X

Which is quadratic form and it guaranties that whatever our xij data are, xij*xij is positive, and for any dt of our domain space, “acceleration” value of the range space (RSS) will be positive. As in our 1D domain example with the diving plane, we can use second member (first one is 0 because it’s a critical point, and higher members will be 0s, too, because the second derivative matrix H has all constant elements) of Taylor series to estimate what dRSS will be when we move from the critical point in any direction in the domain space dt. We can express it as:

dRSS = dtT*H*dt

Here, not only H transform will be positive (because of the quadratic form xij*xij), but dtj*dtj, too. Therefore, whatever dt move will be, dRSS will be positive, so our critical point is indeed minimum, too.

The only thing we have to make sure that we have enough eigenvectors to form eigenbasis, i.e. there are no linear dependency in H, or in other words it’s not degenerate, or is a full rank matrix, otherwise we can’t “control” behaviour of the “acceleration” in some dimensions.

Of course we did not discover any Americas, and all these derivations may be found in many Statistics and Data Science books, and the Linear Regression functions are implemented in many libraries, but we want to experiment a bit further with the regression algorithms on real data, so let’s have our R or/and Python library for it, to play with them…

Posted in Data Science Done Right | Tagged , , , | Leave a comment

Data Science Done Right (the Kitchen Style) #4

What a Linear Regression is?

If we take a look at the general definition of the term Regression we will find something like: “transition to a simpler or less perfect state”. Perfection is quite a subjective category, and, depending on the context and point of view, the same phenomena, by the same person, may be viewed as more or less perfect for one or another purpose. For example a “more simple” state or model may be viewed as “less perfect” for purposes of the simulation accuracy, or “more perfect” for easiness and clarity of understanding. So, let’s rather stick to a more clear and distinct “simplicity” aspect of the definition.

In application to the Data Science’s meaning of “simplicity”, and especially in the context of the space mappings, it would, obviously, mean reduction of the dimensionality and/or number and complexity of Relations between the Space objects. Which, actually, means Projection of our Data from a Super- to a Subspace. As we previously noted, a Space is a Set with Relations defined over members of it. For a Linear Space that would be relations of additions and multiplications by scalars which comply the linear properties, again, mentioned before. For a particular instance of the Linear Space we chose an Identity element (“coordinate origin”) and the minimal number (i.e. linearly independent) of basis elements (which will create ordered pairs – basis vectors) which will spawn the Space (i.e. any its element may be expressed as a linear combination of them).

If we defined different Linear Relations on the same Set, i.e. different bases through which we can express literally how one element of the Set is related to other, we may map elements of one Linear Space, expressed in its Relations/basis to the expression of Relations/basis of another Linear space through matrix multiplication. Considering that those representations a (in the basis ijk) and a’ (in the basis uvw) of the same Set element are representations of the fundamentally the same thing, and we can express those basis vector “things” of one basis via another basis “things”, we can write (see notations in the next chapter):

a=bi+cj+dk; a’=xu+yv+zw; a=a’
u=e11i+e21j+e31k;
v=e12i+e22j+e32k;
w=e13i+e23j+e33k;

a’ = x(e11i+e21j+e31k) + y(e12i+e22j+e32k) + z(e13i+e23j+e33k) = i(x*e11+y*e12+z*e13) + j(x*e21+y*e22+z*e23) + k(x*e31+y*e32+z*e33) = abi+cj+dk

E*(x, y, z)T (in respect to uvw) = (b, c, d)T (in respect to ijk); or E*a’ = a, or a’ = E-1*a

If we express coefficients of the matrix of an arbitrary transformation F as coefficients of the basis converting matrix E plus some deltas, then transform will give as the same vector in the respect to range basis plus a delta that may be spurned into any element of the range Space, thus giving us any desirable vector b:

F*a’ =i(x*(e11+f11)+y*(e12+f12)+z*(e13+f13)) + j(x*(e21+f21)+y*(e22+f22)+z*(e23+f23)) + k(x*(e31+f31)+y*(e32+f32)+z*(e33+f33)) = i(x*e11+y*e12+z*e13) + j(x*e21+y*e22+z*e23) + k(x*e31+y*e32+z*e33)+ i(x*h11+y*h12+z*h13) + j(x*h21+y*h22+z*h23) + k(x*h31+y*h32+z*h33) = a + h = b

A Subspace would be a Subset of the Superset‘s elements with the Superspace‘s Relations abridged by the Subset‘s nature. For example if the Superset Relations involve pairs of the elements which (or at least one of them) are no longer exist in the Subset, that Relation will be gone from the Subspace‘s Relations. For a (Linear) Subspace of the Linear Superspace some of those “things” we have chosen as basis vectors will be gone, and such a Mapping that involves eliminations of basis vectors (Relations in general case), with preserving the others, will be called Projection.

Tracking back our thoughts to a Regression, we won’t usually know beforehand which Subspace is more suitable for our Projecting purposes, but we may have an idea about possible variants from which we may chose, applying particular criteria, the best one. As it was already mentioned, any objects may be members of a Space, including other Spaces, or Subspaces. There, concept of Quotient (or Factor) Spaces may be useful. In such a Space its members are its disjoint (not having common elements) Subspaces.

Let’s imagine a Crape Cake, which, as a whole, is a 3D Space, but also it can be thought of as a 1D Quotient Space of the 2D Crapes. Also, let’s imagine  we have Blueberries somehow stuffed in between our Crapes. And then, we somehow want to associate (via so called Equivalence Relation) all these Blueberries with only one Crape, for example by protruding (Projecting) those Blueberries through other Crapes by our fingers and smashing them into One Chosen Crape (or, making sure they somehow squeezed and moved through the holes made by toothpicks). All these Blue Spots on the One Chosen Crape (or, maybe on a sheet of Parchment Paper inserted in-between the Cake Layers) we may call Equivalence Class. And we may want to minimize the ruin we have just done to our Cake by choosing the One Crape that would ensure that, and that will be condition for our Equivalence Relation.

Of course, there may be other criteria chosen, for example a Crape with biggest holes in it, or something else. Also we may want to choose boring holes in the Cake not with the straight, but crooked fingers (that won’t be a Linear Transformation), or put the Cake on the edge of table, and let it bend like Clocks on Salvador Dali paintings and then bore it with straight fingers (that will be neither (Linear) Subspaces nor Linear Transformation). We may decide those non-linear Blueberry Transformations and Subspaces are even cooler than the linear ones (for example the crooked holes in the Cake would make it a Piece of Art), but for the Linear, One Output Parameter Regression from 3D Space into 2D Subspace we will stick to the algorithm (Linear Projections from Vector Superspace into Vector Subspaces) described in the former chapter.

Actually, it may take sense, for a better understanding what Linear Regression is, to stop for a moment and discuss what is not a Linear Regression. For the Cake Bending over a Table Edge case, though the Space it represents may look non-liner to as looking at it from the Linear Space with the fixed basis (so the Bent Layer Space won’t be Subspace of it), if the Bend Cake Space is defined by the vector basis that changes (relatively to our unchanging basis) with the movement along it, all the linear operations relative the Bent Cake basis may still hold. For example we may take a sheet of parchment paper, draw on it additions and multiplications of vectors and scalars representing all necessary properties of the Linear Space, and put that paper in the middle of the Cake and then bend it, our drawings will be still true. So it’s not necessarily the Space itself which make the Regression Nonlinear, but our view on it, or rather mappings from our Space to the Bent Cake Space (both basis conversions M and projections Pij (see next chapter)) won’t be Globally Linear. At each given point we do can represent the mapping as a Matrix by Vector multiplication (so it will be a local Linear Transform), but Globally, moving point to point (and from one element of the quotient Space of Cake Layers to another) that Matrix will differ. Looking at it utilitarian way – in the next chapter we won’t be able to write that matrix equation where we use our dataset as a Linear Transform from the space of Linear Mapping to the Space of Residuals to find the transform with the minimal Residual Square Sum.

In the Crooked Fingers Boring case we have obvious Nonlinear Transformations f(x)=y that could not be represented by the matrix equations of the next chapter. Actually the second case is the same as the first one, but if we would look from inside of the range Benk Cake Space, imagining it’s “flat”, to the domain Kitchen Space which would now looked like a Warped one. That could be easily seen using an elastic eraser bar, which we can bend with our fingers, draw a straight on the side, and then let it go, – the previously straight line will become crooked.

In the last case we, again, have Locally Linear transforms, but they differ from point to point, like in the first case, and, again, are not described by the next chapter equations.Technically, we may use particular case of the Locally Linear, but Globally Nonlinear Transformations that vary from one Data element to another if the are Linearly “modulated” (for example if we get a Transform Matrix in a particular point(vector) as multiplication of a Tensor by that vector (so we’ll be adding, first, Linear Transformation of Vector-to-Linear Transformation (D*x+S)*x=Mx*x=y)), and actually that may be a way to linearize generally Nonlinear Transformations, but that will call for a bit different mathematical treatment (adding one more transformation in the target subspace) of the Transformation presented in the next chapter.

 

Posted in Data Science Done Right | Tagged , , , | Leave a comment

Data Science Done Right (the Kitchen Style) #3

More on the fundamental Method of Data Science

When we want to model an unstructured collection of the real world phenomena we use such mathematical abstraction as Set. It can contain not just simple elements (or objects, or members – these are interchangable terminologies) as numbers, or more complex mathematical abstractions (for example Sets themselves), – the members of a Set could be really any possible or imaginable objects. If we want to introduce (and we usually want to do that) a Structure over these objects we use such mathematical abstraction as Space. A Space is a Set with Relations (or, as a special case, Mappings (Functions or Transformations – these are, too, interchangeable terminologies), or even a more special case – Operators) defined over its members. Functions are Relations that define correspondence of a member of the Domain Space (Space we mapping from) to exactly one member of the Range Space (Space we mapping to). Of course there may be multiple Domain members that map in the same Range member, but we do not split them. Operators are mapping to the same Space.

For example, to define a Linear (Vector) Space we have to declare what element(s) will be Identity element(s), and we have to define Operators of the members’ Addition and scalar Multiplication in such a way that their result will be still an element of the Space (i.e. such Mappings are, indeed, Operators), and those Operators will be Associative, Commutative and Distributive, and we have to declare that every element will have its Inverse and their Sum will produce Identity (Additive Identity will be Null) element. Again, members of such a Space may be not only numbers or their lists, but also any phenomena, or even their relationships. We just have to define Operators on them as described above, and then we can apply all the Linear Vector Space analytical apparatus to our newly created Space.

For Metrics Spaces we have to define Distance Function (which is strictly speaking a Relation, therefore these are more generic Spaces) that will give us a distance between any two selected elements of the Space, and we are free to define whatever we want, and not necessarily being bounded by only Euclidean distance calculation. For the most generic Topological Spaces we define Topologies – those Sets that basically tell us whether elements of the Space are in a Relation of being neighbours or not, and where do boundaries lie between neighborhoods.

Representing our Data as Spaces with Structure definitions over them is, obviously, useful for finding Structural relationships between the Data elements, and sufficient enough for the Unsupervised Learning methods of Data Science. In addition, by defining Mappings or Relations between Spaces, we can ask (and answer) such questions as: “Can two Spaces be mapped to each other?”, “Is one of them a Subspace of the other?”, “Is that mapping continuous (isomorphic/homeomorphic)?” In terms of the Data Analysis those questions and answers will tell us whether our Data sets have the same or similar Structures, allowing us to recognize Patterns and mine Data.

Those Relations, Functions, or Operators, define Structure of the Space to which members of the Space can be subjected to, or which is “visible” in the Space. Our Real World Data could have a much more sophisticated “Real Structure”, but, when modelling the Real World Data in the particular modeling Space, we will be able to see no more Structure than we defined in the model. Or maybe even less Structure in the Data, if our expected, model Structure is not present in the Real World Data. For example Decision Tree (which is such a Relation) formulated to pinpoint fraudulent credit card use will not make visible authentic owner spending habits (for which we will need another kind of the Decision Tree). Or, Linear Vector Space will make visible to us only linear Structure of the relations between Data elements. Or, which is usually the case in Topological Data Analysis, if we Metricize generic Topological Space, we will lose non-metrizable relations.

Because using Statistical Modeling we can not (or do not bother to) get an insight on the causes and driving forces of our Data, and we do treat them like movements inside a “black box”, we are also in a darkness (of that box) about whether (as we may think they are) all the aspects (parameters or variables) of the objects we study are, indeed, their defining parameters, and not the incomplete or overlapping combinations of the “real” (independent) parameters. Because of that we are bound to see those parameters as random and dependent between each other (welcome to the real-world, or “nasty”, or “dirty” Data). Which is really not Data’s, but, instead, our problem of the failed assumptions, expectations, or, in a way, – ego.

If the aspects (variables) of the Data, and Structure Operators of the initial model do not give us much of the meaningful information, we may want to map the Data isomorphically, or at least partially, homomorphically, onto another Space with more relevant and interesting for us topologies, with different bases and different (reduced or introduced) dimensions. That may make visible those Structures we are interested, or maybe surprised to see, eliminate or reduce variable dependencies, or even reduce the very “randomness” of the variables.

But enough wordy theorizing, let us see how the data Statistics/Data Science workhorse of the Linear Regression is seen from the Space Mapping point of view…

Posted in Data Science Done Right | Tagged , , , | Leave a comment