A bit more formal on Linear Regression
Let’s formalize a bit our Kitchen Style analogy from the previous post, into a more suitable for coding notation, still being a bit too verbose, as it is typical for Kitchen talks. We’ll start with the same 3D case, and generalize it to bigger dimensions later. Let’s denote scalars as the lower case italic letters (s.a. a), vectors as the lower case bold letters (s.a. x), spaces and sets as the upper case non-bold letters (s.a. A), non-vector members of spaces or sets as lower case letter (s.a. f), and transformations (or in this case liner ones, or matrices) as the upper case bold letters (s.a. M).
Let’s take a 3D vector space X on which we define basis i,j,k. Let’s take a 3D set of our Data and represent it as a set A in the space X. An arbitrary member a of the set A <b.t.> X could be represented as a linear combination x1*i+x2*j+x3*k. Let’s choose a dimension ( represented by, say, the base vector k) through the projection of which we are going to regress our 3D space X into a 2D space Y, or rather one, not yet determined, member of the quotient space X/Y. Torturing a bit notation, let’s call it (x/y)l (or Yl), where l <b.t.> I (-inf…0…+inf). Let’s define a basis in Yl as u and v.
Let’s denote T as a transformation of an arbitrary a<b.t.>A to b<b.t.>B, where B is a equivalence class of all projections of A to the element of the quotient space Yl (or the plane we are seeking in this regression), as T(a)=b=x1*i+x2*j+h*k (in respect to the basis ijk, where h=x3-dx3).
The same transformation may be achieved through the chain of the following transformations of a to f: Pij(a)=c (projection of k to the plain ij), M(c)=d(projection to Y0, with the origins of uv basis in the ijk origin), N(d)=e (transformation to ijk basis back), Q(e)=b (linear shift h from Y0 to Yl along the k (or h) – the quotient space thing). Let’s define f as a representation of the same element in respect to the basis uv: f=y1*u+y2*v, where f<b.t.>Yl, b<b.t.>X. d=f because where b=h+N(f)=N(d)+h, hence f=d (obviously, because of the commutativity) in the vector space (not in general case, though).
We may be are too talkative, defining too many transformations, but let’s see, we may need them later to better understand behaviour of our data in the reduced-dimensionality spaces themselves, and not only in the original space after the reverse transformation, as we usually do it working with regressions.
Looking at the schema 2 we can see that T(a)=QNMP(a)=b=a–dx3*k=a+h–x3*k,
or T(a)-h=a–x3*k=b, or:
(b11 b12 b13) (x1) i – (0) i (x1) i
(b21 b22 b23) (x2) j – (0) j = (x2) j
(b31 b32 b33) (x3) k – (h) k (x3-dx3) k
And if Pij(a)=c, or:
(1 0 0) (x1) i (x1) i
(0 1 0) (x2) j = (x2) j
(0 0 0) (x3) k (0) k
(a11 a12 0) (x1) i (y1) u
(a21 a22 0) (x2) j = (y2) v
(0 0 0) (0) k
of course we could have done direct projection Puv=MPij, without the intermediate step of Pij, though that matrix could be less intuitive to get. However if it is not, just forget the intermediate steps:
(a11 a12 0) (x1) i (y1) u
(a21 a22 0) (x2) j = (y2) v
(0 0 0) (x3) k
(c11 c12 0) (y1) u (x1) i
(c21 c22 0) (y2) v = (x2) j
(c31 c32 0) (e3) k
(c11 c12 0) (a11 a12 0) (c11*a11+c12*a21 c11*a12+c12*a22 0) (t11 t12 t13)
(c21 c22 0) (a21 a22 0) = (c21*a11+c22*a21 c21*a12+c22*a22 0) = (t21 t22 t23)
(c31 c32 0) (0 0 0) (c31*a11+c32*a21 c31*a12+c32*a22 0) (t31 t32 t33)
(1 0 0) (x1) i (0) i (x1) i
(0 1 0) (x2) j – (0) j = (x2) j
(t31 t32 0) (x3) k (h) k (x3 – dx3) k
Now, how do we choose which indexed element l of the quotient space X/Y (or what value of h (or intercept)) is the best on for our purposes? Actually, there could be many of the reasonable criteria we can use, but the usual, default one is the minimisation of the sum of the squares of deltas (or residuals), i.e. RSS (residual square sum), which looks reasonable and is the nice looking one in the matrix form, and also gives nice analytical equations for the first and second derivatives needed for the minimum calculation.
Let’s forget for a moment bases ijk and uv, and let’s won’t depict them in our notations, and denote i <b.t.> I (1…n), where n is a size of our data set A. Then i-th element’s delta x (dx=b–a) can be written as:
(x1i – x1i) = (0)
(x2i – x2i) = (0)
(t31*x1i +t32*x2i – h – x3i) = (dx3i)
Leaving only non-trivial dimension k we do the regression on, we can write system of equations for all 1..i..n data elements in a matrix form:
(x11 x21 1) (t31) (x31) (dx31)
(… ) (t32) (… ) (… )
(x1i x2i 1) ( –h ) – (x3i) = (dx3i)
(… ) (… ) (… )
(x1n x2n 1) (x3n) (dx3n)
or, compactly: Xt–x3=dx, or, as usually it denoted in literature, Xb–y=e. Let’s partially borrow that notation for easiness of the mental mapping of this rubric to books, leaving, though, t in place, because we use b for other purposes. We’ll also drop index 3 in t and rename –h to t0.
Having expressed RSS= e1*e1 + … + ei*ei + … + en*en = eT*e (where eT is a transposed vector e), we can do the same with the left part of the equation as well: (Xt–y)T*(Xt–y)=eT*e; and we want to minimize eT*e (or (Xt–y)T*(Xt–y)). But take a note, – here, from one problem of mapping our data set from a space with the same dimensions as the original data elements have (T(a), ijk) into a subspace with the reduced dimensions (b vu), we moved to another problem, – using our data as a transformation matrix, we map coefficients of our transformation matrix (or we can say transformations T themselves (yes, transformation also may a space element)) of the original problem (X(T), ijk) (with the original number of dimensions (say, T-space)), into a space of deltas (or errors (say, E-space)) with number of dimensions equal to the size of our original data set (e, 1..n). And we want to find such an element (for the one return linear regression it will be a vector t (while for the multi-return regression we look at later, that will be a matrix T)) from the T-space, which our data would transform into the smallest element of E-space.
As usually we do that in Calculus, to find a minimum point (actually, a critical point, which includes maximum and saddle points) of a graph (curve, surface, or generic multi-dimensional data set, or product of 1-dimensional ones), we take Gradient Grad y = @y/@x1*i + @y/@x2*j + … + @y/@xn*n in that point expecting it to be equal 0. Or, for one-return parameter functions (Rn->R), it may be intuitively easier to look for a null Differential dy = @y/@x1*dx1 + @y/@x2*dx2 + … + @y/@xn*dxn (which is, anyway, related to Gradient (Grad y)T*dx=dy), and in either case, we end up looking for null Partial Derivatives of RSS in respect to the basis of t, and generalizing it from the initial 3D case to 1..j..m size, we get:
@Sum[i=1..n]((t1*x1i + … + tj*xji + … + tm*xmi + t0 – yi)*(t1*x1i + … + tj*xji + … + tm*xmi + t0 – yi))/@t1 = @RSS/@t1 = 0
@Sum[i=1..n]((Sum[j=1..m](tj*xji) + t0 – yi)*(Sum[i=1..m](tj*xji) + t0 – yi))/@tj = @RSS/@tj = 0
@Sum[i=1..n]((Sum[j=1..m](tj*xji) + t0 – yi)^2)/@tm = @RSS/@tm = 0
2*Sum[i=1..n]((x1i )*(t1*x1i + … + tj*xji + … + tm*xmi + t0 – yi)) = 0
2*Sum[i=1..n]((xji )*(t1*x1i + … + tj*xji + … + tm*xmi + t0 – yi)) = 0
2*Sum[i=1..n]((xmi )*(t1*x1i + … + tj*xji + … + tm*xmi + t0 – yi)) = 0
and converting equations above into matrix form we’ll get (actually, multiplication signs are not strictly necessary, they are just a visual convenience to navigate between the transpose and inversion notations, and parentheses):
2*XT*(Xt–y) = @RSS/@t = 0
In this transformation we map t of 1..j..m (in our original 3D case ijk) size basis to the same size basis of the partial derivative space, looking for the Null Transform of t, or Kernel t. Which will be:
XT*Xt = XTy
and after multiplication of both sides by the inverse (XT*X)-1:
where (XT*X)-1*XT*X = I – identity matrix, i.e. diagonal with coefficients 1, hence It=t, then:
t = (XT*X)-1*XTy
However, we yet to find out if our critical point is in fact a minimum, and not a maximum or a saddle point.
As usual, and obvious for the 1D mapping (function of one variable) y=f(x), we are looking that second derivative were positive. If the first derivative, usually expressed as slope line to the point where we take it, may be more intuitively envisioned as a “speed” of moving along the graph y in Y space, while we are moving along the x graph in X space, the second derivative is an “acceleration” of that movement. For example, a plane diving (for example for a zero-gravity simulation), and then climbing back up, changes its vertical speed from negative value to positive, transitioning through 0 at the critical point of minimum, its acceleration remains positive (which defines that the critical point indeed was minimum – i.e. whatever large negative speed the plain has, sooner or later (if it doesn’t hit the ground) it will be attenuated by the positive acceleration to 0 speed, and then the plane will start climbing up). If we get rid of the time variable, first derivatives along the function graph dy/dx (“speed” of changing y with changing x) to the left and right from the minimum will have the same sign as dx, while second derivatives d^2y/dx^2 (“acceleration” of changing y with changing x) will be positive. Actually the latter ensures the former, because positive “acceleration” makes dy positive (the further away from the minimum, the greater y is), and sign of the first derivative comes from the dx direction.
The similar approach works in a multi-dimensional case too, but in that case the measure of “acceleration” of the f:Rn->R is called Hessian matrix H:
(@^2f/@x1^2 … @^2f/@x1@xi … @^2f/@x1@xn)
(@^2f/@xi@x1 … @^2f/@xi^2 … @^2f/@xi@xn)
(@^2f/@xn@x1 … @^2f/@xn@xi … @^2f/@xn^2)
We still want make sure that whatever our (positive, or same sign) movement dx is in our X domain, it will be mapped in a positive movement dy in our range space Y , after applying the H transform (which is actually what brute force gradient descent methods do). In the literature such a transform (matrix) is called the positive definite one, but for better intuitive understanding we rather start from the eigenvectors and eigenvalues idea (anyway these things are closely related). Eigenvalue is such a value lambda (of a transform M) that gives to a particular vector(s) v the same transform as a regular Mv mapping. There could be multiple eigenvalues and vectors for particular transform. The nice thing about eigenvectors is that if we find enough (number of X space dimensions, so we can span it) linearly independent eigenvectors (so it will be basis of the space) we may express any vector x=x1*v1+x2*v2+…+xn*vn in the new eigenbasis. If it happens that all eigenvalues lambdai (coefficients in the diagonal matrix L) for that eigenvector basis are positive, M transform will be positive for any positive vector of our domain space X. In case of the H transform that would mean that the point with null Gradient/Differential is indeed minimum.
H*x = x1*H*v1+x2*H*v2+…+xn*H*vn = x1*lambada1*v1+x2*lambada2*v2+…+xn*lambadan*vn =L*x
or variation on the positive definite matrix definition:
xT*H*x = (x1, x2, …, xn) (x1*lambada1, x2*lambada2, …, xn*lambadan) T = x1*x1*lambada1*+x2*x2*lambada2+…+xn*xn*lambadan = lambda>0
In a general case, we may want to find eigenvectors and eigenvalues (anyway they are very handy for the transform analysis (and, in our case, because our data set is used as a transform, for the data analysis, too), however, in special cases, for figuring out only what type of the critical point we are in, we can estimate whether our H matrix is really positive definite, or it can have negative lambdas.
Let’s calculate Hessian for our RSS case, which could be expressed in a compact matrix for as:
@^2RSS/@t@tT = 2*XT*X
Which is quadratic form and it guaranties that whatever our xij data are, xij*xij is positive, and for any dt of our domain space, “acceleration” value of the range space will be positive.
The only thing we have to make sure that we have enough eigenvectors to form eigenbasis, i.e. there are no linear dependency in H, or in other words it’s not degenerate, or is a full rank matrix, otherwise we can’t “control” behaviour of the “acceleration” in some dimensions.
Of course we did not discover any Americas, and all these derivations may be found in many Statistics and Data Science books, and the Linear Regression functions are implemented in many libraries, but we want to experiment a bit further with the regression algorithms on real data, so let’s have our R or/and Python library for it, to play with them…