Data Science Done Right (Kitchen Style) #5

A bit more formal on Linear Regression

Let’s formalize a bit our Kitchen Style analogy from the previous post, into a more suitable for coding notation, still being a bit too verbose, as it is typical for Kitchen talks. We’ll start with the same 3D case, and generalize it to bigger dimensions later. Let’s denote scalars as the lower case italic letters (s.a. a), vectors as the lower case bold letters (s.a. x), spaces and sets as the upper case non-bold letters (s.a. A), non-vector members of spaces or sets as lower case letter (s.a. f), and transformations (or in this case liner ones, or matrices) as the upper case bold letters (s.a. M).

Let’s take a 3D vector space X on which we define basis i,j,k. Let’s take a 3D set of our Data and represent it as a set A in the space X. An arbitrary member a of the set A <b.t.> X could be represented as a linear combination x1*i+x2*j+x3*k. Let’s choose a dimension ( represented by, say, the base vector k) through the projection of which we are going to regress our 3D space X into a 2D space Y, or rather one, not yet determined, member of the quotient space X/Y. Torturing a bit notation, let’s call it (x/y)l (or Yl), where l <b.t.> I (-inf…0…+inf). Let’s define a basis in Yl as u and v.

Let’s denote T as a transformation of an arbitrary a<b.t.>A to b<b.t.>B, where B is a equivalence class of all projections of A to the element of the quotient space Yl (or the plane we are seeking in this regression), as T(a)=b=x1*i+x2*j+h*k (in respect to the basis ijk, where h=x3-dx3).

(figure 1)

The same transformation may be achieved through the chain of the following transformations of a to f: Pij(a)=c (projection of k to the plain ij), M(c)=d(projection to Y0, with the origins of uv basis in the ijk origin), N(d)=e (transformation to ijk basis back), Q(e)=b (linear shift h from Y0 to Yl along the k (or h) – the quotient space thing). Let’s define f as a representation of the same element in respect to the basis uvf=y1*u+y2*v, where f<b.t.>Yl, b<b.t.>X. d=f because where b=h+N(f)=N(d)+h, hence f=d (obviously, because of the commutativity) in the vector space (not in general case, though).

(schema 2)

We may be are too talkative, defining too many transformations, but let’s see, we may need them later to better understand behaviour of our data in the reduced-dimensionality spaces themselves, and not only in the original space after the reverse transformation, as we usually do it working with regressions.

Looking at the schema 2 we can see that T(a)=QNMP(a)=b=adx3*k=a+hx3*k,
or T(a)-h=ax3*k=b, or:
(b11 b12 b13)   (x1) i – (0) i   (x1) i
(b21 b22 b23) (x2) j – (0) j = (x2) j
(b31 b32 b33) (x3) k – (h) k   (x3-dx3) k

And if Pij(a)=c, or:
(1 0 0) (x1)     (x1) i
(0 1 0) (x2) j = (x2) j
(0 0 0) (x3) k   (0) k

M(c)=d, or:
(a11 a12 0)  (x1) i     (y1) u
(a21 a22 0) (x2) j = (y2) v
(0    0     0)  (0)  k

of course we could have done direct projection Puv=MPij, without the intermediate step of Pij, though that matrix could be less intuitive to get. However if it is not, just forget the intermediate steps:

(a11 a12 0)  (x1) i     (y1) u
(a21 a22 0) (x2) j = (y2) v
(0    0     0)  (x3) k

N(d)=e, or:
(c11 c12 0)  (y1) u    (x1) i
(c21 c22 0) (y2) v = (x2) j
(c31 c32 0)                 (e3) k

(c11 c12 0) (a11 a12   0)     (c11*a11+c12*a21 c11*a12+c12*a22   0)     (t11  t12  t13)
(c21 c22 0) (a21 a22 0) = (c21*a11+c22*a21 c21*a12+c22*a22 0) = (t21 t22 t23)
(c31 c32 0) (0     0     0)     (c31*a11+c32*a21 c31*a12+c32*a22 0)    (t31 t32 t33)

then T(a)-h=b:
(1     0    0)   (x1) i     (0) i    (x1)             i
(0    1     0)   (x2) j  – (0) j = (x2)            j
(t31 t32 0)  (x3) k    (h) k    (x3 – dx3) k

Now, how do we choose which indexed element l of the quotient space X/Y (or what value of h (or intercept)) is the best on for our purposes? Actually, there could be many of the reasonable criteria we can use, but the usual, default one is the minimisation of the sum of the squares of deltas (or residuals), i.e. RSS (residual square sum), which looks reasonable and is  the nice looking one in the matrix form, and also gives nice analytical equations for the first and second derivatives needed for the minimum calculation.

Let’s forget for a moment bases ijk and uv, and let’s won’t depict them in our notations, and denote i <b.t.> I (1…n), where n is a size of our data set A. Then i-th element’s delta x (dx=ba) can be written as:
(x1i – x1i)                                = (0)
(x2i – x2i)                               = (0)
(t31*x1i +t32*x2i – h – x3i) = (dx3i)

Leaving only non-trivial dimension k we do the regression on, we can write system of equations for all 1..i..n data elements in a matrix form:
(x11 x21 1) (t31)    (x31)   (dx31)
(…            ) (t32)    (… )      (…   )
(x1i x2i 1)  ( –) – (x3i) = (dx3i)
(…            )              (… )      (…    )
(x1n x2n 1)            (x3n)   (dx3n)

or, compactly: Xtx3=dx, or, as usually it denoted in literature, Xby=e. Let’s partially borrow that notation  for easiness of the mental mapping of this rubric to books, leaving, though, t in place, because we use b for other purposes. We’ll also drop index 3 in t and rename –h to t0.

Having expressed RSS= e1*e1 + … + ei*ei + … + en*en = eT*e (where eT is a transposed vector e), we can do the same with the left part of the equation as well: (Xty)T*(Xty)=eT*e; and we want to minimize eT*e (or (Xty)T*(Xty)). But take a note, – here, from one problem of mapping our data set from a space with the same dimensions as the original data elements have (T(a), ijk) into a subspace with the reduced dimensions (b vu), we moved to another problem, – using our data as a transformation matrix, we map coefficients of our transformation matrix (or we can say transformations T themselves (yes, transformation also may a space element)) of the original problem (X(T), ijk) (with the original number of dimensions (say, T-space)), into a space of deltas (or errors (say, E-space)) with number of dimensions equal to the size of our original data set (e, 1..n). And we want to find such an element (for the one return linear regression it will be a vector t (while for the multi-return regression we look at later, that will be a matrix T)) from the T-space, which our data would transform into the smallest element of E-space.

As usually we do that in Calculus, to find a minimum point (actually, a critical point, which includes maximum and saddle points) of a graph (curve, surface, or generic multi-dimensional data set, or product of 1-dimensional ones), we take Gradient Grad y = @y/@x1*i + @y/@x2*j + … + @y/@xn*n in that point expecting it to be equal 0. Or, for one-return parameter functions (Rn->R), it may be intuitively easier to look for a null Differential dy = @y/@x1*dx1 + @y/@x2*dx2 + … + @y/@xn*dxn (which is, anyway, related to Gradient (Grad y)T*dx=dy), and in either case, we end up looking for null Partial Derivatives of RSS in respect to the basis of t, and generalizing it from the initial 3D case to 1..j..m size, we get:

@Sum[i=1..n]((t1*x1i + … + tj*xji + … + tm*xmi + t0 – yi)*(t1*x1i + … + tj*xji + … + tm*xmi + t0 – yi))/@t1 = @RSS/@t1 = 0

@Sum[i=1..n]((Sum[j=1..m](tj*xji) + t0 – yi)*(Sum[i=1..m](tj*xji) + t0 – yi))/@tj = @RSS/@tj = 0

@Sum[i=1..n]((Sum[j=1..m](tj*xji) + t0 – yi)^2)/@tm = @RSS/@tm = 0

After differentiating:

2*Sum[i=1..n]((x1i )*(t1*x1i + … + tj*xji + … + tm*xmi + t0 – yi)) = 0

2*Sum[i=1..n]((xji )*(t1*x1i + … + tj*xji + … + tm*xmi + t0 – yi)) = 0

2*Sum[i=1..n]((xmi )*(t1*x1i + … + tj*xji + … + tm*xmi + t0 – yi)) = 0

and converting equations above into matrix form we’ll get (actually, multiplication signs are not strictly necessary, they are just a visual convenience to navigate between the transpose and inversion notations, and parentheses):

2*XT*(Xty) = @RSS/@t = 0

In this transformation we map t of 1..j..m (in our original 3D case ijk) size basis to the same size basis of the partial derivative space, looking for the Null Transform of t, or Kernel t. Which will be:

XT*Xt = XTy

and after multiplication of both sides by the inverse (XT*X)-1:

(XT*X)-1*XT*Xt =(XT*X)-1*XTy

where (XT*X)-1*XT*X = I – identity matrix, i.e. diagonal with coefficients 1, hence It=t, then:

t = (XT*X)-1*XTy

However, we yet to find out if our critical point is in fact a minimum, and not a maximum or a saddle point.

As usual, and obvious for the 1D mapping (function of one variable) y=f(x), we are looking that second derivative were positive. If the first derivative, usually expressed as slope line to the point where we take it, may be more intuitively envisioned as a “speed” of moving along the graph y in Y space, while we are moving along the x graph in X space, the second derivative is an “acceleration” of that movement. For example, a plane diving (for example for a zero-gravity simulation), and then climbing back up, changes its vertical speed from negative value to positive, transitioning through 0 at the critical point of minimum, its acceleration remains positive (which defines that the critical point indeed was minimum – i.e. whatever large negative speed the plain has, sooner or later (if it doesn’t hit the ground) it will be attenuated by the positive acceleration to 0 speed, and then the plane will start climbing up). If we get rid of the time variable, first derivatives along the function graph dy/dx (“speed” of changing y with changing x) to the left and right from the minimum will have the same sign as dx, while second derivatives d^2y/dx^2 (“acceleration” of changing y with changing x) will be positive. Actually the latter ensures the former, because positive “acceleration” makes dy positive (the further away from the minimum, the greater y is), and sign of the first derivative comes from the dx direction.

The similar approach works in a multi-dimensional case too, but in that case the measure of “acceleration” of the f:Rn->R is called Hessian matrix H:

(@^2f/@x1^2 … @^2f/@x1@xi … @^2f/@x1@xn)

(@^2f/@xi@x1 … @^2f/@xi^2 … @^2f/@xi@xn)

(@^2f/@xn@x1 … @^2f/@xn@xi … @^2f/@xn^2)

We still want make sure that whatever our (positive, or same sign) movement dx is in our X domain, it will be mapped in a positive movement din our range space Y , after applying the H transform (which is actually what brute force gradient descent methods do). In the literature such a transform (matrix) is called the positive definite one, but for better intuitive understanding we rather start from the eigenvectors and eigenvalues idea (anyway these things are closely related). Eigenvalue is such a value lambda (of a transform M) that gives to a particular vector(s) v the same transform as a regular Mv mapping. There could be multiple eigenvalues and vectors for particular transform. The nice thing about eigenvectors is that if we find enough (number of X space dimensions, so we can span it) linearly independent eigenvectors (so it will be basis of the space) we may express any vector x=x1*v1+x2*v2+…+xn*vn in the new eigenbasis. If it happens that all eigenvalues lambdai (coefficients in the diagonal matrix L) for that eigenvector basis are positive, M transform will be positive for any positive vector of our domain space X. In case of the H transform that would mean that the point with null Gradient/Differential is indeed minimum.

H*xx1*H*v1+x2*H*v2+…+xn*H*vn x1*lambada1*v1+x2*lambada2*v2+…+xn*lambadan*vn =L*x

or variation on the positive definite matrix definition:

xT*H*x = (x1, x2, …, xn) (x1*lambada1, x2*lambada2, …, xn*lambadan) T = x1*x1*lambada1*+x2*x2*lambada2+…+xn*xn*lambadan = lambda>0

In a general case, we may want to find eigenvectors and eigenvalues (anyway they are very handy for the transform analysis (and, in our case, because our data set is used as a transform, for the data analysis, too), however, in special cases, for figuring out only what type of the critical point we are in, we can estimate whether our H matrix is really positive definite, or it can have negative lambdas.

Let’s calculate Hessian for our RSS case, which could be expressed in a compact matrix for as:

@^2RSS/@t@tT = 2*XT*X

Which is quadratic form and it guaranties that whatever our xij data are, xij*xij is positive, and for any dt of our domain space, “acceleration” value of the range space will be positive.

The only thing we have to make sure that we have enough eigenvectors to form eigenbasis, i.e. there are no linear dependency in H, or in other words it’s not degenerate, or is a full rank matrix, otherwise we can’t “control” behaviour of the “acceleration” in some dimensions.

Of course we did not discover any Americas, and all these derivations may be found in many Statistics and Data Science books, and the Linear Regression functions are implemented in many libraries, but we want to experiment a bit further with the regression algorithms on real data, so let’s have our R or/and Python library for it, to play with them…

Posted in Data Science Done Right | Tagged , , , | Leave a comment

Data Science Done Right (Kitchen Style) #4

What a Linear Regression is?

If we take a look at the general definition of the term Regression we will find something like: “transition to a simpler or less perfect state”. Perfection is quite a subjective category, and, depending on the context and point of view, the same phenomena, by the same person, may be viewed as more or less perfect for one or another purpose. For example a “more simple” state or model may be viewed as “less perfect” for purposes of the simulation accuracy, or “more perfect” for easiness and clarity of understanding. So, let’s rather stick to a more clear and distinct “simplicity” aspect of the definition.

In application to the Data Science’s meaning of “simplicity”, and especially in the context of the space mappings, it would, obviously, mean reduction of the dimensionality and/or number and complexity of Relations between the Space objects. Which, actually, means Projection of our Data from a Super- to a Subspace. We won’t usually know beforehand which Subspace is more suitable for our purposes, but we may have an idea about possible variants from which we may chose, applying particular criteria, the best one. As it was already mentioned, any objects may be members of a Space, including other Spaces, or Subspaces. There, concept of Quotient (or Factor) Spaces may be useful. In such a Space its members are its disjoint (not having common elements) Subspaces.

Let’s imagine a Crape Cake, which, as a whole, is a 3D Space, but also it can be thought of as a 1D Quotient Space of the 2D Crapes. Also, let’s imagine  we have Blueberries somehow stuffed in between our Crapes. And then, we somehow want to associate (via so called Equivalence Relation) all these Blueberries with only one Crape, for example by protruding (Projecting) those Blueberries through other Crapes by our fingers and smashing them into One Chosen Crape (or, making sure they somehow squeezed and moved through the holes made by toothpicks). All these Blue Spots on the One Chosen Crape we may call Equivalence Class. And we may want to minimize the ruin we have just done to our Cake by choosing the One Crape that would ensure that, and that will be condition for our Equivalence Relation.

Of course, there may be other criteria chosen, for example a Crape with biggest holes in it, or something else. Also we may want to choose boring holes in the Cake not with the straight, but crooked fingers (that won’t be a Linear Transformation), or put the Cake on the edge of table, and let it bend like Clocks on Salvador Dali paintings (that won’t be Linear/Vector Subspaces), and then bore it with straight fingers. We may decide those non-linear Blueberry Transformations and Subspaces are even cooler than the linear ones (for example the crooked holes in the Cake would make it a Piece of Art), but for the Linear, One Output Parameter Regression from 3D Space into 2D Subspace we will stick to the algorithm (Linear Projections from Vector Superspace into Vector Subspaces) described in the former paragraph.

Technically, we may use Linear Transformations (but not Projections (which immediately eliminate dimension(s) of the original Vector Space)) that vary from one Data element to another, and actually may be a way to linearize non-linear transformations (not your usual Linear Regression), but that will call for a bit different mathematical treatment (adding one more transformation in the target subspace) of the Transformation presented in the next chapter.


Posted in Data Science Done Right | Tagged , , , | Leave a comment

Data Science Done Right (Kitchen Style) #3

More on the fundamental Method of Data Science

When we want to model an unstructured collection of the real world phenomena we use such mathematical abstraction as Set. It can contain not just simple elements (or objects, or members – these are interchangable terminologies) as numbers, or more complex mathematical abstractions (for example Sets themselves), – the members of a Set could be really any possible or imaginable objects. If we want to introduce (and we usually want to do that) a Structure over these objects we use such mathematical abstraction as Space. A Space is a Set with Relations (or, as a special case, Mappings (Functions or Transformations – these are, too, interchangeable terminologies), or even a more special case – Operators) defined over its members. Functions are Relations that define correspondence of a member of the Domain Space (Space we mapping from) to exactly one member of the Range Space (Space we mapping to). Of course there may be multiple Domain members that map in the same Range member, but we do not split them. Operators are mapping to the same Space.

For example, to define a Linear (Vector) Space we have to declare what element(s) will be Identity element(s), and we have to define Operators of the members’ Addition and scalar Multiplication in such a way that their result will be still an element of the Space (i.e. such Mappings are, indeed, Operators), and those Operators will be Associative, Commutative and Distributive, and we have to declare that every element will have its Inverse and their Sum will produce Identity (Additive Identity will be Null) element. Again, members of such a Space may be not only numbers or their lists, but also any phenomena, or even their relationships. We just have to define Operators on them as described above, and then we can apply all the Linear Vector Space analytical apparatus to our newly created Space.

For Metrics Spaces we have to define Distance Function (which is strictly speaking a Relation, therefore these are more generic Spaces) that will give us a distance between any two selected elements of the Space, and we are free to define whatever we want, and not necessarily being bounded by only Euclidean distance calculation. For the most generic Topological Spaces we define Topologies – those Sets that basically tell us whether elements of the Space are in a Relation of being neighbours or not, and where do boundaries lie between neighborhoods.

Representing our Data as Spaces with Structure definitions over them is, obviously, useful for finding Structural relationships between the Data elements, and sufficient enough for the Unsupervised Learning methods of Data Science. In addition, by defining Mappings or Relations between Spaces, we can ask (and answer) such questions as: “Can two Spaces be mapped to each other?”, “Is one of them a Subspace of the other?”, “Is that mapping continuous (isomorphic/homeomorphic)?” In terms of the Data Analysis those questions and answers will tell us whether our Data sets have the same or similar Structures, allowing us to recognize Patterns and mine Data.

Those Relations, Functions, or Operators, define Structure of the Space to which members of the Space can be subjected to, or which is “visible” in the Space. Our Real World Data could have a much more sophisticated “Real Structure”, but, when modelling the Real World Data in the particular modeling Space, we will be able to see no more Structure than we defined in the model. Or maybe even less Structure in the Data, if our expected, model Structure is not present in the Real World Data. For example Decision Tree (which is such a Relation) formulated to pinpoint fraudulent credit card use will not make visible authentic owner spending habits (for which we will need another kind of the Decision Tree). Or, Linear Vector Space will make visible to us only linear Structure of the relations between Data elements. Or, which is usually the case in Topological Data Analysis, if we Metricize generic Topological Space, we will lose non-metrizable relations.

Because using Statistical Modeling we can not (or do not bother to) get an insight on the causes and driving forces of our Data, and we do treat them like movements inside a “black box”, we are also in a darkness (of that box) about whether (as we may think they are) all the aspects (parameters or variables) of the objects we study are, indeed, their defining parameters, and not the incomplete or overlapping combinations of the “real” (independent) parameters. Because of that we are bound to see those parameters as random and dependent between each other (welcome to the real-world, or “nasty”, or “dirty” Data). Which is really not Data’s, but, instead, our problem of the failed assumptions, expectations, or, in a way, – ego.

If the aspects (variables) of the Data, and Structure Operators of the initial model do not give us much of the meaningful information, we may want to map the Data isomorphically, or at least partially, homomorphically, onto another Space with more relevant and interesting for us topologies, with different bases and different (reduced or introduced) dimensions. That may make visible those Structures we are interested, or maybe surprised to see, eliminate or reduce variable dependencies, or even reduce the very “randomness” of the variables.

But enough wordy theorizing, let us see how the data Statistics/Data Science workhorse of the Linear Regression is seen from the Space Mapping point of view…

Posted in Data Science Done Right | Tagged , , , | Leave a comment

Data Science Done Right (Kitchen Style) #2

What really Data Science is?

So, let us start with the simple basic questions: What are the Object and Method of the Data Science studies? In my humble opinion (which, of course, may be naive, erroneous, or trivial, as any other statement in the following text, for everyone of which I am not going to repeat this caveat, but always imply it), what we are looking in the Data is their Structure, which word by itself, though, tells or explains nothing. Let us look at it in the context of the linguistic and cultural Structuralism of the XX century (OK, OK, it is not fashionable anymore, because we live in the age of Post-Structuralism, or even Trans-Structuralism, but that changes no basics).

Structuralists usually define Structure as a mesh of the Opposition relationships between objects of the domain of interest. This definition still leaves a lot of room for interpretation, and I prefer to look at the Opposition not as something adversarial, but, rather, as a state of two peers being in some kind of relationship, which may or may not be Divisive. For example, we may be interested in finding out are the given objects in a pair neighbors or not. Those relations (or we could say Relations in the Algebraic sense, i.e. if we have sets A and B, then subset of the product AxB is a Relation defined by some criteria), really, is the fundamental Object of the Data Science studies. That is pretty obvious for the unsupervised learning, clustering methods, but it also stands for the other, supervised ones.

Even if we take a look at Descriptive Statistics, we will see that those numbers, functions or diagrams let us peek at the various aspects of the Data Structure in a compact, integral form, without drowning in the excessive peculiars and mass of the Data.

Now, getting an idea what we want to study, we may start thinking about the Method which we may do it with. Definitely, it will be a branch of the Mathematical Modelling, but not the one we usually use in the “Hard”, Natural Science. In the Natural Sciences we also strive to uncover a Structure, but Structure of the causes and driving forces of the data being observed. In a general case we end up with a system of (partial) differential equations that we usually can not solve Analytically. Then we either linearize, or simplify, or modify our models to reduce them to a form that has known analytical solution that is (relatively) easy to comprehend, and works fine in a wide range of initial conditions. Or, if such approach is not possible or acceptable, we resort to the data crunching of the Numerical Methods, which are, basically, the same linearizations, simplifications and modifications, but applied on a small temporal or spatial scale repeatedly, which is easily machine-automated. However, obtaining a solution this way has a cost – limited convergence intervals, and, if the initial conditions are changed significantly, all the bets are off that such a solution will work not only with the required accuracy, but even that it will work on the level of general tendencies. Sounds familiar for the Data Scientists, huh?

“Soft” Science generally despises such an approach of mathematically modelling causes and driving forces of the data it deal with (supposedly, because it deals with much more complicated matters, and such deterministic analysis is practically useless – yeah, yeah, “an invisible hand of the Free Market” will sort everything out instead). What it usually looks for is the Structure of the Data itself, or its appearance. The useful branch of the Mathematical Modelling in such a case is the Statistical Modelling. Similarly to the world of Natural Science, we may be interested either in a more simplified, but universalistic and giving us insights Statistical Inference, or in a more result oriented, but convergence limited Predictive Modelling. Machine Learning is closely associated, and largely overlapping with the latter because its resulting models are hard to interpret in the analytical sense.

Data Mining and Pattern Recognition are also associated with each other, and do exactly what the latter says – search for a Pattern, or a Structure of the Data. However, the latter usually looks for Patterns by example, while the former looks for something new; and the more unexpected that Structure is, the better. They reside in the middle of the Inference and Machine Learning because, on the one hand we still want to have some analytical insight, on the other – we may greatly benefit from the power of “number crunching”. Of course, if the aim of the Pattern Recognition is purely utilitarian (to arrest particular government protester, or kill particular jihadist from a drone), then that bring it closer to the Predictive Modelling.

Again, with the “Soft” Sciences shying away from the mathematical methods, the niche of the theoretical branches of the “Hard” Sciences in the “Soft” Science realm was taken over by the (semi)autonomous Data Science. Of course, in the real world the described above partition of the Mathematical Modelling branches is not strict, and “Hard” Science uses a lot of Statistical methods, though they play more a servile role of the initial empirical data processing, before the real theorizing begins (or verification of theories against the empirical reality), while there is some place of the Analytical Mathematical Models even in the “Soft” Science.

Nevertheless, what is the fundamental Method that lies in the foundation of all the mentioned above (as well as not mentioned) methods of the Data Science? The Method that might not be frequently reflected upon in the real day-to-day practice?

Posted in Data Science Done Right | Tagged , , , | Leave a comment

Data Science Done Right (Kitchen Style)

Motivation and Style

Of course, the name of the blog is pretentious and plagiarized. It may look overconfident, nevertheless, I have quite humble reasons for naming it that way. Having taking Modern Algebra and Statistical Methods classes together in my graduation year, I found myself confused by the mix of two approaches: one, which is deep, fundamental, and universalist, and another – utilitarian, mechanistic, and, honestly, it appears with not much Science behind it.

However, the same difference I felt between the two ways the Linear Algebra classes were taught. Seemingly mechanistic, close to the ground Linear Algebra I in its Linear Algebra II incarnation turned into a much deeper and thoughtful discipline. It was a much tougher class, but, in a way, a more “mind-calming” one – turns out there was a meaning, a reason for all these matrix manipulations you merely memorized in the first class. The second course was taught by Sheldon Axler’s textbook Linear Algebra Done Right. Apparently, that title is an inspiration for the blog rubric.

Similarly, in the following posts I hope to find out and explain to myself the deep meaning of the confusing Data Science buzz-terminology: “Data Mining”, “Machine Learning”, “Artificial Intelligence”, “Deep Learning”, “Big Data”, etc… Unlike the more rigorously defined terminology such as “Statistical Inference”, “Predictive Modelling”, “Reinforced Learning”, “Pattern Recognition”, the former vocabulary is fuzzily defined, redundant, and confusing even for the seasoned Data Scientists.

Of course, I do not envision the blog being in any way comprehensive and exhaustive (I simply do not have qualifications for that), but rather spotty, fragmented, touching the most “unsettling” topics (for me), and, maybe, homing to some “calming” answers 🙂 Because this is mainly a self-directed text of the consciousness dump, the writing is left in the scratch-book style, hardly proof-read, and,  therefore, I beg pardon from the occasional visiting readers.

I am not going to properly format citations, and I will overuse, or even abuse capitalization and Italic fonts, which is, of course, not a proper Scientific Writing :), especially if the use of this emphasis formatic is not consistent, which it will be, – usage context will drive the choice of that abuse. Mathematical proofs, whenever used, are not rigorous, but, rather, illustrations to proofs, intended to make them more understandable, and intuitively clear. When some code is involved, I’m not going to torture readers by inlining it, but will make it freely available one way or another.

Having get some clarity about what this rubric is about and its form, let us head on the most fundamental question about Data Science: “What it really is?”

UPDT: Actually, in the course of writing I realized I can dilute my plain plagiarism with a bit of originality – make this talks Kitchen Talks. Like, talks of the people not rushing anywhere, relaxing at the kitchen with a cup of tea or coffee, with a piece of cake, or pastry, or other gourmet food, on which we are going to do our thought (or maybe quite physical) experiments.

Posted in Data Science Done Right | Tagged , , , | Leave a comment

Should you trust your senses? An essay on Descartes’ Meditations.

Movies_Films_I_Inception_023533_Yes, we should trust our senses. That is the only way how we live and communicate with the world outside us and even inside our bodies. Descartes in his “Meditations” just pretends that he does not trust his senses to get a new insight in his mind, to make a switch in his custom form of thinking, to look at the world from the new perspective. Even though Descartes says he rejects everything he knew before, he conveniently retains memories of Classical and Medieval philosophy, which he periodically refers to.

Descartes thinks he accumulated a lot of questionable and dubious ideas through his life. He makes a conclusion that all those ideas come from senses which are not trustworthy. It is easy to doubt credibility of our senses when they work on the edge of their sensitivity. For example, recognition of small or far away objects. It is much harder to doubt bigger chunk of our senses especially if they work in their confidence interval because their correctness could be proved by experience. If we are going to insist that some of our senses are wrong, we risk being considered as mad. However, it is much easier to reject our senses as a whole. For example, in a dream we perceive all weird events of the dream as normal and real, but we can recognize the strangeness of the dream only outside of the dream “reality” when we wake up. That is why Descartes decides to question the reliability of the whole world of senses.

If we accept an idea that our real world, given to us through senses, is just another dream of a higher rank, we may want to find criteria which would allow us to find imperfections of the dreams (especially of the lower rank) comparing to the really real world. Importance of these criteria is stressed, for example, in the motion picture Inception. For a moment Descartes follows this path, suggesting that the human fantasy is impaired by its scantiness: fantastical creatures that the humans make up are just a combination of parts of real animals, or images in dreams are like bleak paintings of real things. Descartes implies that if we dream a dream impressed on us by some “Architect” (in terms of the Inception), we can use “simple and universal” invariants of the really real world (like mathematical concepts) as landmarks for detecting a dream. But Descartes quickly withdraws from this path, suggesting that the “Architect” may be an omnipotent God capable of creating a deceptive dream for us, which is as perfect as the really real world (i.e. “totem” from Inception would not work).

Descartes meets possible counter-arguments that God could be non-omnipotent or could not be possibly deceptive because deception is manifestation of imperfectness, by saying that he has no answers to these objections. Descartes started doubting senses for he wanted to leave only certain and distinct ideas in his understanding of the world, therefore he is willing to build it suitable for the worst case scenario that God’s task is to deceive Descartes with all his might. If Descartes is still able to infer anything about the world even in conditions of omnipotent deception, those inferences would be quite certain and unshakable.

Descartes is able to identify at least one thing which omnipotent evil deceiver cannot possibly trick him about. He cannot make Descartes believe that he does not exist, therefore the statement “I am, I exist” cannot be taken away from him. Descartes finds other certain qualities of his “I”, which is a “thinking thing”, and which exists only while he thinks. His “I” also has senses, which does not mean that these senses are somehow real, and, strictly speaking, sensing is a type of thinking. “I” can have mental images, which are modifications of thought as well.

Another classification of thought is aimed at identifying what type of thought is prone to errors. Descartes divides thoughts into three categories. Two of them, which are simple ideas (even unreal and imaginary) and emotions, cannot be judged as true or false. We can make an error only if we have thoughts of third category named judgments.

Descartes tries to analyze how the human mind works. He takes the example of the wax. Even when wax in a usual for human environment may appear to our senses in different states of matter, we perceive it as the same substance. He concludes that our mind does not comprehend things with senses, which give us information about appearance of substances, but our mind understands substances by essence. It is similar to Plato’s Forms. This conclusion is encouraging, because, to proceed further, Descartes has to take on the question of God’ existence. Because God could not be comprehended by senses, one must think about him in terms of the world of Forms.
Descartes attempts to find out which ideas could have originated from him and which could not. Based on the idea that the cause should be greater than the effect, he concludes that idea of the omnipotent, omniscient and perfect God may be originated only from the omnipotent, omniscient and perfect God himself. This proof effectively repeats after Aquinas’s proofs of God’s existence.

However, there are contradictions in Descartes syllogisms. He says that idea of physical objects, which are not intelligent and extended things, was created by him as opposition of the idea of a thinking and not-extended himself. Using the same logic, an idea of omnipotent and perfect God could have appeared in Descartes’ mind as an opposition to non-omnipotent and imperfect himself. On the other hand, using concept of the greater cause than its effect, Descartes could have said that it is not only impossible for an idea of God to originate from himself, but an idea of physical objects needs a greater cause in a form of the objective reality. In both cases: the personal dream world of Descartes whose ideas come only from himself, or an objective world created by perfect God, he could trust his senses, because they are either only his and not impressed by a foreign “Architect”, or are caused by the objective reality.

Perhaps feeling deficiency in his arguments Descartes makes a second attempt to justify the existence of God by asking the question whether he could be created by somebody else but God. His answer is “No”, based on the same medieval reasoning of necessity of the cause to be greater than the effect, but he introduces a new twist to this reasoning. Descartes says his continuous existence is of the same nature as his creation. Thus, his world, instead of being created once (long-long time ago and maybe left by God on its own), becomes a dream of God which requires God’s constant attention in actively redreaming it in the whole fullness every next moment.

Having proven the existence of the perfect God, Descartes proceeds to analyze a question where errors come from. He says that human’s free will is as great as God’s. Following Augustine’s thinking, he says his ability to understand is as good as God’s qualitatively, but quantitatively is less. When free will of judgment exceeds scope of understanding, an error appears. By his own standards Descartes’ will to question his senses and desire to prove God’s existence exceeded capacities of his contradictory understanding and made subsequent conclusions about credibility of senses shaky.

Posted in Uncategorized | Tagged , , , , , , | 5 Comments

“Eating Our Way Toward Civilization: How food processing shaped human body, mind and cultures” book announcement

Eating Our Way Toward Civilization: How food processing shaped human body, mind and cultures in paperback and $0.99 e-book:
for Kindles on Amazon
for iPad/iPhone/iPod on iTunes
for Android devices on GooglePlay
for Nook on Barnes&Noble

In the series of short essays the author overviews works of anthropologists, archeologists, cultural sociologists, historians and gastronomists who argue that the cooking played a significant role in human evolution and history, and sometimes even whole Empires “were built not by the sword but by the spoon”. The ancient and modern cuisines bear a deep imprint of the civilizations they appeared and were developing in. One can get a real insight on the cultures, often long gone, by trying recipes created then and there. This maxim did not cease to be true nowadays. Contemporary cultures as well could be judged by their cuisines. Verdict issued by this criteria, when applied to the American diet, may appear shocking. The American culture is the childish, young adolescent one. As a special treat an essay about the everyday life in the Soviet Union, People Waiting in Line, is included.

Posted in Uncategorized | Tagged , , , , , , | Leave a comment