Let’s formalize a bit our Kitchen Style analogy from the previous post, into a more suitable for coding notation, still being a bit too verbose, as it is typical for Kitchen talks. We’ll start with the same 3D case, and generalize it to bigger dimensions later. Let’s denote scalars as the lower case italic letters (s.a. *a*), vectors as the lower case bold letters (s.a. **x**), spaces and sets as the upper case non-bold letters (s.a. A), non-vector members of spaces or sets as lower case letter (s.a. f), and transformations (or in this case liner ones, or matrices) as the upper case bold letters (s.a. **M**).

Let’s take a 3D vector space X on which we define basis **i**,**j**,**k**. Let’s take a 3D set of our Data and represent it as a set A in the space X. An arbitrary member **a** of the set A <b.t.> X could be represented as a linear combination *x1****i**+*x2****j**+*x3****k**. Let’s choose a dimension ( represented by, say, the base vector **k**) through the projection of which we are going to regress our 3D space X into a 2D space Y, or rather one, not yet determined, member of the quotient space X/Y. Torturing a bit notation, let’s call it (x/y)*l* (or Y*l*), where *l* <b.t.> I (-inf…0…+inf). Let’s define a basis in Y*l* as **u** and **v**.

Let’s denote T as a transformation of an arbitrary **a**<b.t.>A to **b**<b.t.>B, where B is a equivalence class of all projections of A to the element of the quotient space Y*l* (or the plane we are seeking in this regression), as T(**a**)=**b**=*x1****i**+*x2****j**+*h****k** (in respect to the basis** ijk**, where *h*=*x3-dx3*).

(figure 1)

The same transformation may be achieved through the chain of the following transformations of **a** to **f**: Pij(**a**)=**c** (projection of **k** to the plain **ij**), M(**c**)=**d**(projection to Y*0*, with the origins of **uv** basis in the **ijk** origin), N(**d**)=**e** (transformation to **ijk** basis back), Q(**e**)=**b** (linear shift *h* from Y*0* to Y*l* along the **k **(or** h**)** **– the quotient space thing). Let’s define **f** as a representation of the same element in respect to the basis **uv**: **f**=*y1****u**+*y2****v**, where **f**<b.t.>Y*l*, **b**<b.t.>X. **d**=**f **because where **b**=**h**+N(**f**)=N(**d**)+**h**, hence **f**=**d** (obviously, because of the commutativity) in the vector space (not in general case, though).

(schema 2)

We may be are too talkative, defining too many transformations, but let’s see, we may need them later to better understand behaviour of our data in the reduced-dimensionality spaces themselves, and not only in the original space after the reverse transformation, as we usually do it working with regressions.

Looking at the schema 2 we can see that T(**a**)=QNMP(**a**)=**b**=**a**–*dx3****k=a**+**h**–*x3****k**,

or T(**a**)-**h**=**a**–*x3****k**=**b, **or:

(b11 b12 b13) (x1) **i** – (0)** i** (x1) **i**

(b21 b22 b23) (x2) **j** – (0) **j** = (x2) **j**

(b31 b32 b33) (x3) **k** – (h) **k** (x3-dx3) **k**

And if Pij(**a**)=**c**, or:

(1 0 0) (x1) **i ** (x1) **i**

(0 1 0) (x2) **j** = (x2) **j**

(0 0 0) (x3) **k** (0) **k**

M(**c**)=**d**, or:

(a11 a12 0) (x1) **i** (y1) **u**

(a21 a22 0) (x2) **j** = (y2) **v**

(0 0 0) (0) **k**

of course we could have done direct projection Puv=MPij, without the intermediate step of Pij, though that matrix could be less intuitive to get. However if it is not, just forget the intermediate steps:

Puv(**a**)=**d**:

(a11 a12 0) (x1) **i** (y1) **u**

(a21 a22 0) (x2) **j** = (y2) **v**

(0 0 0) (x3) **k**

N(**d**)=**e**, or:

(c11 c12 0) (y1) **u** (x1) **i**

(c21 c22 0) (y2) **v** = (x2) **j**

(c31 c32 0) (e3) **k**

NM:

(c11 c12 0) (a11 a12 0) (c11*a11+c12*a21 c11*a12+c12*a22 0) (t11 t12 t13)

(c21 c22 0) (a21 a22 0) = (c21*a11+c22*a21 c21*a12+c22*a22 0) = (t21 t22 t23)

(c31 c32 0) (0 0 0) (c31*a11+c32*a21 c31*a12+c32*a22 0) (t31 t32 t33)

then T(**a**)-**h**=**b**:

(1 0 0) (x1) **i** (0) **i** (x1) **i**

(0 1 0) (x2) **j** – (0) **j** = (x2) **j**

(t31 t32 0) (x3) **k** (h) **k** (x3 – dx3) **k**

Now, how do we choose which indexed element* l* of the quotient space X/Y (or what value of h (or intercept)) is the best on for our purposes? Actually, there could be many of the reasonable criteria we can use, but the usual, default one is the minimisation of the sum of the squares of deltas (or residuals), i.e. *RSS* (residual square sum), which looks reasonable and is the nice looking one in the matrix form, and also gives nice analytical equations for the first and second derivatives needed for the minimum calculation.

Let’s forget for a moment bases **ijk** and **uv**, and let’s won’t depict them in our notations, and denote *i* <b.t.> I (*1…n*), where *n* is a size of our data set A. Then *i*-th element’s delta **x **(d**x**=**b**–**a**) can be written as:

(*x1i – x1i) = (0)*

(*x2i – x2i) = (0)*

(*t31*x1i +t32*x2i – h – x3i) = (dx3i)*

Leaving only non-trivial dimension **k** we do the regression on, we can write system of equations for all *1..i..n* data elements in a matrix form:

(*x11 x21 1*) (*t31*) (*x31*) (*dx31*)

(… ) (*t32*) (… ) (… )

(*x1i x2i 1*) ( –*h *) – (*x3i*) = (*dx3i*)

(… ) (… ) (… )

(*x1n x2n 1*) (*x3n*) (*dx3n*)

or, compactly:** Xt**–**x3**=**dx**, or, as usually it denoted in literature, **Xb**–**y**=**e**. Let’s partially borrow that notation for easiness of the mental mapping of this rubric to books, leaving, though, **t** in place, because we use **b** for other purposes. We’ll also drop index 3 in **t** and rename –*h* to *t0*.

Having expressed *RSS= e1*e1 + … + ei*ei + … + en*en* = **e**T***e **(where** e**T is a transposed vector** e**), we can do the same with the left part of the equation as well: (**Xt**–**y**)T*(**Xt**–**y**)=**eT*****e**; and we want to minimize **e**T***e **(or (**Xt**–**y**)T*(**Xt**–**y**)). But take a note, – here, from one problem of mapping our data set from a space with the same dimensions as the original data elements have (T(**a**), **ijk**) into a subspace with the reduced dimensions (**b vu**), we moved to another problem, – using our data as a transformation matrix, we map coefficients of our transformation matrix (or we can say transformations T themselves (yes, transformation also may a space element)) of the original problem (X(**T**), **ijk**) (with the original number of dimensions (say, T-space)), into a space of deltas (or errors (say, E-space)) with number of dimensions equal to the size of our original data set (**e**,** 1..n**). And we want to find such an element (for the one return linear regression it will be a vector **t** (while for the multi-return regression we look at later, that will be a matrix **T**)) from the T-space, which our data would transform into the smallest element of E-space.

As usually we do that in Calculus, to find a minimum point (actually, a critical point, which includes maximum and saddle points) of a graph (curve, surface, or generic multi-dimensional data set, or product of 1-dimensional ones), we take Gradient **Grad y** = @

@Sum[*i=1..n*]((*t1*x1i + … + tj*xji + … + tm*xmi + t0 – yi)*(t1*x1i + … + tj*xji + … + tm*xmi + t0 – yi))/@t1 = @RSS/@t1 = 0*

…

@Sum[*i=1..n*]((Sum[*j=1..m*]*(tj*xji) + t0 – yi)*(*Sum[i=1..m]*(tj*xji) + t0 – yi))/@tj = @RSS/@tj = 0*

…

@Sum[*i=1..n*]((Sum[*j=1..m*]*(tj*xji) + t0 – yi)^2**)/@tm = @RSS/@tm = 0*

After differentiating:

*2**Sum[*i=1..n*]((*x1i )*(t1*x1i + … + tj*xji + … + tm*xmi + t0 – yi)) = 0*

…

2*Sum[i=1..n]((*xji* )*(*t1*x1i + … + tj*xji + … + tm*xmi + t0 – yi*)) = *0*

…

2*Sum[i=1..n]((*xmi* )*(*t1*x1i + … + tj*xji + … + tm*xmi + t0 – yi*)) = *0*

and converting equations above into matrix form we’ll get (actually, multiplication signs are not strictly necessary, they are just a visual convenience to navigate between the transpose and inversion notations, and parentheses):

*2****X**T*(**Xt**–**y**) = @*RSS*/@**t** = *0*

In this transformation we map **t** of *1..j..m* (in our original 3D case **ijk**) size basis to the same size basis of the partial derivative space, looking for the Null Transform of **t**, or Kernel **t**. Which will be:

**X**T***Xt** = **X**T**y**

and after multiplication of both sides by the inverse (**X**T***X**)-1:

(**X**T***X**)-1***X**T***Xt** =(**X**T***X**)-1***X**T**y **

where** **(**X**T***X**)-1***X**T***X **=** I – **identity matrix, i.e. diagonal with coefficients *1*, hence **It**=**t**, then:

**t** = (**X**T***X**)-1***X**T**y**

However, we yet to find out if our critical point is in fact a minimum, and not a maximum or a saddle point.

As usual, and obvious for the 1D mapping (function of one variable) *y*=f(*x*), we are looking that second derivative were positive. If the first derivative, usually expressed as slope line to the point where we take it, may be more intuitively envisioned as a “speed” of moving along the graph y in Y space, while we are moving along the x graph in X space, the second derivative is an “acceleration” of that movement. For example, a plane diving (for example for a zero-gravity simulation), and then climbing back up, changes its vertical speed from negative value to positive, transitioning through *0* at the critical point of minimum, its acceleration remains positive (which defines that the critical point indeed was minimum – i.e. whatever large negative speed the plain has, sooner or later (if it doesn’t hit the ground) it will be attenuated by the positive acceleration to *0* speed, and then the plane will start climbing up). If we get rid of the time variable, first derivatives along the function graph d*y*/d*x* (“speed” of changing *y* with changing *x*) to the left and right from the minimum will have the same sign as d*x*, while second derivatives d^2*y*/d*x*^2 (“acceleration” of changing *y* with changing *x*) will be positive. Actually the latter ensures the former, because positive “acceleration” makes d*y* positive (the further away from the minimum, the greater *y* is), and sign of the first derivative comes from the d*x* direction.

The similar approach works in a multi-dimensional case too, but in that case the measure of “acceleration” of the f:Rn->R is called Hessian matrix **H**:

(@^2f/@x1^2 … @^2f/@x1@xi … @^2f/@x1@xn)

…

(@^2f/@xi@x1 … @^2f/@xi^2 … @^2f/@xi@xn)

…

(@^2f/@xn@x1 … @^2f/@xn@xi … @^2f/@xn^2)

We still want make sure that whatever our (positive, or same sign) movement d**x** is in our X domain, it will be mapped in a positive movement d**y **in our range space Y , after applying the **H** transform (which is actually what brute force gradient descent methods do). In the literature such a transform (matrix) is called the positive definite one, but for better intuitive understanding we rather start from the eigenvectors and eigenvalues idea (anyway these things are closely related). Eigenvalue is such a value *lambda* (of a transform **M**) that gives to a particular vector(s) **v** the same transform as a regular **Mv** mapping. There could be multiple eigenvalues and vectors for particular transform. The nice thing about eigenvectors is that if we find enough (number of X space dimensions, so we can span it) linearly independent eigenvectors (so it will be basis of the space) we may express any vector **x=***x1****v1+***x2****v2+…+***xn****vn** in the new eigenbasis. If it happens that all eigenvalues *lambdai* (coefficients in the diagonal matrix **L**) for that eigenvector basis are positive, **M** transform will be positive for any positive vector of our domain space X. In case of the **H** transform that would mean that the point with null Gradient/Differential is indeed minimum.

**H*****x** = *x1* H**

or variation on the positive definite matrix definition:

**x**T***H*****x** = (x1, x2, …, xn) (*x1***lambada1, x2*lambada2, …, xn*lambadan) T *= x1**x1***lambada1**+x2**x2***lambada2*+…+xn**xn***lambadan = **lambda>0*

In a general case, we may want to find eigenvectors and eigenvalues (anyway they are very handy for the transform analysis (and, in our case, because our data set is used as a transform, for the data analysis, too), however, in special cases, for figuring out only what type of the critical point we are in, we can estimate whether our **H** matrix is really positive definite, or it can have negative lambdas.

Let’s calculate Hessian for our RSS case, which could be expressed in a compact matrix for as:

@^2*RSS*/@**t**@**t**T = *2****X**T***X**

Which is quadratic form and it guaranties that whatever our *xij* data are, *xij*xij* is positive, and for any d**t** of our domain space, “acceleration” value of the range space will be positive.

The only thing we have to make sure that we have enough eigenvectors to form eigenbasis, i.e. there are no linear dependency in **H**, or in other words it’s not degenerate, or is a full rank matrix, otherwise we can’t “control” behaviour of the “acceleration” in some dimensions.

Of course we did not discover any Americas, and all these derivations may be found in many Statistics and Data Science books, and the Linear Regression functions are implemented in many libraries, but we want to experiment a bit further with the regression algorithms on real data, so let’s have our R or/and Python library for it, to play with them…

]]>

If we take a look at the general definition of the term *Regression* we will find something like: “transition to a simpler or less perfect state”. *Perfection* is quite a subjective category, and, depending on the context and point of view, the same phenomena, by the same person, may be viewed as more or less perfect for one or another purpose. For example a “more simple” state or model may be viewed as “less perfect” for purposes of the simulation accuracy, or “more perfect” for easiness and clarity of understanding. So, let’s rather stick to a more clear and distinct “simplicity” aspect of the definition.

In application to the Data Science’s meaning of “simplicity”, and especially in the context of the space mappings, it would, obviously, mean reduction of the dimensionality and/or number and complexity of *Relations* between the *Space* objects. Which, actually, means *Projection* of our Data from a Super- to a Subspace. We won’t usually know beforehand which Subspace is more suitable for our purposes, but we may have an idea about possible variants from which we may chose, applying particular criteria, the best one. As it was already mentioned, any objects may be members of a Space, including other Spaces, or Subspaces. There, concept of Quotient (or Factor) Spaces may be useful. In such a Space its members are its disjoint (not having common elements) Subspaces.

Let’s imagine a Crape Cake, which, as a whole, is a 3D Space, but also it can be thought of as a 1D Quotient Space of the 2D Crapes. Also, let’s imagine we have Blueberries somehow stuffed in between our Crapes. And then, we somehow want to associate (via so called Equivalence Relation) all these Blueberries with only one Crape, for example by protruding (*Projecting*) those Blueberries through other Crapes by our fingers and smashing them into One Chosen Crape (or, making sure they somehow squeezed and moved through the holes made by toothpicks). All these Blue Spots on the One Chosen Crape we may call Equivalence Class. And we may want to minimize the ruin we have just done to our Cake by choosing the One Crape that would ensure that, and that will be condition for our Equivalence Relation.

Of course, there may be other criteria chosen, for example a Crape with biggest holes in it, or something else. Also we may want to choose boring holes in the Cake not with the straight, but crooked fingers (that won’t be a Linear Transformation), or put the Cake on the edge of table, and let it bend like Clocks on Salvador Dali paintings (that won’t be Linear/Vector Subspaces), and then bore it with straight fingers. We may decide those non-linear Blueberry Transformations and Subspaces are even cooler than the linear ones (for example the crooked holes in the Cake would make it a Piece of Art), but for the Linear, One Output Parameter Regression from 3D Space into 2D Subspace we will stick to the algorithm (Linear Projections from Vector Superspace into Vector Subspaces) described in the former paragraph.

Technically, we may use Linear Transformations (but not Projections (which immediately eliminate dimension(s) of the original Vector Space)) that vary from one Data element to another, and actually may be a way to linearize non-linear transformations (not your usual Linear Regression), but that will call for a bit different mathematical treatment (adding one more transformation in the target subspace) of the Transformation presented in the next chapter.

…

]]>

When we want to model an unstructured collection of the real world phenomena we use such mathematical abstraction as *Set*. It can contain not just simple elements (or objects, or members – these are interchangable terminologies) as numbers, or more complex mathematical abstractions (for example Sets themselves), – the members of a Set could be really any possible or imaginable objects. If we want to introduce (and we usually want to do that) a *Structure* over these objects we use such mathematical abstraction as *Space*. A *Space* is a *Set* with *Relations* (or, as a special case, *Mappings* (*Functions* or *Transformations* – these are, too, interchangeable terminologies), or even a more special case – *Operators*) defined over its members. *Functions* are *Relations* that define correspondence of a member of the *Domain* Space (Space we mapping from) to exactly one member of the *Range* Space (Space we mapping to). Of course there may be multiple Domain members that map in the same Range member, but we do not split them. *Operators* are mapping to the same Space.

For example, to define a Linear (Vector) Space we have to declare what element(s) will be Identity element(s), and we have to define Operators of the members’ Addition and scalar Multiplication in such a way that their result will be still an element of the Space (i.e. such Mappings are, indeed, Operators), and those Operators will be Associative, Commutative and Distributive, and we have to declare that every element will have its Inverse and their Sum will produce Identity (Additive Identity will be Null) element. Again, members of such a Space may be not only numbers or their lists, but also any phenomena, or even their relationships. We just have to define Operators on them as described above, and then we can apply all the Linear Vector Space analytical apparatus to our newly created Space.

For Metrics Spaces we have to define Distance Function (which is strictly speaking a Relation, therefore these are more generic Spaces) that will give us a distance between any two selected elements of the Space, and we are free to define whatever we want, and not necessarily being bounded by only Euclidean distance calculation. For the most generic Topological Spaces we define Topologies – those Sets that basically tell us whether elements of the Space are in a Relation of being neighbours or not, and where do boundaries lie between neighborhoods.

Representing our Data as Spaces with Structure definitions over them is, obviously, useful for finding Structural relationships between the Data elements, and sufficient enough for the Unsupervised Learning methods of Data Science. In addition, by defining Mappings or Relations between Spaces, we can ask (and answer) such questions as: “Can two Spaces be mapped to each other?”, “Is one of them a Subspace of the other?”, “Is that mapping continuous (isomorphic/homeomorphic)?” In terms of the Data Analysis those questions and answers will tell us whether our Data sets have the same or similar Structures, allowing us to recognize Patterns and mine Data.

Those *Relations*, *Functions*, or *Operators*, define *Structure* of the Space to which members of the *Space* can be subjected to, or which is “visible” in the *Space*. Our Real World Data could have a much more sophisticated “Real Structure”, but, when modelling the Real World Data in the particular modeling *Space*, we will be able to see no more *Structure* than we defined in the model. Or maybe even less *Structure* in the Data, if our expected, model *Structure* is not present in the Real World Data. For example Decision Tree (which is such a Relation) formulated to pinpoint fraudulent credit card use will not make visible authentic owner spending habits (for which we will need another kind of the Decision Tree). Or, Linear Vector Space will make visible to us only linear Structure of the relations between Data elements. Or, which is usually the case in Topological Data Analysis, if we Metricize generic Topological Space, we will lose non-metrizable relations.

Because using Statistical Modeling we can not (or do not bother to) get an insight on the causes and driving forces of our Data, and we do treat them like movements inside a “black box”, we are also in a darkness (of that box) about whether (as we may think they are) all the aspects (parameters or variables) of the objects we study are, indeed, their defining parameters, and not the incomplete or overlapping combinations of the “real” (independent) parameters. Because of that we are bound to see those parameters as random and dependent between each other (welcome to the real-world, or “nasty”, or “dirty” Data). Which is really not Data’s, but, instead, our problem of the failed assumptions, expectations, or, in a way, – ego.

If the aspects (variables) of the Data, and Structure Operators of the initial model do not give us much of the meaningful information, we may want to map the Data isomorphically, or at least partially, homomorphically, onto another Space with more relevant and interesting for us topologies, with different bases and different (reduced or introduced) dimensions. That may make visible those Structures we are interested, or maybe surprised to see, eliminate or reduce variable dependencies, or even reduce the very “randomness” of the variables.

But enough wordy theorizing, let us see how the data Statistics/Data Science workhorse of the Linear Regression is seen from the Space Mapping point of view…

]]>

So, let us start with the simple basic questions: *What are the Object and Method of the Data Science* *studies? *In my humble opinion (which, of course, may be naive, erroneous, or trivial, as any other statement in the following text, for everyone of which I am not going to repeat this caveat, but always imply it), what we are looking in the Data is their *Structure, *which word by itself, though, tells or explains nothing. Let us look at it in the context of the linguistic and cultural Structuralism of the XX century (OK, OK, it is not fashionable anymore, because we live in the age of Post-Structuralism, or even Trans-Structuralism, but that changes no basics).

Structuralists usually define *Structure* as a mesh of the *Opposition* relationships between objects of the domain of interest. This definition still leaves a lot of room for interpretation, and I prefer to look at the *Opposition* not as something adversarial, but, rather, as a state of two peers being in some kind of relationship, which may or may not be *Divisive*. For example, we may be interested in finding out are the given objects in a pair neighbors or not. Those relations (or we could say *Relations* in the Algebraic sense, i.e. if we have sets A and B, then subset of the product AxB is a *Relation* defined by some criteria), really, is the fundamental *Object* of the Data Science studies. That is pretty obvious for the unsupervised learning, clustering methods, but it also stands for the other, supervised ones.

Even if we take a look at Descriptive Statistics, we will see that those numbers, functions or diagrams let us peek at the various aspects of the Data *Structure* in a compact, integral form, without drowning in the excessive peculiars and mass of the Data.

Now, getting an idea what we want to study, we may start thinking about the *Method* which we may do it with. Definitely, it will be a branch of the *Mathematical Modelling*, but not the one we usually use in the “Hard”, Natural Science. In the Natural Sciences we also strive to uncover a *Structure*, but *Structure* of the causes and driving forces of the data being observed. In a general case we end up with a system of (partial) differential equations that we usually can not solve *Analytically*. Then we either linearize, or simplify, or modify our models to reduce them to a form that has known analytical solution that is (relatively) easy to comprehend, and works fine in a wide range of initial conditions. Or, if such approach is not possible or acceptable, we resort to the data crunching of the *Numerical Methods*, which are, basically, the same linearizations, simplifications and modifications, but applied on a small temporal or spatial scale repeatedly, which is easily machine-automated. However, obtaining a solution this way has a cost – limited convergence intervals, and, if the initial conditions are changed significantly, all the bets are off that such a solution will work not only with the required accuracy, but even that it will work on the level of general tendencies. Sounds familiar for the Data Scientists, huh?

“Soft” Science generally despises such an approach of mathematically modelling causes and driving forces of the data it deal with (supposedly, because it deals with much more complicated matters, and such deterministic analysis is practically useless – yeah, yeah, “an invisible hand of the Free Market” will sort everything out instead). What it usually looks for is the *Structure* of the Data itself, or its appearance. The useful branch of the *Mathematical Modelling* in such a case is the *Statistical Modelling*. Similarly to the world of Natural Science, we may be interested either in a more simplified, but universalistic and giving us insights *Statistical Inference*, or in a more result oriented, but convergence limited *Predictive Modelling*. *Machine Learning* is closely associated, and largely overlapping with the latter because its resulting models are hard to interpret in the analytical sense.

*Data Mining* and *Pattern Recognition* are also associated with each other, and do exactly what the latter says – search for a Pattern, or a *Structure* of the Data. However, the latter usually looks for Patterns by example, while the former looks for something new; and the more unexpected that *Structure* is, the better. They reside in the middle of the* Inference* and *Machine Learning* because, on the one hand we still want to have some analytical insight, on the other – we may greatly benefit from the power of “number crunching”. Of course, if the aim of the *Pattern Recognition* is purely utilitarian (to arrest particular government protester, or kill particular jihadist from a drone), then that bring it closer to the *Predictive Modelling*.

Again, with the “Soft” Sciences shying away from the mathematical methods, the niche of the theoretical branches of the “Hard” Sciences in the “Soft” Science realm was taken over by the (semi)autonomous Data Science. Of course, in the real world the described above partition of the *Mathematical Modelling* branches is not strict, and “Hard” Science uses a lot of Statistical methods, though they play more a servile role of the initial empirical data processing, before the real theorizing begins (or verification of theories against the empirical reality), while there is some place of the *Analytical Mathematical Models* even in the “Soft” Science.

Nevertheless, what is the fundamental *Method* that lies in the foundation of all the mentioned above (as well as not mentioned) methods of the Data Science? The *Method* that might not be frequently reflected upon in the real day-to-day practice?

…

]]>

Of course, the name of the blog is pretentious and plagiarized. It may look overconfident, nevertheless, I have quite humble reasons for naming it that way. Having taking Modern Algebra and Statistical Methods classes together in my graduation year, I found myself confused by the mix of two approaches: one, which is deep, fundamental, and universalist, and another – utilitarian, mechanistic, and, honestly, it appears with not much Science behind it.

However, the same difference I felt between the two ways the Linear Algebra classes were taught. Seemingly mechanistic, close to the ground Linear Algebra I in its Linear Algebra II incarnation turned into a much deeper and thoughtful discipline. It was a much tougher class, but, in a way, a more “mind-calming” one – turns out there was a meaning, a reason for all these matrix manipulations you merely memorized in the first class. The second course was taught by Sheldon Axler’s textbook *Linear Algebra Done Right*. Apparently, that title is an inspiration for the blog rubric.

Similarly, in the following posts I hope to find out and explain to myself the deep meaning of the confusing Data Science buzz-terminology: “Data Mining”, “Machine Learning”, “Artificial Intelligence”, “Deep Learning”, “Big Data”, etc… Unlike the more rigorously defined terminology such as “Statistical Inference”, “Predictive Modelling”, “Reinforced Learning”, “Pattern Recognition”, the former vocabulary is fuzzily defined, redundant, and confusing even for the seasoned Data Scientists.

Of course, I do not envision the blog being in any way comprehensive and exhaustive (I simply do not have qualifications for that), but rather spotty, fragmented, touching the most “unsettling” topics (for me), and, maybe, homing to some “calming” answers Because this is mainly a self-directed text of the consciousness dump, the writing is left in the scratch-book style, hardly proof-read, and, therefore, I beg pardon from the occasional visiting readers.

I am not going to properly format citations, and I will overuse, or even abuse capitalization and Italic fonts, which is, of course, not a proper *Scientific Writing* :), especially if the use of this emphasis formatic is not consistent, which it will be, – usage context will drive the choice of that abuse. Mathematical proofs, whenever used, are not rigorous, but, rather, illustrations to proofs, intended to make them more understandable, and intuitively clear. When some code is involved, I’m not going to torture readers by inlining it, but will make it freely available one way or another.

Having get some clarity about what this rubric is about and its form, let us head on the most fundamental question about Data Science: “What it really is?”

UPDT: Actually, in the course of writing I realized I can dilute my plain plagiarism with a bit of originality – make this talks Kitchen Talks. Like, talks of the people not rushing anywhere, relaxing at the kitchen with a cup of tea or coffee, with a piece of cake, or pastry, or other gourmet food, on which we are going to do our thought (or maybe quite physical) experiments.

…

]]>

Descartes thinks he accumulated a lot of questionable and dubious ideas through his life. He makes a conclusion that all those ideas come from senses which are not trustworthy. It is easy to doubt credibility of our senses when they work on the edge of their sensitivity. For example, recognition of small or far away objects. It is much harder to doubt bigger chunk of our senses especially if they work in their confidence interval because their correctness could be proved by experience. If we are going to insist that some of our senses are wrong, we risk being considered as mad. However, it is much easier to reject our senses as a whole. For example, in a dream we perceive all weird events of the dream as normal and real, but we can recognize the strangeness of the dream only outside of the dream “reality” when we wake up. That is why Descartes decides to question the reliability of the whole world of senses.

If we accept an idea that our real world, given to us through senses, is just another dream of a higher rank, we may want to find criteria which would allow us to find imperfections of the dreams (especially of the lower rank) comparing to the really real world. Importance of these criteria is stressed, for example, in the motion picture Inception. For a moment Descartes follows this path, suggesting that the human fantasy is impaired by its scantiness: fantastical creatures that the humans make up are just a combination of parts of real animals, or images in dreams are like bleak paintings of real things. Descartes implies that if we dream a dream impressed on us by some “Architect” (in terms of the Inception), we can use “simple and universal” invariants of the really real world (like mathematical concepts) as landmarks for detecting a dream. But Descartes quickly withdraws from this path, suggesting that the “Architect” may be an omnipotent God capable of creating a deceptive dream for us, which is as perfect as the really real world (i.e. “totem” from Inception would not work).

Descartes meets possible counter-arguments that God could be non-omnipotent or could not be possibly deceptive because deception is manifestation of imperfectness, by saying that he has no answers to these objections. Descartes started doubting senses for he wanted to leave only certain and distinct ideas in his understanding of the world, therefore he is willing to build it suitable for the worst case scenario that God’s task is to deceive Descartes with all his might. If Descartes is still able to infer anything about the world even in conditions of omnipotent deception, those inferences would be quite certain and unshakable.

Descartes is able to identify at least one thing which omnipotent evil deceiver cannot possibly trick him about. He cannot make Descartes believe that he does not exist, therefore the statement “I am, I exist” cannot be taken away from him. Descartes finds other certain qualities of his “I”, which is a “thinking thing”, and which exists only while he thinks. His “I” also has senses, which does not mean that these senses are somehow real, and, strictly speaking, sensing is a type of thinking. “I” can have mental images, which are modifications of thought as well.

Another classification of thought is aimed at identifying what type of thought is prone to errors. Descartes divides thoughts into three categories. Two of them, which are simple ideas (even unreal and imaginary) and emotions, cannot be judged as true or false. We can make an error only if we have thoughts of third category named judgments.

Descartes tries to analyze how the human mind works. He takes the example of the wax. Even when wax in a usual for human environment may appear to our senses in different states of matter, we perceive it as the same substance. He concludes that our mind does not comprehend things with senses, which give us information about appearance of substances, but our mind understands substances by essence. It is similar to Plato’s Forms. This conclusion is encouraging, because, to proceed further, Descartes has to take on the question of God’ existence. Because God could not be comprehended by senses, one must think about him in terms of the world of Forms.

Descartes attempts to find out which ideas could have originated from him and which could not. Based on the idea that the cause should be greater than the effect, he concludes that idea of the omnipotent, omniscient and perfect God may be originated only from the omnipotent, omniscient and perfect God himself. This proof effectively repeats after Aquinas’s proofs of God’s existence.

However, there are contradictions in Descartes syllogisms. He says that idea of physical objects, which are not intelligent and extended things, was created by him as opposition of the idea of a thinking and not-extended himself. Using the same logic, an idea of omnipotent and perfect God could have appeared in Descartes’ mind as an opposition to non-omnipotent and imperfect himself. On the other hand, using concept of the greater cause than its effect, Descartes could have said that it is not only impossible for an idea of God to originate from himself, but an idea of physical objects needs a greater cause in a form of the objective reality. In both cases: the personal dream world of Descartes whose ideas come only from himself, or an objective world created by perfect God, he could trust his senses, because they are either only his and not impressed by a foreign “Architect”, or are caused by the objective reality.

Perhaps feeling deficiency in his arguments Descartes makes a second attempt to justify the existence of God by asking the question whether he could be created by somebody else but God. His answer is “No”, based on the same medieval reasoning of necessity of the cause to be greater than the effect, but he introduces a new twist to this reasoning. Descartes says his continuous existence is of the same nature as his creation. Thus, his world, instead of being created once (long-long time ago and maybe left by God on its own), becomes a dream of God which requires God’s constant attention in actively redreaming it in the whole fullness every next moment.

Having proven the existence of the perfect God, Descartes proceeds to analyze a question where errors come from. He says that human’s free will is as great as God’s. Following Augustine’s thinking, he says his ability to understand is as good as God’s qualitatively, but quantitatively is less. When free will of judgment exceeds scope of understanding, an error appears. By his own standards Descartes’ will to question his senses and desire to prove God’s existence exceeded capacities of his contradictory understanding and made subsequent conclusions about credibility of senses shaky.

]]>

for Kindles on Amazon

for iPad/iPhone/iPod on iTunes

for Android devices on GooglePlay

for Nook on Barnes&Noble

In the series of short essays the author overviews works of anthropologists, archeologists, cultural sociologists, historians and gastronomists who argue that the cooking played a significant role in human evolution and history, and sometimes even whole Empires “were built not by the sword but by the spoon”. The ancient and modern cuisines bear a deep imprint of the civilizations they appeared and were developing in. One can get a real insight on the cultures, often long gone, by trying recipes created then and there. This maxim did not cease to be true nowadays. Contemporary cultures as well could be judged by their cuisines. Verdict issued by this criteria, when applied to the American diet, may appear shocking. The American culture is the childish, young adolescent one. As a special treat an essay about the everyday life in the Soviet Union, People Waiting in Line, is included.

]]>

In his dialogue, “Apology”, Plato describes how Socrates responds to the charges against him. Socrates’ speech is an excellent example of rhetoric, rich with arguments, stories, analogies, questions, answers, and conclusions. He defends himself so as to be right and he thinks is necessary. He asserts that he says only the truth, however, sometimes it seems questionable. By analyzing the “Apology” and keeping in mind the historical context of Socrates time, I will show that it cannot be said certainly whether Socrates deserves or does not deserve the death sentence.

When Socrates was brought to the Athenian court there were two types of accusations that he was faced with. The first type was based on rumors, which had been circulating in Athens for a long time. We do not know exactly, but from Socrates’ words, he was accused of practicing natural philosophy or natural science, in making the worst theories appear like the best ones, and in teaching those concepts to others (Apology 18b).

Socrates does a bad job defending himself. He bases his defense on the strategy of plain denial. He says that he knows nothing about natural philosophy (Apology 19c). He implies he could have neither the worst nor the best ideas because he does not have his own ideas at all, he just examines the ideas of others (Apology 21d-22e). He asserts that he does not have pupils because he does not charge any fee from anyone, and that young people who mimic him do this by their own initiatives (Apology 19e-20c, 23c).

Were his claims true and how helpful are they in defending him? First, because he did practice the natural sciences early in his life, and the Athenian people remembered this fact, this statement by Socrates sounds like a false one.

On the second point, Socrates makes a claim that he does not have his own positive ideas. However, in the second part of the “Apology”, he presents an abundance of positive concepts, such as that man has to choose not between life or death, but between right or wrong (Apology 28b-d). He himself confesses also that his ideas are very important, so he must teach Athenians how to live a moral, pious, fair life (Apology 30b-c). The latter example also contradicts his assertion that he is not a teacher.

His argument that he does not have students, as long as he does not take a fee, sounds illogical because he could teach without money. His denial of those people who follow his methods looks not quite ethical, or like a betrayal.

Socrates’ arguments in defense were not only weak but also arrogant and provocative. When he tells his story about the Oracle, he caused a public disturbance at least twice (Apology 20e, 21a). That is not a smart defensive tactic.

If the charges mentioned above played a role in his conviction, to some degree Socrates deserves the verdict. But we cannot expect that as wise a man as Socrates would act so absurdly. Plato gives us a hint why Socrates chooses such a strange means of composition for his speech. In Plato’s dialogue “Euthyphro”, Socrates tells his friend that he realizes that process against him is not an isolated incident, but Meletus plans to launch a widespread campaign in Athens of prosecuting philosophers and other people who think like Socrates (Euthyphro 3a). In Crito, Socrates does not follow Crito’s advice, in part because he does not want to put his friends in danger (Crito 45a). It is quite logical to suppose that Socrates deliberately cut his ties with his pupils and friends in his speech to save them from prosecution by association.

The second type of accusation was made by Meletus. They are that Socrates does evil by corrupting the youth; and he doesn’t believe in Athenian gods but in new ones (Apology 24b). Socrates builds his defense for the first part of the Meletus’ charges on the logical reasoning, such as: one naturally doesn’t want to be hurt, but, if he corrupts his neighbor with evil intention, he will be eventually hurt by his corrupted neighbor, which could not be his desire (Apology 25c-e). He also offers the horse breeding analogy to show that if morals get corrupted it is not from conspiracy of the few, but from the influence of the many (Apology 24e-25b). Socrates shows magnificent orator skills (Apology 24d-26a), when in the beginning of the “Apology” he denies that he is an accomplished orator (Apology 17a-b).

In his defense against the second part of the Meletus’ accusation, Socrates traps Meletus by provoking him to hastily change his charge that Socrates believes in new gods into the statement that Socrates is an atheist (Apology 26b-c). But Socrates argues that all his deeds are committed in the name and by the will of god. However, he evades a direct answer in what divine source he believes (Apology 27c-28a). Thus, Socrates is trying to protect the rights of philosophers of free views on religious issues. According to him, philosophers had great social significance in the life of Athens. In contrary to persuasion, which was a tool of politicians to pursue their self-interests, philosophers care about rational social organization and bring benefits to the ordinary citizens.

In the end of the “Apology”, when the verdict was announced, Socrates plays a fool, saying that he deserved a dinner in Prytaneum or he could pay one mina of silver for his penalty (Apology 36d-37s, 38b). Again, it seems like he was trying to provoke an emotional response from the jury for a worse sentence. This shows, without a doubt, that he cares not for himself but for other philosophers, his pupils, friends and citizens of Athens, so for their sake he needs to achieve either total acquittal or cruel punishment to make his accusers feel guilty afterwards (Apology 38c). If the jury decided on a minor penalty, the citizens of Athens would not feel remorse and continue to chase the philosophers in the future. Socrates got what he aimed for. He got a death sentence and the Athenians regretted their decision; they stopped persecuting other philosophers and artisans.

So it can be said that Socrates in some sense deserves the death sentence, but in another he does not. According to the point of view of a person of modern time the Socrates’ death sentence was totally unjust, because he did not rob or kill anyone. He just questioned people and had free beliefs and his own opinion, which was different from others. From the point of view of the Athenian citizens of that period of time, he might deserve death for not believing in Athenian gods, which was considered a major crime and a betrayal of the social order, especially since Socrates defended himself badly or even not at all. However, his death played a significant role in the history of Athens. Athenians regretted what they did. There was not any case of persecution on philosopher after that. With a sense of Latin etymology of the word ‘deserve,’ which means ‘to devote oneself to,’ he devoted his life to the well-being of Athenians.

]]>

From my experience I know very well, what it is like. It was a normal occurrence in the times of the Soviet Union to stand in lines for many different objectives: food, clothes, footwear, any domestic goods, furniture, cars or even apartments (but it was like virtual lines). It’s amazing! People (mostly women) spent almost 1/3 of their day to go shopping in spite of work, studying and family.

The longest lines, I remember, were those for bananas or oranges from Africa, Indian Black Tea from India, footwear from Italy or Germany, parkas from Sweden, and so on…It was rarely less than an hour. Usually it took a couple, three, or even more hours. One day I spent 8 hours standing in line to buy an imported fur coat for my daughter. But don’t think I had been spending all that time in the queue. Not at all! The main thing here is to remember very accurately a few people standing around you in the line, because almost everybody left the line and then returned to their spots. You would just tell them that you will be back and ask them insistently to remember your face, your clothing, or something else notable. Often when a line was large, folks wrote a number on their palms for accuracy. You may go home, or return to work, or take a seat at nearest park with a newspaper, or a book to read. It was a good idea to queue in few more lines at the same time, so you could get many deficits at once.

Some extraverted people love to make acquaintances in lines or somebody may meet her or his future spouse there. People were trying to relax in queues, otherwise it would be a very boring and an annoying pastime. You would probably become very tied, particularly your legs, back, or even neck, especially if you are on high hills. Try not to stand in line on high hills! It’s terrible and dreadful!

But let’s leave the Soviets lines alone. We can observe some others. For example, you go traveling. The first queues, you will see, are those in an Airport for registration or security purposes. When you arrive to your destination, you probably will need to stand in a Passport Control line. On the next day, when you go to see the great attractions of a city you visit, be prepared for lining up to reach your destination. For The Louvre it may take a couple hours, an hour, for The British Museum, and three more, for Lenin’s Mausoleum. It’s ok if you have a lot of time, and you are probably looking forward to see an attraction with anticipation.

And the last, but not least, some examples I would like to share are: the gas lines, when the gas suddenly rose in price here in Georgia few years ago, and the car lines to KSU parking lots, especially on Mondays.

It can’t be the modern life without queues. It’s impossible!

]]>

Let’s look at what the cheap American food consists of. Common ingredients of junk food are corn syrup, white flour, soy by-products, hydrogenised oils, and artificial colors and flavors. These products are high in trans fat, free radicals and acrilamide (potent cancer-causing chemical). Michael Pollan, the American author, journalist and professor of Journalism at Berkeley, said in his interview with Bill Moyers, the presenter of the public television, “five crops we subsidize are corn, wheat, soy, rice, and cotton… And that our farm policy for many years has been designed to increase production of those crops [the junk food is made of] and keep the prices low”. There is common knowledge and scientifically proven that such a diet leads to food-related diseases as type-2 diabetes, obesities, heart disease, and some cancers. Pollan mentions following data:

*…this generation just being born now is expected to have a shorter lifespan than their parents, one in three Americans born in the year 2000, according to the Centers for Disease Control, will have type 2 diabetes, which is a really serious sentence* (Pollan, Interview).

Adoption of the governmental Agricultural policies, which led to this explosion of chronic diseases, did not come as a response to natural disaster or an outbreak of hunger. Somebody benefits from promoting this diet. That is the large Agricultural business whose goal is to rip off big profits from cheap produce. Governmental policies are taken hostage by Agricultural conglomerates. Michael Pollan gives an example of how it is done:

*…the World Health Organization recommends that no more than 10 percent of daily calories come from added sugar, a benchmark that the U.S. sugar lobby has worked furiously to dismantle. In 2004 it enlisted the Bush State Department is a campaign to get the recommendation changed and has threatened to lobby Congress to cut WHO funding unless the organization recants* (Pollan, *In Defense* 25).

In addition to negative impact on public health those monsters of Agriculture pollute the environment using fertilizers and pesticides, cause climate instability by producing greenhouse gases, and excessively consume fossil fuel (Pollan, Interview). The oligarchs of Agriculture play on the naïve and greedy human nature to feed people junk food:

*…notice that the stark message to “eat less”… had been deep-sixed [i.e. thrown overboard]; don’t look for it ever again in any official U.S. government dietary pronouncement. …you are not allowed officially to tell people to eat less of it [a particular food] or the industry in question will have you for lunch. …it was easy for the take-home message of the 1977 and 1982 dietary guidelines to be simplified as follows: Eat more low-fat foods. And that is precisely what we did* (Pollan, *In Defence* 24, 51).

Why should common people be kept on the rich man’s leash? Is it possible to use common sense and make such a domestic revolution in the name of their health and health of the next generation? “Eat food. Not too much. Mostly plants.”- one can read this on the front cover of Michael Pollan’s bestseller *In Defense of Food: an Eaters Manifesto*. The author believes that even a common man might make a difference by gardening, cooking, and buying local and organic food (Pollan, Interview).

In nowadays, the local food movement becomes more and more popular. What does it mean to eat locally?

According to the definition adopted by the U.S. Congress in the 2008 Food, Conservation, and Energy Act (2008 Farm Act), the total distance that a product can be transported and still be considered a ‘locally or regionally produced agricultural food product’ is less than 400 miles from its origin, or within the State in which it is produced (United States, Department of Agriculture iii).

However, the definition is not exact and differs dependably of zone, farmers and consumers (United States, Department of Agriculture iii). The USDA documentation discusses the advantages of local food systems that had empirical evidence. They “include economic development impacts, health and nutrition benefits, impacts on food security, and effects on energy use and greenhouse gas emissions” (United States, Department of Agriculture 42). For example, people will benefit by eating local produce because it is fresher, nutritious, and less processed; growing crops on community lands will increase food availability; reducing the distance of food delivery saves fossil fuel.

The best way to grow local food is to grow it organically. Here in the US it is not easy. As soils are not fertile enough and the climate is friendly for pests and diseases in many states, farmers have to use fertilizers and pesticides. My family used to have a garden in Russia. We continue to cultivate organic produce here. Because we can not find enough information about garden plants that vegetate well in Georgia we experiment by planting different varieties of fruit trees and bushes, berries, and vegetables. The best practice is to plant those species that may grow in this area without special efforts. Fig trees, blueberries, raspberries, kiwis, pears, tomatoes, cucumbers, Jerusalem artichokes, mustards, peas, radishes, asparagus, herbs, all of that we grow organically on our modest piece of land and enjoy it all year around.

However, “local” is not a panacea. Different cultures behave differently depending on the climate zone. Pamela Cuthbert, the editor of a journal for Slow Food Canada, in her article “Local Food Is Not Always The Best Choice” gives a demonstrative example why in some cases non-local produce is more preferable than local. Growing apples in the wet Ontario climate is challenging because of numerous pests and fungus while in dry regions farmers do not have to spray apples so intensively because pests do not tolerate arid conditions (Cuthbert 26,27). Thus buying organic produce brought from farther places may be better for the health, than buying local ones that are not well acclimated to the local climate and soil (Cuthbert 26,27). Trying to cultivate cultures, that are not sustainable in the local climate, farmers have to use the industrial schema of agriculture with the harmful impacts such as “environment destroying fertilizers and sprayed with pesticides” (Cuthbert 25). As a conclusion the author quotes “Lori Stahlbrand, head of the organization Local Food Plus” who defines local as a complex of such characteristics as “sustainability, animal welfare, labor practices, biodiversity and energy use” (as qtd in Cuthbert 28). Stahlbrand “pairs the words ‘local and sustainable’ as essential co-factors” (Cuthbert 28). If we compare definition of the local food mentioned above with the definition given by United States Department of Agriculture (USDA), we can see now the latter appears formal and incorrect, while the one described by Stahlbrand is more thought-through. Defining the local produce only in terms of the mileage does not guaranty the desired benefits for what local food movement stands for, and leaves the back door for questionable practices of the big agricultural business.

Do the gardening and you will have fresh healthy produce on your table, exercise and intimate contact with nature. If you cannot do this for any reason, buy more organic and local products in your supermarket or farmer’s market, it will support the development of local food producers and undermine the production of junk food.

One more way to follow a healthy diet is to cook your own meals. Pollan encourages: “Cook. Simply by starting to cook again, you declare your independence from the culture of fast food” (Pollan, Interview). Indeed, for cooking you will need whole produce, better oils, and less salt and sugar. You can be creative and add to your recipe any desired ingredient. When you eat a donut in the rush you don’t pay much attention to the list of ingredients. However, when you make your own cookies you know what you put in a dough: butter, eggs, flour, sugar, but not that long, long list of chemicals.

If we look back at our ancestors, we would see they ate simple local whole food, they cooked. We have evolutionary adapted to such meals for tens or hundreds of thousand years. Our bodies and genes don’t know what to do with the new ‘Western’ diet. Turn your head back and gain the wisdom from your grand-grants! Think about the next generation. It is all in your hands!

Works Cited

Cuthbert, Pamela. “Local Food Is Not Always the Best Choice.” *The Local Food Movement*. Ed. Amy Francis. Farmington Hills: Greenhaven Press, 2010. 24-30. Print.

Pollan, Michael. *In Defense of Food: an Eater’s Manifesto.* New York: Penguin Books Ltd., 2008. Print.

Pollan, Michael. Interview. Bill Moyers Journal. PBS, 28 Nov. 2008.Web. 15 March 2012.

United States. Department of Agriculture. *Local Food Systems: Concepts, Impacts, and Issues*. Washington: May 2010.

]]>