What really Data Science is?
So, let us start with the simple basic questions: What are the Object and Method of the Data Science studies? In my humble opinion (which, of course, may be naive, erroneous, or trivial, as any other statement in the following text, for everyone of which I am not going to repeat this caveat, but always imply it), what we are looking in the Data is their Structure, which word by itself, though, tells or explains nothing. Let us look at it in the context of the linguistic and cultural Structuralism of the XX century (OK, OK, it is not fashionable anymore, because we live in the age of Post-Structuralism, or even Trans-Structuralism, but that changes no basics).
Structuralists usually define Structure as a mesh of the Opposition relationships between objects of the domain of interest. This definition still leaves a lot of room for interpretation, and I prefer to look at the Opposition not as something adversarial, but, rather, as a state of two peers being in some kind of relationship, which may or may not be Divisive. For example, we may be interested in finding out are the given objects in a pair neighbors or not. Those relations (or we could say Relations in the Algebraic sense, i.e. if we have sets A and B, then subset of the product AxB is a Relation defined by some criteria), really, is the fundamental Object of the Data Science studies. That is pretty obvious for the unsupervised learning, clustering methods, but it also stands for the other, supervised ones.
Even if we take a look at Descriptive Statistics, we will see that those numbers, functions or diagrams let us peek at the various aspects of the Data Structure in a compact, integral form, without drowning in the excessive peculiars and mass of the Data.
Now, getting an idea what we want to study, we may start thinking about the Method which we may do it with. Definitely, it will be a branch of the Mathematical Modelling, but not the one we usually use in the “Hard”, Natural Science. In the Natural Sciences we also strive to uncover a Structure, but Structure of the causes and driving forces of the data being observed. In a general case we end up with a system of (partial) differential equations that we usually can not solve Analytically. Then we either linearize, or simplify, or modify our models to reduce them to a form that has known analytical solution that is (relatively) easy to comprehend, and works fine in a wide range of initial conditions. Or, if such approach is not possible or acceptable, we resort to the data crunching of the Numerical Methods, which are, basically, the same linearizations, simplifications and modifications, but applied on a small temporal or spatial scale repeatedly, which is easily machine-automated. However, obtaining a solution this way has a cost – limited convergence intervals, and, if the initial conditions are changed significantly, all the bets are off that such a solution will work not only with the required accuracy, but even that it will work on the level of general tendencies. Sounds familiar for the Data Scientists, huh?
“Soft” Science generally despises such an approach of mathematically modelling causes and driving forces of the data it deal with (supposedly, because it deals with much more complicated matters, and such deterministic analysis is practically useless – yeah, yeah, “an invisible hand of the Free Market” will sort everything out instead). What it usually looks for is the Structure of the Data itself, or its appearance. The useful branch of the Mathematical Modelling in such a case is the Statistical Modelling. Similarly to the world of Natural Science, we may be interested either in a more simplified, but universalistic and giving us insights Statistical Inference, or in a more result oriented, but convergence limited Predictive Modelling. Machine Learning is closely associated, and largely overlapping with the latter because its resulting models are hard to interpret in the analytical sense.
Data Mining and Pattern Recognition are also associated with each other, and do exactly what the latter says – search for a Pattern, or a Structure of the Data. However, the latter usually looks for Patterns by example, while the former looks for something new; and the more unexpected that Structure is, the better. They reside in the middle of the Inference and Machine Learning because, on the one hand we still want to have some analytical insight, on the other – we may greatly benefit from the power of “number crunching”. Of course, if the aim of the Pattern Recognition is purely utilitarian (to arrest particular government protester, or kill particular jihadist from a drone), then that bring it closer to the Predictive Modelling.
Again, with the “Soft” Sciences shying away from the mathematical methods, the niche of the theoretical branches of the “Hard” Sciences in the “Soft” Science realm was taken over by the (semi)autonomous Data Science. Of course, in the real world the described above partition of the Mathematical Modelling branches is not strict, and “Hard” Science uses a lot of Statistical methods, though they play more a servile role of the initial empirical data processing, before the real theorizing begins (or verification of theories against the empirical reality), while there is some place of the Analytical Mathematical Models even in the “Soft” Science.
Nevertheless, what is the fundamental Method that lies in the foundation of all the mentioned above (as well as not mentioned) methods of the Data Science? The Method that might not be frequently reflected upon in the real day-to-day practice?