Program for the Advancement of Learning Sciences: Computational Learning, Machine Learning, and Neuroscience

A letter on the philosophy of information sciences, towards their transformation into learning sciences: Philosophy of Mathematics, Philosophy of Computer Science, Philosophy of Computational Learning, and Philosophy of Neuroscience

By: A Mathematician (I will no longer be)

A letter to a student: The Code and the Binding [Translator's note: "Binding" refers to the biblical story of the Binding of Isaac] - Everything is Connected (to Learning) (__Source__)

Learning Philosophy of Mathematics

You think that philosophy of mathematics is not interesting, but it's actually the most truly interesting thing. Learning should have been taken as the foundations of mathematics. Not writing proofs - but learning proofs, because mathematical construction is at its core not a logical construction (that's only its linguistic surface), but a learning construction. After all, the central problem of neuroscience is thinking about the brain as a single agent, instead of understanding that there is competition in the brain - between thoughts, between modules (for example, on attention and decisions), between different memories, between neurons, and between different continuations of this sentence (and this competition parallels economic or political competition, which builds systems that learn, like democracy or capitalism or Chinese meritocracy, and it is the root of their victory). Thus, the central problem of mathematics is that it does not conceptualize within itself its multiple agents, the mathematicians, who learn it, and does not conceptualize learning at all beneath mathematics (just as in the past it did not conceptualize the logic beneath mathematics, and then Frege turned logic into the infrastructure of mathematics, so beneath logic - what activates it, and what will later become the infrastructure of mathematics - is mathematical learning). Moreover - learning should be the tool for defining basic concepts in mathematics, on which everything is built: limit, group, topology, space, proof, set, primes, probability, function, sequence, etc. And so mathematics should undergo a learning reconstruction, axiomatization and reinterpretation (like a possible learning interpretation of quantum theory, among its other interpretations). The property of composition and construction in mathematics - and especially algebra - originates in learning, and should be based on it. Let's say you've already learned how to do a, b, as a black box. What does it mean that you have this function? What does it mean to know, for example, a proof? How do you learn with it to reach c? A stage will come when you can no longer simply say I have a function, but unlike Brouwer's intuitionism or the axiomatic-computational construction of formalism, the construction you'll need to provide is learning-based: how you learned the function. And even if the function already exists in you (say in your brain's neurology), as a black box, knowing it doesn't mean using it, that is, knowing is not the ability to give its answer to inputs, but the meaning of knowing is the ability to learn through it, namely to compose from this black box (which you don't understand) appropriate learning continuations. Just as knowing a proof is not the ability to quote it and reach from the assumptions to the conclusions (QED), but the ability to compose additional proofs from it, that is, to continue learning through it. And understanding a proof is not something you understand within it (for example inside its sequence), but understanding how to build additional proofs from it (not just "using" it in the existing system, like in Wittgenstein, but building from it the continuation of the system and developing the system, like a poet's use of language, not a speaker's, that is, like a programmer's use of a computer, not a "user's"). And here we'll notice for example the similarity between neural networks and genetic algorithms. In neurons, the construction is mainly connecting and combining numbers (that is, a linear combination - the simplest combination - of functions, with just a minimum of necessary non-linearity above it), while in evolution the construction is connecting and combining parts (in practice, it's a linguistic combination of two sentences - two genomes, so that some of the words are from the first and some from the second. And finally after convergence - the sentences are very similar and there are slight variations between them, so that the sentence still has meaning. "A gardener grew grain in a garden" mates with "A gardener grew wheat in a garden". But at its base the construction in the genetic algorithm is simply to connect by swapping. And their son is "A gardener grew grain in a garden"). So beyond the specific difference between the two mechanisms of composition and construction, namely the connections, one being a quantitative size connection and the other a textual-linguistic connection, there is a deep similarity between neuronal learning and evolution: generations are layers. The basic learning components are both very numerous at each stage, and also stack on top of each other in a deep way (that is, very multiple), to create learning. Evolution is inherently deep learning, and one cannot deny this natural similarity. In other words, we see that in nature, construction is fundamental to learning - even if there may be different construction techniques in the world of learning (addition, multiplication, string concatenation, calling another code segment as a function, etc.) - and so it is even in logical and mathematical construction. For in logic too there are multiple layers of construction created by combination (in construction there are two dimensions, because it combines two or more previous things - horizontal dimension - to create something new from them - vertical dimension. That is, construction is created both from the multiplicity downwards, and from the multiplicity of possibilities beside you, like bricks in a wall). And if we return to the project of redefining mathematics above learning, we'll see that this program (the learning program of the foundations of mathematics, in the spirit of Langlands' program) is suitable not only in inherently constructive algebra, but even in analysis. Indeed, in algebra construction is basic, and precisely because of this, basic construction questions in it will benefit from a learning perspective. After all, what are primes? The collision between two methods of constructing numbers: one by addition - and the other by multiplication. This is the source of the enigma (Riemann as a parable), and its solution will be through a new conceptualization: learning to construct them. Learning primes - this is the royal road to the Riemann hypothesis. And so one can learn to construct a group. Or learn a set (or graph, or game, or matrix). And in analysis, what does limit mean? To approach through measurements - means to know. And topology is a generalization of limit. Limit is a learning mechanism, and when it succeeds, when one can learn (that is, as you get closer it teaches you what you're approaching), it's continuous. And when you can't learn - then it's not continuous. And this learning mechanism itself stems from the topology of continuity. That is, in topology learning is a more abstract generalization and not the basis of the definition of limit, but limit is a particular example of it. When looking at the learning mechanism itself (of the continuous) and starting the definition from it - this is topology (as a substitute for definition using filters, or open/closed sets, or other contemporary proposals). And in analysis, we can define the derivative using the idea of method, or method as a generalization of the idea of derivative. This is the learning of learning.

Learning Philosophy of Computer Science

In the same way, a similar process of building the field on learning foundations can be done in computer science (and thus finally seriously establish the field of philosophy of computer science). After all, what is computation: how did a function arrive at this? (You can no longer simply define but it needs to be constructive - computable). If so, what is learning: how did the computation arrive at this? (You need to explain how you built the algorithm, that is, how you learned it, just as before you had to explain how you built the function. This is constructiveness of constructiveness). If so, if we return to the function, what is needed is: to learn to compute a function. Proof is after all construction. And learning is how to build. To build the building itself. Hence the next algebraic stage will be addition and multiplication in learning, which will be a generalization of addition and multiplication, and therefore using learning we can define addition and multiplication of algorithms. And so they will be a generalization of multiplication (calling in a loop, in the polynomial case) and addition (performing one algorithm after another), in the learning construction. And recursion will be a generalization of exponentiation. While conditioning is a type of addition. In Turing's world of computation, infinity and the asymptotic were the analysis, and operations - the algebra. And now we are facing the problem that we want to add infinities, that is, systems learning towards a limit, which is very similar historically to the problem of adding infinities that existed at the root of infinitesimal calculus. After all, learning components always approach an optimum, and this is the continuous part, of optimization. And on the other hand they are composed with/on each other like algebraically, which is the discrete part, of search and mutation, that is, the computationally expensive. If there is no method to do this in general - there are combinations. That is, it's a brute force search. And therefore we must understand that at its core, exponentiality is actually an expression of brute force and inability to understand and solve the problem, but only to formulate it. It means: not knowing how to solve. That is: beneath all the basic algebraic operations we know in mathematics, like addition and multiplication and exponentiation, there is something deeper, and computational, and even (beneath) learning. And it currently peeks out and expresses itself externally simply as a function of runtime. Exponentiation is actually searching the entire space of possibilities. It's language and not learning. Language is all possible combinations, and learning is the convergence of possibilities, and therefore enables a specific solution. A specific sentence. No sentence in the world has ever been written by language - they are all written by learning.

Philosophy of Algorithmics

You learned a function or algorithm? Notice that it's similar to a limit in analysis - where the function is found (which is the limit). And instead of epsilon and delta, we have here an interaction between teacher and student. The student aspires to the limit (which is their horizon), and the teacher stands in the position of the measure in the limit, for example asking how close you are to the function's result at a certain point. That is, the teacher's side, the side that measures success, that judges your convergence, is like the criterion in NP. And what's the trouble with NP? That it's exactly the opposite of a continuous limit in analysis, because in such problems partial measurement of success doesn't help at all in achieving the goal, and doesn't assist learning, meaning you can't succeed as a student. There are no directions along the way that allow reaching the goal. Learning is the process of building from things we know how to do - something we don't know how to do. And all this against an evaluation measure. And if the evaluation is an internal criterion, not external, then this is the way - which is the method. But if there is no internal criterion at all but only external? Then you're in NP. When you learn an algorithm, is it correct to define it as learning from example or from demonstration, that is, as learning what or learning how? Do you receive only the input and output values of the function you're learning in a specific case, or do you receive a constructive building of the function in a specific input-output case? The answer should be both, because learning is exactly the decomposition of the function as built from previous functions, which is the very demonstration, but at each stage the choice of which combination of them to make depends on the example (Is a proof an example or a demonstration?). If so, NP are the problems that are easy to examine - and difficult to learn (that is, that cannot be taught - to be a teacher - in their case). And so exactly in the problem of primes, the question is how much you cannot learn them, how unpredictable they are (probability, which can also be redefined using learning). This is the essence of the Riemann hypothesis (and therefore is expected to have a deep connection to the problem of factorization of primes as a one-way function). What is learning in prime numbers? In every prime number you've reached on the sequence of natural numbers, what you already know is to build using multiplication numbers from all the primes before it. That is, it (the next prime) is something you haven't learned and need to learn, and the deep question is how limited your learning ability is in essence, if the learning construction is building a number using multiplication of previous numbers. That is: in the two most important conjectures in mathematics there exists a learning formulation that touches their essence - and should have been the way to go for their solution, if we hadn't encountered linguistic thinking, that is, a very primitive and combinatorial type of construction (both of natural numbers and of algorithms). In both we need to prove that a certain phenomenon is difficult to learn - that is, to find what cannot be learned. In the history of mathematics we solved basic conjectures that we didn't know at all how to approach (existence of irrational numbers, squaring the circle, the quintic equation, Gödel's theorem, etc.) always through such a new construction, which managed to capture the phenomenon - and afterwards a proof of what cannot be built using it. Let's note how much the NP problem is actually a learning problem (which was misconceptualized using language, and therefore became one that no language suits it, or is even capable of beginning to grasp its solution), and then we won't understand why we didn't realize that conceptualization using learning is its natural solution direction. For using the learning view, we even see the similarity of NP to evolution, where learning is the mechanism (mating and mutation) that struggles against the examiner of survival and fitness, where it's very difficult to build a living creature and innovate in it, and very easy to check if it survives or not. Biology is always in a position of difficult learning against the cruel nature, which finds it easy to judge its efforts. And here, on the way to learning, we see that beauty plays a role in guidance, so that biology can guess through shortcuts who is more fit and who is less. And so also in mathematics. A hard criterion of proof goes hand in hand with a soft criterion of beauty, which allows mathematicians to do mathematics and progress in mathematical learning, despite it being a fundamentally difficult problem. And our thinking too depends on beautiful moves. And so we also judge philosophy.

Philosophy of Complexity Theory

How is evaluation performed: as part of the definition of learning are there many layers of evaluation or just one at the end, like in NP, where it cannot be broken down into evaluation layers? Well, the two natural learning examples help understand what learning is - the brain and evolution - and in them there are countless layers of evaluation, and in fact in every layer (or generation) there is an evaluation of its predecessor (therefore women are the hidden layer - of the network - in evolution, that is, they are what turns each generation into a deep network, as an internal evaluation layer between input and output, namely children). Thus, in the same way, limit and natural numbers help us understand what is the generalized concept of learning in mathematics, in the continuous domain and in the discrete domain (and brain learning is continuous, while evolutionary learning is discrete). But beyond this abstraction itself, which reflects deep content common to all parts of mathematics (learning as the content of mathematics), one can also seek learning as the form of mathematics. What's beneath mathematics itself: how to learn mathematics. For example: to define a mathematician. Today, it is accepted that a learning algorithm should be polynomial. But the restriction on polynomiality for the learning algorithm is not correct in the general case (mathematician). Therefore we as humans, as brains, do many things for which we have an efficient algorithm, but we don't have efficient general learning, and there can't be. In general, learning is efficient only when it's very limited through use of things we learned before. And therefore we have an illusion that learning is an efficient process, because most of our learning is such, but what characterizes such special learning is that it is learning of knowledge. And therefore most learning in our world is learning of knowledge, because learning of new action and algorithm is always inefficient. If so, what is knowledge? When there is an efficient learning algorithm. This is its definition. Let's note that almost everything we learn is things that others know how to do, that is we use ready-made functions, and build from them, and our learning can be broken down into ready-made functions. Therefore, in the decomposition of learning into building the layers that created it, one needs to think about the structure itself of the space of all possible decompositions of a problem into sub-problems. But, any definition of learning from a teacher needs to overcome the problem of "within the system", that is, that the help won't be programming the student from outside and cheating and collusion between them, but if the decomposition is a maximal decomposition, that is into too small pieces, then it's just like programming. Is it possible to characterize the ideal decomposition, as being in the middle between absolute decomposition into crumbs equivalent to programming (maximal decomposition) and the NP problem (minimal decomposition, where there is only an examiner at the end and no evaluations in the middle)? If there is no teacher, there is development - like in evolution that builds on previous algorithms and like in mathematics that builds on previous proofs, and then the division of the problem into sub-problems is natural, because there is no one dividing it. The maximal decomposition is the algorithm, as written code, and the minimal is the problem itself, the exponent - and in the middle learning is what connects between them. That is, this transition from the problem to the algorithm is itself the learning process. Namely: adding more and more divisions (when it's from top to bottom, from the teacher's point of view) or more and more building connections (when it's from bottom to top, from the student's point of view), and when there is only a student and no teacher this is development, which is natural. A polynomial solution means that it can be broken down into simpler sub-problems, that is, to learn. Therefore, what can be learned characterizes the polynomial, and therefore learning is the construction that fits understanding the limitations of the polynomial (i.e. what separates it from NP). After all, learning is the construction of the polynomial from the linear, that is, from the minimum that simply allows reading all the input, and therefore the polynomials are a natural group. Therefore, we should look for a minimal decomposition that is learnable, for example a minimal decomposition into linear sub-problems, because the maximal decomposition is not interesting, as it is identical to writing the code (and linear is of course just one example of the most basic learning blocks in the algorithmic field. For example, in the field of number theory, it could be factorization in multiplication. Or any other bounded function that defines other problems in mathematics). Therefore, in our definition of learning, we can assume the ideal selection of examples (for learning, by the teacher), just as we assume the minimal decomposition. What learns - and also what teaches - does not have to be computationally limited, but it is constructively limited. And let's also note that this whole structure of building through previous functions is much more similar to human thinking (for example, from logic and language and computation and perception). We do not know how we do the things we know how to do, but we know how to do things w-i-t-h them. Learn through them. But we do not know how we learned, it's a black box. And all the functions from which we composed in our learning can be black boxes for us. That is: there are two parts to learning here. One part that defines and characterizes the structure that we want to learn - or the decomposition we want to make to the problem - which is the limitations on the functions: what are the basic functions and what are their allowed combinations. And there is another part here, which asks what information builds this construction precisely from all the possibilities - which are the examples. To prevent collusion between the teacher and student, does the construction need to be done in a specific learning algorithm, and not in any possible algorithm of the learner (so that it will not be possible to encode the solution within the examples)? After all, one can choose such a universal (inefficient) algorithm, using Occam's razor, as the minimal length combination that fits the examples, or perhaps some other naive search algorithm. And then you create a tree of decomposition of the problem (the learned function) into sub-problems (which are sub-functions), with the numbers of examples required to create the correct combination (the correct construction) from sub-functions at each branch split (the number of branches is as the number of sub-functions that build the branch above them). And then maybe there is a trade-off between the dimension of decomposition (like the detailed decomposition into sub-problems) and the number of examples. And then the tree can grow to infinity in an NP problem, or when the sub-bricks from which they build only approximate the solution (like in primes, which only approximate large primes, because they are not enough to span all the naturals, because there are infinitely many primes, and then one can estimate how full and good the approximation is in relation to the number of primes - and this is Riemann's question). And then using this, one can express problems of impossibility of construction. If you demand minimum effort from the teacher, and minimum examples, then if you already have things you've learned, you demand the minimum of the best examples to learn the next thing. And this in itself reduces the complexity of the next thing in the learning process, because for example it's better to teach a rule, and then in additional learning the exception. Therefore, if we have the perfect student and the perfect teacher, we will ask what perfect learning looks like. For example, how does the teacher indicate that it's an example that is the exception? (In order for there to be a rule at all, and not just one example for the rule and one opposite example - if they are given simultaneously, that is without serial decomposition - which can break down the rule altogether, because how will you know which of the examples is the rule and which is the exception)? Well, he doesn't. He simply first teaches the rule. And then after that, in the next construction layer, after the rule is learned, he teaches the exception. And then the shortest thing the learner can do, assuming he already has a function that is the rule, which he has already learned, is simply to add one exception to it (in certain cases). And so the decomposition can save on the number of examples. And the information in the decomposition can allow learning with less information, in certain cases, than what is even in what is being taught (because the information in the decomposition itself, which the teacher gives in the very order of the curriculum, is not counted). This is learning structuralism.

Philosophy of Computational Learning

So, you have a list of functions/algorithms/oracles and you have a function that is a limited combination of them, and you learn them from examples that are chosen as the best, when you have no computational limitations. And neither does the teacher. And the question is what is the minimum number of examples that is possible with a decomposition of the problem into sub-functions/algorithms, when you learn according to Occam's razor (for example according to the complexity of the algorithm, its length, or another simplicity criterion). If the decomposition comes for free then we look at the total number of examples, and then the decomposition is maximal, meaning the learning is as gradual as possible. Alternatively, one can look at the ratio between the examples and the decomposition (between the number of examples required and the number of sub-problems in the given decomposition), which is of course an inverse ratio. Or examine different topologies of different decomposition trees of the same problem (in how many ways can the same problem be decomposed, which are fundamentally different?). Our goal is to build the learning tree in a way that decomposes the problem into problems in a non-trivial way. Because if we look at the minimal decomposition, when the decomposition is expensive and the examples are free, we will get a trivial decomposition, meaning there is no decomposition, and we've returned to the original problem, which has only a test and examples, which is similar to NP. Therefore, we can also look at all these possible decompositions, maybe infinitely many in certain functions, and see how they themselves descend from each other, and what are the properties of forests of such trees. And then find a canonical form of decomposition, which is perhaps in a certain ratio between the amount of decompositions and the number of examples. In the end, it's not the examples that are interesting or their number, but the tree structures - what is the decomposition of an algorithm into sub-algorithms. Or of a problem into sub-problems. Or decomposition of a theorem into all possible proofs (and even mathematics itself can be thought of as a proof graph, which can be studied as a graph, and perhaps find connections between the structure of this graph and mathematical structures). And if the decomposition that the teacher gives is sufficiently detailed into small sub-problems, then perhaps there is an efficient algorithm for learning (that is, for finding construction combinations according to examples), and maybe even just naive search is efficient, because what's really hard to find is the decomposition. But if the decomposition stems from the minimum number of examples (meaning that the minimum number of examples doesn't necessarily require maximal decomposition) then this gives it power (in both senses). And from here we can start thinking about all kinds of different combination functions of sub-functions, which create different construction problems, when limiting what is allowed in construction. For example: only a linear combination of functions that will give the example given by the teacher, or a proof system that will prove like the proof example, or learn a group, which is also a simple function (addition), and can be learned in fewer examples than all combinations of its elements if it is decomposed into sub-problems, and maybe there will even be less information in the examples than what is in it (because as mentioned the rest of the information will hide in the decomposition). And then we can ask how much exemplary information is in a group, or in any other mathematical structure, and this can be the definition of learning information (as opposed to linguistic). Because generalization from examples is not justified, except based on what already exists (the functions you have already learned, that is, that were presented to you first by the teacher in the decomposition of the problem into sub-problems, which are the simpler functions, from which you learn something more complicated, like in a baby's learning or in the development of evolution - and this is a fundamental property of learning). That is, there is a kind of hint to use what you have already learned. What you already know is your priors. And in a continuous function this is extreme (because you are not allowed to complicate it unnecessarily, otherwise you will never learn even simple functions, and you are committed to simplicity first, because of Occam's razor). Therefore, you need the minimal combination of what you know - that produces the new example given by the teacher. And if you are committed to simplicity it is immune to cheating. Because if there is collusion (for example if the teacher encodes the weights required from the student within the example), then it does not meet the condition of Occam's razor. The algorithm is disqualified because it does not give the simplest. The student cannot choose an arbitrary composition but the simplest and minimal. There is an internal criterion for simplicity, which fills the evaluating, feminine side (the middle layers of evaluation), and there is also a composition function (which is different in every learning of a mathematical structure of a certain type. For example: learning graphs, learning groups, learning continuous functions - which can be built using polynomial approximations or alternatively in Fourier transform and so on, learning algorithms, learning proofs, learning games, learning topologies, learning languages, etc.). And the information that is supposedly saved, because it is not counted - is structural. That is: such that stems from the structural division (the decomposition), and therefore if there is no structure at all in what is being learned but only noise then the learning will need to be the transfer of all the information. That is, it is not learning but the transfer of linguistic information.

Philosophy of Machine Learning

The basic question here, which has repeated itself throughout the history of mathematics, is: how is a function created? Maybe it is created physically in nature (ontology), maybe it is created geometrically (vision), maybe it is perceived (reason), maybe it is defined (logically), maybe it is computed, and maybe it is learned. That is: built from sub-functions. And from here, from the parts of function definition, come all the main current research areas in computerized learning. When learning does not have the source of the function (its domain, in mathematical jargon) this is reinforcement learning (and then simplicity looks for the simplest source that will create the simplest function), and when there is no range of the function this is unsupervised learning (and then simplicity looks for the simplest range that will create the simplest function). And when the simplicity of the function is considered not only from the construction of sub-functions (how complex it is) but also from its construction from the examples themselves then this is statistical learning (the size of the distance from them is part of the simplicity calculation). The definition of learning aims to analyze the learned mathematical object - and find its internal structure. Its purpose is to build it - using hierarchy (decomposition into sub-problems) and using examples. That is: using two types of structural information, which allow combination between two structures: top-down (vertical), and from the side (horizontal) - different examples are different parallel composition possibilities, at each stage, from the floor below. And therefore everything in mathematics moves between lack of structure and excess structure. Too many degrees of freedom and too few. And therefore its boundaries are randomness and extreme complexity to the point of inability to say anything meaningful on one side, and on the other side a structure too simple and trivial and lacking information and richness. Therefore, one always needs to find within it the fractal boundary - there is the beauty. And there is also the mathematical interest, because there is the most learning information, as opposed to random and opaque information (in the sense that it cannot be deciphered), or trivial and opaque information (in the sense that there is nothing to decipher, because it is hermetically closed). And why are these fundamental properties of mathematics? Because everything is learned, and learnability is the root of structurality, and also the root of the complexity of structurality, because this is always not one-dimensional structurality, but two-dimensional (which turns it into construction), like we have in numbers (addition and multiplication). And let's note, that simplicity in learning defined above is online, and not against the whole like in the simple Occam's razor (MDL, Solomonoff, or in Kolmogorov complexity). That is: we are looking for the simplest hypothesis after the first example, and then let's say we take it (this hypothesis) as another ready-made function below, and add to it the next example, and then look for the best and simplest hypothesis, considering the previous hypothesis as one that has no cost, that is as simple. That is: the function already learned in the first stage is no longer counted in the complexity and simplicity calculation. And perhaps there may even be a universal and simplistic definition of the simplicity function - simply as the number of compositions. That is, simplicity only as a product of the idea of composition, and not as an independent measure and evaluation.

Philosophy of Mathematics: Summary

Using all this, we can re-characterize through learning the difference between finite and infinite as the difference between learned and not learned, which creates a more precise cut between these two categories. An algebraic, finite structure is always learned eventually. While a category of infinite, continuous structure, can only be learned in the limit, meaning it is not finitely learned. The infinity can be horizontal towards the side (in the collection of examples at each stage), or vertical upwards (in composition) or downwards (in the basic collection of functions from which we start at all). And in such a view, continuity and simplicity are related. Everything is finite but can be approximated. That is: the limit can not be calculated, but learned, to reduce distance. And if we add approximation to the simplicity measurement function (as opposed to the accuracy required in discreteness, when it is mandatory to reconstruct the examples - and this is actually the definition of discreteness), then the idea of the derivative is the linear approximation to the function (that is, if only linear construction is allowed), and so on (in higher derivatives, which are higher layers in learning, up to the series). And continuity is a zero-order derivative - constant. That is, what is simplicity in infinitesimal calculus? Simplicity on the examples and not on the combination (or also on the combination, like in linear regression). And integral is the inverse problem, the teacher's problem: how to find a function that will make the student's evaluation - his approximation - look like a certain function. And in the discrete world, which is controlled by examples precisely, we find infinite problems in what cannot be learned to the end, like the primes (when the allowed composition in construction is multiplication). And then one can ask, for example, how complex is the composition tree of naturals, on average (that is, their decomposition into primes, which is learned with the least examples). To understand how to build the set of naturals, when the combination is multiplication, means to know what is the distribution of the amount of examples that the teacher needs to give, in order to build the naturals up to a certain number. That is, there is a learning formulation for the basic questions in mathematics - which will allow them a learning solution, from the moment the paradigm of language that is stuck in the progress on these questions changes, because of an inappropriate conceptual framework. And so philosophy can assist mathematics - and mathematical learning.

Philosophy of Computerized Learning

The next stage after the philosophy of computer science is the philosophy of computerized learning. The state of deep learning today is like the state of the personal computer before the internet. And the future is an internet network of deep learning networks and machine learning classifiers, connected to each other by protocol, and creating the ability to compose them in learning construction. That is: to connect all kinds of deep learning modules, each specializing in something, to some large system, which really knows many things about the world, like the brain, and not just isolated expert systems trained according to specific data. Such a network of deep networks will be a kind of market, where you pay a little money for a little classification, or any other ability or action, and a huge ecosystem of artificial learning is created. And it will be the introduction towards the great intelligence - and from it artificial intelligence will grow, and not from any specific system - it will not be determined one day from some network in some laboratory, but from within the network. What will be the natural categories of such intelligence? Just as in the world of computation, the Turing machine redefined the idea of space as memory, that is, as information that takes up space, and the idea of time as operations in computation, that is, as something that takes time (and hence - efficiency), so deep learning redefines them. What is space now? Something local, like in convolutional networks, that is, a system where something affects things close to it. And what is time? Continuous memory, like in RNN, that is, a system where something affects things far from it. The previous world, the world of computation, reduced the importance of space (because everything is in memory), and nullified its natural dimensions (memory is inherently one-dimensional), and in contrast emphasized the dimension of time and speed precisely. And here, in the world of deep learning, we see that there is actually room for expanding the dimension of time, which will no longer be one-dimensional, because things can influence from afar from all kinds of directions - and in more than one dimension. There can certainly be a deep learning network with two time dimensions and more, that is, connected in the time dimension to copies of itself in more than one dimension, and not just recursively backwards, but recursive in two variables/directions and more. That is, if computation was a temporalization of space (everything, including money, equals time), then deep learning can be a spatialization of time (everything will be space, even times).

Philosophy of Deep Learning

What is deep learning made of? From the two most basic and primitive things learned in mathematics, that is from semester A: from Linear 1 and from Calculus 1. Linear algebra is the composition we talked about (and it is the simplest composition available: linear combination). And in addition there is also the derivative, which gives the guidance, according to the third Nathanian postulate (derivative is direction and therefore it is the simplest guidance). That is: what does learning actually do? Replaces examples with guidance. And what makes learning deep? That all this construction is done within a system. This is the depth of the system (and the second postulate). And learning is no longer always close to the surface of the system, like in language, in the system's dialogue with external examples (at the bottom and top of the network). And in addition, each layer is female for the layer below it and male for the one above it, according to the fourth Nathanian postulate. That is, we see here the realization in the field of all the postulates (and even the first, if you notice). Just like a prophecy. And let's also note, that there are two elements here, which compete with each other throughout the history of learning: guidance versus structure. Here we see them in the gradient derivative that washes everything in backward diffusion during learning time (the guidance) versus building a specific model (for example the specific architecture of the network, which is determined in advance, but even more so all kinds of ideas that are less popular today, like creating a specific learning model with strong priors for a specific problem, instead of the general approach of a deep network for every problem). And all this is just the contemporary incarnation of that old problem of environment versus heredity, and of empiricism versus rationalism, and of Aristotle versus Plato. Or of free competition and the invisible hand (the world of guidance) versus socialism and the state (the world of structure), liberalism versus conservatism, and Lamarckian evolution (in the guidance extreme) versus intelligent design (in the structural extreme). At the mathematical level, guidance is continuous, and related to the world of analysis and geometry, versus structural composition which is linguistic, and related to the world of algebra and logic. And deep learning is a tremendous victory of the learning approach of guidance at the expense of construction in this dialectic (but the counter-movement will still come), and it is parallel to the victory of capitalism and democracy (the guidance of communication and elections versus the bureaucratic and governmental structure), or the takeover of hedonism at the expense of structure in society. Because in deep learning it turns out that structure is much less important than simply a lot of feedback and guidance (but of course there is a synthesis here, because after all where is there such a high hierarchy as in deep learning? Only that it turns out that the details of the hierarchy are less important, and in fact everything in it is determined by guidance, and thus we have created a rather general learning mechanism, which is a kind of empirical blank slate). Therefore, to understand what learning is, maybe what is needed is to take the ratio between the examples required for learning and the structure giving required, that is how it changes (the ratio between them). The more examples are needed the less structure, and vice versa. And to understand what this function looks like, and that this is the important investigation, and not whether structure is more or less important than examples. For example, is this function linear, is it polynomial, is it exponential, and so on, in different problem domains (for example if learning different mathematical objects, and also in different problems in reality). That is, what needs to be asked is what is the relationship between the amount of examples and the amount of priors. And this is the same problem of variance versus bias, which is at the heart of machine learning (but less at the heart of deep learning, after the great victory of variance against bias, with the countless parameters of deep learning, which are much more than the amount of constraints).

Philosophy of Neuroscience

What is the conceptual foundation that even allows a rule like Hebb's rule (so local, compared to the globality of deep networks), which tends towards positive or negative self-feedback (a fatally corrupt property)? How is Hebb's rule even possible, as a basic learning mechanism, which has no connection at all to guidance - nor to structure, not to the outside - nor to the inside? Well, Hebb's rule is not just "neurons that fire together wire together" (the fire&wire brothers), but its true formulation is that I strengthen the connection from those who predicted me, and weaken from those I predicted. Therefore, this rule makes sense only under the assumption that neurons are primarily both information transmitters and independent qualitative evaluators, and then this rule creates reputation, and seeks novelty, in order to spread it. Additionally, it creates layers, because it is against circularity. It seeks the first to identify, and therefore creates competition for who will be the first to identify, that is: it is a competitive rule. But no single source to a neuron should exceed fifty percent, or at least a fixed threshold, otherwise it's a corrupt positive feedback loop (additionally, it's clear that this rule alone is not enough, because it's autistic, and also needs a neurotransmitter that gives external feedback, and probably other forms of feedback). That is, Hebb's rule can only work if you (as a neuron) have an independent evaluation ability, and hints at such (!). And then there is competition over it. Therefore, it is certainly logical for a social network of humans, even more than for a network of neurons, apparently. But from any cursory glance at the connectome of the brain, or even of a single neural network, it seems that the brain goes very very far in ways to create disorder and diversity and noise and complexity, perhaps in order not to allow Hebb's rule to converge to triviality, and to give it diverse, stochastic, chaotic and unstable enough materials to work on. Hebb's rule treats information as content, and not as an action in computation (like in the perception of the neural network as a kind of distributed and parallel computer). That is, when there is a lot of redundancy (like in the brain, or in many other learning systems) and everyone stands on the same line, then you need to choose the correct message, which you pass on with a change with relatively small parameters, that is when it's more about information transfer and less about computation. And in this context, the whole story of top-down prediction in the brain (say: when each upper layer predicts the one below it, and thus for example I predict the sensory data I will receive), is probably deeply related to Hebb's rule. Because who I predict is redundant for me to listen to. And if so, there is a process of guessing and convergence and less of computation. Therefore, the word prediction should be replaced with guessing. In such a perception, the brain operates through bottom-up computation and top-down guessing, and then there are the conflict points between them, or conflict layers, and whoever was right (and guessed or computed the continuation) decides there over the other. If each upper layer says what should have been below, and vice versa, then the convergence of this process allows finding the source of the error, which is the place from which the incorrect evaluation begins to jump, and there is a sharp rise in the problem there. That is, either the computation - rising from below - was distorted at this place and became incorrect, and then caused an error in the continuation of the layers above, or the guess - descending from above - was distorted at this place and proved incorrect, and then caused an error in guessing towards the continuation of the layers below. Therefore, a real neuron is a content evaluator, and not just evaluated. Because it decides who to listen to. That is, it is specifically evaluated on every content it transmits, and specifically evaluates every content it receives. It is not afraid of an infinite positive or negative feedback mechanism, where it listens only to one friend and not to anyone else, because it hears enough opinions with enough noise, and maybe there is also a limit to how much it listens to someone (maybe it's logarithmic for example). That is, we see that each neuron can have not only external feedback and guidance from above, but also intrinsic measurement methods for evaluation, such as: does it predict me, and does it fit the prediction of who is above me right now. The common thinking in deep learning is about the two directions in the network as separate, coming in two separate stages: the computation (forward propagation) from bottom to top, and the feedback (backward propagation) from top to bottom. But we can also think of them as waves in a system that progress in time simultaneously, asynchronously and in both directions, according to their strength, that is, sometimes a certain progression stops at an unsatisfied neuron, or an entire such layer, and begins to return feedback backwards, and vice versa, and there are returns and echoes and a stormy sea, until it converges to a certain state, and this is the true computation mechanism of the network (and not just from bottom to top). And so both the training and the execution/prediction are not two separate stages, as if the back propagation and front propagation occur in parallel. And this is likely how it happens in the brain. And if each layer predicts the one before it, then sometimes feedback will even be returned from the input layer, of the data, which doesn't happen currently in deep learning, and it's a shame, because we are missing this reverberation, and the information in the back propagation signal disappears and is lost when it reaches the input layer (we don't use this information to compare to the real input). But if each processing unit receives guidance from above, and independently outputs (and not just as part of backward propagation) guidance downwards, then at the meeting point between bottom and top the gradient or evaluation descending from above meets what rises from below, in the computation that was. And if there is a mismatch then there is a problem. For both sides. And if they don't agree on which direction the signal should change, then attention needs to be alerted to the mismatch. And to direct the system's resources towards it, and thus it's possible to notice innovations, or surprises, or problems. For example, at the micro level, if there is an unaccepted neuron, whose weights onwards are close to zero, then it received negative feedback to become a more useful and interesting function. And if it consistently receives strong contradictory feedbacks, then maybe it needs to split into two neurons. And if the connections of two neurons are too similar, then maybe they need to unite into one. That is, we can design the architecture according to feedbacks and mismatches. And at the macro level, this allows the system to search for surprises, and examples where it erred in predicting the future, and this is curiosity. For example, if a layer from above erred in predicting the one below it, then we continue to investigate similar examples, until we reach a solution. Because the correct systemic thinking is about a network that has already learned (a lot). And then it continues to learn another example, or encounters a new example that doesn't fit the past. Unlike the thinking today where the network starts learning all the examples from the beginning (infant thinking). And therefore, when we identified a problem, the search space of the parameters needs to work like a search, and not just like optimization - but exploration. And to offer more innovations - new combinations. Once there is independent evaluation, where a layer judges the one below it using its own measure, and not just according to the guidance it received from the one above it (in backward propagation), you can also perform a search, and narrow the search space all along the way (that is, between the different layers, and so the search will not need to explode into countless combinations in brute force). The first generation of artificial intelligence research was search (as a central algorithmic paradigm, for example in logical inference), while today artificial intelligence flees from search like from fire, and replaced it with optimization (in tuning continuous parameters and in statistics), but in the future there will be a synthesis between them. Search also has something to offer (and not just explode), if managed correctly, and indeed many times in the brain search is performed, as well as in evolution, because this is a way that allows more creative innovations - through combination and evaluation of it. After all, philosophy itself would have been very boring and sycophantic if it was just optimization against its evaluation function, and its being a search is what makes it difficult and interesting - and creative, in its struggle against its evaluation. And why is evolution faster than brute force search? The success of evolution stems from the very ability to compose, that is, at first the search is with simple combinations, and then, in the next layers, the search steps grow, with combinations of parts that are complex themselves. And at each stage (that is, layer) there is an independent evaluation of the creature. So it's not brute force because the previous steps in learning influence the next steps, and guide them, and therefore the search is not in the entire space of possibilities, but only in an advancing beam. If so, the phenomenon of composition and generations (=layers) is basic in learning. That is: in deep learning and in the brain and in evolution and in the definition of general learning, we have multiple components that are black boxes, and there are connections between them in construction (which needs to be characterized in each particular case: in deep learning linear combinations with a twist of non-linearity, in evolution - mating, and so on in other systems). Upwards they compute a function, with the help of the below. And downwards they give an evaluation (for example with the help of a gradient or maybe a choice, for example in a mate, do you understand?).

Philosophy of Network Research

What does feedback create? Simply put, partial differential equations and recursive equations, which are actually feedback mechanisms, and hence the phenomena of complexity and chaos. Therefore also in the brain, and in learning in general, feedback loops will produce similar phenomena, which are therefore natural to learning, and not its malfunctions. But what types of feedback are there? There are alternative mechanisms to backward propagation of gradient descent (=descent in slope, in optimization) in backward transfer of evaluation. For example: striving for simplicity (the evaluation is according to a measurement of how simple it is, like according to Occam's razor). Or striving for novelty. Or for variability and diversity (a certain distribution). But the most important property of feedback is not what it is according to, but what size of loop it creates, because this is a systemic property. And here the weakness of backward propagation stands out, which creates a huge feedback loop, which is very artificial in a large system - and very slow. A more reasonable alternative and therefore more common is short feedback loops (there is no learning system in the world outside of artificial neural networks that learns by backward propagation). For example in the brain, there are many backward connections between the neuron layers, in the opposite direction (which do not exist in deep learning). What is currently missing in understanding the brain - and likewise in deep learning - is the idea of competition, and of the spread of an idea in a population (which actually fits more with Hebb's rule). After all, at every stage, several possibilities compete in the brain, several thought continuations, and one is chosen. That is, there is competition over some evaluation, which chooses how to continue the learning. That is: the greatest importance of feedback is precisely in the competition it creates (exactly like in economics or democracy, the very existence of feedback is important, even if it is not ideal). But in too large a feedback loop all this is lost or inefficient, compared to close competition in small loops. Also in Google's PageRank algorithm there are hubs, which are evaluators, and this is actually its essence - analyzing the graph so that some of the vertices in the network evaluate others (and in turn are evaluated by them). All this is very similar to neural networks, and thus competition is created between the sites for ranking, and in general quality competition in the network. And in science? Each paper cites others, that is, this is the evaluation in the network, where there are no layers but everyone is connected to everyone. And the layers are created according to publication time (each paper evaluates those published before it). That is, we have here layers that evaluate those before them, and are evaluated by those after them, and thus competition is created, with the help of a very simple network mechanism. In these two cases, a large external feedback loop from outside is not needed to create evaluation and competition, but the evaluation in them is created from within themselves. We don't necessarily need strong external evaluation like in evolution to create competition, and this is the key to unsupervised learning, which is the dominant learning in the brain, and the great flaw of deep learning, which needs an enormous amount of examples (by the way, even in evolution the main competition is for the mate, that is on the small feedback loops, internal to the species, and not against the great extinction). Thus we see that precisely in networks where there is no clear external evaluation, for example on Facebook, in the stock market, and in dating, and in papers, fierce competition can still exist. In such networks you receive a number, like price or likes or h-index or pagerank and Google ranking, and the guidance on you. This number does not give you any guidance, but only evaluation, and you need to interpret it and understand from it what direction you need to change towards. And this is in contrast to the gradient that directs you in deep learning, which is a direction given to you from above. And maybe one can argue that the polynomial domain is what has moderate guidance, while NP is the class of problems without guidance, and not differentiable, but chaotic and non-local. Therefore we need to learn from NP that evaluation is not enough for learning. Only guidance. After all, NP is exactly this huge feedback loop, from outside, which turns out to give nothing to the learning inside, which will lead us to a solution. Such an evaluation cannot be derived into guidance. Is the polynomial the intermittently Lamarckian, that is, can be decomposed into local optimization, that is, it is construction+guidance? In the brain we still don't know how learning works, but in evolution we do, and we see that in it too there is a key feature: an independent evaluation function, which is why there are two sexes. That is, even though there is strong external evaluation of life and death, for learning to work there needs to be within the system also independent internal evaluation, of sex. The large feedback loop must be decomposed into smaller and closer feedback loops, which are not just a derivative, in both senses, of it. Also in a cultural/political/corporation/economy network there are independent evaluation functions. Namely: there are parts whose entire function is this. And then there is competition over it, namely there is duplication and redundancy and diversity and variance and comparison between alternatives (otherwise why does the psychic redundancy exist in all learning systems? Why are there so many neurons in the brain and genes in the genome and organisms in the species - and people in the state). If so, how does the internal evaluation work? How is it evaluated itself? Well, there are simply independent evaluation units within the system, which are independently guided, and not just a large overall feedback loop. In general, the general feedback to the system is rare and expensive, and therefore they rely on secondary evaluation functions. And they simply learn the evaluation functions as well. And what happens in NP? The secondary evaluations fail. In fact, the whole idea of reinforcement learning from outside the system as something that creates the system's learning (for example behaviorism) is a conceptual error, whose origin is in a simplistic philosophical picture of learning. We never have final feedback, the entire account is not yet finished.

Philosophy of Neural Networks

How else do the independent evaluations, within the system, help, in contrast to the external evaluation, which comes from outside the system to teach it? Because you also need to protect what you learned before from new learning that erases it. And the internal evaluation protects the learning it led to from being washed away and eroded by all-sweeping external guidance (like in backward propagation). This way it's possible to cause the new feedback to reach only something new, and be channeled towards it, and not towards all the old, and add - and not erase. What allows memory preservation is precisely that there is no backward learning. For example that it's not Lamarckian, but DNA learning, that is digital and not just continuous analog (which is all eroded with derivative and convergence in optimization). And this also allows combination. When the evaluations are independent, the learning goes backward only one layer at a time. There the magic happens, for example of complexity, simply with another layer. In evolution too - it's always one generation. Backward propagation is the root of evil, which turned the entire field of deep learning into brute force, black box and therefore engineering and not science. All the problematic phenomena stem from it. And there is no natural system that learns like that. The catastrophic forgetting (the phenomenon where a deep network forgets what it learned if given examples of a different type now) and the inability to connect building blocks well in deep learning would have been avoided if we had chosen a model like the one presented here at the beginning, of teacher and construction. The catastrophic forgetting is actually because there is no memory at all, but only action or learning. Therefore we need memory that is resistant to learning, namely: cases where the network decides that it has learned something useful, or a certain concept, and keeps it separate from further change (or greatly slows down its ability to change regarding it). Therefore we need a way to strengthen what you did and not just not change it, but that there be a confidence parameter for each parameter, which strengthens every time you succeeded (that is when there is almost no derivative change for the parameter's guidance, which is also valuable information, which currently goes more or less to waste, although it partially affects gradient descent optimization algorithms, for example in momentum). To remember is the ability not to learn. In order to learn anything that will persist you need the ability not to learn, and not to be affected by every new information like a weathervane of guidance. Any change in the backward propagation mechanism is much more fundamental than other changes in deep learning, because this is the method, the learning mechanism. And there it can be fixed. And the role of philosophy is to analyze this conceptual depth analysis (which it almost doesn't do today, and therefore no one pays philosophers, despite the enormous value they can provide).

The Philosophy of Deep Learning: Summary

Therefore, what is needed is a model where everything that goes down (the evaluations) is connected in one network of deep evaluation, and each layer in it has outputs and inputs to what happens in the regular deep network, that is to the parallel layer in the computing network, which goes up. The input to the evaluation network from the computing network is the output of a layer of the computing network, which is transferred to the evaluation network - for its evaluation. And the output from the evaluating network to the computing network is its evaluation output - which is guidance. Yes, it's completely symmetrical from both directions. And therefore much more general. One network that goes up and opposite it a completely parallel network that goes down. And in the particular case that they have exactly the same structure, then actually each neuron has double weights, downwards and upwards, for their update. That is, it can be thought of as one network (double action), but perhaps it's better to give the evaluating network independence in architecture, that is two networks that control each other. And what does all this say to NP? The definition of learning here is as a decomposition into layers of evaluator and evaluated, teacher and students. And the question is whether such a decomposition exists, or not, for a problem, when every polynomial algorithm is such a decomposition. That is, this is a different definition of learning than the one we saw in the philosophy of computer science, and it may be more suitable for dealing with the fundamental problem of these sciences. And I have already passed the stage in my life where I am able to take these thoughts and turn them into formal ones - but maybe you will be able to.