I don't understand many of most of these words (highest I got was college calcul...

jpeterson · on March 19, 2024

It's all really just basic calculus, with a couple nifty tricks layered on top:

1) Create a bunch of variables and initialize them to random values. We're going to add and multiply these variables. The specific way that they're added and multiplied doesn't matter so much, though it turns out in practice that certain "architectures" of addition and multiplication patterns are better than others. But the key point is that it's just addition and multiplication.

2) Take some input, or a bunch of numbers that convey properties of some object, say a house (think square feet, number of bedrooms, number of bathrooms, etc) and add/multiply them into the set of variables we created in step 1. Once we plug and chug through all the additions and multiplications, we get a number. This is the output. At first this number will be random, because we initialized all our variables to random numbers. Measure how far the output is from the expected value corresponding to the given inputs (say, purchase price of the house). This is the error or "loss". In the case of purchase price, we can just subtract the predicted price from the expected price (and then square it, to make the calculus easier).

3) Now, since all we're doing is adding and multiplying, it's very straight-forward to set up a calculus problem that minimizes the error of the output with respect to our variables. The number of multiplication/addition steps doesn't even matter, since we have the chain rule. It turns out this is very powerful: it gives us a procedure to minimize the error of our system of variables (i.e. model), by iteratively "nudging" the variables according to how they affect the "error" of the output. The iterative nudging is what we call "learning". At the end of the procedure, rather than producing random outputs, the model will produce predictions of house prices that correlate with the distribution input square footage, bedrooms, bathrooms, etc. we saw in the training set.

In a sense, ML and AI are really just the next logical step of calculus once we have big data and computational capacity.

HarHarVeryFunny · on March 19, 2024

Calculus is all you need! Neural nets are trained to minimize their errors (what they actually output vs what we want them to output). When we build a neural net we know the function corresponding to the output error, so training them (finding the minimum of the error function) is done just by following the gradient (derivative) of the error function.

drdeca · on March 19, 2024

I think there are still open questions about this that are worth asking.

It is clear enough that following gradients of a bounded differentiable function can bring you to a local minimum of the function (unless I guess if there’s a path that heads away from starting location, going off to infinity, along which the function is always decreasing, asymptotically approaching some value, but this sort of situation can be prevented by adding loss terms that penalize parameters being too big).

But, what determines whether it reaches a global minimum? Or, if it doesn’t reach a global minimum, what kinds of local minima are there, and what determines which kinds it is more likely to end up in? Does including momentum and stochastic stuff in the gradient descent influence the kinds of local minima that are likely to be approached? If so, in what way?

HarHarVeryFunny · on March 19, 2024

Local minima aren't normally a problem for neural nets since they usually have a very large number of parameter, meaning that the loss/error landscape has a correspondingly high number of dimensions. You might be in a local minima in one of those dimensions, but the probability of simultaneously being in a local minima of all of them is vanishingly small.

Different learning rate schedules, as well as momentum/etc, can also help getting stuck for too long in areas of the loss landscape that many not be local minima, but may still be slow to move out of. One more modern approach is to cycle between higher and lower learning rates rather than just use monotonically decreasing ones.

I'm not sure what latest research is, but things like batch size and learning rate can certainly effect the minimum found, with some resulting in better generalization than others.