Machine Learning Newsletter

1. Introduction to Deep Learning with Lua - Gradients

Deep learning or neural networks refers to wide and deep neural network architectures, but in its core, they are nothing but circuit operations, whether it is sigmoid, multiplication or even addition operation. However, in order to learn an input and output pair, one needs to know how to compute gradient of a function to be able to optimize the weights of the network.

In this post, I will show how to compute various functions' gradients analytically and numerically(using gradient definition), you could consider it as Calculus 101 as well since most of the concepts are somehow a review of Calculus 101, which I will build on top of it in the following weeks.

Basic Circuit Definitions

Let's define very basic circuit definitions in Lua. I will be using Torch in the following posts, so this is a nice way to expose yourself with Lua's syntax and if you are familiar with Python, I wrote a blog post on comparing Python with Lua, you may want to check it out.

In [1]:
function multiplier_gate(x, y)
    return x * y

function adder_gate(x, y)
    return x + y

function sigmoid_gate(x)
    return 1 / (1 + math.exp(-x))

function not_gate(x)
    return not x

function or_gate(x, y)
    return x or y

function and_gate(x, y)
    return x and y

function xor_gate(x, y)
    return and_gate(or_gate(x, y),
                    or_gate(not_gate(x), not_gate(y)))

function xand_gate(x, y)
    return and_gate(x, not_gate(y))

function max_gate(x, y)
    if x >= y then
        return x
        return y

function min_gate(x, y)
    if x <= y then
        return x
        return y

Let's do some of the operations so that we'd know they compute correct things.

In [2]:
print(max_gate(3, 5))
print(min_gate(3, 5))
print(multiplier_gate(5, 10))
In [3]:
function round(num, idp)
  local mult = 10^(idp or 0)
  return math.floor(num * mult + 0.5) / mult

I am using derivative instead of gradient as derivative is gradient of a function in one dimension, which I will be using only one dimension to compute the gradient, so it is the derivative. First, let's look at the implementation, I will then show what it actually computes(kind of backwards but it works better since code scares less than math)

In [4]:
function derivative(f, x, y, perturbation)
    return (f(x, y) - f(x-perturbation, y)) / perturbation
In [5]:
derivative_of_x = round(derivative(multiplier_gate, 5, -6, 0.00001))
In [6]:
-- which is the same as y, as y is the derivative of xy with respect to x

That is because the definition of derivative definition and we could compute as in the following:

$$ \frac{\partial f(x, y)}{\partial(x)} = \frac{f(x,y) - f(x-h, y)}{h} $$

which in our case, $f(x, y) = xy$, and let's put the function definition inside of the definition as such:

$$ =\frac{xy - (x-h)y}{h} $$

which results in $$ =\frac{xy - xy + hy}{h} $$ $$ =\frac{hy}{h} $$

Then, it is obvious that: $$ \frac{\partial f(x, y)}{\partial(x)} = y $$

Can you do this for $ \frac{\partial f(x, y)}{\partial(y)} $? What would you expect before doing the computation?

However, rarely we use very simple circuits to build neural networks. What we need to compute in order to get the gradients of each circuits, let's look at how we could try to solve those problems.

In [7]:
function square(gate_func, x, y)
    local temp = gate_func(x, y)
    return temp * temp

function sum_square(x, y)
    return square(adder_gate, x, y)

The sum_square gate is gate which does the following operation:

$$ f(x, y) = (x+y)^2 $$

How do we compute such a functions's derivative with respect to $x$? Easy, peasy. We would expand the function as in the following: $$ f(x,y) = x^2 + 2xy + y^2 $$

Then, we would compute the derivative very easily using Calculus 101: $$\frac{\partial f(x, y)}{x} = 2x + 2y $$

We could do the unexpanded version as well which would result in the same thing.

Another approach would be assigning x+y into some other variable($z$) and compute the derivative of $f$ with respect to z. In order to get back the derivative with respect to $x$ or $y$, it is enough to multiply derivative of function with respect to z with derivative of z with respect to x by using Chain Rule. The formula is: $$ \frac{\partial f(x, y)}{\partial x} = \frac{\partial f(x+y=z)}{\partial z} \frac{\partial z}{\partial x} $$

Then, $f(z) = z^2 $ which has a direct derivative of $2z$. $z$ itself is $x+y$ and its derivative with respect to $x$ is straightforward which is 1. If we multiply the terms with chain formula in the above, we would get:

$$ \frac{\partial f(x,y)}{\partial x} = 2z * 1$$

where $z$ is $x+y$. Let's remove the $z$ from the equation alltogether: $$ \frac{\partial f(x,y)}{\partial x} = 2x + 2y$$

This is the same with the result that we got above.

What do you think derivative of $z$ with respect to $y$?

Going back to numerical derivation, let's prove that analytical solution that we got is actually correct. If we have $x=5$ and $y=-6$ in the above, our gradient should be $2x + 2y$ which is -2, let's see if that is actually the case.

In [8]:
derivative_of_x = round(derivative(sum_square, 5, -6, 0.00001))
In [9]:

Yes, it is. We just checked our analytical solution with a numerical one.

We could do similar derivative computation with a sigmoid gate as well. Sigmoid function is defined as: $$ \sigma(x) = \frac{e^x}{e^x + 1} $$ which is equivalent to the following(you need to divide both numerator and denominator by $e^x$, though): $$ = \frac{1}{1 + e^{-x}} $$

We could get derivative of this function, very easily using the fraction rule:

((derivative of numerator * denominator) - (derivative of denominator * numerator)) / (denominator^2)

$$ \frac{\partial \sigma(x)}{\partial x} = \frac{e^x}{(1+e^{-x})^2} $$

Unless it is obvious, here is the $1 - \sigma(x) = \frac{1}{e^{-x} + 1}$

Can you think a better way to express the derivative of the sigmoid function?

Here is one: $$ \frac{\partial \sigma(x)}{\partial x} = \sigma(x) (1 - \sigma(x)) $$ which is much nicer to deal with and compute as long as you have a function that computes sigma, its gradient is only multiplication opearation!

For $x=0$, the derivative should be 0.25 as the $\sigma(0) = 0.5$. Let's see if our numerical gradient agrees with the analytical derivation.

In [10]:
round(derivative(sigmoid_gate, 0, -6, 0.00001), 3)

Yes, it is. Pretty neat.

One of the most famous algorithms Backprop uses chain rule and gradient computation in an analytical form as numerical gradient computation is not practical for large networks, which I will show in the next post.

What is next?

Backprop algorithm.

comments powered by Disqus