Intro to Neural Networks

a

As I have been hinting in previous posts, hyperplanes and the logistic regression models you get from them are a key piece of information for constructing a neural network. Introductions to neural networks often begin with networks where a single logistic regression model is the basic building block.


Notation and the building block, the neuron. To get going I will introduce a diagram representation for these building blocks. Each individual building block, or sometimes its corresponding logistic regression function, is called a (sigmoid) neuron.

Diagram of a “neuron” representing a single logistic regression model.

The diagram has vertices (the circles) and edges, each edge connecting a pair of vertices. Each edge also has a number (called a weight) associated to it. In neural networks (including even a single neuron) the vertices are organized in a sequence of layers. In basic neural networks, the edges connect vertices that are in adjacent layers.
In a neuron (see figure to the right), there are just two layers, an input layer and an output layer. The vertices in the input layer have values from the components of the input vector \( {\bf X}\). There is a single vertex in the output layer with value equal to \( \sigma({\bf W}\cdot{\bf X})\), where \( {\bf W}\) is the vector of weights coming from the edges that lead from the input to the output layer. When \( \sigma = \sigma_k\), the sigmoid function, the neuron is the same as a logistic regression model.

The function at the output can be chosen to be something else. For example, one could use \( \text{sign}\) in place of \( \sigma\), so the output would be \( \text{sign}({\bf W}\cdot{\bf X})\), a half-space model. Whatever function is used, it is called the activation function.1

A neural network where inputs are points in the plane and there is one hidden layer which has 3 vertices.

A Neural Network is a function obtained by composing vector-valued functions, whose component functions are neurons (and, recently, there are typically multiple compositions). It still has an input layer \( V_0\) and output layer \( V_L\), like the individual neuron, but it also has some vertices that are arranged in hidden layers \( V_1, \ldots, V_{L-1}\).2  Each vertex/neuron in a hidden layer has the output of a logistic regression model, but its output is fed into a neuron in the next layer to the right, as a component of a vector. Also, unless the neuron is in \( V_1\), the input to the neuron is not a data point \( {\bf X}_i = (x_1, x_2,\ldots,x_d, 1)\), but is the vector of outputs from the previous layer to the left.

An example with one hidden layer that has three vertices is shown. As an example of how the network works, suppose the input is \( (x_1, x_2) = (0.5, 1)\). If the weights (not displayed in the figure) are such that the top neuron in \( V_1\) has output \( 0.9\) on this input, the middle neuron has value \( 0.05\), and the third neuron has value \( 0.6,\) then the vector \( (0.9, 0.05, 0.6)\) will be fed into the next layer.3  Suppose the weights on the edges from \( V_1\) to \( V_2\) are \( -1, 0, 2\) (from top to bottom). Then the total output of the network is \(\sigma((-1, 0, 2)\cdot(0.9, 0.05, 0.6)) = \sigma(0.3)\).

I said above that there is a vector of weights for each building block. Think of the weight being on an edge that feeds into a vertex in the layer to its right. Consider weights on an edge between layers \( V_{t-1}\) and \( V_t\), for some \(t\), and write the vertices of \(V_t\) as \( v^{t}_1, v^{t}_2, \ldots, v^{t}_{k_{t}}\) (and similarly, replacing \(t\) with \(t-1\), for vertices of layer \( V_{t-1}\)).4  If a weight is on an edge from \( v^{t-1}_j\) to \( v^t_i\), then write \( w^{t-1}_{i,j}\) for that weight. Then \( {\bf W}^{t-1}_i = (w^{t-1}_{i,1}, \ldots, w^{t-1}_{i, k_{t-1}})\) is the vector of weights for the neuron at the \( i^{th}\) vertex \(v^t_i\) in layer \( V_t\). Write \( {\bf W}^{t-1}\) for the matrix that has \( {\bf W}^{t-1}_i\) as its \( i^{th}\) row; this weight matrix between \(V_{t-1}\) and \(V_t\) has \(k_t\) rows and \(k_{t-1}\) columns. Use \( {\bf W}\) for the list of weight matrices \( {\bf W}^0, \ldots, {\bf W}^{L-1}\).


Example (Implementing the AND logical function). Using the sign function in place of \( \sigma,\) the AND function can be made with a single neuron. By the AND function, I mean the function which only takes four ordered pairs as inputs: \( (x_1, x_2) = (1, 1), (-1, 1), (1, -1),\) or \( (-1, -1)\). The output is 1 if both in the pair are 1 and is -1 otherwise. That is,

AND\( (1, 1) = 1\); AND\( (-1, 1) = -1\); AND\( (1, -1) = -1\); AND\( (-1, -1) = -1\).

A neuron which implements the AND function.

Let the single weight vector be \( {\bf W}^0 = (1, 1, -1)\). Then the neuron on the right (using a sign function) will agree exactly with the AND function. For example, if \( (x_1, x_2) = (1, -1)\) then the output on the right is \( \text{sign}(1 – 1 – 1) = \text{sign}(-1) = -1\). It is a good exercise to try to determine some of the other logical functions.

Exercise. Find a single neuron that agrees with the OR function on many inputs \( x_1, x_2, \ldots, x_d\) (where each \(x_i = \pm1\)). This function outputs 1 in every case except when all of the \( x_i\) equal -1). Think about making the last weight on the shift be \( d-1\).

Exercise (More involved). Find a neural network, with sign activation functions, that agrees with the XOR function (exclusive OR, which outputs 1 when exactly one of the inputs is 1 and all others are -1). You might try it for just two inputs \( x_1, x_2\). \(\lbrack\)Hint5\(\rbrack\).


Which functions can be expressed with a neural network? All of them! 6  Let \( f:\{\pm 1\}^n \to \{\pm 1\}\) be a function, which has the set of \( n\)-component binary vectors (with \( \pm 1\) as coordinates) for input, and outputs \( \pm 1\) for each of them. I’ll explain how to construct a neural network, with one (HUGE) hidden layer, that is the exact same function as \( f\).7  This means that, if I write \( NN({\bf x})\) for the output of the neural network, then it will be true that \( NN({\bf x}) = f({\bf x})\) for every binary vector in \( \{\pm1\}^n\).

The input layer \( V_0\) will have \( n+1\) vertices, the one hidden layer \( V_1\) will have \( 2^n+1\) vertices, and the ouput layer \( V_2\) will have a single vertex. The activation function will be the \( \text{sign}\) function.

A neural network that will have output equal to that of a given binary function.

The neural network will look something like the one depicted to the right. In that picture, the blue edges have weight 1, the red edge has weight \( k-1\), where \( k\) has something to do with the function \( f\) (and in the picture, \( k\) is 2). For the edges drawn in black, I’ll describe the weights below. Finally, an edge that is not drawn has weight 0. So there is only one edge with non-zero weight from the input layer to the last vertex, \(v^1_{2^n+1}\), of the hidden layer.
To define the weights of the black edges, let \( {\bf u}_1, \ldots, {\bf u}_k\) be all the vectors in \( \{\pm1\}^n\) that the function \( f\) sends to 1 (\( k\) is the number of them). Since there are \( 2^n\) vectors in \( \{\pm 1\}^n\), it must be that \( k \le 2^n\). Augment these vectors by \( {\bf U}_i = ({\bf u}_i, -n+1)\) (i.e. place an additional entry of \( -n+1\) at the end and use a capital letter for the result). For \( 1\le i\le k,\) set the vector of weights \( {\bf W}^0_{i} = {\bf U}_i\). For example, the weights on the edges that feed into the very top vertex in the hidden layer are the coordinates of \( {\bf U}_1\).
The rest of the weights are as depicted. To be explicit: for \( k< i \le 2^n,\) set \( {\bf W}^0_i\) equal to the zero vector; set \( {\bf W}^0_{2^n+1}\) (the weights that feed to the last vertex) to be 0 for the first \( n\) entries and 1 for its last entry. The weights between \( V_1\) and \( V_2\) are 1 (there are \( k\) of these), 0 (\( 2^n-k\) of these), and \( k-1\) (just one — the red edge).

Write \( g_i\) for the \( i^{th}\) neuron in the hidden layer. Because of the weights chosen, and writing \( {\bf X}\) for \( ({\bf x}, 1)\), you get that if \( i\le k\) then

\( g_i({\bf x}) = \text{sign}({\bf U}_i\cdot{\bf X}) = \text{sign}({\bf u}_i\cdot{\bf x} – n+1).\)

For \( {\bf x} \in \{\pm1\}^n\), since the product of a coordinate of \( {\bf x}\) and a coordinate of \( {\bf u}_i\) can only be 1 or -1, the largest that \( {\bf u}_i\cdot{\bf x}\) can be is \( n\), happening when \( {\bf x} = {\bf u}_i.\) Since every other instance requires that \( {\bf x}\) and \( {\bf u}_i\) disagree in at least one coordinate (so there is a -1 instead of 1 in the sum that is the dot product), it must be that \( {\bf x}\cdot{\bf u}_i \le n-2\) in every other instance. That implies that \( g_i({\bf x}) = 1\) if and only if \( {\bf x} = {\bf u}_i\).

As a consequence, when the input \( {\bf x}\) is not one of the \( {\bf u}_i\), then the output of the hidden layer will be a length \( k\) list of -1 followed by \( 2^n – k\) zeroes, and a 1 in coordinate \( 2^n+1\). If \( {\bf x}\) is one of the \( {\bf u}_i\), then the \( i^{th}\) entry of the hidden layer output will be 1 instead of -1, and everything else will be the same.

To finish understanding the network, note that the weights between \( V_1\) and \( V_2\) are such that the output neuron is an OR function on the first \( k\) entries of the hidden layer output. This is because, if at least one of the top \( k\) vertices is \( 1\) instead of \( -1\) then the dot product going into the last activation \( \text{sign}\) function is, at least, \( 1 + (k-1)(-1) + k-1 = 1\), and possibly larger. If all top \( k\) neurons have value \( -1\), then the dot product is \( k(-1) + k-1 = -1\).

And so, with these weights the neural network outputs 1 when \( {\bf x} = {\bf u}_i\) and -1 otherwise. The vectors \( {\bf u}_i\) were chosen in such a way that this means that the neural network has the same values as \( f\).


While I have shown that the theoretical neural network described here has been built to be the same function as \(f\), I wish to stress that it is a truly theoretical result. Most problematically, the size of the hidden layer of the neural network is exponential in the input: it has \(2^n+1\) vertices, where \(n\) is the number of bits in the input. Your computer uses either 32 bits, or 64 bits, to represent a single number, so if you have a function that needs three numbers as input (a rather reasonable request), the size of the input is at least 96 bits. The neural network described here would then need \(2^{96}+1\) vertices in its hidden layer (not to mention storing all the needed weights).

To put into perspective how much information this is, 1 tb (a terabyte, which is 1000 gigabytes) is \(2^{43}\) bits. So, even if you could use just one bit per vertex, you would need more than \(2^{53}\) terabytes. This is more than the estimated total storage capacity of all computers on earth (for this one neural network, that computes a single function of three variables).8  So yes, the construction described here is theoretical. For certain functions, you could come up with more efficient network structures (“architectures”) that give you the function. More importantly, usual practice is to pick the network structure, and then train weights based on a set of data. Hence, depending on the choice of structure, your hypothesis class of functions \(\mathcal H\) is likely very rich (it does much more than linear models, for example), but it is a restrictive class (it does not contain all possible functions). This is good; you have a chance, at least, of avoiding overfitting.

  1. Some more common choices include \(\tanh\), the hyperbolic tangent function, and a function called the \(\text{rectifier}\) function.
  2. Sometimes I consider the neurons to be the things in the layer, rather than just the labeled vertex.
  3. When the activation function is a sigmoid function \(\sigma\), the output value at each vertex is between 0 and 1.
  4. I label the vertices in order from top to bottom of the diagram. So, in a figure, the top-most vertex in \(V_t\) will be \(v^t_1\).
  5. It cannot be done with a single neuron; in fact, this follows from the fact that the four points \((1,1), (-1,1), (-1,-1),\) and \((1,-1)\), with labels given by multiplying the two coordinates, are not linearly separable.  If you change the sign on all weights in the AND neuron above, you get NAND (not AND), and it’s pretty easy to find weights on a neuron so it has output \( x_i\) for one of the \( i\). Using this neuron and the NAND neuron as basic pieces, you can get XOR with a neural network that has two hidden layers.
  6. The statement is almost trivially true if you make no restriction on activation functions. I will show, using a single hidden layer and \(\text{sign}\) activation functions, that there is a neural network equal to any function on binary vectors that has binary output. For functions with infinitely many possible inputs, like a function from a compact domain in \(\mathbb R^n\) to \(\mathbb R\), one can approximate any such function with a neural network that uses sigmoid activation functions (and a single hidden layer).
  7. You might feel that this is a strange kind of function to think about. However, everything that is done on a computer is a combination of such functions, or a list of functions of this type, since computers can only work with (and use as output) some number of bits.
  8. If you wanted a function that took larger input, like an image, say, then this construction above would easily require more bits then the estimated number of particles in the observable universe (which is less than \(2^{270}\)).

Hello world :face_palm:

Let’s begin this blog the correct way, with a confession. As an undergrad math major, I did not take any statistics. There was a required course, called something like Mathematical Statistics. But my senior year arrived and I had not taken it. Instead of biting the bullet, I spoke with my faculty advisor and he arranged with the department to waive the requirement. It is one of my regrets.

Despite my confession I have picked up basic ideas about statistics over the years. In graduate school/post-doc years I had enough conversations (and saw enough talks) that I can tell you about i.i.d. random variables, covariance, cdf’s and pdf’s. But statistical ideas and considerations never became as natural to me as algebraic and geometric/topological ones did.

Well, the day of reckoning has come. I want to barge my way into the world of mathematical applications — machine learning, particularly. And statistics will be front and center. So please, do bear with me if you see my statistical gaps. I am learning too.


In the blog, I will be drawing on a few sources. Mainly, this includes the following references.

  1. Understanding Machine Learning: from theory to algorithms by S. Shalev-Shwartz and S. Ben-David.
  2. Mathematics of Machine Learning lecture notes from the MIT OCW course by Philippe Rigollet.