Linear Algebra
By now, we can load datasets into tensors and manipulate these tensors with basic mathematical operations. To start building sophisticated models, we will also need a few tools from linear algebra. This section offers a gentle introduction to the most essential concepts, starting from scalar arithmetic and ramping up to matrix multiplication.
Scalars
Most everyday mathematics consists of manipulating numbers one at a time.
Formally, we call these values scalars. For example, the temperature in
Orlando, Florida is a balmy
We denote scalars by ordinary lower-cased letters (e.g.,
Scalars are implemented as tensors that contain only one element. Below, we assign two scalars and perform the familiar addition, multiplication, division, and exponentiation operations.
pdl> $x = ones(1) * 3.0
pdl> $y = ones(1) * 2.0
pdl> print $x, $y
[3] [2]
pdl> print $x+$y
pdl> print $x + $y, $x * $y, $x / $y, $x ** $y
[5] [6] [1.5] [9]
Vectors
For current purposes, you can think of a vector as a fixed-length array of
scalars. As with their code counterparts, we call these scalars the elements
of the vector (synonyms include entries and components). When vectors
represent examples from real-world datasets, their values hold some real-world
significance. For example, if we were training a model to predict the risk of a
loan defaulting, we might associate each applicant with a vector whose
components correspond to quantities like their income, length of employment, or
number of previous defaults. If we were studying the risk of heart attack, each
vector might represent a patient and its components might correspond to their
most recent vital signs, cholesterol levels, minutes of exercise per day, etc.
We denote vectors by bold lowercase letters, (e.g.,
Vectors are implemented as PDL
), as in most reasonable programming languages, vector
indices start at
pdl> $x = sequence(3)
pdl> print $x
[0 1 2]
We can refer to an element of a vector by using a subscript. For example,
Here
pdl> print $x(2)
[2]
To indicate that a vector contains length
function in PDL
. The output of length
is identical to dims
, which is the
generic dimensionality retrieval function. The return value of dims
is a
tuple that indicates a tensor’s length along each dimension. Tensors with just one
dimension have shapes with just one element. We can also use the shape
function
to return a PDL
object with the dimensions.
pdl> print $x->length
3
pdl> print $x->dims
3
pdl> print $x->shape
[3]
Oftentimes, the word “dimension” gets overloaded to mean both the number of axes and the length along a particular dimension. To avoid this confusion, we use order to refer to the number of axes and dimensionality exclusively to refer to the number of components.
Matrices
Just as scalars are
In code, we represent a matrix reshape
. Recall that PDL
requires the order to
be swapped since it uses column-major form:
pdl> $A = sequence(6)->reshape(2,3)
pdl> print $A
[
[0 1]
[2 3]
[4 5]
]
Sometimes we want to flip the axes. When we exchange a matrix’s rows and
columns, the result is called its transpose. Formally, we signify a matrix
In code, we can access any matrix’s transpose using the transpose
function
as shown below:
pdl> print $A->transpose
[
[0 2 4]
[1 3 5]
]
Symmetric matrices are the subset of square matrices that are equal to their own
transposes:
pdl> $A = pdl([[1, 2, 3], [2, 0, 4], [3, 4, 5]])
pdl> print $A == $A->transpose
[
[1 1 1]
[1 1 1]
[1 1 1]
]
Matrices are useful for representing datasets. Typically, rows correspond to individual records and columns correspond to distinct attributes.
Tensors
While you can go far in your machine learning journey with only scalars,
vectors, and matrices, eventually you may need to work with higher-order
tensors. Tensors give us a generic way of describing extensions to
Tensors will become more important when we start working with images. Each
image arrives as a
NOTE: In PDL
the order of the parameters to reshape
are reverse that of in
the Python
equivalent.
pdl> print sequence(24)->reshape(4,3,2)
[
[
[ 0 1 2 3]
[ 4 5 6 7]
[ 8 9 10 11]
]
[
[12 13 14 15]
[16 17 18 19]
[20 21 22 23]
]
]
Basic Properties of Tensor Arithmetic
Scalars, vectors, matrices, and higher-order tensors all have some handy properties. For example, elementwise operations produce outputs that have the same shape as their operands.
pdl> $A = sequence(6)->reshape(3,2)
### Assign a copy of $A to $B by allocating new memory
pdl> $B = $A->copy
pdl> print $A, $A+$B
[
[0 1 2]
[3 4 5]
]
[
[ 0 2 4]
[ 6 8 10]
]
The elementwise product of two matrices is called their Hadamard product
(denoted
pdl> print $A * $B
[
[ 0 1 4]
[ 9 16 25]
]
Adding or multiplying a scalar and a tensor produces a result with the same shape as the original tensor. Here, each element of the tensor is added to (or multiplied by) the scalar.
pdl> print $a + $X, ($a * $X)->shape
[
[
[ 2 3 4 5]
[ 6 7 8 9]
[10 11 12 13]
]
[
[14 15 16 17]
[18 19 20 21]
[22 23 24 25]
]
]
[4 3 2]
Reduction
Often, we wish to calculate the sum of a tensor’s elements. To express the sum
of the elements in a vector
pdl> $x = sequence(3)
pdl> print $x, $x->sum
[0 1 2] 3
To express sums over the elements of tensors of arbitrary shape, we simply sum
over all its axes. For example, the sum of the elements of an
pdl> print $A
[
[0 1 2]
[3 4 5]
]
pdl> print $A->shape, $A->sum
[3 2] 15
By default, invoking the sum function reduces a tensor along all of its axes,
eventually producing a scalar. The sumover
function allows us to specify the
axes along which the tensor should be reduced. To sum over all elements along
the rows (dimension 0), we call sumover
with no arguments. Since the input
matrix reduces along dimension 0 to generate the output vector, this dimension
is missing from the shape of the output.
pdl> print $A->sumover, $A->sumover->shape
[3 12] [2]
If we want to reduce along the columns we call the mv
function with the
dimension we want to swap and call sumover
on that. Here we move dimensions
0 and 1 and invoke sumover
on the result.
pdl> print $A->mv(0,1)->sumover
[3 5 7]
## what the move looks like
pdl> print $A->mv(0,1)
[
[0 3]
[1 4]
[2 5]
]
Another way is to use transpose
with sumover
.
pdl> print $A->transpose->sumover
[3 5 7]
Reducing a matrix along both rows and columns via summation is equivalent to summing up all the elements of the matrix.
pdl> print $A->sumover
[3 12]
pdl> print $A->sumover->sum
15
A related quantity is the mean, also called the average. We calculate the
mean by dividing the sum by the total number of elements. Because computing the
mean is so common, it gets a dedicated library function that works analogously
to sum
. In PDL
this function is avg
. Likewise, the function average
can
also calculate the mean along specific dimensions.
pdl> print $A->avg
2.5
pdl> print $A->average
[1 4]
Non-Reduction Sum
Sometimes it can be useful to keep the number of dimensions unchanged when invoking the function for calculating the sum or mean. This matters when we want to use the broadcast mechanism.
pdl> print $A
[
[1 2 3]
[2 0 4]
[3 4 5]
]
pdl> print $A->sumover
[6 6 12]
pdl> $sumA = $A->sumover->transpose
pdl> print $sumA
[
[ 6]
[ 6]
[12]
]
For instance, since $sumA
keeps its two axes after summing each row, we can
divide $A
by $sumA
with broadcasting to create a matrix where each row sums
up to
pdl> print $A/$sumA
[
[ 0.16666667 0.33333333 0.5]
[ 0.33333333 0 0.66666667]
[ 0.25 0.33333333 0.41666667]
]
If we want to calculate the cumulative sum of elements of $A
along some dimension,
say dimension 0
(row by row), we can call the cumusumover
function.
By design, this function does not reduce the input tensor along any dimension.
pdl> print $A->cumusumover
[
[ 1 3 6]
[ 2 2 6]
[ 3 7 12]
]
Dot Products
So far, we have only performed elementwise operations, sums, and averages. And
if this was all we could do, linear algebra would not deserve its own section.
Fortunately, this is where things get more interesting. One of the most
fundamental operations is the dot product. Given two vectors
The dot product of two vectors is a sum over the products of the elements at the same position.
In PDL
, the dot product can be obtained by using the inner
function for
two vectors, since it is also known as the inner product of two vectors.
pdl> $x = sequence(3)
pdl> $y = ones(3)
pdl> print inner($x,$y)
3
Equivalently, we can calculate the dot product of two vectors by performing an elementwise multiplication followed by a sum:
pdl> print sum($x * $y)
3
Dot products are useful in a wide range of contexts. For example, given some
set of values, denoted by a vector
Matrix–Vector Products
Now that we know how to calculate dot products, we can begin to understand the
product between an
where each
The matrix–vector product
We can think of multiplication with a matrix
To express a matrix–vector product in code, we use the same inner
function.
The operation is inferred based on the type of the arguments. Note that the
column dimension of $A
(its length along dimension 1) must be the same as the
dimension of $x
(its length).
pdl> print $x
[0 1 2]
pdl> print $A
[
[ 0 2 6]
[ 0 0 8]
[ 0 4 10]
]
pdl> print $A->shape, $x->shape
[3 3] [3]
pdl> print inner($A, $x)
[8 8 14]
Matrix–Matrix Multiplication
Once you have gotten the hang of dot products and matrix–vector products, then matrix–matrix multiplication should be straightforward.
Say that we have two matrices
Let
To form the matrix product
We can think of the matrix–matrix multiplication $A
and $B
. Here, $A
is a matrix with two
rows and three columns, and $B
is a matrix with three rows and four columns.
After multiplication, we obtain a matrix with two rows and four columns. In
PDL
, this can be accomplished by the operator x
or the function matmult
.
pdl> $A = sequence(6)->reshape(3,2)
[
[0 1 2]
[3 4 5]
]
pdl> $B = ones(4,3)
pdl> print $B
[
[1 1 1 1]
[1 1 1 1]
[1 1 1 1]
]
pdl> print matmult($A,$B)
[
[ 3 3 3 3]
[12 12 12 12]
]
pdl> print $A x $B
[
[ 3 3 3 3]
[12 12 12 12]
]
The term matrix–matrix multiplication is often simplified to matrix multiplication, and should not be confused with the Hadamard product.
Norms
Some of the most useful operators in linear algebra are norms. Informally,
the norm of a vector tells us how big it is. For instance, the
A norm is a function
- Given any vector
, if we scale (all elements of) the vector by a scalar , its norm scales accordingly: - For any vectors
and : norms satisfy the triangle inequality: - The norm of a vector is nonnegative and it only vanishes if the vector is zero:
Many functions are valid norms and different norms encode different notions of
size. The Euclidean norm that we all learned in elementary school geometry when
calculating the hypotenuse of a right triangle is the square root of the sum of
squares of a vector’s elements. Formally, this is called the
The method magnover
calculates the PDL
. This is different
from the norm
function in PDL
which
normalizes a vector.
For matrices, in PDL
you have to use the mnorm
function available in
the PDL::LinearAlgebra
module. So maybe it makes sense to use mnorm
for
vectors too.
pdl> print pdl([3, -4])->magnover
5
pdl> use PDL::LinearAlgebra
pdl> print pdl([3, -4])->mnorm
5
The
Compared to the
pdl> print pdl([[3, -4]])->abs->sumover
[7]
Both the
In the case of matrices, matters are more complicated. After all, matrices can
be viewed both as collections of individual entries and as objects that
operate on vectors and transform them into other vectors. For instance, we can
ask by how much longer the matrix–vector product
The Frobenius norm behaves as if it were an
pdl> use PDL::LinearAlgebra
pdl> print ones(9,4)->mnorm
6
While we do not want to get too far ahead of ourselves, we already can plant some intuition about why these concepts are useful. In deep learning, we are often trying to solve optimization problems: maximize the probability assigned to observed data; maximize the revenue associated with a recommender model; minimize the distance between predictions and the ground truth observations; minimize the distance between representations of photos of the same person while maximizing the distance between representations of photos of different people. These distances, which constitute the objectives of deep learning algorithms, are often expressed as norms.
Discussion
In this section, we have reviewed all the linear algebra that you will need to understand a significant chunk of modern deep learning. There is a lot more to linear algebra, though, and much of it is useful for machine learning. For example, matrices can be decomposed into factors, and these decompositions can reveal low-dimensional structure in real-world datasets. There are entire subfields of machine learning that focus on using matrix decompositions and their generalizations to high-order tensors to discover structure in datasets and solve prediction problems. But this book focuses on deep learning. And we believe you will be more inclined to learn more mathematics once you have gotten your hands dirty applying machine learning to real datasets. So while we reserve the right to introduce more mathematics later on, we wrap up this section here.
If you are eager to learn more linear algebra, there are many excellent books and online resources. For a more advanced crash course, consider checking out Introduction to Linear Algebra by Strang (1993), Linear Algebra Review and Reference by Kolter (2008) and The Matrix Cookbook by Petersen & Pedersen (2008)(archive.org link to pdf).
To recap:
- Scalars, vectors, matrices, and tensors are the basic mathematical objects used in linear algebra and have zero, one, two, and an arbitrary number of axes, respectively.
- Tensors can be sliced or reduced along specified axes
via indexing, or operations such as
sum
andmean
, respectively. - Elementwise products are called Hadamard products. By contrast, dot products, matrix–vector products, and matrix–matrix products are not elementwise operations and in general return objects having shapes that are different from the the operands.
- Compared to Hadamard products, matrix–matrix products take considerably longer to compute (cubic rather than quadratic time).
- Norms capture various notions of the magnitude of a vector (or matrix), and are commonly applied to the difference of two vectors to measure their distance apart.
- Common vector norms include the
and norms, and common matrix norms include the spectral and Frobenius norms.
Exercises
- Prove that the transpose of the transpose of a matrix is the matrix itself:
. - Given two matrices
and , show that sum and transposition commute: . - Given any square matrix
, is always symmetric? Can you prove the result by using only the results of the previous two exercises? - We defined the tensor
X
of shape (4, 3, 2) in this section. What is the output oflength(X)
? Write your answer without implementing any code, then check your answer using code. - For a tensor
X
of arbitrary shape, doeslength(X)
always correspond to the length of a certain dimension ofX
? What is that dimension? - Run
$A / $A->sum()
and see what happens. Can you analyze the results? - When traveling between two points in downtown Manhattan, what is the distance that you need to cover in terms of the coordinates, i.e., in terms of avenues and streets? Can you travel diagonally?
- Consider a tensor of shape (4, 3, 2). What are the shapes of the summation outputs along dimensions 0, 1, and 2?
- Feed a tensor with three or more dimensions to the
mnorm
function and observe its output. What does this function compute for tensors of arbitrary shape? - Consider three large matrices, say
, and , initialized with Gaussian random variables. You want to compute the product . Is there any difference in memory footprint and speed, depending on whether you compute or ? Why? - Consider three large matrices, say
, and . Is there any difference in speed depending on whether you compute or ? Why? What changes if you initialize without cloning memory? Why? - Consider three matrices, say
. Construct a tensor with three axes by stacking . What is the dimensionality? Slice out the second coordinate of the third dimension to recover . Check that your answer is correct.
Next - Calculus | Previous - Data Pre-processing |