Published on

Linear Algebra 101 for AI/ML – Part 1



You don't need to be an expert in linear algebra to get started in AI, but you do need to know the basics. This is part 1 of my Linear Algebra 101 for AI/ML series, which is my attempt to compress the 6+ months I spent learning linear algebra before I started my career in AI. With the benefit of hindsight, I know now that you don't need to spend 6+ months or even 6 weeks brushing up on linear algebra to dive into AI. Instead, you can quickly ramp up on the basics and get started coding in AI much faster. As you make progress in AI/ML, you can continue your math studies.

In this article, you will learn:

  • 🔢 the basics of vector and matrix math
  • 🧮 vector and matrix operations
  • 💻 the basics of PyTorch, an open source ML framework

In Part 2 of the series, we will build on this foundational knowledge by covering the dot product, the embedding, and its application to similarity search. You'll also see two interactive playgrounds (the Interactive Dot Product Playground and the Interactive Embedding Explorer), which are best viewed on laptop/desktop and were designed to help you understand the concepts.

In Part 3 of the series, we will use all the fundamental linear algebra we will have learned in Part 1 and Part 2 to build an image search engine.

As you read this guide, keep an eye out for interactive question and quiz modules to check your understanding of the material!

Without further ado, here are the topics of the article:

Basic Definitions

Scalar – A scalar is a single numerical value that represents a magnitude without direction. In programming terms, you can think of scalars as simple variables holding a single number, like an integer or float. Examples of scalars include temperature, age, and weight.

Vector – A vector is an ordered list of scalars. Why do we say it's ordered? Because the position of the scalar in the vector matters. Below is an example of a vector. Pretend y\color{cyan}{\vec{y}} is a vector representing the movie "Avengers: Endgame". The vector contains five numbers stacked on top of one another in a single column, each of which describes a specific attribute of the movie.

y=[0.990.520.450.100.26]actioncomedydramahorrorromance}5 rows{\color{cyan}{\vec{y}}} \quad = \left. \begin{bmatrix} 0.99 \\ 0.52 \\ 0.45 \\ 0.10 \\ 0.26 \\ \end{bmatrix} \quad \begin{array}{l} \text{action} \\ \text{comedy} \\ \text{drama} \\ \text{horror} \\ \text{romance} \end{array} \right\}\text{5 rows}

We see that the movie has a value of 0.99 for action and 0.10 for horror. This suggests the movie is more of an action movie than a horror movie. If we were to swap the value for action with the value for horror, the vector would no longer accurately represent "Avengers: Endgame", which is not a horror movie. This is why order matters.

[0.990.520.450.100.26][0.100.520.450.990.26]actioncomedydramahorrorromance\begin{bmatrix} {\color{cyan}{0.99}} \\ 0.52 \\ 0.45 \\ {\color{orange}{0.10}} \\ 0.26 \\ \end{bmatrix} \neq \begin{bmatrix} {\color{cyan}{0.10}} \\ 0.52 \\ 0.45 \\ {\color{orange}{0.99}} \\ 0.26 \\ \end{bmatrix} \quad \begin{array}{l} {\color{cyan}{\text{action}}} \\ \text{comedy} \\ \text{drama} \\ {\color{orange}{\text{horror}}} \\ \text{romance} \end{array}

Are vectors always arranged in column form? No, not necessarily. Below are vectors in either row or column form of different lengths.

[18212427]4 columns\color{orange}{ \overbrace{ \begin{bmatrix} 18 & 21 & 24 & 27 \end{bmatrix} }^{\text{4 columns} } }[1821]2 columns\color{cyan}{ \overbrace{ \begin{bmatrix} 18 & 21 \end{bmatrix} }^{\text{2 columns}} }[1.50.890.41]}3 rows\color{magenta}{ \left. \begin{bmatrix} -1.5 \\ 0.89 \\ 0.41 \\ \end{bmatrix} \right\}\text{3 rows} }

Notice a vector either has one row or one column. What if you want a mathematical object that has multiple rows and multiple columns? That's where a matrix comes into play.

Matrix – If a scalar is a single number, and a vector is a one-dimensional ordered list of scalars, then a matrix is a two-dimensional array of scalars. Below, X\color{cyan}{X} is an example matrix. You can see it has four rows and two columns.

X=[33435354]123 Maple Grove Lane888 Ocean View Terrace100 Birch Street987 Sunflower Court{\color{cyan}{X}} \quad = \begin{bmatrix} {\color{magenta}{3}} & {\color{orange}{3}} \\ {\color{magenta}{4}} & {\color{orange}{3}} \\ {\color{magenta}{5}} & {\color{orange}{3}} \\ {\color{magenta}{5}} & {\color{orange}{4}} \\ \end{bmatrix} \quad \begin{array}{l} \text{123 Maple Grove Lane} \\ \text{888 Ocean View Terrace} \\ \text{100 Birch Street} \\ \text{987 Sunflower Court} \\ \end{array}

Each row corresponds to the address of a single home. The first column represents the number of bedrooms in the home, and the second column represents the number of bathrooms.

Any mathematician might find these definitions too simplistic and overly reductionist, but they are good enough to get us started. We'll see later how vectors and matrices can hold data to be processed by machine learning models.

Element-wise Operations with PyTorch

Code Environment Setup

Now that we've established the definitions of vectors and matrices and their mathematical notation, let's play around with them in code to gain some intuition and familiarity. To do this, we're going to use an open source machine learning framework called PyTorch. PyTorch is widely used throughout academia and industry for cutting edge AI research and production grade software at institutions and companies such as OpenAI, Amazon, Meta, Salesforce, Stanford University, and thousands of startups, so it'll be practical to build up experience with the framework. Visit the official PyTorch installation instructions page to get started.

After you install PyTorch, open up your Python REPL. Copy the code below (tip: on desktop, hover over the code and click on the clipboard that appears to copy the code):

a=[3455]R4×1a = \begin{bmatrix} 3 \\ 4 \\ 5 \\ 5 \\ \end{bmatrix} \in \mathbb{R}^{4 \times 1}
import torch

a = torch.tensor([[3], [4], [5], [5]])

Above, on the left hand side we see a vector with four elements, and on the right hand side is its equivalent in code.

Set up your REPL with the following before continuing.

>>> import torch
>>> a = torch.tensor([1.0, 2.0, 4.0, 8.0])
>>> b = torch.tensor([1.0, 0.5, 0.25, 0.125])

We're going to look at a class of operations performed on vectors and matrices called element-wise operations. Element-wise operations are operations that are applied independently to each element of a vector or matrix, resulting in a new vector or matrix of the same shape. These operations include addition, subtraction, multiplication, division, and many more.

Element-wise addition

[1248]+[]=[1+12+0.54+0.258+0.125]\begin{bmatrix} {\color{cyan}{1}} \\ {\color{orange}{2}} \\ {\color{yellow}{4}} \\ {\color{magenta}{8}} \\ \end{bmatrix} + \begin{bmatrix} {\color{cyan}{1}} \\ {\color{orange}{0.5}} \\ {\color{yellow}{0.25}} \\ {\color{magenta}{0.125}} \\ \end{bmatrix} = \begin{bmatrix} {\color{cyan}{1}} + {\color{cyan}{1}} \\ {\color{orange}{2}} + {\color{orange}{0.5}} \\ {\color{yellow}{4}} + {\color{yellow}{0.25}} \\ {\color{magenta}{8}} + {\color{magenta}{0.125}} \\ \end{bmatrix}
>>> a + b # element-wise addition
tensor([2.00, 2.50, 4.25, 8.125])

Element-wise subtraction

[1248][]=[1120.540.2580.125]\begin{bmatrix} {\color{cyan}{1}} \\ {\color{orange}{2}} \\ {\color{yellow}{4}} \\ {\color{magenta}{8}} \\ \end{bmatrix} - \begin{bmatrix} {\color{cyan}{1}} \\ {\color{orange}{0.5}} \\ {\color{yellow}{0.25}} \\ {\color{magenta}{0.125}} \\ \end{bmatrix} = \begin{bmatrix} {\color{cyan}{1}} - {\color{cyan}{1}} \\ {\color{orange}{2}} - {\color{orange}{0.5}} \\ {\color{yellow}{4}} - {\color{yellow}{0.25}} \\ {\color{magenta}{8}} - {\color{magenta}{0.125}} \\ \end{bmatrix}
>>> a - b # element-wise subtraction
tensor([0.0, 1.5, 3.75, 7.8750])

Element-wise multiplication

[1248][]=[1120.540.2580.125]\begin{bmatrix} {\color{cyan}{1}} \\ {\color{orange}{2}} \\ {\color{yellow}{4}} \\ {\color{magenta}{8}} \\ \end{bmatrix} \odot \begin{bmatrix} {\color{cyan}{1}} \\ {\color{orange}{0.5}} \\ {\color{yellow}{0.25}} \\ {\color{magenta}{0.125}} \\ \end{bmatrix} = \begin{bmatrix} {\color{cyan}{1}} \cdot {\color{cyan}{1}} \\ {\color{orange}{2}} \cdot {\color{orange}{0.5}} \\ {\color{yellow}{4}} \cdot {\color{yellow}{0.25}} \\ {\color{magenta}{8}} \cdot {\color{magenta}{0.125}} \\ \end{bmatrix}
>>> a * b # element-wise multiplication
tensor([1., 1., 1., 1.])

Element-wise division

[1248][]=[1/12/0.54/0.258/0.125]\begin{bmatrix} {\color{cyan}{1}} \\ {\color{orange}{2}} \\ {\color{yellow}{4}} \\ {\color{magenta}{8}} \\ \end{bmatrix} \oslash \begin{bmatrix} {\color{cyan}{1}} \\ {\color{orange}{0.5}} \\ {\color{yellow}{0.25}} \\ {\color{magenta}{0.125}} \\ \end{bmatrix} = \begin{bmatrix} {\color{cyan}{1}} / {\color{cyan}{1}} \\ {\color{orange}{2}} / {\color{orange}{0.5}} \\ {\color{yellow}{4}} / {\color{yellow}{0.25}} \\ {\color{magenta}{8}} / {\color{magenta}{0.125}} \\ \end{bmatrix}
>>> a / b # element-wise division
tensor([ 1.,  4., 16., 64.])
Subscribe for more Linear Algebra 101 and access to a free Google Colab notebook that walks you through step by step how to build an image search engine. Unsubscribe any time.

There are also element-wise operations that act on a vector/matrix alone. Below are two commonly used operations in machine learning.


σ([1248])=[σ(1)σ(2)σ(4)σ(8)]where σ(x)=11+ex\sigma \left( \begin{bmatrix} {\color{cyan}{1}} \\ {\color{orange}{2}} \\ {\color{yellow}{4}} \\ {\color{magenta}{8}} \\ \end{bmatrix} \right) = \begin{bmatrix} \sigma({\color{cyan}{1}}) \\ \sigma({\color{orange}{2}}) \\ \sigma({\color{yellow}{4}}) \\ \sigma({\color{magenta}{8}}) \\ \end{bmatrix} \\ \\[8pt] \text{where } \sigma({\color{yellow}{x}}) = \frac{1}{1+e^{-{\color{yellow}{x}}}}
>>> torch.sigmoid(a)
tensor([0.7311, 0.8808, 0.9820, 0.9997])
>> torch.sigmoid(torch.tensor(239))
>>> torch.sigmoid(torch.tensor(0))
>>> torch.sigmoid(torch.tensor(-0.34))

The sigmoid function takes any value of xx and squashes it into the range (0,1)(0, 1). Note that only σ()=0\sigma(-\infty) = 0 and σ(+)=1\sigma(+\infty) = 1. This is useful when you have arbitrarily large values and you want to condense them into the range of values between 0 and 1. It's sometimes useful to interpret the output of sigmoid as a probability.

ReLU (Rectified Linear Unit)

ReLU([1248])=[f(1)f(2)f(4)f(8)]where f(x)=max(x,0)\text{ReLU} \left( \begin{bmatrix} {\color{cyan}{1}} \\ {\color{orange}{2}} \\ {\color{yellow}{4}} \\ {\color{magenta}{8}} \\ \end{bmatrix} \right) = \begin{bmatrix} f({\color{cyan}{1}}) \\ f({\color{orange}{2}}) \\ f({\color{yellow}{4}}) \\ f({\color{magenta}{8}}) \\ \end{bmatrix} \\ \\[8pt] \text{where } f({\color{yellow}{x}}) = \text{max}({\color{yellow}{x}}, 0)
>>> c = torch.tensor([4, -4, 0, 2])
>>> torch.relu(c)
tensor([4, 0, 0, 2])

The ReLU function acts as a filter. Any positive input goes through it unchanged, but any negative input becomes zero. You might find it strange why such a function exists, but this simple function helps neural networks learn to recognize objects in images and is used in ChatGPT and other sophisticated chatbots. 1

In addition to element-wise operations, there are other operations that operate on the entire tensor, which we'll cover in Part 2 of the series. We will also introduce the embedding and its application to similarity search.


Take the quiz below to see if you've mastered the concepts above. Don't worry if you can't answer them right away. Each question contains multiple concepts, so review the article if you're stuck.

Head over to Part 2!


  1. ReLU was popularized in 2012 by a famous neural network called AlexNet.