Fun with PyTorch - Part 1: Variables and Gradients


PyTorch is a brand new framework for deep learning, mainly conceived by the Facebook AI Research (FAIR) group, which gained significant popularity in the ML community due to its ease of use and efficiency.
This is the first of a series of tutorials devoted to this framework, starting with the basic building blocks up to more advanced models and techniques to develop deep neural networks. In this first tutorial, we are introducing the two main PyTorch elements: variables and gradients.

Read the entire series:

  • Fun with PyTorch - Part 1: Variables and Gradients (this one)

These tutorials are also available in italian language: Alle Prese con PyTorch.

Install

PyTorch is currently released as beta version and available for Linux or OS X platforms only (latest version v0.3.1). However, below we provide some guidelines to install on Windows as well. If your machine isn't already packed with Python libraries, our hint is to install a package manager, such as Anaconda, to ease the next steps.

Linux

Assuming both Python and Anaconda have been installed, just type:

conda install pytorch -c pytorch

This version supports GPU through CUDA.

OSX

You can rely on the same command, but there will be no GPU support. If you want to enable CUDA support, please refer to the following link to build PyTorch from source.

Windows

  • Option 1: an unofficial GitHub repo provides a Window version of PyTorch. In case of any setup issue, there's a dedicated thread to follow.
  • Option 2: use the Google Colaboratory cloud service to create a Python 3 notebook and install PyTorch by typing (bear in mind that this command must be submitted every time you open the notebook, since Google Colaboratory relies on temporary virtual machines):

    !pip3 install http://download.pytorch.org/whl/cu80/torch-0.3.1-cp36-cp36m-linux_x86_64.whl .

Tensors: PyTorch vs NumPy

A Tensor, that is, a multi-dimensional numeric array, is the main PyTorch element, like in NumPy and, more in general, in almost every scientific framework based on Python. Since PyTorch's method signature is very close to NumPy, let's start by comparing the two libraries (and how the two interact) with the definition of a new tensor:

# NumPy
import numpy as np
x = np.zeros((2, 3))

# PyTorch
import torch
y = torch.zeros(2, 3)

More or less the syntax is the same. With NumPy, the tensor's size is expressed as a vector, while in PyTorch every dimension is passed as a separate argument.
NumPy and PyTorch tensors can be even combined with an automatic cast:

z = x + y

However, automatic casting always hides pitfalls, as shown below:

print(type(y)) # <class 'torch.FloatTensor'>
print(type(z)) # <class 'torch.DoubleTensor'>

As you can see, the destination tensor z changed its data type with respect to the source y! NumPy tensors are by default initialized as np.float64, while PyTorch adopts a 32-bit torch.FloatTensor type, to be GPU-compliant.
When combining the two tensors, there's an automatic upcast to a 64bit type, which in turn leads to manifold runtime errors. There are two possible workarounds to avoid this: (1) downcasting the destination tensor with z.float() or (2) upcasting the source tensor with y.double().

We can convert a PyTorch tensor to its corresponding NumPy version by using z.numpy(), or build a PyTorch tensor from a NumPy one through torch.from_numpy(x). Pay attention that both the operations are sharing the allocated memory. Any operation applied to one tensor will alter the other one as well, as illustrated below:

xx = z.numpy()
xx += 1.0
print(z)
# 1  1  1
# 1  1  1
# [torch.DoubleTensor of size 2x3]

Besides these remarks, almost every method signature is equivalent in the two libraries. In particular, we can index elements within a tensor through squared brackets, as well as implicitly combine matrices with different size through broadcasting (as reported in this picture):

Broadcasting in PyTorch

torch.Tensor([3, 2]) * torch.Tensor([[0, 1], [4, 2]])
# 0  2
# 12  4
# [torch.FloatTensor of size 2x2]

Retrieving tensors' sizing is almost the same, besides the return type:

print(x.shape)   # (3, 2)
print(y.size())  # torch.Size([3, 2])

Last, but not least, a minor difference is how we address a specific dimension of a tensor:

x.mean(axis=0)
y.mean(dim=0)

Variables, gradients and functions

Let's jump into fancy stuff: how to automatically compute tensors' gradients (aka derivatives), given a set of functions.

We will leverage on autograd, a core PyTorch package for automatic differentiation, which combines a tape based system for automatic differentiation with a PyTorch element we are introducing here: variables.

Variables in PyTorch

A variable is a small tensor's wrapper consisting of three major elements:

  • v.data references to the raw tensor;
  • v.grad accumulates the gradient computed on demand through the backward pass with respect to this variable;
  • v.grad_fn is used by PyTorch to link the root element of the computational graph containing the applied operations.

Every single operation applied to the variable is tracked by PyTorch through the autograd tape within an acyclic graph:

Construction of the computational graph in PyTorch

This allows to compute gradients over tensors by automatically feed-forwarding all the required information within the acyclic graph through the aforementioned tape.
Let's make an example:

from torch.autograd import Variable
v = Variable(torch.ones(1, 2), requires_grad=True)

Note the require_grad flag to set the automatic gradient update with respect to the variable v.
Let's play with the tensor, by performing the sum of squared elements:

v_fn = torch.sum(v ** 2)
print(v_fn.data)    # 2 [torch.FloatTensor of size 1]
print(v_fn.grad_fn) # <SumBackward0 object at 0x7fa959f21550>

With respect to other deep learning frameworks (e.g. TensorFlow without the brand new eager execution), PyTorch builds up the graph dynamically, which leads to a very fast response. Furthermore, the grad_fn property contains an object reference to the operation originating the v_fn variable within the graph (in this case the sum function).

In order to compute the function gradient of v_fn over v, we type:

torch.autograd.grad(v_fn, v) # Gradient of v_fn over v
# (Variable containing:
# 2  2 [torch.FloatTensor of size 1x2],)

A more interesting approach to handle variables is reported through the following example with two variables:

v1 = Variable(torch.Tensor([1, 2]), requires_grad=True)
v2 = Variable(torch.Tensor([3]), requires_grad=True)
v_fn = torch.sum(v1 * v2)

Instead of explicitly invoking the gradient computation for each variable, we can automatically compute the gradient with respect to the combination of all the involved variables where the requires_grad flag has been ticked:

v_fn.backward()
print(v1.grad) # Variable containing: 3 3 [torch.FloatTensor of size 2x1]
print(v2.grad) # Variable containing: 3 [torch.FloatTensor of size 1]

In this case the backward function is not returning any value, since the gradients are stored in the grad property of each variable. PyTorch also exposes more complex and advanced methods to auto-compute gradients, which are beyond the scope of this tutorial.

Example: Linear Regression

Let's apply everything so far described with a linear regression example.

It is worth underlining that this is an example focused on re-applying the techniques introduced. Indeed, PyTorch offers much more advanced methodologies to accomplish the same task, introduced in the following tutorials.

In this example we will consider a simple one-dimensional synthetic problem (with some added noise):

X = np.random.rand(30, 1)*2.0
w = np.random.rand(2, 1)
y = X*w[0] + w[1] + np.random.randn(30, 1) * 0.05

Dataset for linear regression

In order to detect the line's coefficient, we define a linear model:

W = Variable(torch.rand(1, 1), requires_grad=True)
b = Variable(torch.rand(1), requires_grad=True)

def linear(x):
  return torch.matmul(x, W) + b

Using torch.matmul is redundant in this case, but we want the function to be as general as possible to be re-used for more complex models.

Below we report how to train the model:

Xt = Variable(torch.from_numpy(X)).float()
yt = Variable(torch.from_numpy(y)).float()

for epoch in range(2500):

  # Compute predictions
  y_pred = linear(Xt)

  # Compute cost function
  loss = torch.mean((y_pred - yt) ** 2)

  # Run back-propagation
  loss.backward()

  # Update variables
  W.data = W.data - 0.005*W.grad.data
  b.data = b.data - 0.005*b.grad.data

  # Reset gradients
  W.grad.data.zero_()
  b.grad.data.zero_()

Few remarks about the code:

  1. We need to cast the dataset from NumPy to PyTorch (32bit) using float().
  2. The cost function is a mean squared error.
  3. After the back-propagation step we update the weights with a gradient descent, and we explicitly use W.data instead of W, in order not to override the original variables
  4. At the end of every iteration, gradients are reset.

Finally, to validate our regression problem, we plot the final model:

Linear regression

In the next tutorial we will deal with more advanced models and their optimization through native PyTorch functions and classes.

Read the entire series:

  • Fun with PyTorch - Part 1: Variables and Gradients (this one)

If you liked this post and you would like to keep in touch with our activities, you can become a member of the Italian Association for Machine Learning, or follow us on Facebook or LinkedIn.

Previous Post