Linear Algebraic Approaches to Image Classification

Aim

To train a minimal convolutional neural network on MNIST and then fully deconstruct its learned weight matrices using Singular Value Decomposition, demonstrating that every convolutional layer is equivalently an SVD operation — and to make each component of that equivalence numerically verifiable and visually interpretable.

Introduction

Convolutional Neural Networks have achieved state-of-the-art performance on image classification tasks, yet the internal mathematical structure of a trained network remains opaque in most treatments. A conv layer is commonly described as a learnable bank of spatial filters — but what does gradient descent actually discover when it optimises those filters?

The central insight of this project is that the answer lies in linear algebra. When a conv layer's weight tensor of shape (out_ch, in_ch, kH, kW) is reshaped into a 2-D matrix W of shape (out_ch, in_ch·kH·kW), its full Singular Value Decomposition gives:

W  =  U · diag(σ) · Vᵀ

This is an identity — not an approximation. The forward pass through the layer for any input patch p can therefore be written in three explicit steps:

z  =  Vᵀ p          → project patch onto input eigenfilter basis
s  =  diag(σ) z     → scale each coordinate by its singular value
y  =  U s  =  W p   → rotate into output feature space
h  =  ReLU(y)       → carve piecewise-linear subspace

When gradient descent optimises W, it is implicitly discovering {U, σ, V} — the singular vectors and values. The singular values σᵢ encode how much discriminative energy the network places in each eigenfilter direction vᵢ, and their decay rate reveals the effective dimensionality of each layer. This project implements, verifies, and visualises every claim above across an 8-phase analysis pipeline.

Literature Survey and Technologies Used

Theoretical Background

Singular Value Decomposition (Golub & Van Loan, Matrix Computations) — foundational matrix factorisation underlying the entire analysis
im2col patch extraction — the standard implementation trick used in cuDNN and PyTorch that transforms convolution into matrix multiplication
Low-rank neural network compression (Denton et al., NIPS 2014; Jaderberg et al., ICLR 2014) — empirical observation that conv layer weight matrices are approximately low-rank, motivating the spectral energy analysis
Piecewise-linear network analysis (Montufar et al., NIPS 2014) — theoretical treatment of how ReLU networks partition input space into linear regions

Dataset

MNIST (LeCun et al., 1998) — 60,000 training and 10,000 test 28×28 greyscale handwritten digit images, 10 classes. Automatically downloaded via torchvision.datasets.MNIST on first run.

Technologies

Python 3.x — core language
PyTorch / torchvision — model definition, training, MNIST loading
NumPy + SciPy — SVD computation (numpy.linalg.svd), numerical verification
Matplotlib — all diagnostic visualisations

Methodology

The Network: TinyConvNet

To keep the analysis fully tractable, a deliberately small CNN is used. Bias terms are disabled in both conv layers so that the weight matrix alone captures the entire linear transformation.

Layer	Operation	Output Shape	Weight Shape	SVD Shapes
conv1	`Conv2d(1, 8, 3×3, bias=False)`	`(B, 8, 26, 26)`	`(8, 9)`	`U:(8,8), σ:(8,), Vᵀ:(8,9)`
ReLU + MaxPool(2)		`(B, 8, 13, 13)`	—	—
conv2	`Conv2d(8, 16, 3×3, bias=False)`	`(B, 16, 11, 11)`	`(16, 72)`	`U:(16,16), σ:(16,), Vᵀ:(16,72)`
ReLU + MaxPool(2)		`(B, 16, 5, 5)`	—	—
fc	`Linear(400, 10)`	`(B, 10)`	`(10, 400)`	—

SVD Decomposition of Convolutional Layers

Each conv layer weight tensor is reshaped into a 2-D matrix W and the thin SVD is computed:

U, σ, Vᵀ = numpy.linalg.svd(W, full_matrices=False)

Component	Shape	Meaning
V (right singular vectors)	`(r, in_ch·kH·kW)`	Eigenfilters — orthogonal spatial patterns the layer is sensitive to, ranked by importance
σ (singular values)	`(r,)`	Importance weights — σᵢ measures how much energy the network places in eigenfilter direction vᵢ
U (left singular vectors)	`(out_ch, r)`	Feature directions — orthogonal basis for the output feature space

Low-Rank Approximation

A rank-k reconstruction is obtained by retaining only the top-k singular triplets:

W_k  =  (U[:, :k] * σ[:k])  @  Vᵀ[:k, :]

Sweeping k from 1 to r allows plotting reconstruction error and cumulative spectral energy vs rank, quantifying each layer's effective dimensionality.

Analytical Forward Pass

AnalyticalNet mirrors TinyConvNet but performs every computation in explicit SVD coordinates:

Layer 1:
  z1 = Vᵀ1 @ p        # project patch: (k,)
  s1 = σ1 * z1         # scale:         (k,)
  a1 = U1 @ s1 = W1@p  # lift:          (8,)
  h1 = ReLU(a1)

Layer 2 (on pooled patches f):
  z2 = Vᵀ2 @ f
  s2 = σ2 * z2
  a2 = U2 @ s2 = W2@f
  h2 = ReLU(a2)

output = W_fc @ flatten(MaxPool(h2)) + b_fc

ReLU Subspace Analysis

For each patch p, let M(p) = diag(1[Wp > 0]) be the binary activation mask. Then:

h(p)  =  ReLU(W p)  =  M(p) · W · p

Each unique mask defines a different effective linear map. Layer 1 has 8 output channels giving a theoretical maximum of 2⁸ = 256 distinct linear regions. The analysis counts unique masks across 500 MNIST patches and projects activations onto the U-basis to visualise the space contraction induced by ReLU.

The 8-Phase Pipeline

Phase	What it does
1 — Train	Train TinyConvNet on MNIST for N epochs; per-epoch loss and test accuracy logging
2 — SVD	Decompose both conv layers; compute singular values, cumulative energy, and effective rank
3 — Analytical	Build AnalyticalNet; verify W = U·Σ·Vᵀ holds to machine precision; print two-layer SVD composition
4 — Verify	Confirm analytical im2col → W·p output matches PyTorch conv output numerically (max diff < 1e-4)
5 — Low-rank	Sweep rank k = 1…r; compute relative reconstruction error and cumulative energy captured
6 — ReLU	Analyse how ReLU partitions pre-activation space into piecewise-linear regions; count unique activation masks
7 — Composition	Verify Layer 2 analytically; verify Parseval identity ‖W‖²_F = Σ σᵢ² for both layers
8 — Plots	Generate all diagnostic figures to outputs/

Results

Metric	Value	Interpretation
W = U·Σ·Vᵀ identity (Layer 1)	Max error ~8.9 × 10⁻⁸	Exact decomposition confirmed to machine epsilon
W = U·Σ·Vᵀ identity (Layer 2)	Max error ~1.8 × 10⁻⁷	Exact decomposition confirmed to machine epsilon
Test accuracy (20 epochs)	98.71%	Competitive with standard CNNs
Layer 1 singular values	[2.170, 1.753, 1.496, 1.041, 0.787, 0.623, 0.534, 0.424]	6 of 8 components capture 95% energy
Layer 2 singular values	[2.895, 2.480, 2.349, 2.070, …] (16 values)	13/16 for 95% energy; 15/16 for 99%
ReLU unique activation masks	40 out of theoretical max 256 (2⁸)	Low effective dimensionality in practice
Frobenius identity (Layer 1)	‖W‖²_F = Σ σᵢ² = 12.5733	Parseval identity verified
Frobenius identity (Layer 2)	‖W‖²_F = Σ σᵢ² = 43.7439	Parseval identity verified
Analytical vs CNN forward pass	Max diff < 1 × 10⁻⁶ (both layers)	Forward pass equivalence confirmed

Conclusions / Future Scope

Conclusions

This project establishes, numerically and visually, that every convolutional layer is exactly an SVD. Training TinyConvNet on MNIST achieves 98.71% test accuracy, and post-training analysis confirms all theoretical claims to machine epsilon:

W = U·Σ·Vᵀ holds with max error below 10⁻⁷ for both layers
AnalyticalNet forward pass agrees with PyTorch conv to within 10⁻⁶
Parseval/Frobenius identity ‖W‖²_F = Σ σᵢ² verified for both layers
Gradient descent discovers eigenfilters with interpretable spatial structure (edges, blobs, diagonals)
ReLU activates only ~40 effective linear regions out of 256 theoretical maximum

These results provide a principled explanation for why low-rank compression of conv layers works: the bottom singular components carry negligible energy and removing them barely changes network output.

Future Scope

Deeper networks — extending SVD analysis to VGG or ResNet-style architectures with batch normalisation and residual connections
SVD-parameterised training — directly optimising {U, σ, V} as separate parameter groups to enforce orthogonality constraints during training
Structured pruning — using singular value spectra to guide layer-wise rank reduction with provable output bounds
Dataset generalisation — applying the pipeline to CIFAR-10 or ImageNet crops to study how spectrum complexity scales with input difficulty
Attention mechanisms — investigating analogous SVD decompositions in transformer attention weight matrices

Virtual Expo 2026

Abstract