Linear Algebraic Approaches to Image Classification
Abstract
Abstract
Aim
To train a minimal convolutional neural network on MNIST and then fully deconstruct its learned weight matrices using Singular Value Decomposition, demonstrating that every convolutional layer is equivalently an SVD operation — and to make each component of that equivalence numerically verifiable and visually interpretable.
Introduction
Convolutional Neural Networks have achieved state-of-the-art performance on image classification tasks, yet the internal mathematical structure of a trained network remains opaque in most treatments. A conv layer is commonly described as a learnable bank of spatial filters — but what does gradient descent actually discover when it optimises those filters?
The central insight of this project is that the answer lies in linear algebra. When a conv layer's weight tensor of shape (out_ch, in_ch, kH, kW) is reshaped into a 2-D matrix W of shape (out_ch, in_ch·kH·kW), its full Singular Value Decomposition gives:
W = U · diag(σ) · Vᵀ
This is an identity — not an approximation. The forward pass through the layer for any input patch p can therefore be written in three explicit steps:
z = Vᵀ p → project patch onto input eigenfilter basis
s = diag(σ) z → scale each coordinate by its singular value
y = U s = W p → rotate into output feature space
h = ReLU(y) → carve piecewise-linear subspace
When gradient descent optimises W, it is implicitly discovering {U, σ, V} — the singular vectors and values. The singular values σᵢ encode how much discriminative energy the network places in each eigenfilter direction vᵢ, and their decay rate reveals the effective dimensionality of each layer. This project implements, verifies, and visualises every claim above across an 8-phase analysis pipeline.
Literature Survey and Technologies Used
Theoretical Background
- Singular Value Decomposition (Golub & Van Loan, Matrix Computations) — foundational matrix factorisation underlying the entire analysis
- im2col patch extraction — the standard implementation trick used in cuDNN and PyTorch that transforms convolution into matrix multiplication
- Low-rank neural network compression (Denton et al., NIPS 2014; Jaderberg et al., ICLR 2014) — empirical observation that conv layer weight matrices are approximately low-rank, motivating the spectral energy analysis
- Piecewise-linear network analysis (Montufar et al., NIPS 2014) — theoretical treatment of how ReLU networks partition input space into linear regions
Dataset
- MNIST (LeCun et al., 1998) — 60,000 training and 10,000 test 28×28 greyscale handwritten digit images, 10 classes. Automatically downloaded via
torchvision.datasets.MNISTon first run.
Technologies
- Python 3.x — core language
- PyTorch / torchvision — model definition, training, MNIST loading
- NumPy + SciPy — SVD computation (
numpy.linalg.svd), numerical verification - Matplotlib — all diagnostic visualisations
Methodology
The Network: TinyConvNet
To keep the analysis fully tractable, a deliberately small CNN is used. Bias terms are disabled in both conv layers so that the weight matrix alone captures the entire linear transformation.
| Layer | Operation | Output Shape | Weight Shape | SVD Shapes |
|---|---|---|---|---|
| conv1 | Conv2d(1, 8, 3×3, bias=False) |
(B, 8, 26, 26) |
(8, 9) |
U:(8,8), σ:(8,), Vᵀ:(8,9) |
| ReLU + MaxPool(2) | (B, 8, 13, 13) |
— | — | |
| conv2 | Conv2d(8, 16, 3×3, bias=False) |
(B, 16, 11, 11) |
(16, 72) |
U:(16,16), σ:(16,), Vᵀ:(16,72) |
| ReLU + MaxPool(2) | (B, 16, 5, 5) |
— | — | |
| fc | Linear(400, 10) |
(B, 10) |
(10, 400) |
— |
SVD Decomposition of Convolutional Layers
Each conv layer weight tensor is reshaped into a 2-D matrix W and the thin SVD is computed:
U, σ, Vᵀ = numpy.linalg.svd(W, full_matrices=False)
| Component | Shape | Meaning |
|---|---|---|
| V (right singular vectors) | (r, in_ch·kH·kW) |
Eigenfilters — orthogonal spatial patterns the layer is sensitive to, ranked by importance |
| σ (singular values) | (r,) |
Importance weights — σᵢ measures how much energy the network places in eigenfilter direction vᵢ |
| U (left singular vectors) | (out_ch, r) |
Feature directions — orthogonal basis for the output feature space |
Low-Rank Approximation
A rank-k reconstruction is obtained by retaining only the top-k singular triplets:
W_k = (U[:, :k] * σ[:k]) @ Vᵀ[:k, :]
Sweeping k from 1 to r allows plotting reconstruction error and cumulative spectral energy vs rank, quantifying each layer's effective dimensionality.
Analytical Forward Pass
AnalyticalNet mirrors TinyConvNet but performs every computation in explicit SVD coordinates:
Layer 1:
z1 = Vᵀ1 @ p # project patch: (k,)
s1 = σ1 * z1 # scale: (k,)
a1 = U1 @ s1 = W1@p # lift: (8,)
h1 = ReLU(a1)
Layer 2 (on pooled patches f):
z2 = Vᵀ2 @ f
s2 = σ2 * z2
a2 = U2 @ s2 = W2@f
h2 = ReLU(a2)
output = W_fc @ flatten(MaxPool(h2)) + b_fc
ReLU Subspace Analysis
For each patch p, let M(p) = diag(1[Wp > 0]) be the binary activation mask. Then:
h(p) = ReLU(W p) = M(p) · W · p
Each unique mask defines a different effective linear map. Layer 1 has 8 output channels giving a theoretical maximum of 2⁸ = 256 distinct linear regions. The analysis counts unique masks across 500 MNIST patches and projects activations onto the U-basis to visualise the space contraction induced by ReLU.
The 8-Phase Pipeline
| Phase | What it does |
|---|---|
| 1 — Train | Train TinyConvNet on MNIST for N epochs; per-epoch loss and test accuracy logging |
| 2 — SVD | Decompose both conv layers; compute singular values, cumulative energy, and effective rank |
| 3 — Analytical | Build AnalyticalNet; verify W = U·Σ·Vᵀ holds to machine precision; print two-layer SVD composition |
| 4 — Verify | Confirm analytical im2col → W·p output matches PyTorch conv output numerically (max diff < 1e-4) |
| 5 — Low-rank | Sweep rank k = 1…r; compute relative reconstruction error and cumulative energy captured |
| 6 — ReLU | Analyse how ReLU partitions pre-activation space into piecewise-linear regions; count unique activation masks |
| 7 — Composition | Verify Layer 2 analytically; verify Parseval identity ‖W‖²_F = Σ σᵢ² for both layers |
| 8 — Plots | Generate all diagnostic figures to outputs/ |
Results
| Metric | Value | Interpretation |
|---|---|---|
| W = U·Σ·Vᵀ identity (Layer 1) | Max error ~8.9 × 10⁻⁸ | Exact decomposition confirmed to machine epsilon |
| W = U·Σ·Vᵀ identity (Layer 2) | Max error ~1.8 × 10⁻⁷ | Exact decomposition confirmed to machine epsilon |
| Test accuracy (20 epochs) | 98.71% | Competitive with standard CNNs |
| Layer 1 singular values | [2.170, 1.753, 1.496, 1.041, 0.787, 0.623, 0.534, 0.424] | 6 of 8 components capture 95% energy |
| Layer 2 singular values | [2.895, 2.480, 2.349, 2.070, …] (16 values) | 13/16 for 95% energy; 15/16 for 99% |
| ReLU unique activation masks | 40 out of theoretical max 256 (2⁸) | Low effective dimensionality in practice |
| Frobenius identity (Layer 1) | ‖W‖²_F = Σ σᵢ² = 12.5733 | Parseval identity verified |
| Frobenius identity (Layer 2) | ‖W‖²_F = Σ σᵢ² = 43.7439 | Parseval identity verified |
| Analytical vs CNN forward pass | Max diff < 1 × 10⁻⁶ (both layers) | Forward pass equivalence confirmed |
Conclusions / Future Scope
Conclusions
This project establishes, numerically and visually, that every convolutional layer is exactly an SVD. Training TinyConvNet on MNIST achieves 98.71% test accuracy, and post-training analysis confirms all theoretical claims to machine epsilon:
- W = U·Σ·Vᵀ holds with max error below 10⁻⁷ for both layers
- AnalyticalNet forward pass agrees with PyTorch conv to within 10⁻⁶
- Parseval/Frobenius identity ‖W‖²_F = Σ σᵢ² verified for both layers
- Gradient descent discovers eigenfilters with interpretable spatial structure (edges, blobs, diagonals)
- ReLU activates only ~40 effective linear regions out of 256 theoretical maximum
These results provide a principled explanation for why low-rank compression of conv layers works: the bottom singular components carry negligible energy and removing them barely changes network output.
Future Scope
- Deeper networks — extending SVD analysis to VGG or ResNet-style architectures with batch normalisation and residual connections
- SVD-parameterised training — directly optimising {U, σ, V} as separate parameter groups to enforce orthogonality constraints during training
- Structured pruning — using singular value spectra to guide layer-wise rank reduction with provable output bounds
- Dataset generalisation — applying the pipeline to CIFAR-10 or ImageNet crops to study how spectrum complexity scales with input difficulty
- Attention mechanisms — investigating analogous SVD decompositions in transformer attention weight matrices
https://meet.google.com/imi-iqzy-szv
Report Information
Report Details
Created: May 19, 2026, 11:18 a.m.
Approved by: None
Approval date: None
Report Details
Created: May 19, 2026, 11:18 a.m.
Approved by: None
Approval date: None