license: public domain CC0

Design Document: Multi-Scale Neural Network Visualization via CA, Voxels, and Fractal Compression

1. Overview

This document defines a high-performance, multi-scale visualization framework for representing the internal state of deep neural networks using:

Cellular automata (CA)
3D voxel grids
Subpixel and multi-resolution compression
Fractal-inspired scaling derived from network weights and dynamics

The framework converts high-dimensional tensors (activations, weights, gradients, attention maps) into structured, recursively compressed visual fields capable of scaling to billion-parameter models.

The system supports:

Static snapshots (single forward pass)
Time evolution (training iterations)
Layer transitions
CA-driven emergent visualizations
Recursive zoom / fractal exploration

The architecture is model-agnostic (CNNs, transformers, MLPs, diffusion models, etc.).

2. Objectives

2.1 Interpretability

Provide structured visibility into:

Activation sparsity patterns
Feature hierarchies
Attention clustering
Gradient flow and vanishing/exploding behavior
Residual path dominance
Spectral structure of weight matrices

Interpretability goal: expose structure, not raw magnitude.

2.2 Scalability

Target constraints:

Handle ≥10⁹ parameters
Maintain interactive performance (30–60 FPS for moderate models)
Support progressive refinement

Strategies:

Hierarchical spatial compression
Tensor factorization (PCA/SVD)
Block quantization
Octree voxelization
Multi-resolution caching

2.3 Artistic and Structural Insight

Neural networks inherently exhibit:

Recursive composition
Hierarchical feature reuse
Spectral decay
Self-similar clustering
Power-law distributions

The system intentionally leverages these properties to produce fractal-like representations grounded in real model statistics.

3. System Architecture

3.1 Data Sources

3.1.1 Activation Capture

Implementation (PyTorch example conceptually):

Register forward hooks on modules
Capture:
- Input tensor
- Output tensor
- Intermediate states (if needed)

Memory constraints:

For large models, stream activations layer-by-layer.
Use half precision (FP16/BF16).
Optionally detach and move to CPU asynchronously.

3.1.2 Gradients

Use backward hooks or register_full_backward_hook.

Store:

dL/dW
dL/dX
Gradient norms
Gradient sign maps

Optionally compute:

[
||\nabla W||_F, \quad ||\nabla X||_2
]

These become color or intensity drivers.

3.1.3 Weight Statistics

Precompute per layer:

Frobenius norm
Spectral norm (via power iteration)
Singular values (top-k)
Channel norms
Kernel norms
Sparsity ratio
Weight distribution histogram

Cache results for rendering.

3.1.4 Attention Matrices

For transformer layers:

Extract:

[
A \in \mathbb{R}^{H \times N \times N}
]

Where:

H = number of heads
N = sequence length

Store:

Mean across heads
Per-head matrices
Symmetrized attention
Eigenvalues of A

3.1.5 Jacobians (Optional)

Expensive but powerful.

Approximate Jacobian norm via:

[
||J||_F^2 = \sum_i ||\frac{\partial y}{\partial x_i}||^2
]

Efficient approximation:

Hutchinson trace estimator
Random projection methods

Used to visualize sensitivity fields.

3.2 Processing Pipeline

Stage 1 — Tensor Acquisition

Normalize tensors per layer:

Options:

Min-max scaling
Z-score normalization
Robust scaling (median + MAD)
Log scaling for heavy-tailed distributions

Recommended default:

[
x' = \tanh(\alpha x)
]

Prevents outlier domination.

Stage 2 — Dimensionality Compression

CNN Feature Maps

Input shape:
[
B \times C \times H \times W
]

Steps:

Aggregate batch:
- mean across B
Compute:
- mean activation per channel
- variance per channel
Reduce channels:
- PCA across C
- Top 3 components → RGB

Optional:

Spatial pooling pyramid:
- 1×
- 1/2×
- 1/4×
- 1/8×

Store as mipmap pyramid.

MLP Activations

Vector shape:
[
B \times D
]

Options:

Reshape D into 2D grid (nearest square)
PCA to 3 components
Use block averaging
Spectral embedding

Attention Compression

Compute recursive powers:

[
A^{(2^k)} = A^{(2^{k-1})} \cdot A^{(2^{k-1})}
]

Normalize at each step.

This produces long-range interaction amplification.

Also compute:

Laplacian:
[
L = D - A
]
Eigenvectors for cluster visualization.

Stage 3 — Fractal Scaling

3.3.1 Weight Norm Scaling

For each layer:

[
s_L = ||W_L||_F
]

For each channel:

[
s_c = ||W_{L,c}||
]

Use scaling factor:

[
\tilde{x} = x \cdot \frac{s_c}{\max(s_c)}
]

Maps structural importance to visual prominence.

3.3.2 Spectral Scaling

Compute top singular values:

[
\sigma_1 \ge \sigma_2 \ge \dots
]

Define recursive zoom depth:

[
depth \propto \log(\sigma_1 / \sigma_k)
]

High spectral dominance → deeper fractal recursion.

3.3.3 Residual Path Branching

For networks with skip connections:

Represent each residual branch as a child region in CA or voxel tree.

Branch width ∝ branch weight norm.

This creates visible branching trees.

3.3.4 Jacobian Field Visualization

Map:

Jacobian norm → brightness
Largest singular vector direction → color angle

Results often produce ridge-like structures in input space.

4. Compression Techniques

4.1 Subpixel Encoding

Each pixel subdivided into:

2×2 grid or 3×3 microcells

Encode:

Mean
Variance
Gradient magnitude
Sign ratio

Use bit-packing for GPU upload:

Example:

8 bits mean
8 bits variance
8 bits gradient
8 bits sign entropy

Packed into RGBA texture.

4.2 Octree Voxelization

Data structure:

Node:
    bounds
    mean_activation
    variance
    children[8]

Merge rule:

If:
[
|a_i - a_j| < \epsilon
]

And variance below threshold → collapse children.

Provides O(N log N) construction.

4.3 Density-Aware Merging

Define density:

[
\rho = |activation|
]

High ρ:

Subdivide

Low ρ:

Merge

Adaptive voxel resolution.

4.4 Multi-Resolution Blending

Algorithm:

Downsample tensor via average pooling
Upsample via bilinear
Blend:

[
x_{blend} = \lambda x + (1-\lambda)x_{up}
]

Repeat recursively.

Produces controlled fractal texture.

5. Cellular Automaton Layer

Each CA cell contains:

struct Cell:
    activation_mean
    activation_variance
    gradient_mean
    weight_scale
    spectral_scale

Neighborhood:

Moore (8-neighbor)
3D 26-neighbor (voxels)

Update rule example:

[
x_{t+1} = f(x_t, \text{neighbor mean}, \text{gradient}, \text{weight scale})
]

Possible update equation:

[
x' = x + \alpha \cdot \Delta_{neighbors}
]
[
x' = x' \cdot (1 + \beta \cdot weight_scale)
]

Optionally nonlinear activation (ReLU/tanh).

Can be:

Hand-crafted
Learned (Neural CA)

6. Voxel Rendering

6.1 Mapping Strategy

Dimension mapping examples:

X,Y → spatial
Z → channel index
Brightness → activation
Hue → gradient direction
Opacity → weight norm

6.2 GPU Rendering

Recommended:

OpenGL / Vulkan
WebGL for browser
CUDA volume ray marching

Techniques:

3D textures
Ray marching with early termination
Transfer functions for opacity
Instanced cube rendering for sparse voxels

Acceleration:

Frustum culling
Level-of-detail switching
Sparse voxel octrees

7. Color Encoding

7.1 Diverging Maps

Map:

[
x < 0 → blue
]
[
x > 0 → red
]

Gamma correct before display.

7.2 PCA → RGB

Compute PCA:

[
X \rightarrow U \Sigma V^T
]

Take first 3 columns of UΣ.

Normalize per component.

Map to RGB.

7.3 HSV Gradient Encoding

Hue:
[
\theta = \text{atan2}(g_y, g_x)
]

Saturation:
[
||\nabla||
]

Value:
[
|activation|
]

8. Rendering Modes

8.1 Static

Single layer spectral map
Attention fractal heatmap
Weight norm landscape
Voxel activation cloud

8.2 Animated

Training evolution over epochs
Gradient flow over time
CA emergent patterns
Recursive zoom via spectral scale

8.3 Interactive

User controls:

Layer selection
Head selection
Compression threshold
Spectral depth
Toggle raw vs scaled
Voxel slicing plane

Add inspection overlay:

Hover → show tensor statistics
Click → show singular values

9. Performance Considerations

9.1 Memory

Use FP16 where possible
Stream tensors instead of storing entire model
Compress PCA bases

9.2 Parallelism

GPU for voxel + CA
CPU for PCA/SVD (or cuSOLVER)
Async prefetch

9.3 Caching

Cache:

Downsample pyramids
PCA bases per layer
Weight norms
Spectral norms

Invalidate cache when model updates.

10. Stability & Safety

Always normalize before visualization.
Clamp extreme outliers.
Provide legends and numeric scales.
Separate aesthetic exaggeration from faithful mode.
Provide “scientific mode” toggle (no scaling distortions).

11. Future Extensions

Learned Neural CA visualizers
VR exploration of voxel space
Differentiable visualization loss
Integration with experiment tracking systems
Spectral topology analysis
Persistent homology overlays

12. Implementation Roadmap (High-Level)

Phase 1

Activation capture
PCA compression
2D heatmap renderer

Phase 2

Multi-resolution pyramid
Octree voxelization
GPU volume rendering

Phase 3

Spectral scaling
Attention recursion
CA evolution engine

Phase 4

Interactive UI
Training-time animation
VR or WebGL deployment

Saturday, February 14, 2026

art