grieferz: video games pt 2

license: public domain CC0

A learned 2D-to-3D shading system that could make billboards feel volumetric without actual geometry. This has some really compelling properties.

Why This Could Work Beautifully

The Core Insight

Billboards are "cheap" but look flat because they lack:

Parallax (solved by 3D positioning, you already have this)
View-dependent shading (this is what the CNN would fake)
Silhouette variation (harder, but possible with alpha)

A CNN that takes [albedo, normal_map, light_dir, view_angle] and outputs "how this would look if it were 3D" is essentially learning view-dependent relighting on flat cards.

What Makes This Tractable

Traditional 3D Pipeline:
  Millions of triangles → Rasterization → Shading → Pixels
  
Your Proposed Pipeline:
  Hundreds of billboards → Style CNN per billboard → Composited scene
  
The CNN is doing "shading" but learned from artistic examples
rather than physically computed.

Conceptual Architecture

┌─────────────────────────────────────────────────────────────────────────────┐
│                        BILLBOARD STYLIZATION ENGINE                         │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                             │
│   ASSET AUTHORING (Offline)                                                 │
│   ─────────────────────────                                                 │
│   Each billboard asset includes:                                            │
│   ┌──────────────┐ ┌──────────────┐ ┌──────────────┐                       │
│   │   Albedo     │ │  Normal Map  │ │ Height/Depth │                       │
│   │   (RGBA)     │ │  (tangent)   │ │   (8-bit)    │                       │
│   └──────────────┘ └──────────────┘ └──────────────┘                       │
│          │                │                │                                │
│          └────────────────┼────────────────┘                                │
│                           ▼                                                 │
│   ┌─────────────────────────────────────────────────────────────────────┐  │
│   │                    Per-Billboard Runtime Input                      │  │
│   │                                                                     │  │
│   │  • Albedo texture [64×64×4]                                        │  │
│   │  • Normal map [64×64×3] (baked from high-poly or hand-painted)     │  │
│   │  • Light direction [3] (world space, from sun/dominant light)      │  │
│   │  • View direction [3] (camera to billboard center)                 │  │
│   │  • Style embedding [100] (which artistic style to apply)           │  │
│   │                                                                     │  │
│   └─────────────────────────────────────────────────────────────────────┘  │
│                           │                                                 │
│                           ▼                                                 │
│   ┌─────────────────────────────────────────────────────────────────────┐  │
│   │              BILLBOARD SHADING CNN (runs per-billboard)             │  │
│   │                                                                     │  │
│   │    ┌─────────┐                                                      │  │
│   │    │ Encoder │◀── Concat(Albedo, Normal, Depth)                     │  │
│   │    └────┬────┘                                                      │  │
│   │         │                                                           │  │
│   │         ▼                                                           │  │
│   │    ┌─────────┐                                                      │  │
│   │    │ Light   │◀── MLP(LightDir, ViewDir) → [256] embedding          │  │
│   │    │ + View  │                                                      │  │
│   │    │ Cond.   │◀── Style embedding [100]                             │  │
│   │    └────┬────┘                                                      │  │
│   │         │         (AdaIN-style conditioning)                        │  │
│   │         ▼                                                           │  │
│   │    ┌─────────┐                                                      │  │
│   │    │ Decoder │──▶ Output [64×64×4] RGBA                             │  │
│   │    └─────────┘    (stylized, shaded, with updated alpha)            │  │
│   │                                                                     │  │
│   └─────────────────────────────────────────────────────────────────────┘  │
│                           │                                                 │
│                           ▼                                                 │
│   ┌─────────────────────────────────────────────────────────────────────┐  │
│   │                      COMPOSITING                                    │  │
│   │                                                                     │  │
│   │  • Sort billboards back-to-front (standard)                         │  │
│   │  • Alpha blend with depth test                                      │  │
│   │  • Optional: soft particles, atmospheric fog                        │  │
│   │                                                                     │  │
│   └─────────────────────────────────────────────────────────────────────┘  │
│                                                                             │
└─────────────────────────────────────────────────────────────────────────────┘

What The CNN Learns

If trained correctly, the CNN would learn to:

Input Signal	Learned Behavior
Normal map + Light dir	Directional shading (lit side vs shadow side)
Normal map + View dir	Fresnel-like rim lighting, specular hints
Depth/height map	Ambient occlusion, contact shadows
Style embedding	"Paint this like Moebius" vs "Paint this like Ghibli"
Alpha channel	Soften silhouettes based on view angle

The Silhouette Problem (And A Clever Solution)

The hardest part of making billboards look 3D is that their silhouette doesn't change with view angle. But you could:

Option A: Multiple billboard layers (parallax billboards)

    Front layer ──────────  (foreground details, high alpha cutout)
    Mid layer   ─────────   (main body)
    Back layer  ────────    (background/shadow catcher)
    
    Slight parallax offset based on depth map creates pseudo-3D

Option B: CNN predicts alpha erosion based on view angle

When viewed edge-on, the CNN learns to:
  • Thin the silhouette
  • Add rim lighting
  • Soften alpha at edges
  
This fakes the "foreshortening" you'd get from real geometry

Option C: Learn to generate displacement for mesh billboards

Billboard has a simple quad mesh that gets vertex-displaced
based on CNN-predicted depth. Not flat anymore, but still
way cheaper than full 3D model.

Training Data Strategy

This is where it gets interesting. You'd need paired data:

Training Pair:
┌─────────────────────────────────────────────────────────────────┐
│                                                                 │
│  INPUT:                          TARGET:                        │
│  ┌─────────────────────┐        ┌─────────────────────┐        │
│  │ Flat-lit albedo     │        │ Same object,        │        │
│  │ + normal map        │   ──▶  │ rendered in target  │        │
│  │ + light/view dirs   │        │ artistic style with │        │
│  │                     │        │ correct 3D shading  │        │
│  └─────────────────────┘        └─────────────────────┘        │
│                                                                 │
│  Source: 3D render of          Source: Either 3D render with   │
│  object with baked normals     NPR shader, OR hand-painted     │
│                                artist reference                 │
└─────────────────────────────────────────────────────────────────┘

You could generate this synthetically:

Take 3D models
Render flat albedo + normal maps
Render the same model with various NPR/toon shaders as targets
Vary light and view direction
Train CNN to map (flat + normals + light + view) → (shaded stylized)

Performance Characteristics

SCALING ANALYSIS
════════════════

Assumption: 64×64 billboard textures, ~100 visible billboards

Per-billboard CNN inference:
  Input:  64 × 64 × 7 channels = 28,672 floats
  Output: 64 × 64 × 4 channels = 16,384 floats
  
Batched inference (100 billboards):
  Combined input tensor:  [100, 64, 64, 7]
  Single CNN forward pass (batched)
  Combined output tensor: [100, 64, 64, 4]

Estimated timing (RTX 3060, optimistic):
  Batched 100× 64×64 inference: ~4-8ms
  
Compare to traditional rendering:
  100 stylized 3D objects with full geometry: Potentially much more
  expensive depending on triangle count and shader complexity

SWEET SPOT:
  • Many small objects (vegetation, particles, crowds, debris)
  • Stylized/artistic rendering where "painterly" beats "accurate"
  • Mobile/low-end where geometry is expensive

Game Genres This Would Suit

Paper Mario / Parappa style — Intentionally flat characters in 3D world
Diablo-like isometric ARPGs — Lots of small enemies, fixed-ish camera
City builders / RTS — Hundreds of units, low camera angle
Stylized horror — Junji Ito-style 2D characters in 3D environments
Living illustration — "Playable storybook" aesthetic
VR with intentional flatness — Characters that feel like paper cutouts but properly lit

What This Engine Would NOT Have

Traditional Engine          Billboard Stylization Engine
──────────────────          ────────────────────────────
Skeletal meshes         →   Flipbook animations or sprite sheets
Normal mapping          →   Normal maps still used, but as CNN input
PBR materials           →   Style embeddings
Shadow maps             →   CNN learns to fake shadows
LOD meshes              →   Resolution scaling on billboard textures
Occlusion culling       →   Still works (billboard bounds)

Minimum Viable Experiment

PHASE 0: Proof of Concept
═════════════════════════

1. Single billboard asset:
   • Hand-painted albedo (64×64)
   • Normal map (from Blender bake or hand-painted)
   
2. Minimal CNN:
   • Input: [albedo, normal, light_dir]
   • Output: [shaded_albedo]
   • Architecture: Tiny U-Net (~200K params)
   • Trained on synthetic data (Blender renders)

3. Demo scene:
   • One billboard
   • Rotating light source
   • Watch the shading respond

Success = "It looks like a 3D object even though it's flat"

grieferz

Saturday, January 31, 2026

video games pt 2