license: public domain CC0
A learned 2D-to-3D shading system that could make billboards feel volumetric without actual geometry. This has some really compelling properties.
Why This Could Work Beautifully
The Core Insight
Billboards are "cheap" but look flat because they lack:
- Parallax (solved by 3D positioning, you already have this)
- View-dependent shading (this is what the CNN would fake)
- Silhouette variation (harder, but possible with alpha)
A CNN that takes [albedo, normal_map, light_dir, view_angle] and outputs "how this would look if it were 3D" is essentially learning view-dependent relighting on flat cards.
What Makes This Tractable
Traditional 3D Pipeline:
Millions of triangles → Rasterization → Shading → Pixels
Your Proposed Pipeline:
Hundreds of billboards → Style CNN per billboard → Composited scene
The CNN is doing "shading" but learned from artistic examples
rather than physically computed.
Conceptual Architecture
┌─────────────────────────────────────────────────────────────────────────────┐
│ BILLBOARD STYLIZATION ENGINE │
├─────────────────────────────────────────────────────────────────────────────┤
│ │
│ ASSET AUTHORING (Offline) │
│ ───────────────────────── │
│ Each billboard asset includes: │
│ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ │
│ │ Albedo │ │ Normal Map │ │ Height/Depth │ │
│ │ (RGBA) │ │ (tangent) │ │ (8-bit) │ │
│ └──────────────┘ └──────────────┘ └──────────────┘ │
│ │ │ │ │
│ └────────────────┼────────────────┘ │
│ ▼ │
│ ┌─────────────────────────────────────────────────────────────────────┐ │
│ │ Per-Billboard Runtime Input │ │
│ │ │ │
│ │ • Albedo texture [64×64×4] │ │
│ │ • Normal map [64×64×3] (baked from high-poly or hand-painted) │ │
│ │ • Light direction [3] (world space, from sun/dominant light) │ │
│ │ • View direction [3] (camera to billboard center) │ │
│ │ • Style embedding [100] (which artistic style to apply) │ │
│ │ │ │
│ └─────────────────────────────────────────────────────────────────────┘ │
│ │ │
│ ▼ │
│ ┌─────────────────────────────────────────────────────────────────────┐ │
│ │ BILLBOARD SHADING CNN (runs per-billboard) │ │
│ │ │ │
│ │ ┌─────────┐ │ │
│ │ │ Encoder │◀── Concat(Albedo, Normal, Depth) │ │
│ │ └────┬────┘ │ │
│ │ │ │ │
│ │ ▼ │ │
│ │ ┌─────────┐ │ │
│ │ │ Light │◀── MLP(LightDir, ViewDir) → [256] embedding │ │
│ │ │ + View │ │ │
│ │ │ Cond. │◀── Style embedding [100] │ │
│ │ └────┬────┘ │ │
│ │ │ (AdaIN-style conditioning) │ │
│ │ ▼ │ │
│ │ ┌─────────┐ │ │
│ │ │ Decoder │──▶ Output [64×64×4] RGBA │ │
│ │ └─────────┘ (stylized, shaded, with updated alpha) │ │
│ │ │ │
│ └─────────────────────────────────────────────────────────────────────┘ │
│ │ │
│ ▼ │
│ ┌─────────────────────────────────────────────────────────────────────┐ │
│ │ COMPOSITING │ │
│ │ │ │
│ │ • Sort billboards back-to-front (standard) │ │
│ │ • Alpha blend with depth test │ │
│ │ • Optional: soft particles, atmospheric fog │ │
│ │ │ │
│ └─────────────────────────────────────────────────────────────────────┘ │
│ │
└─────────────────────────────────────────────────────────────────────────────┘
What The CNN Learns
If trained correctly, the CNN would learn to:
| Input Signal | Learned Behavior |
|---|---|
| Normal map + Light dir | Directional shading (lit side vs shadow side) |
| Normal map + View dir | Fresnel-like rim lighting, specular hints |
| Depth/height map | Ambient occlusion, contact shadows |
| Style embedding | "Paint this like Moebius" vs "Paint this like Ghibli" |
| Alpha channel | Soften silhouettes based on view angle |
The Silhouette Problem (And A Clever Solution)
The hardest part of making billboards look 3D is that their silhouette doesn't change with view angle. But you could:
Option A: Multiple billboard layers (parallax billboards)
Front layer ────────── (foreground details, high alpha cutout)
Mid layer ───────── (main body)
Back layer ──────── (background/shadow catcher)
Slight parallax offset based on depth map creates pseudo-3D
Option B: CNN predicts alpha erosion based on view angle
When viewed edge-on, the CNN learns to:
• Thin the silhouette
• Add rim lighting
• Soften alpha at edges
This fakes the "foreshortening" you'd get from real geometry
Option C: Learn to generate displacement for mesh billboards
Billboard has a simple quad mesh that gets vertex-displaced
based on CNN-predicted depth. Not flat anymore, but still
way cheaper than full 3D model.
Training Data Strategy
This is where it gets interesting. You'd need paired data:
Training Pair:
┌─────────────────────────────────────────────────────────────────┐
│ │
│ INPUT: TARGET: │
│ ┌─────────────────────┐ ┌─────────────────────┐ │
│ │ Flat-lit albedo │ │ Same object, │ │
│ │ + normal map │ ──▶ │ rendered in target │ │
│ │ + light/view dirs │ │ artistic style with │ │
│ │ │ │ correct 3D shading │ │
│ └─────────────────────┘ └─────────────────────┘ │
│ │
│ Source: 3D render of Source: Either 3D render with │
│ object with baked normals NPR shader, OR hand-painted │
│ artist reference │
└─────────────────────────────────────────────────────────────────┘
You could generate this synthetically:
- Take 3D models
- Render flat albedo + normal maps
- Render the same model with various NPR/toon shaders as targets
- Vary light and view direction
- Train CNN to map (flat + normals + light + view) → (shaded stylized)
Performance Characteristics
SCALING ANALYSIS
════════════════
Assumption: 64×64 billboard textures, ~100 visible billboards
Per-billboard CNN inference:
Input: 64 × 64 × 7 channels = 28,672 floats
Output: 64 × 64 × 4 channels = 16,384 floats
Batched inference (100 billboards):
Combined input tensor: [100, 64, 64, 7]
Single CNN forward pass (batched)
Combined output tensor: [100, 64, 64, 4]
Estimated timing (RTX 3060, optimistic):
Batched 100× 64×64 inference: ~4-8ms
Compare to traditional rendering:
100 stylized 3D objects with full geometry: Potentially much more
expensive depending on triangle count and shader complexity
SWEET SPOT:
• Many small objects (vegetation, particles, crowds, debris)
• Stylized/artistic rendering where "painterly" beats "accurate"
• Mobile/low-end where geometry is expensive
Game Genres This Would Suit
- Paper Mario / Parappa style — Intentionally flat characters in 3D world
- Diablo-like isometric ARPGs — Lots of small enemies, fixed-ish camera
- City builders / RTS — Hundreds of units, low camera angle
- Stylized horror — Junji Ito-style 2D characters in 3D environments
- Living illustration — "Playable storybook" aesthetic
- VR with intentional flatness — Characters that feel like paper cutouts but properly lit
What This Engine Would NOT Have
Traditional Engine Billboard Stylization Engine
────────────────── ────────────────────────────
Skeletal meshes → Flipbook animations or sprite sheets
Normal mapping → Normal maps still used, but as CNN input
PBR materials → Style embeddings
Shadow maps → CNN learns to fake shadows
LOD meshes → Resolution scaling on billboard textures
Occlusion culling → Still works (billboard bounds)
Minimum Viable Experiment
PHASE 0: Proof of Concept
═════════════════════════
1. Single billboard asset:
• Hand-painted albedo (64×64)
• Normal map (from Blender bake or hand-painted)
2. Minimal CNN:
• Input: [albedo, normal, light_dir]
• Output: [shaded_albedo]
• Architecture: Tiny U-Net (~200K params)
• Trained on synthetic data (Blender renders)
3. Demo scene:
• One billboard
• Rotating light source
• Watch the shading respond
Success = "It looks like a 3D object even though it's flat"
No comments:
Post a Comment