13 — Snake RL¶
Beyond neural rendering
Vultorch isn't just for NeRF and 3DGS. Anything you can write
into a torch.Tensor — reinforcement learning, physics
simulation, signal processing — can be visualized at 60 fps
with zero-copy GPU tensors.
New friends in this chapter¶
| Name | What it does | Analogy |
|---|---|---|
| Snake environment | A grid-based game, pure Python + deque | OpenAI Gym, but 50 lines |
| DQN agent | 3-layer MLP that maps observations → Q-values | The simplest deep RL algorithm |
| Experience replay | Store transitions, sample random batches | A deque that makes RL stable |
| ε-greedy | Random moves with decaying probability | Exploration vs exploitation |
| Q-value heatmap | Color-coded grid showing the agent's "thinking" | Like an attention map, but for RL |
| Manual mode | Take over the snake with buttons | Debug your env by playing it yourself |
Why Snake?¶
Snake is the perfect RL demo:
- Simple rules — even non-ML people understand it instantly
- Visual feedback — you can see the agent getting smarter on the 32×16 board
- Fast training — a DQN learns basic food-seeking in ~2000 episodes
- Fits in one file — environment + agent + training + visualization
And it proves an important point: Vultorch is a general-purpose GPU visualization tool, not just a neural rendering library.
The environment — 50 lines of pure Python¶
class SnakeEnv:
def reset(self):
self.snake = deque([(ROWS // 2, COLS // 2)])
self.dir = RIGHT
self._place_food()
...
return self._obs()
def step(self, action):
# action: 0=straight, 1=turn left, 2=turn right
...
return obs, reward, done
The observation is an 11-dimensional vector:
| Dims | Meaning |
|---|---|
| 0–2 | Danger ahead / left / right (wall or body) |
| 3–6 | Current direction (one-hot) |
| 7–10 | Food direction (up / right / down / left) |
This is enough for a small MLP to learn. No pixels, no CNN — just the essential spatial relationships.
The DQN agent — 3 lines of PyTorch¶
class DQN(nn.Module):
def __init__(self):
super().__init__()
self.net = nn.Sequential(
nn.Linear(11, 128), nn.ReLU(True),
nn.Linear(128, 128), nn.ReLU(True),
nn.Linear(128, 3), # straight, left, right
)
def forward(self, x):
return self.net(x)
Three actions (straight / turn left / turn right) instead of four (up/down/left/right) — this makes learning much easier because the agent doesn't need to learn "don't reverse into yourself".
Experience replay and training¶
class ReplayBuffer:
def __init__(self, cap=50000):
self.buf = deque(maxlen=cap)
def push(self, s, a, r, s2, done):
self.buf.append((s, a, r, s2, done))
def sample(self, n):
batch = random.sample(self.buf, min(n, len(self.buf)))
...
The DQN training step is textbook:
def train_dqn():
s, a, r, s2, d = buf.sample(BATCH)
with torch.no_grad():
q2 = target(s2).max(dim=1).values
y = r + GAMMA * q2 * (1 - d)
q = policy(s).gather(1, a.unsqueeze(1)).squeeze(1)
loss = F.smooth_l1_loss(q, y)
opt.zero_grad(); loss.backward(); opt.step()
A target network (updated every 500 steps) stabilizes learning. ε decays from 1.0 to 0.01 — the agent starts fully random and gradually shifts to exploiting what it's learned.
Rendering the game board¶
def render_game():
board = torch.zeros(ROWS, COLS, 3)
board[:, :] = torch.tensor([0.05, 0.05, 0.08]) # dark background
# Snake body (gradient from bright to dark green)
for i, (r, c) in enumerate(env.snake):
if i == 0:
board[r, c] = torch.tensor([1.0, 0.9, 0.0]) # head: yellow
else:
t = 1.0 - i / max(len(env.snake), 1) * 0.5
board[r, c] = torch.tensor([0.0, t, 0.2])
# Food: red
board[fr, fc] = torch.tensor([1.0, 0.15, 0.15])
game_t.copy_(cpu_rgba.to(disp_dev))
The grid is 32×16 (2:1 landscape ratio). The filter="nearest" setting on the canvas prevents blurring —
each grid cell is a crisp pixel, just like the Conway's Game of Life
example.
Q-value heatmap¶
The Q-value panel visualizes the agent's decisions:
- Green = agent wants to go straight
- Blue = agent wants to turn left
- Orange = agent wants to turn right
- The top-left 3 cells show Q-value magnitude for each action
This is functionally similar to an attention map in a transformer — it shows you what the model is thinking, not just what it does.
Manual mode¶
Check the Manual Mode box and three buttons appear: Left, Fwd, Right. Now you play the snake. This is invaluable for debugging RL environments — if you can't solve the game yourself, the agent probably can't either.
The training loop¶
while view.step():
for _ in range(S["speed"]):
# ε-greedy action selection
if random.random() < S["eps"]:
a = random.randint(0, 2)
else:
a = policy(obs).argmax().item()
next_obs, reward, done = env.step(a)
buf.push(obs, a, reward, next_obs, float(done))
train_dqn()
if done:
obs = env.reset()
render_game()
render_qvalues()
view.end_step()
The Steps/Frame slider controls how many environment steps happen per rendered frame. Crank it up to 50 for faster training; set it to 1 to watch every move in slow motion.
Full code¶
What just happened?¶
In one Python file you built:
- A Snake game environment with reward shaping
- A DQN agent with experience replay and target network
- Live visualization — game board + Q-value heatmap + reward curves
- Manual mode — play the game yourself to debug the env
- ε-greedy exploration with live epsilon display
All rendered at 60 fps via Vultorch's zero-copy tensor display. No separate Gym renderer, no matplotlib, no logging to disk.
The takeaway: if your data lives in a tensor, Vultorch can visualize it — whether that's neural radiance fields, Gaussian splats, cellular automata, or a snake learning to find food.
Key takeaways¶
| Concept | Code | Purpose |
|---|---|---|
| Snake env | SnakeEnv class |
32×16 grid RL environment |
| DQN | nn.Sequential(11 → 128 → 128 → 3) |
Deep Q-Network |
| Replay buffer | deque(maxlen=50000) |
Stable off-policy learning |
| ε-greedy | random() < eps |
Exploration vs exploitation |
| Nearest filter | filter="nearest" |
Pixel-perfect grid display |
| Q-value heatmap | Color-coded agent decisions | Visual debugging |
| Manual mode | checkbox("Manual Mode") |
Debug the environment |
| step()/end_step() | Training-loop-owned rendering | RL loop controls pace |