Agent moving from start to end position in maze grid

What if your LLM learns that doing nothing is better than trying? That’s one of the things I ran into while fine-tuning an LLM to solve grid mazes.

I took a language model, prepared the necessary data, set up logging of the metrics, and spent a LOT of time performing fine-tuning experiments, analyzing results, tweaking, fixing, ideating, etc. all in order to reach my goal. That goal was not reached, but the lessons remain. This blogpost is meant to serve anyone who is curious about training LLMs, to learn more, and for myself, to write my efforts in digital history.

TLDR

I spent 98 training runs trying to fine-tune Llama 3.2 1B with GRPO to solve 3x3 grid mazes. It didn’t work. The key takeaways:

Reward functions design is an art. Your model will find every possible shortcut you leave open. Mine learned to output 50 directions to brute-force reaching the goal, then learned to drop closing tags entirely when I started penalizing the long answers.
Start simple. Curriculum learning (smaller grids, fixed symbols, few reward functions) make it easier to pinpoint problems. Add complexity only after you’ve confirmed the model is learning something real.
Checking result graphs is not enough. I saw reward climb while the model just repeated “right” for every maze, it got lucky often enough on small grids to look like progress.
Understand the libraries you use. While working with ReasoningGym, I spotted an edge case in their maze generation and got a fix merged (PR #515). Reading library source code pays off, you’ll understand the tool better and sometimes get to give back.

Technology used:

Motivation

I’ve been interested in the idea of fine-tuning a “smallish” language model for a specific task. With the advent of newer, bigger, and better models, there’s going to be a real need for smaller, task-specific LLMs in the future. Sure, the big models can do a wide range of tasks right now, but what about the people who want to self-host things for specific use cases, without paying for APIs or renting big GPUs?

I wanted to combine two ideas: fine-tuning and reinforcement learning (RL). Lo and behold, there are already methods to train LLMs using reinforcement learning. Online, you’ll find all sorts of algorithms: REINFORCE, GRPO, PPO, DPO, Q-Learning, and more. One might say: what the hell is even all this stuff? At their core, these are just different approaches to training models. As with many things, people have different views on what works best. Some methods rely on preference data (ranked output pairs), while others don’t. Some require training an additional model alongside the one being fine-tuned, while others keep things simpler.

After exploring these options, I settled on GRPO because it doesn’t require preference data or a separate reference model, just reward functions I could define myself. This was a good fit for my use case: training an LLM to navigate a maze. I know it’s overkill when simpler pathfinding algorithms exist, but the idea of teaching an LLM to navigate an environment was exciting.

The most commonly used example for reinforcement learning is an agent learning to walk. It starts off knowing nothing, tumbling at every step, but as it learns through its reward functions (staying upright for longer, moving forward more), the agent can achieve to navigate an environment without us explicitly programming every situation, every if-else statement.

Setup

I wanted to fine-tune with a LoRA adapter due to memory limitations, so I started by looking for code examples that would run on Apple Silicon. I explored the MLX library, but it lacked LoRA support for GRPO, so I went with HuggingFace instead, which already had good examples and integrations to get started quickly.

I went with the Llama-3.2-1B-Instruct model since it could fit in my RAM. At first, I started off with gemma-3-4b-it, but training went slow and I ran into out of memory errors, so I decided to stick with Llama3.2 1B.

Setup details:

Transformers Reinforcement learning (TRL)
Weights & Biases for tracking (automatic TRL integration)
Checkpoints every 50–100 steps, kept last 3

Data prep

The first step is the data. The whole workflow I had in mind to do was:

Generate a dataset of prompts to pass to the model which include the maze and instructions on how to navigate it
Fine-tune the model (parse the model’s output and simulate walking through the maze)
Iterate with different configurations until we get to a good solution

For generating grid mazes, there must be some already made packages. I found a library called Reasoning Gym that already had ready functions to generate the prompt I need. I could just call a function specifying maze parameters such as number of rows, number of columns, and it created the prompt for me. Surprisingly, while playing around with it, I found a bug in their code. I started working on a fix and ended up in a late-night coding session to produce a pull request (PR). I got happy when it got accepted!

Example data sent to the LLM

Prompts (click to expand)

Example grid given to the LLM:

O X *
O O #
X O O

Legend: * = start, # = goal, O = open, X = blocked

System prompt:

You are an expert maze solver. Your task is to find the shortest path from the start to the destination point in a grid.

IMPORTANT - How to read the grid:
- The grid is displayed with row 0 at the TOP and the last row at the BOTTOM
- Columns go from left (column 0) to right
- To find a cell's position: count rows from the TOP (starting at 0) and columns from the LEFT (starting at 0)

First, think through your solution step by step inside <think></think> tags. Identify where * and # are located, then trace a path. Be brief (around 5 sentences).

Then, output only the sequence of directions (up, down, left, right) inside <answer></answer> tags.

Example format:
<think>
[Your reasoning here]
</think>
<answer>
[sequence of directions here]
</answer>

User prompt:

The grid is represented as a matrix with the following types of cells:
- *: your starting point
- #: your destination point
- O: an open cell
- X: a blocked cell

Therefore, you need to find the shortest path from * to #, moving only through open cells.

O X *
O O #
X O O

Experimentation

I created the necessary helper classes to generate any number of maze configurations I wanted.

For training with GRPO, we need to define reward functions. These guide the model on what’s good or bad behavior without us explicitly programming it which paths it should follow. The other Reinforcement learning methods, are suited for slightly different use cases, for example DPO is more suited when you want alignment on preferences, let’s say you have output pairs, and you tell the model, output A is better than output B. For our use case, GRPO was more suited since we could relatively easy define the reward function we had in mind.

Defining these reward functions is not trivial. Over the course of this work, I evolved the training function, iterated on them, included and excluded some.

In total, I had done 98 different training runs.

Reward functions

Here’s the full set of reward functions I ended up writing. I didn’t start with all of these at once, rather than, I created them throughout experimentation.

Function	Type	Score	Purpose
`got_to_end_reward`	Binary	+2 or 0	Did the path reach the goal? Foundation of every reward set.
`format_reward`	Structured	Up to +1.0 (+0.25 per tag)	Rewards correct `<think>`, `</think>`, `<answer>`, `</answer>` structure. Modified later to require exactly one of each.
`binary_got_closer`	Binary	+0.5 or 0	Did the path end closer to the goal than the start? Soft nudge in the right direction (Manhattan distance)
`distance_reward`	Proportional version of `binary_got_closer`	Up to +3.0	`3.0 × (distance_improvement / initial_distance)` - more reward the closer you get.
`simulate_path`	Granular	Variable	Step-by-step simulation: +0.5 per valid step, −1.0 per wall hit, +10.0 for reaching goal, −2.0 per extra move after reaching goal.
`length_reward`	Binary	+1 or 0	Does the number of directions outputted match the ground truth shortest path?
`validity_reward`	Proportional	Up to +3.0	Fraction of output tokens that are valid directions (`up`, `down`, `left`, `right`).
`no_answer_reward`	Penalty	−5.0	Fires when no answer is extractable at all.
`diversity_reward`	Regularizer	Variable	Penalizes identical completions across a GRPO batch to maintain gradient signal.

Experimentation journey

The missing chat template

On my very first test run, the model wasn’t following the prompt instructions at all, the model started producing Python code instead of outputting directions. Then I remembered: I forgot to add the chat template from the tokenizer, so none of the special tokens were being added. The model had no idea what the instructions were!

My first experiments started with varying grid sizes from 5x5 to 8x8, with varying dimensions (5x7, 6x6, etc.). This was a moving target for evaluation since we were varying the input grid mazes and this would give the model a harder time to adjust its internal “reasoning” to align with the reward functions.

I moved on to doing curriculum learning, meaning starting off with simpler inputs to see if the model can learn, and then moving on to harder mazes when the model has had opportunity to learn. This meant I just changed the grid size to always be 3x3. Also, initially before doing curriculum learning, I was using a different function from Reasoning Gym to generate my mazes, I was using maze.py, but this one generated different symbols for each cell type. This would also be like a moving target for the model when trying to learn which symbol means what, so I decided to go with a grid cell representation that is the same for all training samples. For that I used shortest_path.py instead which offered the same thing, just with a fixed symbols.

Silence beats failure

I started with the simulate_path reward function, but it did a lot of heavy lifting: it simulated wall hits, correct paths, reaching the end, all at once! in one occasion that it was actually working against itself: if the model produced no tags, the answer could not be parsed, meaning there was nothing to simulate, so it got 0 reward. But if the model did produce valid tags and performed adverse actions like walking into walls or moving away from the goal, it got a negative score like -3. Doing nothing made more sense to the model than actually trying something. Not answering was the optimal strategy. This was the first “gotcha!”.

The closing-tag exploit

Next, I wanted a richer reward signal so that we can craft the correct behavior of the model on multiple areas, reaching the goal, outputting correct XML tags, telling it how close it got to the end in case it failed. I ran many combinations of simulate_path, diversity_reward, format_reward, and length_reward reward functions. But reading the results was murky, it was difficult to tell which reward function caused improvements.

I then said okay let’s start from point 1 again, so I just started off with the binary got_to_end_reward. With just this one function, I noticed the model started generating more and more directions. Here is worth saying that my got_to_end_reward function initially had the logic that the last position where the agent is after moving using its directions, doesn’t have to be the goal position, it just has to move through it when navigating it. More steps meant more chances of accidentally landing on the goal. Model output showing an excessive number of directions to brute-force reaching the goal Here we can see that the model generated a lot of directions, and that the got_to_end_reward is 1, which means that the agent somehow managed to hit the goal on its crazy path, but now it outputs a lot of directions, does a spray-and-pray and hopes it would randomly reach the goal.

To combat this, I added length_reward alongside the existing got_to_end_reward to penalize outputs that were not the same number of directions as the real answer.

We can see at step 470 the length_reward working as it should, it outputs -9 since we have 9 directions more than what it actually takes to reach the end.

Shortly after that’s when another gotcha appeared: if the model didn’t output a closing tag, it couldn’t parse the answer and the length penalty couldn’t kick in, and by the reward function logic, if there was no parsable answer, it got a flat -3. But if it did close the tag with 30 directions inside, then the answer could be parsed and the penalty ballooned on wrong answers. The model figured this out fast, it learned to just not generate the closing tag.

Model output exploiting missing closing tag to avoid length penalty Opened view showing model learned to omit the closing answer tag At step 590, the model learned that without a closing tag, we have a fixed -2 reward.

This meant that I had to revisit the logic on how the reward functions handle unparsable answers. At first, I thought each function should specify severity for missing answers, but that scattered responsibility throughout all the reward functions, which is not elegant. Instead, each function should stay simple and unambiguous: only format_reward handles non-parsable outputs. I rewrote the logic to enforce this.

Likewise, it got me thinking, how big should the reward scales be? Should some reward functions have negative reward, should they start from 0? It was a delicate balance to create the reward scales as well, since what if for example format_reward assigns -100 for not finding the correct XML tags, while simulate_path assigns +10 for reaching the goal. In this situation, the model would be more “scared” to output badly formatted answers, rather than trying to find the goal. After some consultation, I went with having no negative rewards and just using a 0 as the baseline. This actually makes sense, since this means “I’m not punishing you for getting it wrong, but I’m also not reward you in any way”. This doesn’t update the model’s weights until it actually does a behavior that’s rewardable.

Side note: just for fun and laughs, I did one run where I put HUGE negative reward if we do not have a parsable answer. Due to GRPO having group level advantage, meaning a single completion’s gradient would not dominate too much instead, the group advantage is taken into consideration. I noticed very drastic changes in gradients when I had the first version of the simulate_path reward function. Here we did not have a clip of how far could the reward function go, since if the model outputted 100 directions, the reward function would simulate them all, and if they were all wrong, get like -100. And in the other completions in that same group, if we had more sane outputs from the model, then the reward could be like -3, and thus the advantage higher. This un-checked behavior for how far we could go in either positive or negative reward values caused instabilities during training.

The tag flood

Later there was a situation that showed me I needed to incorporate slightly different logic in the format_reward function. I was using _format_reward from policy-gradients which adds +0.25 if the completion contains a valid tag. The issue I saw was that sometimes the model started produced a LOT of valid tags, not just 1 pair of and . In this case, from which tags should we take to parse the output? The reward reached up to 1, which for the model meant valid tags are produced. But for us that wasn’t the case, the format was not right. I modified the reward function to only account for exactly one occurrence of each tag pair, to only give a reward when there is exactly one opening and one closing tag for both and .

Model output containing multiple duplicate answer tag pairs In this output we see that there are multiple pairs. Which one is the model telling us is the right path?

Side note: GRPO tricky~ness

GRPO is a bit tricky when we are dealing with sparse rewards. If all completions in a group have 0 reward, then there is no advantage, there’s no gradient update. This happened often during training, I tracked it via “frac_reward_zero_std”, the fraction of zero-reward batches. Early on, the model produces varied responses (it does have entropy) but nothing triggers the got_to_end_reward, leaving no training signal to push the model in the right direction.

Moving to hyperparameter search

After designing various reward functions and running manual experiments, I decided to try out sweeps, which is Weights & Biases’ term for automated hyperparameter search. I tested various learning rates and reward function combinations, running both LoRA fine-tuning and full fine-tuning.

With sweeps I could define a certain configuration of parameters and weights and biases would handle passing the parameter combinations to my training script and invoking it multiple times.

The parameters I explored were:

Sweep config with LoRA: learning rate values: [1e-3, 1e-4, 1e-5, 1e-6], reward set (this means which rewards functions would be used), temperature [0.4, 0.6, 0.8]
Sweep config with full fine-tune: learning rate values [1e-6, 1e-8, 1e-10] and reward sets

The learning rate results aligned with guidance from the “LoRA without regret” blogpost, which I’d used as a reference throughout. LoRA needs a much higher learning rate than a full fine-tune, roughly 10x higher.

Again, just for laughs, I tried a huge learning rate (0.01). Unsurprisingly, the model fell into an unrecoverable state. The weights simply changed too drastically and there is no chance to get the model to produce valid directions anymore.

There is a KL penalty term we can use in GRPO to prevent the model from drifting too far from the original model’s distribution. This requires loading a second copy of the model in memory, which I did not have the RAM for that.

With full fine-tune sweeps, many runs crashed due to out-of-memory errors. Counterintuitively, most crashes occurred on runs with only 1 reward function. This is completely flipped when you think about it. You would expect single-function runs to use less memory than multi-function runs, but the opposite happened.

Weights and Biases dashboard showing full fine-tune sweep results with several crashed runs

Successful and failed Wandb sweep runs

I did have some sweep runs with common parameters. But even after all these configurations, there was not a run with any kind of promising results, there are several possibilities for this:

Further experiments with reward functions, checkpointing: since it was very easy for the training to destabilize, meaning gradient norms diminishing, entropy getting too high or too low, more experiments specifically with varying the reward functions, starting off with 1 reward function, training until you reach some improvement, then before the model finds a reward hack, save the model and add another reward function.
Model limitation: the 1B model Llama 3.2 might just not have what it takes to learn the grid representation and accurately find a path from start to goal

Some areas are worth revisiting in future experiments.

Quick pitfalls

Throughout coding, I made technical mistakes that cost me some debugging time. (learn from them)

Your prompt wording really matters. After a few training runs, I noticed something off. Since I was using a prompt from the Reasoning Gym library with my own modifications, there was a leftover phrase: “Find the length of the shortest path”. This was in addition to my instructions to output the directions to the goal, so the model sometimes outputted directions, sometimes I saw it output just a number. During writing of this post, I saw that they fixed this bug in an official PR. I should have been faster this time to create a PR :P
I was not loading correctly the model at checkpoints, I noticed that after loading the model at a certain checkpoint, the learning rate starts off like it was from the beginning of the training. This was weird because I had a decay scheduler and looking at the wandb metrics I saw it shot up when continuing training, the model started with the base learning rate again. I fixed this so that the optimizer and scheduler are saved and loaded properly.
Since GRPO needs literal strings, I had to pass tokenize=False argument to the loaded tokenizer. You don’t want the input to GRPO to be token IDs like 12355, 5542, 2222, you want them to be pure prompt strings since GRPO generates a group of completions for the same input prompt.
add_generation_prompt=True is essential. The model must generate completions for the input prompts. With it set to False, the model just continues generating text without following instructions, it believes it needs to just continue responding.

Conclusions

I was at the edge of my computer’s limits and had to bend some algorithms’ intended use (e.g., fewer GRPO completions to be able to fit everything in memory). Starting with an even smaller model, or heck, even with supervised fine-tuning is a good option since I already have ground truth paths. Once that works, we can move to reinforcement learning to help the model understand 2D grid structure.

Right now, with the current setup and configuration, it was difficult to get Llama3.2 1B off the ground and improve it meaningfully for the maze solving task.

Practical tips for moving forward:

Give special thought to your reward functions. You may think your carefully crafted reward function will guide the model to achieving the goal easily, but it’s gonna find flaws in your logic where you least expect them. Start small, verify, improve.
Don’t trust reward scores alone. I had runs where the reward climbed, but the model had only learned to output one direction (e.g., “right”). Due to the small maze size, there was a decent chance the answer actually was “right”, so the reward climbed without real learning. A separate evaluation function on a held-out test set of mazes is essential to verify real learning.
Be ready to switch models. If extensive experimentation still yields poor results, consider switching models, some may be pre-trained on data with 2D grid representations, giving them a natural advantage.

If you’re thinking about starting your own RL fine-tuning journey, I hope this post helps you on your journey.

The biggest lesson: your model will exploit every loophole in your reward functions, so design them carefully and verify what the model is actually doing.

The full training code is available on GitHub.