an experiment in
Possibly! But the point was getting hands-on experience with RL. That said, RL can learn preferences that are hard to label explicitly - "cuteness" is easier to score than to define in training examples.
What I've learned is that RL refines what SFT taught, it doesn't teach new concepts. If the base model doesn't know how to draw penguins, RL won't help much.
Yes, it would. But this was much more fun, and I really liked the idea of a model training itself!
I actually experimented with fine-tuning the model for the review task: I created a dataset of PNGs, ranked them on various criteria, and trained a "critique" version of the model. The quality of critique change was so minimal that I decided to stick with a single model.
I started with SmolVLM (much smaller) but couldn't get satisfactory results. My second choice was Pixtral 12B, but it made memory requirements much bigger.
Qwen3-VL 8B hit the sweet spot: good enough quality, fits on reasonable hardware with 4-bit quantization, and it was newer.
GRPO (Group Relative Policy Optimization) seemed like the most practical choice: simpler API than PPO, more memory efficient, and I've been hearing a lot about it.
Generator: 1200 prompt→SVG pairs. Started with 200 Claude-generated examples, augmented to 1200. Available on HuggingFace.
I then reused a subset of those prompts for the RL run.
RunPod - pay-per-use cloud GPUs. Found it really easy to get started with.
SFT: An RTX A4500 is quite enough, but I also used RTX3090/RTX4090/RTX5090 depending on availability.
RL: Needs ~48GB because both models (generator + critique) stay loaded. Used 2x RTX 4090.
Total cost for a full pipeline run: ~$2.50.
Generator SFT: ~30 minutes for 5 epochs (RTX 4090)
Generator RL: ~40 minutes for 16 steps of 4 generations each (2x RTX 5090)
Most of my time went into dataset creation and debugging, not waiting for training.
About 3 weeks from first experiment to working pipeline (about 1-2 hours a day of work). Most time went into:
The actual training runs are fast - it's everything else that takes time.
Total cost for the full final pipeline run: ~$2.
But I've spent over $45 on experimentation, trial and errors, and debugging.
Yes! The code's on GitHub: github.com/yoavf/prompet
Dataset on HuggingFace: yoavf/svg-animal-illustrations along with the final lora at yoavf/prompet-cute-pet
Cool, feel free to open an issue on the github repo, or shoot me a line on X/Twitter.