PROMPET

an experiment in

Reinforcement learning

FAQ

Wouldn't SFT alone get you the same (or better) results?

Possibly! But the point was getting hands-on experience with RL. That said, RL can learn preferences that are hard to label explicitly - "cuteness" is easier to score than to define in training examples.

What I've learned is that RL refines what SFT taught, it doesn't teach new concepts. If the base model doesn't know how to draw penguins, RL won't help much.

Why just one model? Wouldn't it be more effective to use a more powerful hosted VLM for the critique?

Yes, it would. But this was much more fun, and I really liked the idea of a model training itself!

I actually experimented with fine-tuning the model for the review task: I created a dataset of PNGs, ranked them on various criteria, and trained a "critique" version of the model. The quality of critique change was so minimal that I decided to stick with a single model.

Why Qwen3-VL 8B?

I started with SmolVLM (much smaller) but couldn't get satisfactory results. My second choice was Pixtral 12B, but it made memory requirements much bigger.

Qwen3-VL 8B hit the sweet spot: good enough quality, fits on reasonable hardware with 4-bit quantization, and it was newer.

Why GRPO for RL?

GRPO (Group Relative Policy Optimization) seemed like the most practical choice: simpler API than PPO, more memory efficient, and I've been hearing a lot about it.

How big was your dataset?

Generator: 1200 prompt→SVG pairs. Started with 200 Claude-generated examples, augmented to 1200. Available on HuggingFace.

I then reused a subset of those prompts for the RL run.

Where did you run this? What GPUs?

RunPod - pay-per-use cloud GPUs. Found it really easy to get started with.

SFT: An RTX A4500 is quite enough, but I also used RTX3090/RTX4090/RTX5090 depending on availability.

RL: Needs ~48GB because both models (generator + critique) stay loaded. Used 2x RTX 4090.

Total cost for a full pipeline run: ~$2.50.

How long does training take?

Generator SFT: ~30 minutes for 5 epochs (RTX 4090)

Generator RL: ~40 minutes for 16 steps of 4 generations each (2x RTX 5090)

Most of my time went into dataset creation and debugging, not waiting for training.

How long did this take to build?

About 3 weeks from first experiment to working pipeline (about 1-2 hours a day of work). Most time went into:

  • Dataset creation (generating + labeling critique examples)
  • Debugging quantization + GRPO issues
  • Building deployment automation

The actual training runs are fast - it's everything else that takes time.

How much did it cost?

Total cost for the full final pipeline run: ~$2.

But I've spent over $45 on experimentation, trial and errors, and debugging.

Is the code available?

Yes! The code's on GitHub: github.com/yoavf/prompet

Dataset on HuggingFace: yoavf/svg-animal-illustrations along with the final lora at yoavf/prompet-cute-pet

I have more questions!

Cool, feel free to open an issue on the github repo, or shoot me a line on X/Twitter.