1. baseline llm.c to pytorch for modded gpt
prev time: 45 minutes improved time: 31.4 minutes
1. rotatatory embeddings:
a new class was added to use rotatory instead of classical embeddings which are later used in CausalSelfAttention
class
RoPE has shown to give great improvements in transformer architecture since complex and high dimensional rotation work better relatively to other tokens to find out their relations (positionally) here is a great youtube video if you want to learn more since the base was using fixed positional encoding this should improve the model capabilities nicely although a a little speed should be lost but it should very extremely small since we are only calculating an overhead sine and cos also rope helps with if we want to extend the ctx lenght of the transformers that wouldn’t happen on absolute pe so but this can increase time but cache is also added so if its a fixed seq length it uses the same functions helping reduce the time
2. checkpoint saving change:
just a new line
parser.add_argument("--save_every", type=int, default=5000, help="every how many steps to save the checkpoint")
added to save ckp every 5000 points should increase speed my a millisecond