this is a note on different types of attention and different ways to implement them (sliding window, flash). for reference Q=query, K=key, V=value
multi head attention
the classic one that was explained in the architecture basics and how everything work in theory uses multiple heads each having their own query, key , values for each head
multi query attention
all heads share the same K,V in each block respectively but Q is diff for each block reduces perf a lot
group query attention
each block has it’s own Q but K,V are shared with groups of head, the groups are architectural and fixed not learned on the fly. reduces performance by like a medium size