1: depthwise convolution input: 56x56x32 kernel: 3x3 stride: 1 padding: 1 output: 56x56x32 a) # of params: - (3 * 3 * 32) = 288 b) # FLOPs of forward pass: - one kernel application: 2 * 288^2 FLOPs - kernel (3x3) applications on (56x56) image: 52 * 52 applications - total: 2 * 288^2 * 52 * 52 = 448561152 2: scaled dot-product attention inputs: 64x128 Q, K, V: 64x128 Wq, Wk, Wv: 128x128 Wo: 128x128 a) # of params: - 4 * 128 * 128 = 65536 b) # FLOPs of forward pass b1) Q/K/V projection: - 2 * (64 * 128) * (128 * 128) b2) attention score (Q*K^T) - 2 * (64 * 128)^2 b3) scaling by 1/(d^-2) - 2 * (128 * 128) b4) softmax(S)*V - 2 * (128) * (128 * 64) b5) output projection - (128 * 128) * (128 * 64) c) softmax operation types and count - exp: 1 FLOP - sum: over all elements