1:
depthwise convolution

input: 56x56x32
kernel: 3x3
stride: 1
padding: 1
output: 56x56x32

a) # of params:
	- (3 * 3 * 32) = 288

b) # FLOPs of forward pass:
	- one kernel application: 2 * 288^2 FLOPs
	- kernel (3x3) applications on (56x56) image: 52 * 52 applications
	- total: 2 * 288^2 * 52 * 52 = 448561152

2:
scaled dot-product attention

inputs: 64x128
Q, K, V: 64x128
Wq, Wk, Wv: 128x128
Wo: 128x128

a) # of params:
	- 4 * 128 * 128 = 65536

b) # FLOPs of forward pass
	b1) Q/K/V projection:
		- 2 * (64 * 128) * (128 * 128)
	b2) attention score (Q*K^T)
		- 2 * (64 * 128)^2
	b3) scaling by 1/(d^-2)
		- 2 * (128 * 128)
	b4) softmax(S)*V
		- 2 * (128) * (128 * 64)
	b5) output projection
		- (128 * 128) * (128 * 64)

c) softmax operation types and count
	- exp: 1 FLOP
	- sum: over all elements