Combined feature representations, VQA, captioning, saliency maps Learning Goals Understand tradeoffs between various architectures for combining textual and visual feature representations Understand advanced usages of attention Exercises fast.ai: 10-11: DeViSE cs231n: 3-3 (15%): Network Visualization: Saliency maps, Class Visualization, and Fooling Images cs231n: 3-2 (30%): Image Captioning with LSTMs cs231n: 3-1 (25%): Image Captioning with Vanilla RNNs