Faced with a baby screaming the house down and throwing food on the floor, frazzled parents may be surprised to hear that their beloved offspring is probably the smartest learner in the known universe.
The video stream, including jumbled-up images and sounds of parents, cats, play and toys, was then processed into 600,000 video frames and 37,500 transcribed “utterances” and fed into a neural network. The challenge was to match what Sam saw during approximately 1 per cent of his waking hours with the sounds he heard, to create a multimodal AI model.
“We were very surprised that the model could exhibit a pretty remarkable degree of learning given the limited data it had,” Wai Keen Vong, the lead author of the NYU paper, told me in a video interview.