March 2024

Lecture Video Retrieval-Augmented Generation Framework

Project Overview This project aimed to enhance automatic question answering within educational lecture videos by developing a multimodal Retrieval-Augmented Generation (RAG) framework. With the rapid growth of online educational videos, learners often struggle to efficiently locate precise information.

Cover Photo

Objectives

  • Efficient Information Retrieval: Quickly pinpoint accurate answers within a library of lengthy lecture recordings, as well as within the lecture videos themselves.
  • Multimodal Integration: Leverage audio and visual data.
  • Answer Questions : Instead of just finding relevant videos and passages within videos, answer based on the video's contents.

Solution

  1. Advanced Multimodal RAG Framework: Developed a sophisticated retrieval mechanism combining audio transcripts (ASR) and extracted textual visuals (OCR) to gather comprehensive contextual information.

System Diagram

  1. Contextual Re-ranking: Integrated a context re-ranking step that prioritizes the most relevant extracted segments, enhancing answer relevance and accuracy.

  2. Interactive User Interface: A prototypical user interface allows learners to pose questions and receive detailed, accurate answers alongside retrieved video segments.

Interactive UI

Outcomes

  • Improved Answer Quality: Outperformed traditional standalone language models by effectively incorporating multimodal context.
  • Enhanced Efficiency: Reduced time spent searching through videos, enabling learners to quickly locate relevant information.
  • Academic Recognition: Findings published in Springer's Lecture Notes in Computer Science, contributing valuable insights to multimodal AI applications in education.