March 2024
Lecture Video Retrieval-Augmented Generation Framework
Project Overview This project aimed to enhance automatic question answering within educational lecture videos by developing a multimodal Retrieval-Augmented Generation (RAG) framework. With the rapid growth of online educational videos, learners often struggle to efficiently locate precise information.
Objectives
- Efficient Information Retrieval: Quickly pinpoint accurate answers within a library of lengthy lecture recordings, as well as within the lecture videos themselves.
- Multimodal Integration: Leverage audio and visual data.
- Answer Questions : Instead of just finding relevant videos and passages within videos, answer based on the video's contents.
Solution
- Advanced Multimodal RAG Framework: Developed a sophisticated retrieval mechanism combining audio transcripts (ASR) and extracted textual visuals (OCR) to gather comprehensive contextual information.
Contextual Re-ranking: Integrated a context re-ranking step that prioritizes the most relevant extracted segments, enhancing answer relevance and accuracy.
Interactive User Interface: A prototypical user interface allows learners to pose questions and receive detailed, accurate answers alongside retrieved video segments.
Outcomes
- Improved Answer Quality: Outperformed traditional standalone language models by effectively incorporating multimodal context.
- Enhanced Efficiency: Reduced time spent searching through videos, enabling learners to quickly locate relevant information.
- Academic Recognition: Findings published in Springer's Lecture Notes in Computer Science, contributing valuable insights to multimodal AI applications in education.