Download Brochure!
Feel free to fill in the form to download brochure.
Feel free to fill in the form to download brochure.
Narsee Monjee Institute of Management Studies - India
Video content is growing exponentially, making effective search within videos a crucial challenge. Traditional video search engines largely rely on metadata (titles, tags, descriptions) and transcripts, which often fail to capture the rich visual and auditory content of videos. This paper proposes an AI-based in-video search tool that indexes actual video content to enable granular search by scene, object, action, face, and speech. We integrate multiple deep learning models in a unified pipeline, object and face recognition to identify visual entities, action recognition to detect activities, and speech-to-text to transcribe spoken dialogue. The system automatically generates timestamped metadata and summaries for video segments, allowing content-based queries (“Robin Williams eating food scene”) that go beyond traditional metadata search. We evaluate the accuracy of each component and compare the overall system against existing solutions including text-metadata search and other similar research topics. Experimental results show that our content-based approach significantly improves recall of relevant moments in videos, demonstrating the advantages of AI-driven video search. We discuss challenges such as processing cost, imperfect AI-predictions, and suggest future improvements like expanded genre classification and multimodal semantic querying. The proposed system illustrates a step towards more intelligent video search engines that understand video content in depth.