Both in the AccessWorld Spring issue as well as in recent AccessWorld podcast episodes, I have been raving about the capabilities of Gemini 2.5, specifically its ability to provide detailed descriptions of videos. For this blog post, I wanted to share detailed information on how this process works and what you may encounter when using it. At the time of writing, the most recent test version—and the one used for testing in this post—is Gemini 2.5 Pro Preview 06/05 (June 5th). The model is updated frequently, though at the moment, older versions (going back to April in my testing) are still available.
It’s easy to get up and running with Gemini 2.5. Just go to aistudio.google.com while logged into your Google account. Note that this page is primarily intended for developers, so it includes many advanced options you may not have encountered if you're mostly used to more consumer-focused AI tools. Gemini 2.5 offers fine-tuning parameters such as temperature and others, but for my testing, I’ve found that leaving these at their defaults works perfectly well. These kinds of settings can make more sense when experimenting with smaller or locally hosted models, but for general use with a large, highly capable model like Gemini 2.5, they usually aren’t necessary for most users.
Describing a video is easy and straightforward. Find a YouTube video URL, copy it to your clipboard, and locate the edit field labeled "Type something or tab to select an example prompt." In the edit field, write something like "Please provide a detailed description of this YouTube video," or similar, and then paste the URL from your clipboard. This will automatically attach the video to the prompt, and you’ll see its details displayed below the edit field.
Note that you only have a certain number of tokens to use, and everything you write or do consumes tokens, including adding videos. Look for the "Token Count" field, marked as a level-three heading, to see how many you have available. For me, the token count shows 1,048,576 tokens total per prompt. Videos consume varying amounts of tokens depending on length and complexity. My rough estimate is that it uses slightly under 300 tokens per second of video. So, depending on the specifics, you can potentially upload something close to an hour in length.
Once you’ve entered your prompt, select the "Run" option in Gemini to begin processing. You can monitor its progress in the "Thinking" section located above the prompt edit field.
It helps to be intentional about your prompts to get the type of information you're looking for. For example, when I describe video game gameplay videos, I often use Let’s Plays from other people and specifically ask Gemini to ignore the spoken audio, since it's usually someone talking over the game footage. You can have it focus on whatever you’d like; plain language works fine here. Personally, I like to clear the prompt every time I want something new from Gemini, rather than doing follow-ups in the same session, to start with a clean slate—especially when describing a video.
I’ve also found that Gemini works best with shorter videos, as it seems able to provide more detailed information compared to longer ones. Additionally, the longer the video, the more likely I was to run into some sort of processing error, along with longer processing times. Even when I didn’t encounter an error, Gemini tended to follow instructions less well the longer the video ran; most commonly, it would stop describing the video at a certain point partway through.
At the moment, it seems that the service only accepts YouTube URLs from the web, but you can also upload your own videos. Note that I was prompted to confirm that I had the rights to use any video I uploaded. It wasn’t clear (to my legally untrained self) whether this referred to the result Gemini generated or the video itself. So, I proceeded cautiously and only uploaded a video I had recorded myself for testing. Much like YouTube videos, this video was well described. To upload your own video, you first need to upload it to Google Drive, so be aware of that requirement.
Though not directly related to audio description, I also experimented with having Gemini translate a video from Japanese to English, since I couldn’t read the subtitles. The video I used was quite chaotic, but Gemini seemed to do a good job and presented the information well. Unfortunately, I had no way of verifying the accuracy. However, Gemini was able to detect audio that wasn’t transcribed in the subtitles (which Gemini provided alongside its own translation). In segments where both subtitles and Gemini’s translation were available, the Gemini translation seemed accurate—if a bit more literal than the more interpretive subtitles.
Though this technology is still very new, it shows a lot of promise. Considering how new the technology is, I have been blown away at the accuracy and detail that Gemini can provide already. So much of the video content consumed these days isn’t in theaters or on television, but comes from scrolling on your phone on platforms like YouTube, as we've tested here, or on services like Instagram or TikTok. This makes up a significant amount of social interaction and shared meaning in modern culture. Compared to TV shows or movies, it's extremely difficult to manually provide audio descriptions for the billions of hours of video content generated daily. But with a feature like this, we're much closer to having full access to video content as blind and low vision users.
I would love to see this integrated directly into video platforms or as an overlay in the future, allowing for on-the-fly audio descriptions—even beyond the transcripts available now—for even greater accessibility.
Definitely give Gemini 2.5 a try and see what you think. It’s free and easy to get started, so there’s no downside! How did you find using the site? Do you have any tips or tricks for other users? Let us know on socials!