Accessibility is always improving, but 2023 saw one of the most significant accessibility breakthroughs since the advent of the accessible smartphone. GPT4, produced by Open AI, is a Large Language Model (LLM) that can accept both text and images. In summary, you can converse with an LLM much like you would with a person, and it will respond in a manner closely approximating human interaction. Also, the most powerful LLMs such as Open AI's GPT and Google's Bard, perform various tasks only previously possible by people. Examples include programming, writing song lyrics, composing poems and prose, providing feedback on documents, editing and more. In addition to these capabilities, GPT-4 can also accept and understand images. This means that it has massive implications for accessibility. Open AI understood this, and made Be My Eyes their partner for their image recognition technology.
Essentially, you can take a picture or screenshot, and GPT-4 will describe the image in highly accurate and thorough detail. In addition, depending on what platform you are using, you can talk to GPT-4 and ask it questions to get more information about the image.
Note that the describer can and will get details wrong. For example, it nearly always thinks my guide dog's harness is a purse or some other leather object. In some cases it may qualify that it can't quite see something that it identifies wrongly, but other times it will be confident in its assertion. This means that if you are using image recognition for something particularly delicate, you may want to have a back-up for understanding what you need described.
Usage on Mobile
GPT-4 image recognition is available for free on both iOS and Android. The first app which supports image descriptions is Be My Eyes. When in the app, there will be a tab called "Seeing AI" where you can use image recognition. When you first use this feature, you may need to pass through a screen or two of disclaimers before being brought to the interface proper.
The interface is straightforward, containing only a "Take Picture" button. After you take the picture, a description will be returned. On the page containing the description, there will be a button labeled 'Ask More' where you can inquire further about the image. Messages are sent between you and the AI similar to a text messaging app.
On iOS, and most likely Android as well, it is possible to share images from sources on your device to the app to recognize. On iOS, open the share sheet, usually simply titled "Share", and scroll through the options until you see "Recognize with Be My Eyes". When the image is recognized, you will be placed in the "Ask More" message view instead of the traditional view when using the app normally.
In addition to Be My Eyes, you can also access GPT-4 image recognition using the Seeing AI app. In Seeing AI, scroll to "Scene" and take a picture. You will be given the traditional short description but can select the "More Info" button to have it processed by GPT-4. Note that with Seeing AI, you cannot ask follow up questions as of this time. Since you can recognize items with Seeing AI using the Share Sheet, it is also possible to review images on your device with Seeing AI as well.
When an image is sent to GPT-4, it is sent along with a text prompt giving the AI instructions. Because of this, Seeing AI and Be My Eyes will return very different, but similarly accurate, results. I would recommend testing both to determine what description style you prefer.
Using GPT-4 on Windows with an NVDA Addon
It is also possible to recognize elements of your computer's screen using GPT-4 if you use the NVDA screen reader. Note that, unlike Be My Eyes or Seeing AI, you will utilize your personal account to access the describer, meaning that you must buy tokens for your account before the addon can be used. To use the describer with NVDA, you must first download and configure this addon. The addon page contains full instructions on addon setup and usage. One element not fully described is how to fund an account. If you are creating a new account, you will likely receive free credits for use with the addon. If you would like to purchase credits, you can do so by going to settings and then billing.
The addon allows you to describe various parts of your computer's screen. The addon provides options to describe the entire screen, the currently focused item, and the navigator object (if Navigator Objects are new to you, check out section 5.4 in the NVDA User Guide). In addition, you can ask the addon to take a screenshot and recognize it, or recognize an image on your clipboard. You can also recognize an image by copying the file path of the image, not necessarily the image itself.
In the NVDA settings, you can configure the prompt sent along with any image, as well as the number of tokens used in the description. This is helpful if you want to have careful control of your token usage.
The following are some examples of image descriptions taken in the real world and on my computer.
This first image is from Seeing AI and is a description of part of my living room:
The picture shows a room with two windows covered by white blinds, allowing some light to filter through. Between the windows, there's a dark wooden wall decoration that resembles a map or continent outline. Below it, to the left, there's a small wooden rack holding a folded white towel on top of a red storage box. In front of the rack, there's a small round table with a black surface, on which lies a pair of glasses. To the right, there's a black leather recliner chair. The room's walls are a light beige color, and the ceiling is white with a single visible smoke detector.
This image showcases the detail that the AI can achieve, but also where it makes mistakes. The towel and red storage box it refers to are actually two blankets on a blanket rack, one white and the other red. There is nothing on the table that would look like glasses, I can only assume it saw a headphone cord partially draped over a folded piece of tech and made a guess.
Notice that it uses some qualifiers in reference to the wooden decoration. In this case it was correct, it is a wooden carving of the state of WV.
This second example demonstrates how you can ask Seeing AI for more details:
- Seeing AI: The picture shows a view from a high vantage point looking out over a cityscape. In the foreground, there is a large flat roof covered with a patterned material, possibly for waterproofing. The edge of the roof has a metal guardrail. Beyond the roof, several buildings of varying heights can be seen, with a mix of flat and sloped roofs. In the distance, there are two tall red and white towers, possibly for telecommunications. The sky is overcast with gray clouds, suggesting a gloomy or cold day.
- Me: Please describe the buildings in greater detail.
- Seeing AI: The buildings in view are primarily industrial or commercial in style, with flat roofs and simple, functional architecture. The closest building has a series of small, square structures on its roof, likely housing mechanical equipment. Beyond that, there is a building with a large scaffold structure around it, indicating construction or maintenance work in progress. To the right, there are two large, white, rectangular structures that appear to be storage tanks or silos. The overall color palette of the buildings is muted, with greys, browns, and the red of brickwork. There are no prominent signs or distinctive features that stand out on these buildings. The area gives the impression of an industrial district rather than a residential or commercial one.
As you can see, being able to ask further questions can provide significant extra detail from the photo you took.
The following are descriptions of two images captured using the NVDA addon on my personal computer. The first image is of the entire screen:
This image shows a desktop screen of a Windows operating system with a dark wallpaper and several icons arranged in no specific order. In the top left corner, there's the "Recycle Bin," followed by other icons like "Microsoft Edge," "NVDA," "mp3DirectCut," "Kindle," and "Firefox." There are also icons for "RS Games Client," "Steam," "Play Super Liam," "Conjury," "Jarte - Shortcut," "locale emulator," "Mudlet PTB," and "Adobe Acrobat." At the bottom, there's a taskbar with icons for various apps such as a web browser, file explorer, settings, and a notification area with the date and time indicated as 12:14 PM, 2/8/2024. There's also a blue dotted selection box in the middle of the screen indicating an area that has been selected or clicked but contains no icons or objects within it.
I found this impressive, as this is completely accurate to what is displayed on screen. The AI was able to recognize every single item on my desktop.
This time, we are describing a single icon on the desktop, recognizing using the "Recognize Focus" option:
This image is of an application icon, specifically for the Kindle app. It features a silhouette of a person sitting under a tree, reading a book with light emanating from it, suggesting an electronic or digital reading source. The figure is seated with one knee up and one leg stretched out. The background is a gradient of blue, which transitions from light to dark from top to bottom. The top-right corner has a curved, dog-eared page, indicating interaction or a characteristic feature of reading. Below the silhouette, the word "Kindle" is written in white font. There is also a smaller white icon, possibly indicating a notification or a branding detail, within a small blue square superimposed on the lower left of the main image.
I personally never knew how intricate icons/logos could be and have enjoyed checking several out using the NVDA addon.
The Bottom Line
As you can tell from the samples above, the level of detail and accuracy provided by GPT-4 is groundbreaking. That being said, you will notice that it does have a tendency to misidentify less focused items or those that are less common. In these cases, it will make its best guess. Even with that limitation, this technology is incredibly useful in so many arenas. I've personally used it over OCR to recognize products, identified clothing, reviewed control panels, and more using the apps on my iOS device. On PC, there are so many images that never get descriptions, especially on social media. It would be best if knowledge and adoption of alt text was more prevalent, but using GPT-4, we have greater access to visual culture than we ever had before.
Using Be My Eyes or Seeing AI is quite straightforward and both apps are quite friendly for people new to smartphones. The fact that uses of the technology seem free and unlimited is a major boon. There are many applications for the NVDA addon but you will need to be quite comfortable with the screen reader and the setup process as well as being willing to purchase your own tokens to use it.
I personally have found GPT-4 image recognition to be a major upgrade to my accessibility toolkit and highly encourage you to give it a try.