: Aid Visually Impaired People Walking by Vision Language Model

Abstract

Approximately 200 million individuals around the world suffer from varying degrees of visual impairment, making it crucial to leverage AI technology to offer walking assistance for these people. With the recent progress of vision-language models (VLMs), employing VLMs to improve this field has emerged as a popular research topic. However, most existing methods are studied on self-built question-answering datasets, lacking a unified training and testing benchmark for walk guidance. Moreover, in blind walking task, it is necessary to perform real-time streaming video parsing and generate concise yet informative reminders, which poses a great challenge for VLMs that suffer from redundant responses and low inference efficiency. In this paper, we firstly release a diverse, extensive, and unbiased walking awareness dataset, containing 12k video-manual annotation pairs from Europe and Asia to provide a fair training and testing benchmark for blind walking task. Furthermore, a WalkVLM model is proposed, which employs chain of thought for hierarchical planning to generate concise but informative reminders and utilizes temporal-aware adaptive prediction to reduce the temporal redundancy of reminders. Finally, we have established a solid benchmark for blind walking task and verified the advantages of WalkVLM in stream video processing for this task compared to other VLMs.

Background


How the world looks like to visually impairment people?


They are helpless, wandering, and lack the light of life.


Technology for good, let's take action!

Walking Awareness Dataset


Annotation pipeline

The data annotation pipeline for constructing the walking awareness dataset.

Geographical distribution

Visualization results of the WAD dataset sorted by region. The WAD dataset has a wide range of sources, and the and categories shown are randomly obtained from the dataset. The pie chart in the lower left corner shows the proportion of video length from different regions.

Compared to other datasets

Static information comparison of different datasets in blind walking task. WAD dataset holds a significant advantage in terms of sample numbers, categories, and modalities.

Data samples

Weather Condition: Sunny

Reminder task

Area Type:Pedestrian Path

Danger Level:Mid

Traffic Flow Rating:Low

Summary:on the sidewalk on the right side of the road, there is a downward step on the left, a yellow car passes on the left side of the road, there is a row of green plants on the right, there is a row of trees at one o'clock direction, there is a sign at two o'clock direction, there are cars parked on the roadside at eleven o'clock direction, the current road is narrow and there are few pedestrians

Reminder:at 11 o'clock direction there is a car, at 1 o'clock direction there is a sign, be careful to avoid.

Target detection labels. Right-click to open the image in a new tab.

Weather Condition: Overcast

Reminder task

Area Type:Pedestrian Path

Danger Level:High

Traffic Flow Rating:High

Summary:walking on a stone bridge. left side is red, right side is paved with stone bricks. there are stone railings on both sides of the bridge. there are many trees below the opposite side and both sides of the bridge. there is a pedestrian in a black coat about five steps ahead. the large flow of people is mainly concentrated about fifteen steps ahead. there is no road nearby and the traffic flow is zero.

Reminder:at 10 o'clock direction, there are pedestrians passing by. please move slowly towards 11 o'clock direction.

Target detection labels. Right-click to open the image in a new tab.

Weather Condition: Overcast

QA task

Area Type:Pedestrian Path

Danger Level:Low

Traffic Flow Rating:Low

Summary:on a walking path, there are rows of cars parked in front and on the right. on the silver - grey car on the right, there is a black - and - white cow cat squatting. about 15 steps ahead, there is a pedestrian in white. on the left, there is a row of metal railings, and inside the railings are residential buildings. the road is narrow, there are few pedestrians and no vehicles passing.

Q:how about current road situation?

A:the current is a small road with cars parked on both sides. there are few pedestrians on the road. now walk to the left side of the road which is relatively narrow. be careful to avoid the cars parked on the right side and pay attention to safety.

Target detection labels. Right-click to open the image in a new tab.

Weather Condition: Cloudy

QA task

Area Type:Pedestrian Path

Danger Level:Low

Traffic Flow Rating:Mid

Summary:on the walkway, it is sunny, with green belts on the right, shops on the left, the road is narrow, there is a tree and a sign at one o'clock, and there is a bicycle straight ahead, and there are few pedestrians on the current road.

Q:describe the scene?

A:now there is a bicycle parked on the road 5 steps ahead of you. you need to deviate towards the 11 o'clock direction.

Target detection labels. Right-click to open the image in a new tab.


WalkVLM

Framework

An overview of the proposed WalkVLM framework. WalkVLM employs CoT-based hierarchical planning to summarize the static attributes and understanding of scenes, thereby facilitating the subsequent reminder and QA tasks. Furthermore, temporal-aware adaptive prediction has been proposed to calculate the trigger state of VLM, thereby reducing the temporal redundancy of outputs.

Quantitative results

Quantitative comparison of different methods on reminder and QA tasks. WalkVLM leads in almost all the TF-IDF, ROUGE, and GPT Score metrics. The higher the metric, the better the result. Bold and underline indicate the best and the second-best, respectively.

Qualitative results

Visualization comparison of different VLM models. Compared to other models, WalkVLM is able to generate concise and informative answers, providing users with a good experience in blind walking.

Streaming inference


Visualization comparison of video stream inference on the blind walking task between WalkVLM and MiniCPM-V2.6 with the same parameter magnitude. Our method can generate more concise reminders with lower temporal redundancy, but there is still room for important event ranking, error identification and fine-grained recognition. See Appendix C.3 for a more detailed analysis.

Vision

We are committed to providing walking assistance to approximately 200 million visually impaired individuals worldwide through Vision-Language Model (VLM) technology, improving their quality of life. We have introduced the WalkVLM model and the walking awareness dataset, aimed at generating concise and informative walking reminders. We call on all sectors of society to pay more attention to this group, promote technology for good, and help them better integrate into society. Additionally, we hope to further crowdsourcing more data and resources, making our services more comprehensive and effective.

BibTeX

If you use our work in your research, please cite:


            @misc{anonymous2024WalkVLM,
            title={WalkVLM: Aid Visually Impaired People Walking by Vision Language Model},
            author={Anonymous},
            archivePrefix={arXiv},
            primaryClass={cs.CV}}