Multimodal Human-Autonomous Agents Interaction Using Pre-Trained Language and Visual Foundation Models

Linus Nwankwo; Elmar Rückert

Multimodal Human-Autonomous Agents Interaction Using Pre-Trained Language and Visual Foundation Models

Publikationen: Konferenzbeitrag › Poster › Forschung › (peer-reviewed)

Standard

Multimodal Human-Autonomous Agents Interaction Using Pre-Trained Language and Visual Foundation Models. / Nwankwo, Linus ; Rückert, Elmar.
2024. Postersitzung präsentiert bei 19th ACM/IEEE International Conference on Human-Robot Interaction (HRI 2024), Boulder, Colorado, USA / Vereinigte Staaten.

Publikationen: Konferenzbeitrag › Poster › Forschung › (peer-reviewed)

Harvard

Nwankwo, L & Rückert, E 2024, 'Multimodal Human-Autonomous Agents Interaction Using Pre-Trained Language and Visual Foundation Models', 19th ACM/IEEE International Conference on Human-Robot Interaction (HRI 2024), Boulder, USA / Vereinigte Staaten, 11/03/24 - 15/03/24.

APA

Nwankwo, L., & Rückert, E. (2024). Multimodal Human-Autonomous Agents Interaction Using Pre-Trained Language and Visual Foundation Models. Postersitzung präsentiert bei 19th ACM/IEEE International Conference on Human-Robot Interaction (HRI 2024), Boulder, Colorado, USA / Vereinigte Staaten.

Vancouver

Nwankwo L , Rückert E. Multimodal Human-Autonomous Agents Interaction Using Pre-Trained Language and Visual Foundation Models. 2024. Postersitzung präsentiert bei 19th ACM/IEEE International Conference on Human-Robot Interaction (HRI 2024), Boulder, Colorado, USA / Vereinigte Staaten.

Author

Nwankwo, Linus ; Rückert, Elmar. / Multimodal Human-Autonomous Agents Interaction Using Pre-Trained Language and Visual Foundation Models. Postersitzung präsentiert bei 19th ACM/IEEE International Conference on Human-Robot Interaction (HRI 2024), Boulder, Colorado, USA / Vereinigte Staaten.

Bibtex - Download

@conference{032c64c312354301b1bcf02c86a70b74,

title = "Multimodal Human-Autonomous Agents Interaction Using Pre-Trained Language and Visual Foundation Models",

abstract = "In this paper, we extended the method proposed in [17] to enable humans to interact naturally with autonomous agents through vocal and textual conversations. Our extended method exploits the inherent capabilities of pre-trained large language models (LLMs), multimodal visual language models (VLMs), and speech recognition (SR) models to decode the high-level natural language conversations and semantic understanding of the robot's task environment, and abstract them to the robot's actionable commands or queries. We performed a quantitative evaluation of our framework's natural vocal conversation understanding with participants from different racial backgrounds and English language accents. The participants interacted with the robot using both spoken and textual instructional commands. Based on the logged interaction data, our framework achieved 87.55% vocal commands decoding accuracy, 86.27% commands execution success, and an average latency of 0.89 seconds from receiving the participants' vocal chat commands to initiating the robot's actual physical action. The video demonstrations of this paper can be found at https://linusnep.github.io/MTCC-IRoNL/",

author = "Linus Nwankwo and Elmar R{\"u}ckert",

year = "2024",

month = mar,

day = "11",

language = "English",

note = "19th ACM/IEEE International Conference on Human-Robot Interaction (HRI 2024) ; Conference date: 11-03-2024 Through 15-03-2024",

url = "https://humanrobotinteraction.org/2024/",

}

RIS (suitable for import to EndNote) - Download

TY - CONF

T1 - Multimodal Human-Autonomous Agents Interaction Using Pre-Trained Language and Visual Foundation Models

AU - Nwankwo, Linus

AU - Rückert, Elmar

PY - 2024/3/11

Y1 - 2024/3/11

N2 - In this paper, we extended the method proposed in [17] to enable humans to interact naturally with autonomous agents through vocal and textual conversations. Our extended method exploits the inherent capabilities of pre-trained large language models (LLMs), multimodal visual language models (VLMs), and speech recognition (SR) models to decode the high-level natural language conversations and semantic understanding of the robot's task environment, and abstract them to the robot's actionable commands or queries. We performed a quantitative evaluation of our framework's natural vocal conversation understanding with participants from different racial backgrounds and English language accents. The participants interacted with the robot using both spoken and textual instructional commands. Based on the logged interaction data, our framework achieved 87.55% vocal commands decoding accuracy, 86.27% commands execution success, and an average latency of 0.89 seconds from receiving the participants' vocal chat commands to initiating the robot's actual physical action. The video demonstrations of this paper can be found at https://linusnep.github.io/MTCC-IRoNL/

AB - In this paper, we extended the method proposed in [17] to enable humans to interact naturally with autonomous agents through vocal and textual conversations. Our extended method exploits the inherent capabilities of pre-trained large language models (LLMs), multimodal visual language models (VLMs), and speech recognition (SR) models to decode the high-level natural language conversations and semantic understanding of the robot's task environment, and abstract them to the robot's actionable commands or queries. We performed a quantitative evaluation of our framework's natural vocal conversation understanding with participants from different racial backgrounds and English language accents. The participants interacted with the robot using both spoken and textual instructional commands. Based on the logged interaction data, our framework achieved 87.55% vocal commands decoding accuracy, 86.27% commands execution success, and an average latency of 0.89 seconds from receiving the participants' vocal chat commands to initiating the robot's actual physical action. The video demonstrations of this paper can be found at https://linusnep.github.io/MTCC-IRoNL/

UR - https://human-llm-interaction.github.io/workshop/hri24/papers/hllmi24_paper_5.pdf

M3 - Poster

T2 - 19th ACM/IEEE International Conference on Human-Robot Interaction (HRI 2024)

Y2 - 11 March 2024 through 15 March 2024

ER -

Forschungsportal

Multimodal Human-Autonomous Agents Interaction Using Pre-Trained Language and Visual Foundation Models

Standard

Harvard

APA

Vancouver

Author

Bibtex - Download

RIS (suitable for import to EndNote) - Download

360 Link

Forschungsportal

Multimodal Human-Autonomous Agents Interaction Using Pre-Trained Language and Visual Foundation Models

Standard

Harvard

APA

Vancouver

Author

Bibtex - Download

RIS (suitable for import to EndNote) - Download

360 Link

Vom selben Autor

The Conversation is the Command: Interacting with Real-World Autonomous Robot Through Natural Language

Understanding Why SLAM Algorithms Fail in Modern Indoor Environments