Publication Type
Conference Proceeding Article
Version
publishedVersion
Publication Date
12-2023
Abstract
Large language models (LLMs) outperform information retrieval techniques for downstream knowledge-intensive tasks when being prompted to generate world knowledge. Yet, community concerns abound regarding the factuality and potential implications of using this uncensored knowledge. In light of this, we introduce CONNER, a COmpreheNsive kNowledge Evaluation fRamework, designed to systematically and automatically evaluate generated knowledge from six important perspectives - Factuality, Relevance, Coherence, Informativeness, Helpfulness and Validity. We conduct an extensive empirical analysis of the generated knowledge from three different types of LLMs on two widely-studied knowledge-intensive tasks, i.e., open-domain question answering and knowledge-grounded dialogue. Surprisingly, our study reveals that the factuality of generated knowledge, even if lower, does not significantly hinder downstream tasks. Instead, the relevance and coherence of the outputs are more important than small factual mistakes. Further, we show how to use CONNER to improve knowledge-intensive tasks by designing two strategies: Prompt Engineering and Knowledge Selection. Our evaluation code and LLM-generated knowledge with human annotations will be released to facilitate future research.
Keywords
Comprehensive evaluation, Down-stream, Empirical analysis, Evaluation framework, Informativeness, Knowledge evaluations, Knowledge intensive tasks, Language model, Retrieval techniques, World knowledge
Discipline
Databases and Information Systems | Information Security
Research Areas
Data Science and Engineering; Information Systems and Management
Areas of Excellence
Digital transformation
Publication
Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, Singapore, December 6-10
First Page
6325
Last Page
6341
ISBN
9798891760608
Identifier
10.18653/v1/2023.emnlp-main.390
Publisher
Association for Computational Linguistics
City or Country
Texas
Citation
CHEN, Liang; DENG, Yang; BIAN, Yatao; QIN, Zeyu; WU, Bingzhe; CHUA, Tat-Seng; and WONG, Kam-Fai.
Beyond factuality: A comprehensive evaluation of large language models as knowledge generators. (2023). Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, Singapore, December 6-10. 6325-6341.
Available at: https://ink.library.smu.edu.sg/sis_research/9117
Copyright Owner and License
Authors
Creative Commons License
This work is licensed under a Creative Commons Attribution-NonCommercial-No Derivative Works 4.0 International License.
Additional URL
https://doi.org/10.18653/v1/2023.emnlp-main.390