Research Collection School Of Computing and Information Systems

Sustainable LLM inference for edge AI: Evaluating quantized LLMs for energy efficiency, output accuracy, and inference latency

Erik Johanne HUSOM
Arda GOKNIL
Merve ASTEKIN
Lwin Khin SHAR, Singapore Management UniversityFollow
Andre KASEN
Sagar SEN
Benedikt Andreas MITHASSEL
Ahmet SOYLU

Publication Type

Journal Article

Version

acceptedVersion

Publication Date

9-2025

Abstract

Deploying Large Language Models (LLMs) on edge devices presents significant challenges due to computational constraints, memory limitations, inference speed, and energy consumption. Model quantization has emerged as a key technique to enable efficient LLM inference by reducing model size and computational overhead. In this study, we conduct a comprehensive analysis of 28 quantized LLMs from the Ollama library, which applies by default Post-Training Quantization (PTQ) and weight-only quantization techniques, deployed on an edge device (Raspberry Pi 4 with 4GB RAM). We evaluate energy efficiency, inference performance, and output accuracy across multiple quantization levels and task types. Models are benchmarked on five standardized datasets (CommonsenseQA, BIG-Bench Hard, TruthfulQA, GSM8K, and HumanEval), and we employ a high-resolution, hardware-based energy measurement tool to capture real-world power consumption. Our findings reveal the trade-offs between energy efficiency, inference speed, and accuracy in different quantization settings, highlighting configurations that optimize LLM deployment for resource-constrained environments. By integrating hardware-level energy profiling with LLM benchmarking, this study provides actionable insights for sustainable AI, bridging a critical gap in existing research on energy-aware LLM deployment.

Discipline

Software Engineering

Research Areas

Cybersecurity; Software and Cyber-Physical Systems

Areas of Excellence

Sustainability

Publication

ACM Transactions on Internet of Things

First Page

Last Page

ISSN

2691-1914

Identifier

10.1145/3767742

Publisher

Association for Computing Machinery (ACM)

Citation

HUSOM, Erik Johanne; GOKNIL, Arda; ASTEKIN, Merve; SHAR, Lwin Khin; KASEN, Andre; SEN, Sagar; MITHASSEL, Benedikt Andreas; and SOYLU, Ahmet. Sustainable LLM inference for edge AI: Evaluating quantized LLMs for energy efficiency, output accuracy, and inference latency. (2025). ACM Transactions on Internet of Things. 1-34.
Available at: https://ink.library.smu.edu.sg/sis_research/10489

Creative Commons License

This work is licensed under a Creative Commons Attribution-NonCommercial-No Derivative Works 4.0 International License.

Additional URL

https://doi.org/10.1145/3767742

Download

Included in

Software Engineering Commons

COinS

Research Collection School Of Computing and Information Systems

Sustainable LLM inference for edge AI: Evaluating quantized LLMs for energy efficiency, output accuracy, and inference latency

Publication Type

Version

Publication Date

Abstract

Discipline

Research Areas

Areas of Excellence

Publication

First Page

Last Page

ISSN

Identifier

Publisher

Citation

Creative Commons License

Additional URL

Included in

Search

Links

Browse

Links

Research Collection School Of Computing and Information Systems

Sustainable LLM inference for edge AI: Evaluating quantized LLMs for energy efficiency, output accuracy, and inference latency

Author

Publication Type

Version

Publication Date

Abstract

Discipline

Research Areas

Areas of Excellence

Publication

First Page

Last Page

ISSN

Identifier

Publisher

Citation

Creative Commons License

Additional URL

Included in

Share

Search

Links

Browse

Links