EVALUASI LLAMA3.2 3B UNTUK MENGHASILKAN SOAL OTOMATIS DENGAN DEEPEVAL BERDASARKAN METRIK ANSWER RELEVANCY DAN HALLUCINATION

Authors

  • Thoriq Dharmawan Universitas Mercu Buana Yogyakarta
  • Arita Witanti Universitas Mercu Buana Yogyakarta

DOI:

https://doi.org/10.51401/jinteks.v7i1.5423

Keywords:

Model, Artificial Intelligence, LLaMA3.2, Answer Relevancy, Hallucination

Abstract

Kecerdasan Buatan (Artificial Intelligence/AI) membuka peluang baru dalam berbagai bidang, salah satunya dalam bidang pendidikan. Penelitian ini melakukan evaluasi terhadap model LLaMA3.2 3B dalam menghasilkan soal untuk media pembelajaram, proses evaluasi menggunakan DeepEval yang merupakan kerangka kerja evaluasi LLM yang bersifat open-source. Proses evaluasi menggunakan dua metrik yaitu Answer Relevancy untuk mengukur tingkat kesesuaian hasil pertanyaan dengan materi yang diberikan, serta Halluciation untuk mengukur tingkat kesalahan terhadap output yang diinginkan. Hasil pengujian menunjukan bahwa LLaMA3.2 3B mempunyai performa yang lebih baik untuk menghasilkan soal dalam jumlah sedikit dengan rata – rata skor Answer Relevancy 0.813 untuk dataset 150 kata dan 0.776 untuk dataset 650 kata. Model ini juga memberikan skor Hallucination yang lebih baik pada dataset yang lebih sedikit yaitu hingga 0.05 untuk 150 kata, dan mendapatkan skor 0.33 untuk dataset 650 kata. Dengan hasil ini dapat disimpulkan bahwa model LLaMA 3.2 3B perlu dilakukan fine-tuning untuk meningkatkan kualitas soal yang dihasilkan.

References

Y. S. K?yak and A. A. Kononowicz, “Using a hybrid of artificial intelligence and template-based method in automatic item generation to create multiple-choice questions in medical education: Hybrid AIG,” Jul. 15, 2024. doi: 10.1101/2024.07.15.24310424.

C. Diwan, S. Srinivasa, G. Suri, S. Agarwal, and P. Ram, “AI-based learning content generation and learning pathway augmentation to increase learner engagement,” Computers and Education: Artificial Intelligence, vol. 4, Jan. 2023, doi: 10.1016/j.caeai.2022.100110.

Z. Yao et al., “MCQG-SRefine: Multiple Choice Question Generation and Evaluation with Iterative Self-Critique, Correction, and Comparison Feedback,” Oct. 2024, [Online]. Available: http://arxiv.org/abs/2410.13191

S. Shankar, J. D. Zamfirescu-Pereira, B. Hartmann, A. G. Parameswaran, and I. Arawjo, “Who Validates the Validators? Aligning LLM-Assisted Evaluation of LLM Outputs with Human Preferences,” Apr. 2024, doi: 10.1145/3654777.3676450.

A. Thu et al., “The Two Sides of the Coin: Hallucination Generation and Detection with LLMs as Evaluators for LLMs Notebook for the ELOQUENT Lab at CLEF 2024,” 2024.

R. Awasthi et al., “HumanELY: Human evaluation of LLM yield, using a novel web-based evaluation tool,” Dec. 27, 2023. doi: 10.1101/2023.12.22.23300458.

C. Sun, Y. Li, D. Wu, and B. Boulet, “OnionEval: An Unified Evaluation of Fact-conflicting Hallucination for Small-Large Language Models,” Jan. 2025, [Online]. Available: http://arxiv.org/abs/2501.12975

C. Pornprasit and C. Tantithamthavorn, “Fine-tuning and prompt engineering for large language models-based code review automation,” Inf Softw Technol, vol. 175, Nov. 2024, doi: 10.1016/j.infsof.2024.107523.

P. Liu, W. Yuan, J. Fu, Z. Jiang, H. Hayashi, and G. Neubig, “Pre-train, Prompt, and Predict: A Systematic Survey of Prompting Methods in Natural Language Processing,” ACM Comput Surv, vol. 55, no. 9, Sep. 2023, doi: 10.1145/3560815.

Z. Wang, Y. Tu, C. Rosset, N. Craswell, M. Wu, and Q. Ai, “Zero-shot Clarifying Question Generation for Conversational Search,” in ACM Web Conference 2023 - Proceedings of the World Wide Web Conference, WWW 2023, Association for Computing Machinery, Inc, Apr. 2023, pp. 3288–3298. doi: 10.1145/3543507.3583420.

B. Riordan, A. A. Bonela, Z. He, A. Nibali, D. Anderson-Luxford, and E. Kuntsche, “How to apply zero-shot learning to text data in substance use research: An overview and tutorial with media data,” Addiction, vol. 119, no. 5, pp. 951–959, May 2024, doi: 10.1111/add.16427.

J. Kurek et al., “Zero-Shot Recommendation AI Models for Efficient Job–Candidate Matching in Recruitment Process,” Applied Sciences (Switzerland), vol. 14, no. 6, Mar. 2024, doi: 10.3390/app14062601.

A. Vaswani et al., “Attention Is All You Need,” Jun. 2017, [Online]. Available: http://arxiv.org/abs/1706.03762

D. Deutsch, T. Bedrax-Weiss, and D. Roth, “Towards Question-Answering as an Automatic Metric for Evaluating the Content Quality of a Summary”, doi: 10.1162/tacl.

M. Ali, P. Rao, Y. Mai, and B. Xie, “Using Benchmarking Infrastructure to Evaluate LLM Performance on CS Concept Inventories: Challenges, Opportunities, and Critiques,” in ICER 2024 - ACM Conference on International Computing Education Research, Association for Computing Machinery, Inc, Aug. 2024, pp. 452–468. doi: 10.1145/3632620.3671097.

T. A. Van Schaik and B. Pugh, “A Field Guide to Automatic Evaluation of LLM-Generated Summaries,” in SIGIR 2024 - Proceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval, Association for Computing Machinery, Inc, Jul. 2024, pp. 2832–2836. doi: 10.1145/3626772.3661346.

S. Kang, G. An, and S. Yoo, “A Quantitative and Qualitative Evaluation of LLM-Based Explainable Fault Localization,” Proceedings of the ACM on Software Engineering, vol. 1, no. FSE, pp. 1424–1446, Jul. 2024, doi: 10.1145/3660771.

Published

2025-02-10

How to Cite

[1]
T. Dharmawan and A. Witanti, “EVALUASI LLAMA3.2 3B UNTUK MENGHASILKAN SOAL OTOMATIS DENGAN DEEPEVAL BERDASARKAN METRIK ANSWER RELEVANCY DAN HALLUCINATION”, JINTEKS, vol. 7, no. 1, pp. 242-248, Feb. 2025.

Issue

Section

Articles