Assessing LLM Type Prediction with Multi-Dimensional Similarity Metrics

The inference of static type hints for dynamically-typed languages like Python is an explored area in machine learning, notably by benchmarks such as TYPYBENCH. However, these studies primarily focus on predicting missing annotations. A more realistic scenario where annotations exist but are semantically incorrect remains unaddressed. This thesis introduces and formalizes semantic type correction by LLM: automatically diagnosing and fixing incorrect type hints in Python. To enable this research, we construct an evaluation benchmark by augmenting the TYPYBENCH dataset. We develop a two-tier mutation engine that injects plausible type errors. For built-in types, mutations are guided by a formal type hierarchy and directional coercion costs. For user-defined types, a novel semantic analysis system extracts inheritance patterns, method roles, and structural signatures from real-world repositories to generate semantically related but incorrect candidates. Each mutation is annotated with a fine-grained semantic distance score (0–3), moving beyond binary correctness toward graded evaluation. Using this benchmark, we conduct a comprehensive study on the robustness of modern code models to noisy type annotations. We fine-tune encoder-decoder (CodeT5+) and decoder-only (CodeLlama) models on training datasets with varied noise levels (α = 0.1, 0.3, 0.5), where models learn from clean, missing, and corrupted type hints. Our experiments demonstrate that while high noise can degrade standard inference, models learn to correct a significant portion of semantic errors. Evaluation using our distance metric reveals that models often recover partially correct types, reducing semantic distance even without exact matches. This work establishes the benchmark for type correction, provides a methodology for generating realistic type errors, and offers new insights into the robustness and capabilities of language models for code refinement.

Assessing LLM Type Prediction with Multi-Dimensional Similarity Metrics

MENGESHA, HILINA BIZUNEH

2024/2025

Abstract

The inference of static type hints for dynamically-typed languages like Python is an explored area in machine learning, notably by benchmarks such as TYPYBENCH. However, these studies primarily focus on predicting missing annotations. A more realistic scenario where annotations exist but are semantically incorrect remains unaddressed. This thesis introduces and formalizes semantic type correction by LLM: automatically diagnosing and fixing incorrect type hints in Python. To enable this research, we construct an evaluation benchmark by augmenting the TYPYBENCH dataset. We develop a two-tier mutation engine that injects plausible type errors. For built-in types, mutations are guided by a formal type hierarchy and directional coercion costs. For user-defined types, a novel semantic analysis system extracts inheritance patterns, method roles, and structural signatures from real-world repositories to generate semantically related but incorrect candidates. Each mutation is annotated with a fine-grained semantic distance score (0–3), moving beyond binary correctness toward graded evaluation. Using this benchmark, we conduct a comprehensive study on the robustness of modern code models to noisy type annotations. We fine-tune encoder-decoder (CodeT5+) and decoder-only (CodeLlama) models on training datasets with varied noise levels (α = 0.1, 0.3, 0.5), where models learn from clean, missing, and corrupted type hints. Our experiments demonstrate that while high noise can degrade standard inference, models learn to correct a significant portion of semantic errors. Evaluation using our distance metric reveals that models often recover partially correct types, reducing semantic distance even without exact matches. This work establishes the benchmark for type correction, provides a methodology for generating realistic type errors, and offers new insights into the robustness and capabilities of language models for code refinement.

Scheda breve

Scheda completa

Scheda completa (DC)

	Corso di studio
	
				COMPUTER SCIENCE AND INFORMATION TECHNOLOGY
			
	Anno Accademico
	
				2024
			
	Relatore
	
				BUGLIESI, MICHELE
			
	Appare nelle tipologie:
	
				Laurea magistrale

File in questo prodotto:

File	Dimensione	Formato
Final_Thesis_Cafoscari_University_of_Venice(HILINAMENGESHA) (1).pdf non disponibili Dimensione 553.19 kB Formato Adobe PDF	553.19 kB	Adobe PDF

I documenti in UNITESI sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/20.500.14247/28804