This thesis examines the linguistic properties of texts generated by Large Language Models in comparison with human-authored corpora. It focuses on three core questions: the extent to which AI-generated corpora approximate human texts in syntactic complexity, the influence of prompt complexity on output characteristics, and the distribution of sentiment and emotion in machine-produced language. Three corpora were created using the OpenAI API (version o1) under systematically varied prompting conditions and compared with three human reference corpora: OpenSubtitles, a Children’s Stories Text Corpus, and the Leipzig Web Corpus. The analysis combined established syntactic complexity metrics and dependency-based measures of structural depth with sentiment modelling through a pre-trained emotion detection model built on the RoBERTa architecture (i.e. Emotion English DistilRoBERTa-base). The findings show that AI-generated texts display grammatical fluency, but exhibit lower syntactic variety and greater structural regularity than human corpora, resembling the simplicity of subtitles rather than the richness of children’s literature or web texts. Dependency-based measures further revealed a preference for efficient, low-cost constructions. Prompt complexity was found to shape outputs significantly, with more elaborate prompts eliciting greater syntactic diversity, though never fully matching human-like range. Sentiment analysis indicated a strong bias toward neutral and mildly positive affect, with limited representation of negative emotions. The study contributes to computational linguistics by offering a detailed, corpus-based comparison of human and AI texts, highlighting the methodological role of prompt engineering, and underlining both the promise and the constraints of Large Language Model outputs. It concludes that while prompt design can narrow the gap between human and machine-authored texts, AI-generated corpora remain distinguishable in their syntactic regularities and emotional limitations, making them valuable but imperfect substitutes for authentic human language.

Syntactic and Emotional Properties of AI-Generated Texts: a Corpus-Based Comparison

GERSH, VERONIKA
2024/2025

Abstract

This thesis examines the linguistic properties of texts generated by Large Language Models in comparison with human-authored corpora. It focuses on three core questions: the extent to which AI-generated corpora approximate human texts in syntactic complexity, the influence of prompt complexity on output characteristics, and the distribution of sentiment and emotion in machine-produced language. Three corpora were created using the OpenAI API (version o1) under systematically varied prompting conditions and compared with three human reference corpora: OpenSubtitles, a Children’s Stories Text Corpus, and the Leipzig Web Corpus. The analysis combined established syntactic complexity metrics and dependency-based measures of structural depth with sentiment modelling through a pre-trained emotion detection model built on the RoBERTa architecture (i.e. Emotion English DistilRoBERTa-base). The findings show that AI-generated texts display grammatical fluency, but exhibit lower syntactic variety and greater structural regularity than human corpora, resembling the simplicity of subtitles rather than the richness of children’s literature or web texts. Dependency-based measures further revealed a preference for efficient, low-cost constructions. Prompt complexity was found to shape outputs significantly, with more elaborate prompts eliciting greater syntactic diversity, though never fully matching human-like range. Sentiment analysis indicated a strong bias toward neutral and mildly positive affect, with limited representation of negative emotions. The study contributes to computational linguistics by offering a detailed, corpus-based comparison of human and AI texts, highlighting the methodological role of prompt engineering, and underlining both the promise and the constraints of Large Language Model outputs. It concludes that while prompt design can narrow the gap between human and machine-authored texts, AI-generated corpora remain distinguishable in their syntactic regularities and emotional limitations, making them valuable but imperfect substitutes for authentic human language.
2024
File in questo prodotto:
File Dimensione Formato  
Thesis_Gersh (3)_pdfA.pdf

accesso aperto

Dimensione 1.5 MB
Formato Adobe PDF
1.5 MB Adobe PDF Visualizza/Apri

I documenti in UNITESI sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/20.500.14247/26177