Can Large Language Models Identify Authorship?

Overview: This repo contains the code, results and data used in EMNLP 2024 Findings paper titled "Can Large Language Models Identify Authorship?"
TLDR: We propose Linguistically Informed Prompting (LIP) strategy, which offers in-context linguistic guidance, to boost LLMs' reasoning capacity for authorship verification and attribution tasks, while also providing natural language explanations.
[arXiv] [Project Website]

This work focuses on exploring the capabilities of Large Language Models (LLMs) in authorship analysis tasks, specifically authorship verification and authorship attribution. The primary aim is to investigate whether LLMs can accurately identify the authorship of texts, which is pivotal for verifying content authenticity and mitigating misinformation.\

A Comparison Between Linguistically Informed Prompting (LIP) and other Prompting Strategies for Authorship Verification. "Analysis" and "Answer" are the output of prompting GPT-4. Only LIP strategy correctly identifies that the given two texts belong to the same author. Text colored in orange highlights the differences compared to vanilla prompting with no guidance. Text colored in Blue indicates the linguistically informed reasoning process. Blue text represents the text referenced from the original documents.

BibTex

@inproceedings{huang2024authorship,
    title = "Can Large Language Models Identify Authorship?",
    author = "Huang, Baixiang  and  Chen, Canyu  and  Shu, Kai",
    editor = "Al-Onaizan, Yaser  and  Bansal, Mohit  and  Chen, Yun-Nung",
    booktitle = "Findings of the Association for Computational Linguistics: EMNLP 2024",
    month = nov,
    year = "2024",
    address = "Miami, Florida, USA",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2024.findings-emnlp.26/",
    doi = "10.18653/v1/2024.findings-emnlp.26",
    pages = "445--460",
    abstract = "The ability to accurately identify authorship is crucial for verifying content authenticity and mitigating misinformation. Large Language Models (LLMs) have demonstrated exceptional capacity for reasoning and problem-solving. However, their potential in authorship analysis remains under-explored. Traditional studies have depended on hand-crafted stylistic features, whereas state-of-the-art approaches leverage text embeddings from pre-trained language models. These methods, which typically require fine-tuning on labeled data, often suffer from performance degradation in cross-domain applications and provide limited explainability. This work seeks to address three research questions: (1) Can LLMs perform zero-shot, end-to-end authorship verification effectively? (2) Are LLMs capable of accurately attributing authorship among multiple candidates authors (e.g., 10 and 20)? (3) Can LLMs provide explainability in authorship analysis, particularly through the role of linguistic features? Moreover, we investigate the integration of explicit linguistic features to guide LLMs in their reasoning processes. Our assessment demonstrates LLMs' proficiency in both tasks without the need for domain-specific fine-tuning, providing explanations into their decision making via a detailed analysis of linguistic features. This establishes a new benchmark for future research on LLM-based authorship analysis."
}

Methodology

Traditional authorship analysis methods rely on hand-crafted writing style features and classifiers, while state-of-the-art approaches utilize text embeddings from pre-trained language models, often requiring domain-specific fine-tuning. Our approach evaluates LLMs' performance in authorship analysis without the need for fine-tuning, and explores the integration of explicit linguistic features to enhance reasoning capabilities.

Data Preprocessing

For this study, texts and authors were filtered to remove duplicates and authors contributing fewer than two texts. Non-English texts were excluded using the py3langid tool, available at py3langid GitHub.

Datasets

The datasets used in this research are publicly available on Kaggle:

Enron Email Dataset: Access Here
Blog Authorship Corpus: Access Here

Code

The code accompanying this research is structured to facilitate the replication of our study and further exploration of LLMs in authorship analysis tasks. It includes scripts for data preprocessing and evaluation.

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
code		code
data		data
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Can Large Language Models Identify Authorship?

BibTex

Methodology

Data Preprocessing

Datasets

Code

About

Uh oh!

Releases

Packages

Languages

License

baixianghuang/authorship-llm

Folders and files

Latest commit

History

Repository files navigation

Can Large Language Models Identify Authorship?

BibTex

Methodology

Data Preprocessing

Datasets

Code

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages