PhD
Permanent URI for this community
Browse
Browsing PhD by Author "Muhammad Faraz Manzoor"
Now showing 1 - 1 of 1
Results Per Page
Sort Options
Item Language resource and model for intrinsic plagiarism detection for urdu language(UMT, Lahore, 2025) Muhammad Faraz ManzoorIn the evolving field of natural language processing (NLP), plagiarism detection has become an essential task, particularly for low-resource languages like Urdu. This PhD research addresses the critical challenge of intrinsic plagiarism detection in Urdu texts by employing a novel framework that combines machine learning, deep learning, and language models. The study conducts a comprehensive analysis at both the paragraph and sentence levels to advance the detection of intrinsic plagiarism. At the paragraph level, a set of 43 stylometry features across six granularity levels was meticulously curated to capture linguistic patterns indicative of plagiarism. The selected models include traditional machine learning techniques such as Logistic Regression, Decision Trees, Support Vector Machines (SVM), K-Nearest Neighbors (KNN), Naive Bayes, Gradient Boosting, and Voting Classifier, alongside deep learning models like GRU, BiLSTM, CNN, LSTM, and MLP, as well as Large Language Models (LLMs) such as BERT and GPT-2. Two distinct experiments were conducted: the first utilized the entire dataset for classification into intrinsic plagiarized and non-plagiarized documents, while the second categorized the dataset into three topical types—Moral Lessons, National Celebrities, and National Events. The Random Forest Classifier achieved an exceptional accuracy of 98.81% in the first experiment, while the Extreme Gradient Boosting Classifier reached an overall accuracy of 99.00% in the second experiment, demonstrating superior capability in distinguishing nuanced stylistic features across different topics. At the sentence level, the study focuses on leveraging various embeddings, including TF IDF, Word2Vec, FastText, and GloVe, in conjunction with machine learning and ensemble learning classifiers. A dataset comprising 2520 balanced documents was used to evaluate the efficacy of these models. The experiments showed promising results, with FastText embeddings combined with Support Vector Classifier and Random Forest emerging as top performers, achieving accuracy viii scores of 0.89. While BiLSTM also demonstrated competitive performance with an accuracy of 0.75, the BERT model underperformed with an accuracy of 0.65, highlighting the challenges of applying LLMs in low-resource languages like Urdu. This research highlights the effectiveness of tailored stylometry features and traditional machine learning models over deep learning and LLMs for intrinsic plagiarism detection in Urdu. The findings underscore the potential for further advancements through the expansion of datasets and the development of more sophisticated language models tailored to the linguistic characteristics of Urdu.