• Machine Learning

  • NLP

Mentors :

Mentees :

  • Preferably 4 but may take more based on SOPs

This project aims to develop a source code plagiarism detector using Python.

First step will be to implement a basic bag of words approach by file parsing. After that, some language specific preprocessing like renaming of variables, usage of macros, etc. can be integrated to improve accuracy on a specific language (like c++). Based on the results, we will further add some machine learning techniques like k-nearest neighbours to further improve the results.

This project can also be extended to compare source codes with online available codes using Google Search API (for example) if time permits.

We expect the students to go through some of the references mentioned and do some research of their own and include some of their ideas related to the project topic in their proposals. More importantly, we look for enthusiasm in students which will be judged by the effort they put in their proposals.


  1. Bag of words approach:
  2. Basic Python tutorial:
  3. File parsing using Python:
  4. KNN:
  5. A research paper on this approach:

Tentative Timeline :

Week Work
Week 1 Learn basics of Python and file parsing
Week 2 Implement a basic bag of words approach
Week 3 Add some preprocessing specific to language syntax
Week 4 Integrate KNN
Week 5 Final touch and presentation
Bonus In case we meet deadlines earlier than planned, we can integrate Google Search API to search on online available codes.

Checkpoints :

Checkpoint Number Progress
1 (4th April) - Implement bag of words with similarity percentage
Rest Same as week schedule