Linkedin

Plagiarism Detection Architecture

Project Overview

Project Detail

  1. Copy the document you’d like to run plagiarism detection on to Amazon Simple Storage Service (Amazon S3).

  2. Amazon S3 event triggers start of AWS Step Functions workflow.

  3. AWS Lambda function extracts text from document using Tika (a content analysis toolkit that detects and extracts metadata and text from over a thousand different file types.

  4. For each paragraph in the document, text is passed to a pre-trained Bidirectional Encoder Representations from Transformers (BERT)-based model to extract word embedding vectors.

  5. For each word embedding vector, a K-Nearest Neighbor (KNN) search is run using a cosine-similarity algorithm.

  6. Amazon OpenSearch Service (OpenSearch Service) domain stores an index of pre-processed works that have been converted into word embedding vectors and indexed.

  7. Based on the configured similarity threshold that is compared against the OpenSearch Service query result score, an event bridge event is raised, specifying source document information that has possibly been plagiarized with reference to relevant works.

https://docs.aws.amazon.com/architecture-diagrams/latest/plagiarism-detection-architecture/plagiarism-detection-architecture.html?did=wp_card&trk=wp_card

To know more about this project connect with us

Plagiarism Detection Architecture