Store a writing sample of the student (fingerprint).
For all future assignment submissions, use that as baseline and return a similarity score from 0 to 1 to ensure they’re writing the assignments.
Similarity score of < 0.3 is flagged.
We pass the fingerprint as context to the LLM along with the content to be compared. LLM returns a similarity score. Sample prompt:
You are an expert university evaluator, your aim is to determine if two distinct essays were written by the same person.
The subject matter is different.
Employ only stylistic factors such as sentence length style, general word choices, figurative and rhetorical tendencies, formatting and active/passive approaches.
Add the expected response schema as input to the LLM as params.
{
"name": "similarity_response",
"description": "Check if the response is valid",
"parameters": {
"type": "object",
"properties": {
"similarity_score": {
"type": "int",
"description": "score from 0 to 1 where 0 is completely different and 1 is exact style match"
},
"score_reasoning": {
"type": "string",
"description": "Very short and simple non-technical human readable reasoning of what stylistic factors were compared and led to this score"
},
},
"required": ["similarity_score", "score_reasoning"]
}
}
Disadvantages:
similarity_score as a weighted average of both (70% LLM, 30% our custom eval etc)Extract datapoints from fingerprint text:
Function words ratio
SpaCy to count function wordsratio = (# function words) / (# total words)Punctuation distribution
dist = (# commas) / # sentences)Result:
Weighted average of all these parameters.
result = (
(sentence_length_dist_sample/sentence_length_dist_fingerprint) +
(function_words_sample/function_words_fingerprint) +
(punctuation_dist_sample/punctuation_dist_fingerprint) +
(active_passive_sample/active_passive_fingerprint)
) / 4
Prepare a set of at least 10 sample texts to test. These samples should not be changed and will be used for benchmarks.
Getting real writing is best; if not, take it from any fiction/non-fiction book.
(note that real writing is best because books have editors/reviewers etc who clean and normalize language)
Each example should have a fingerprint text and at least 1 text which is expected to pass (matching_answer) and at least 1 which is expected to fail (non_matching_answer).
Every time a change is made to code/LLM version, run these samples and see if it performs better/worse than the previous.
Get to MVP
Get to improved version
Perf verification
x inputs. This is the max load your system can handle, since the LLM is the bottleneck.
Note:
Rough calculation:
input = (prompt + fingerprint + new input text) = 3000 tokens (~1000 words)
output = 300 tokens (~100 words)
# assume LLM takes 2s for input, gives 50 tokens/s output
total_time_per_doc = 8 seconds = 0.002 hours
# We need to process all docs in the queue within 12 hours
# (because then we'll start getting new docs)
max_allowed_time = 12 hours
max_docs_in_allowed_time = max_allowed_time/total_time_per_doc
= 12/0.002 = 6000 docs
Add servers/downgrade LLM accordingly to handle the load. Postgres/Redis/Kafka will be able to handle this load easily.