Currently, many studies are measuring the similarity between documents in a specific language, such as Vietnamese - Vietnamese and English - English. However, situations have recently appeared in the problem of copying articles. For example, English sources have been translated into Vietnamese and edited into their manuscripts. As a result, it is considered cross-language plagiarism. Therefore, this study has applied a new approach: translate from English to Vietnamese documents, then calculate and compare the translated document with documents modified or copied from a translated document. In the study, the main focus is on stages such as Translating English documents into Vietnamese, preprocessing documents, and determining the similarity between documents. The determination of similarity between documents mentioned in this topic is Cosine similarity based on Term Frequency (TF), Inverse Document Frequency (IDF), and word order similarity in the text. Combine these two metrics to give a similar result that is more accurate and convincing. The data is collected in 7 topics with related topics with the number of 15 documents with lengths from 2000 to more than 8000 words, successfully built a document translation integration system based on Google Translate Application Programming Interface (API) and similarity checking, Precision and Recall measures show very positive results over 80%.
Tạp chí khoa học Trường Đại học Cần Thơ
Lầu 4, Nhà Điều Hành, Khu II, đường 3/2, P. Xuân Khánh, Q. Ninh Kiều, TP. Cần Thơ
Điện thoại: (0292) 3 872 157; Email: tapchidhct@ctu.edu.vn
Chương trình chạy tốt nhất trên trình duyệt IE 9+ & FF 16+, độ phân giải màn hình 1024x768 trở lên