RETVec promises to cut down on Gmail spam and reduce false positives
(Image credit: Future)
Google has a shiny new tool to keep your Gmail inbox spam-free.
RETVec is short for Resilient and Efficient Text Vectorizer, with vectorization being a “methodology in natural language processing to map words or phrases from a corresponding vector of real numbers” and then using those to run further analysis, predictions, and word similarities, per Towards Data Science.
With RETVec, Gmail will be better at spotting spam emails hiding invisible characters, LEET substitution (3xpl4in3d instead of explained, for example), intentional typos, and more. Harmful email messages will have a tough time making it into inboxes.
More than 100 languages supported
“RETVec is trained to be resilient against character-level manipulations including insertion, deletion, typos, homoglyphs, LEET substitution, and more,” Google explains on GitHub. “The RETVec model is trained on top of a novel character encoder which can encode all UTF-8 characters and words efficiently.”
Right out of the box, RETVec will support more than 100 languages, Google said, adding that it could thus be deployed in different scenarios:
“Due to its novel architecture, RETVec works out-of-the-box on every language and all UTF-8 characters without the need for text preprocessing, making it the ideal candidate for on-device, web, and large-scale text classification deployments,” Google’s Elie Bursztein and Marina Zhang noted.
With RETVec, Google’s spam detection rate increased by 38%, the company said, adding that its false positive rate dropped by almost a fifth (19.4%).
The Tensor Processing Unit (TPU) usage of the model dropped by 83%.
“Models trained with RETVec exhibit faster inference speed due to its compact representation. Having smaller models reduces computational costs and decreases latency, which is critical for large-scale applications and on-device models,” Bursztein and Zhang added.
Spam is the most popular attack vector in existence, used by virtually all cybercriminals out there. It’s omnipresent, cheap, and efficient, and enables threat actors to deliver malware and steal sensitive data.