A new open-source project employs a neural network to create names for organic compounds compliant with the IUPAC nomenclature systems - showing the potential of this technology to efficiently handle exact algorithmic problems.
Organic compounds, those that generally have carbon-hydrogen bonds, are named following a standardized set of rules implemented by the International Union of Pure and Applied Chemistry (IUPAC), commonly referred to as the IUPAC nomenclature. However, in following this standard, molecules can sometimes end up having long and tedious names corresponding to their structure. For example, table sugar is scientifically known as sucrose, and has the preferred IUPAC name (2R,3R,4S,5S,6R) -2-{[(2S,3S,4S,5R)-3, 4- Dihydroxy- 2, 5- bis (hydroxymethyl) oxolan-2-yl]oxy}- 6- (hydroxymethyl) oxane- 3,4,5-triol.
More importantly, these names do not allow the omission of even a single digit or character, requiring chemists to pay close attention to reports and notes in addition to having a clear understanding of the nomenclature rules. While there are off-the-shelf software tools that assist in IUPAC nomenclature is available,
The new neural network to help name organic compounds is detailed in the article "Transformer -based artificial neural networks for the conversion between chemical notations," appearing in the latest Scientific Reports journal.
Training a Neural Network for Exact Algorithmic Problems
Researchers from the Skolkovo Institute of Science and Technology (Skoltech) in Moscow, Russia, and colleagues from Lomonosov Moscow State University, an AI tech startup Syntelly, developed and trained the new neural network to generate names for organic compounds in compliance with IUPAC nomenclature.
"Initially, we wanted to create an IUPAC name generator for Syntelly, our AI chemistry platform. Soon we realized that it would take us more than a year to create an algorithm by digitizing the IUPAC rules, so we decided instead to leverage our experience in neural network solutions," explains Sergey Sosnin, lead author of the study, a Skoltech researcher and co-founder of Syntelly, in a press release from the institute.
To realize this solution, researchers used the Transformer architecture, a machine learning model, and one of today's most powerful machine translation neural networks developed by Google. It served as the basis for the new neural network, using the model to train their own network in converting the structural representation of a molecule to an IUPAC name and vice versa.
Testing and Demonstrating The Network's Capacity
To test the capability of the new neural network to generate names consistent with IUPAC nomenclature, the Skoltech team used PubChem, the world's largest collection of free and accessible chemical data. It also holds the largest chemical database of more than 100 million compounds. After being designed for six weeks, the neural network-based from the Transformer architecture managed to perform the required conversion with almost the same accuracy as other rule-based, algorithmic solutions.
More importantly, the study demonstrates how neural networks can potentially address exact algorithmic problems. Sosnin also explains that distinguishing two images, such as a cat and a dog, is an "equally easy task" for humans and neural networks. However, there is no existing method to generate a "purely algorithmic solution."
Skoltech researchers have already implemented the new neural network on the Syntelly platform and are publicly available online. Researchers hope that the new method could be used for converting between chemical notations as well as other technical and related tasks.
Read also: Organic Salt on Mars: Did NASA Perseverance Rover Find Evidence of Ancient Microbial Activity?
Check out more news and information on Organic Molecules in Science Times.