SMILES augmentation

Let’s look at an example molecule and see how we can apply different augmentation strategies to its SMILES representation.

Let’s consider the fomepizole drug. One of its SMILES is given by the following string [1]: Cc1cn[nH]c1.

[1] Gaulton A, et al. (2017) ‘The ChEMBL database in 2017.’ Nucleic Acids Res., 45(D1) D945-D954. DOI: 10.1093/nar/gkw1074

[1]:

smiles = "Cc1cn[nH]c1"
print(f"SMILES of fomepizole drug: {smiles}")

SMILES of fomepizole drug: Cc1cn[nH]c1

Let’s visualize the molecule using RDKit.

[2]:

import rdkit
from rdkit import Chem
from rdkit.Chem import Draw, AllChem

[3]:

molecule = Chem.MolFromSmiles(smiles)
molecule

[3]:

../_images/example_smiles_augmentation_4_0.png

Canonical SMILES

Let’s convert the SMILES into the canonical form using RDKit.

[4]:

from maxsmi.utils.utils_smiles import smiles_to_canonical, smiles_to_random

[5]:

canonical_smiles = smiles_to_canonical(smiles)
print(f"Canonical SMILES of the molecule:\t {canonical_smiles} \n"
      f"Length of canonical SMILES:\t\t {len(canonical_smiles)}")

Canonical SMILES of the molecule:        Cc1cn[nH]c1
Length of canonical SMILES:              11

Random SMILES

Let’s look at different variations of the SMILES.

[6]:

smiles_to_random?

Signature: smiles_to_random(smiles, int_aug=50)
Docstring:
Takes a SMILES (not necessarily canonical) and returns `int_aug` random variations of this SMILES.

Parameters
----------
smiles : str
    SMILES string describing a compound.
int_aug : int, Optional, default: 50
    The number of random SMILES generated.

Returns
-------
list
    A list of `int_aug` random (may not be unique) SMILES or None if the initial SMILES is not valid.
File:      ~/Documents/github/maxsmi/maxsmi/utils/utils_smiles.py
Type:      function

[7]:

random_smiles = smiles_to_random(smiles, int_aug=5)

[8]:

print(f"Random SMILES: {random_smiles}")

Random SMILES: ['Cc1cn[nH]c1', 'c1n[nH]cc1C', 'Cc1cn[nH]c1', 'n1[nH]cc(C)c1', 'c1c(cn[nH]1)C']

Augmentation strategies

No augmentation

As a baseline, no random SMILES are generated, just one SMILES is kept.

[9]:

from maxsmi.augmentation_strategies import no_augmentation

[10]:

one_smiles = no_augmentation(smiles)
print(f"No augmentation for SMILES: {one_smiles}")

No augmentation for SMILES: ['Cc1cn[nH]c1']

Augmentation with duplication

Here random SMILES are generated on the fly, despite duplicates.

[11]:

from maxsmi.augmentation_strategies import augmentation_with_duplication

[12]:

augmentation_with_duplication?

Signature: augmentation_with_duplication(smiles, augmentation_number)
Docstring:
Takes a SMILES and returns a list of random SMILES with possible duplicates.

Parameters
----------
smiles : str
    SMILES string describing a compound.
augmentation_number : int
    The integer to generate the number of random SMILES.

Returns
-------
list
    A list containing the given number of random SMILES, which might include duplicated SMILES.
File:      ~/Documents/github/maxsmi/maxsmi/augmentation_strategies.py
Type:      function

[13]:

duplicated_smiles = augmentation_with_duplication(smiles, 100)

Since we are keeping duplicates, we should end up with 100 SMILES:

[14]:

print(f"Number of SMILES: {len(duplicated_smiles)}")

Number of SMILES: 100

Augmentation without duplication

Here unique random SMILES are generated.

[15]:

from maxsmi.augmentation_strategies import augmentation_without_duplication

[16]:

non_duplicated_smiles = augmentation_without_duplication(smiles, 100)

[17]:

print(f"Number of (unique) SMILES: {len(non_duplicated_smiles)}")

Number of (unique) SMILES: 20

Augmentation with reduced duplication

Here random SMILES are generated and only a fraction of duplicates are kept.

[18]:

from maxsmi.augmentation_strategies import (
    augmentation_with_reduced_duplication
)

[19]:

augmentation_with_reduced_duplication?

Signature: augmentation_with_reduced_duplication(smiles, augmentation_number)
Docstring:
Takes a SMILES and returns a list of random SMILES with a reduced amount of duplicates.
The reduction is square root given the number of duplicates.

Parameters
----------
smiles : str
    SMILES string describing a compound.
augmentation_number : int
    The integer to generate the number of random SMILES.

Returns
-------
list
    A list of random SMILES with a reduced amount of duplicates.
File:      ~/Documents/github/maxsmi/maxsmi/augmentation_strategies.py
Type:      function

[20]:

reduced_duplicated_smiles = augmentation_with_reduced_duplication(smiles, 100)

[21]:

print(f"Number of SMILES: {len(reduced_duplicated_smiles)}")

Number of SMILES: 53

Augmentation with estimated maximum

Here we try to generate a large portion of the SMILES space.

[22]:

from maxsmi.augmentation_strategies import augmentation_maximum_estimation

[23]:

augmentation_maximum_estimation?

Signature: augmentation_maximum_estimation(smiles, max_duplication=10)
Docstring:
Returns augmented SMILES with estimated maximum number.

Parameters
----------
smiles : str
    SMILES string describing a compound.
max_duplication : int, Optional, default: 10
    The number of redundant SMILES that have to be generated before stopping augmentation process.

Returns
-------
list
    A list of "estimated" maximum unique random SMILES.
File:      ~/Documents/github/maxsmi/maxsmi/augmentation_strategies.py
Type:      function

[24]:

estimated_max_smiles = augmentation_maximum_estimation(smiles)

[25]:

print(f"Number of SMILES: {len(estimated_max_smiles)}")

Number of SMILES: 18

Deep Neural Network Input

Here is an example of how to numerically encode a SMILES to use as input for a neural network.

Example of one-hot encoded SMILES

Let’s look at the afatinib kinase inhibitor, given by the SMILES CN(C)CC=CC(=O)NC1=C(C=C2C(=C1)C(=NC=N2)NC3=CC(=C(C=C3)F)Cl)OC4CCOC4.

[26]:

import matplotlib.pyplot as plt
from matplotlib import colors
from maxsmi.utils.utils_encoding import (
    char_replacement,
    get_unique_elements_as_dict,
    one_hot_encode,
)

[27]:

smiles = "CN(C)CC=CC(=O)NC1=C(C=C2C(=C1)C(=NC=N2)NC3=CC(=C(C=C3)F)Cl)OC4CCOC4"
print("Length of original smile:\t\t\t", len(smiles))

# Replace double symbols
new_smiles = char_replacement([smiles])[0]
print("Length of double characters replaced smiles:\t", len(new_smiles))

# Obtain dictionary for this SMILES string
smi_dict = get_unique_elements_as_dict(new_smiles)
print("Number of unique characters: \t\t\t", len(smi_dict))

# One-hot encode SMILES string
one_hot_smiles = one_hot_encode(new_smiles, smi_dict)
print("Shape of one-hot encoded matrix: \t\t", one_hot_smiles.shape)

Length of original smile:                        67
Length of double characters replaced smiles:     66
Number of unique characters:                     12
Shape of one-hot encoded matrix:                 (12, 66)

Let’s plot the molecule using RDKit.

[28]:

molecule = Chem.MolFromSmiles(smiles)
molecule

[28]:

../_images/example_smiles_augmentation_42_0.png

We now visualize the one-hot encoded SMILES.

[29]:

f, ax = plt.subplots()
ax.imshow(one_hot_smiles, cmap=colors.ListedColormap([(1, 1, 1), "black"]))
ax.set_xticks([])
ax.set_yticks([])
ax.patch.set_edgecolor("black")
ax.patch.set_linewidth("1")
plt.title(f"one-hot encoded SMILES of afatinib drug")
plt.show()

../_images/example_smiles_augmentation_44_0.png

Data

Let’s look at some of the data available in maxsmi.

[30]:

from maxsmi.utils.utils_data import data_retrieval
from maxsmi.utils.utils_smiles import smiles_to_canonical, smiles_to_random

[31]:

data_retrieval?

Signature: data_retrieval(target_data='ESOL')
Docstring:
Retrieve data from MoleculeNet, or ChEMBL.

Parameters
----------
target_data: str
    The target data to be considered. The default is the ESOL data set.

Returns
-------
dataframe: pd.Pandas
    Pandas data frame with two columns:
        - `smiles`: SMILES encoding of the compounds.
        - `target`: the measured target values.
File:      ~/Documents/github/maxsmi/maxsmi/utils/utils_data.py
Type:      function

Currently implemented data sets:

ESOL (and ESOL_small)
FreeSolv
lipophilicity
EGFR affinity (ChEMBL28/Kinodata)

[32]:

# Read data
df = data_retrieval("ESOL")

# Canonical SMILES
df["canonical_smiles"] = df["smiles"].apply(smiles_to_canonical)

# Random SMILES
df["random_smiles"] = df["canonical_smiles"].apply(smiles_to_random, args=(2,))

print(f"Shape of dataframe: {df.shape}\nColumns: {df.columns}")

Shape of dataframe: (1128, 4)
Columns: Index(['target', 'smiles', 'canonical_smiles', 'random_smiles'], dtype='object')

[33]:

df.head(4)

[33]:

	target	smiles	canonical_smiles	random_smiles
0	-0.77	OCC3OC(OCC2OC(OC(C#N)c1ccccc1)C(O)C(O)C2O)C(O)...	N#CC(OC1OC(COC2OC(CO)C(O)C(O)C2O)C(O)C(O)C1O)c...	[C1(C(O)C(OCC2OC(C(C(C2O)O)O)OC(C#N)c2ccccc2)O...
1	-3.30	Cc1occc1C(=O)Nc2ccccc2	Cc1occc1C(=O)Nc1ccccc1	[Cc1c(C(=O)Nc2ccccc2)cco1, N(C(=O)c1ccoc1C)c1c...
2	-2.06	CC(C)=CCCC(C)=CC(=O)	CC(C)=CCCC(C)=CC=O	[C(CCC(=CC=O)C)=C(C)C, O=CC=C(CCC=C(C)C)C]
3	-7.87	c1ccc2c(c1)ccc3c2ccc4c5ccccc5ccc43	c1ccc2c(c1)ccc1c2ccc2c3ccccc3ccc21	[c1cc2c(ccc3c2cccc3)c2ccc3ccccc3c21, c1cccc2c3...