Feature extraction and manipulation



Copyright (c) 2023 Valence Labs, Recursion Pharmaceuticals and Graphcore Limited.

Use of this software is subject to the terms and conditions outlined in the LICENSE file. Unauthorized modification, distribution, or use is prohibited. Provided 'as is' without warranties of any kind.

Valence Labs, Recursion Pharmaceuticals and Graphcore Limited are not liable for any damages arising from its use. Refer to the LICENSE file for the full terms and conditions.


Store the parameters required to initialize a, but as a dictionary to reduce memory consumption.

Possible keys for the dictionary:

  • adj: A sparse Tensor containing the adjacency matrix

  • ndata: A dictionnary containing different keys and Tensors associated to the node features.

  • edata: A dictionnary containing different keys and Tensors associated to the edge features.

  • dtype: The dtype for the floating data.

  • mask_nan: Deal with molecules that fail a part of the featurization. NaNs can happen when taking the of a noble gas, or other properties that are not measured for specific atoms.

    • "raise": Raise an error when there is a nan or inf in the featurization
    • "warn": Raise a warning when there is a nan or inf in the featurization
    • "None": DEFAULT. Don't do anything
    • "Floating value": Replace nans or inf by the specified value

Convert the current dictionary of parameters, containing an adjacency matrix with node/edge data into a of torch Tensors.

**kwargs can be used to overwrite any parameter from the current dictionary. See GraphDict.__init__ for a list of parameters

get_estimated_bond_length(bond, mol)

Estimate the bond length between atoms by looking at the estimated atomic radius that depends both on the atom type and the bond type. The resulting bond-length is then the sum of the radius.

Keep in mind that this function only provides an estimate of the bond length and not the true one based on a conformer. The vast majority od estimated bond lengths will have an error below 5% while some bonds can have an error up to 20%. This function is mostly useful when conformer generation fails for some molecules, or for increased computation speed.


Name Type Description Default
bond Bond

The bond to measure its lenght

mol Mol

The molecule containing the bond (used to get neighbouring atoms)



Name Type Description
bond_length float

The bond length in Angstrom, typically a value around 1-2.

get_mol_atomic_features_float(mol, property_list, offset_carbon=True, mask_nan='raise')

Get a dictionary of floating-point arrays of atomic properties. To ensure all properties are at a similar scale, some of the properties are divided by a constant.

There is also the possibility of offseting by the carbon value using the offset_carbon parameter.


    molecule from which to extract the properties

    A list of atomic properties to get from the molecule, such as 'atomic-number',
    'mass', 'valence', 'degree', 'electronegativity'.
    Some elements are divided by a factor to avoid feature explosion.

    Accepted properties are:

    - "atomic-number"
    - "mass", "weight"
    - "valence", "total-valence"
    - "implicit-valence"
    - "hybridization"
    - "chirality"
    - "hybridization"
    - "aromatic"
    - "ring", "in-ring"
    - "min-ring"
    - "max-ring"
    - "num-ring"
    - "degree"
    - "radical-electron"
    - "formal-charge"
    - "vdw-radius"
    - "covalent-radius"
    - "electronegativity"
    - "ionization", "first-ionization"
    - "melting-point"
    - "metal"
    - "single-bond"
    - "aromatic-bond"
    - "double-bond"
    - "triple-bond"
    - "is-carbon"
    - "group"
    - "period"

    Whether to subract the Carbon property from the desired atomic property.
    For example, if we want the mass of the Lithium (6.941), the mass of the
    Carbon (12.0107) will be subracted, resulting in a value of -5.0697

    Deal with molecules that fail a part of the featurization.
    NaNs can happen when taking the of a noble gas,
    or other properties that are not measured for specific atoms.

    - "raise": Raise an error when there is a nan or inf in the featurization
    - "warn": Raise a warning when there is a nan or inf in the featurization
    - "None": DEFAULT. Don't do anything
    - "Floating value": Replace nans or inf by the specified value


    A dictionnary where the element of ``property_list`` are the keys
    and the values are np.ndarray of shape (N,). N is the number of atoms
    in ``mol``.

get_mol_atomic_features_onehot(mol, property_list)

Get the following set of features for any given atom

  • One-hot representation of the atom
  • One-hot representation of the atom degree
  • One-hot representation of the atom implicit valence
  • One-hot representation of the the atom hybridization
  • Whether the atom is aromatic
  • The atom's formal charge
  • The atom's number of radical electrons

Additionally, the following features can be set, depending on the value of input Parameters

  • One-hot representation of the number of hydrogen atom in the the current atom neighborhood if explicit_H is false
  • One-hot encoding of the atom chirality, and whether such configuration is even possible


    molecule from which to extract the properties

    A list of integer atomic properties to get from the molecule.
    The integer values are converted to a one-hot vector.
    Callables are not supported by this function.

    Accepted properties are:

    - "atomic-number"
    - "degree"
    - "valence", "total-valence"
    - "implicit-valence"
    - "hybridization"
    - "chirality"
    - "phase"
    - "type"
    - "group"
    - "period"


Name Type Description
prop_dict Dict[str, Tensor]

A dictionnary where the element of property_list are the keys and the values are np.ndarray of shape (N, OH). N is the number of atoms in mol and OH the lenght of the one-hot encoding.

get_mol_conformer_features(mol, property_list, mask_nan=None)

obtain the conformer features of a molecule Parameters:

    molecule from which to extract the properties

    A list of conformer property to get from the molecule
    Accepted properties are:
    - "positions_3d"


Name Type Description
prop_dict Dict[str, ndarray]

a dictionary where the element of property_list are the keys

get_mol_edge_features(mol, property_list, mask_nan='raise')

Get the following set of features for any given bond See graphium.features.nmp for allowed values in one hot encoding

  • One-hot representation of the bond type. Note that you should not kekulize your molecules, if you expect this to take aromatic bond into account.
  • Bond stereo type, following CIP classification
  • Whether the bond is conjugated
  • Whether the bond is in a ring


Name Type Description Default
mol Mol

rdkit.Chem.Molecule the molecule of interest

property_list List[str]

A list of edge properties to return for the given molecule. Accepted properties are:

  • "bond-type-onehot"
  • "bond-type-float"
  • "stereo"
  • "in-ring"
  • "conjugated"
  • "conformer-bond-length" (might cause problems with complex molecules)
  • "estimated-bond-length"


Name Type Description
prop_dict Dict[str, ndarray]

A dictionnary where the element of property_list are the keys and the values are np.ndarray of shape (N,). N is the number of atoms in mol.


If the molecule has a conformer, then it will return the conformer at idx 0. Otherwise, it generates a simple molecule conformer using rdkit.Chem.rdDistGeom.EmbedMolecule and returns it. This is meant to be used in simple functions like GetBondLength, not in functions requiring complex 3D structure.


mol: Rdkit Molecule


Name Type Description
conf Union[Conformer, None]

A conformer of the molecule, or None if it fails

mol_to_adj_and_features(mol, atom_property_list_onehot=[], atom_property_list_float=[], conformer_property_list=[], edge_property_list=[], add_self_loop=False, explicit_H=False, use_bonds_weights=False, pos_encoding_as_features=None, dtype=np.float16, mask_nan='raise')

Transforms a molecule into an adjacency matrix representing the molecular graph and a set of atom and bond features.

It also returns the positional encodings associated to the graph.


    The molecule to be converted

    List of the properties used to get one-hot encoding of the atom type,
    such as the atom index represented as a one-hot vector.
    See function `get_mol_atomic_features_onehot`

    List of the properties used to get floating-point encoding of the atom type,
    such as the atomic mass or electronegativity.
    See function `get_mol_atomic_features_float`

    list of properties used to encode the conformer information, outside of atom properties, currently support "positions_3d"

    List of the properties used to encode the edges, such as the edge type
    and the stereo type.

    Whether to add a value of `1` on the diagonal of the adjacency matrix.

    Whether to consider the Hydrogens explicitely. If `False`, the hydrogens
    are implicit.

    Whether to use the floating-point value of the bonds in the adjacency matrix,
    such that single bonds are represented by 1, double bonds 2, triple 3, aromatic 1.5

pos_encoding_as_features: keyword arguments for function `graph_positional_encoder`
    to generate positional encoding for node features.

    The torch data type used to build the graph

    Deal with molecules that fail a part of the featurization.
    NaNs can happen when taking the of a noble gas,
    or other properties that are not measured for specific atoms.

    - "raise": Raise an error when there is a nan or inf in the featurization
    - "warn": Raise a warning when there is a nan or inf in the featurization
    - "None": DEFAULT. Don't do anything
    - "Floating value": Replace nans or inf by the specified value


    torch coo sparse adjacency matrix of the molecule

    Concatenated node data of the atoms, based on the properties from
    `atom_property_list_onehot` and `atom_property_list_float`.
    If no properties are given, it returns `None`

    Concatenated node edge of the molecule, based on the properties from
    If no properties are given, it returns `None`

    Dictionary of all positional encodings. Current supported keys:

    - "pos_enc_feats_sign_flip":
        Node positional encoding that requires augmentation via sign-flip.
        For example, eigenvectors of the Laplacian are ambiguous to the
        sign and are returned here.

    - "pos_enc_feats_no_flip":
        Node positional encoding that requires does not use sign-flip.
        For example, distance from centroid are returned here.

    - "rwse":
        Node structural encoding corresponding to the diagonal of the random
        walk matrix

    contains the 3d positions of a conformer of the molecule or 0s if none is found

mol_to_adjacency_matrix(mol, use_bonds_weights=False, add_self_loop=False, dtype=np.float32)

Convert a molecule to a sparse adjacency matrix, as a torch Tensor. Instead of using the Rdkit GetAdjacencyMatrix() method, this method uses the bond ordering from the molecule object, which is the same as the bond ordering in the bond features.


Do not use Tensor.coalesce() on the returned adjacency matrix, as it will change the ordering of the bonds.


Name Type Description Default
mol Mol

A molecule in the form of a SMILES string or an RDKit molecule object.

use_bonds_weights bool

If True, the adjacency matrix will contain the bond type as the value of the edge. If False, the adjacency matrix will contain 1 as the value of the edge.

add_self_loop bool

If True, the adjacency matrix will contain a self-loop for each node.

dtype dtype

The data type used to build the graph



Name Type Description
adj coo_matrix

coo sparse adjacency matrix of the molecule

mol_to_graph_dict(mol, atom_property_list_onehot=[], atom_property_list_float=[], conformer_property_list=[], edge_property_list=[], add_self_loop=False, explicit_H=False, use_bonds_weights=False, pos_encoding_as_features=None, dtype=np.float16, on_error='ignore', mask_nan='raise', max_num_atoms=None)

Transforms a molecule into an adjacency matrix representing the molecular graph and a set of atom and bond features, and re-organizes them into a dictionary that allows to build a object.

Compared to mol_to_pyggraph, this function does not build the graph directly, and is thus faster, less memory heavy, and compatible with other frameworks.


    The molecule to be converted

    List of the properties used to get one-hot encoding of the atom type,
    such as the atom index represented as a one-hot vector.
    See function `get_mol_atomic_features_onehot`

    List of the properties used to get floating-point encoding of the atom type,
    such as the atomic mass or electronegativity.
    See function `get_mol_atomic_features_float`

    list of properties used to encode the conformer information, outside of atom properties, currently support "positions_3d"

    List of the properties used to encode the edges, such as the edge type
    and the stereo type.

    Whether to add a value of `1` on the diagonal of the adjacency matrix.

    Whether to consider the Hydrogens explicitely. If `False`, the hydrogens
    are implicit.

    Whether to use the floating-point value of the bonds in the adjacency matrix,
    such that single bonds are represented by 1, double bonds 2, triple 3, aromatic 1.5

pos_encoding_as_features: keyword arguments for function `graph_positional_encoder`
    to generate positional encoding for node features.

    The numpy data type used to build the graph

    What to do when the featurization fails. This can change the
    behavior of `mask_nan`.

    - "raise": Raise an error
    - "warn": Raise a warning and return a string of the error
    - "ignore": Ignore the error and return a string of the error

    Deal with molecules that fail a part of the featurization.
    NaNs can happen when taking the of a noble gas,
    or other properties that are not measured for specific atoms.

    - "raise": Raise an error when there is a nan or inf in the featurization
    - "warn": Raise a warning when there is a nan or inf in the featurization
    - "None": DEFAULT. Don't do anything
    - "Floating value": Replace nans or inf by the specified value

    Maximum number of atoms for a given molecule. If a molecule with more atoms
    is give, an error is raised, but catpured according to the rules of


    A dictionary `GraphDict` containing the keys required to build a graph,
    and which can be used to build a PyG graph. If it fails
    to featurize the molecule, it returns a string with the error.

    - "adj": A sparse int-array containing the adjacency matrix

    - "data": A dictionnary containing different keys and numpy
      arrays associated to the (node, edge & graph) features.

    - "dtype": The numpy dtype for the floating data.


Get the default arguments of mol_to_graph_dict and update it with a provided dict of arguments in order to get a fulle signature of the featurizer args actually used for the features computation.


Name Type Description Default
featurizer_args Dict[str, Any]

A dictionary of featurizer arguments to update


Returns: A dictionary of featurizer arguments

mol_to_pyggraph(mol, atom_property_list_onehot=[], atom_property_list_float=[], conformer_property_list=[], edge_property_list=[], add_self_loop=False, explicit_H=False, use_bonds_weights=False, pos_encoding_as_features=None, dtype=np.float16, on_error='ignore', mask_nan='raise', max_num_atoms=None)

Transforms a molecule into an adjacency matrix representing the molecular graph and a set of atom and bond features.

Then, the adjacency matrix and node/edge features are used to build a with pytorch Tensors.


    The molecule to be converted

    List of the properties used to get one-hot encoding of the atom type,
    such as the atom index represented as a one-hot vector.
    See function `get_mol_atomic_features_onehot`

    List of the properties used to get floating-point encoding of the atom type,
    such as the atomic mass or electronegativity.
    See function `get_mol_atomic_features_float`

    list of properties used to encode the conformer information, outside of atom properties, currently support "positions_3d"

    List of the properties used to encode the edges, such as the edge type
    and the stereo type.

    Whether to add a value of `1` on the diagonal of the adjacency matrix.

    Whether to consider the Hydrogens explicitely. If `False`, the hydrogens
    are implicit.

    Whether to use the floating-point value of the bonds in the adjacency matrix,
    such that single bonds are represented by 1, double bonds 2, triple 3, aromatic 1.5

pos_encoding_as_features: keyword arguments for function `graph_positional_encoder`
    to generate positional encoding for node features.

    The numpy data type used to build the graph

    What to do when the featurization fails. This can change the
    behavior of `mask_nan`.

    - "raise": Raise an error
    - "warn": Raise a warning and return a string of the error
    - "ignore": Ignore the error and return a string of the error

    Deal with molecules that fail a part of the featurization.
    NaNs can happen when taking the of a noble gas,
    or other properties that are not measured for specific atoms.

    - "raise": Raise an error when there is a nan in the featurization
    - "warn": Raise a warning when there is a nan in the featurization
    - "None": DEFAULT. Don't do anything
    - "Floating value": Replace nans by the specified value

    Maximum number of atoms for a given molecule. If a molecule with more atoms
    is give, an error is raised, but catpured according to the rules of


    Pyg graph, with `graph['feat']` corresponding to the concatenated
    node data from `atom_property_list_onehot` and `atom_property_list_float`,
    `graph['edge_feat']` corresponding to the concatenated edge data from `edge_property_list`.
    There are also additional entries for the positional encodings.

to_dense_array(array, dtype=None)

Assign the node data Parameters: array: The array to convert to dense dtype: The dtype of the array Returns: The dense array

to_dense_tensor(tensor, dtype=None)

Assign the node data Parameters: array: The array to convert to dense dtype: The dtype of the array Returns: The dense array

Positional Encoding


get_all_positional_encodings(adj, num_nodes, pos_kwargs=None)

Get features positional encoding.


Name Type Description Default
adj [num_nodes, num_nodes]

Adjacency matrix of the graph

num_nodes int

Number of nodes in the graph


keyword arguments for function graph_positional_encoder to generate positional encoding for node features.



Name Type Description
pe_dict Tuple[OrderedDict[str, ndarray]]

Dictionary of positional and structural encodings

graph_positional_encoder(adj, num_nodes, pos_type=None, pos_level=None, pos_kwargs=None, cache=None)

Get a positional encoding that depends on the parameters.


Name Type Description Default
adj [num_nodes, num_nodes]

Adjacency matrix of the graph

num_nodes int

Number of nodes in the graph

pos_type Optional[str]

The type of positional encoding to use. If None, it must be provided by pos_kwargs["pos_type"]. Supported types are: - laplacian_eigvec \ - laplacian_eigval \ -> cache connected comps. & eigendecomp. - rwse - electrostatic \ - commute \ -> cache pinvL - graphormer

pos_level Optional[str]

Positional level to output. If None, it must be provided by pos_kwargs["pos_level"]. - node - edge - nodepair - graph

pos_kwargs Optional[Dict[str, Any]]

Extra keyword arguments for the positional encoding. Can include the keys pos_type and pos_level.

cache Optional[Dict[str, Any]]

Dictionary of cached objects



Name Type Description
pe Dict[str, ndarray]

Positional or structural encoding

cache Dict[str, Any]

Updated dictionary of cached objects


get_prop_or_none(prop, n, *args, **kwargs)

return properties. If error, return list of None with lenght n. Parameters: prop: The property to compute. n: The number of elements in the property. args: The arguments to pass to the property. *kwargs: The keyword arguments to pass to the property. Returns: The property or a list of None with lenght n.

get_props_from_mol(mol, properties='autocorr3d')

Function to get a given set of desired properties from a molecule, and output a property list.


Name Type Description Default
mol Union[Mol, str]

The molecule from which to compute the properties.

properties Union[List[str], str]

The list of properties to compute for each molecule. It can be the following:

  • 'descriptors'
  • 'autocorr3d'
  • 'rdf'
  • 'morse'
  • 'whim'
  • 'all'


Name Type Description
props ndarray

np.array(float) The array of properties for the desired molecule

classes_start_idx ndarray

list(int) The list of index specifying the start of each new class of descriptor or property. For example, if props has 20 elements, the first 5 are rotatable bonds, the next 8 are morse, and the rest are whim, then classes_start_idx = [0, 5, 13]. This will mainly be useful to normalize the features of each class.

classes_names ndarray

list(str) The name of the classes associated to each starting index. Will be usefull to understand what property is the network learning.

Spectral PE


compute_laplacian_pe(adj, num_pos, cache, disconnected_comp=True, normalization='none')

Compute the Laplacian eigenvalues and eigenvectors of the Laplacian of the graph.


Name Type Description Default
adj [num_nodes, num_nodes]

Adjacency matrix of the graph

num_pos int

Number of Laplacian eigenvectors to compute

cache Dict[str, Any]

Dictionary of cached objects

disconnected_comp bool

Whether to compute the eigenvectors for each connected component

normalization str

Normalization to apply to the Laplacian



Name Type Description

Two possible outputs: eigvals [num_nodes, num_pos]: Eigenvalues of the Laplacian repeated for each node. This repetition is necessary in case of disconnected components, where the eigenvalues of the Laplacian are not the same for each node. eigvecs [num_nodes, num_pos]: Eigenvectors of the Laplacian

base_level str

Indicator of the output pos_level (node, edge, nodepair, graph) -> here node

cache Dict[str, Any]

Updated dictionary of cached objects

normalize_matrix(matrix, degree_vector=None, normalization=None)

Normalize a given matrix using its degree vector

matrix: torch.tensor(N, N) or scipy.sparse.spmatrix(N, N)
    A square matrix representing either an Adjacency matrix or a Laplacian.

degree_vector: torch.tensor(N) or np.ndarray(N) or None
    A vector representing the degree of ``matrix``.
    ``None`` is only accepted if ``normalization==None``

normalization: str or None, Default='none'
    Normalization to use on the eig_matrix

    - 'none' or ``None``: no normalization

    - 'sym': Symmetric normalization ``D^-0.5 L D^-0.5``

    - 'inv': Inverse normalization ``D^-1 L``
matrix: torch.tensor(N, N) or scipy.sparse.spmatrix(N, N)
    The normalized matrix

Random Walk PE

compute_rwse(adj, ksteps, num_nodes, cache, pos_type='rw_return_probs' or 'rw_transition_probs', space_dim=0)

Compute Random Walk Spectral Embedding (RWSE) for given list of K steps.


Name Type Description Default
adj [num_nodes, num_nodes]

Adjacency matrix

ksteps Union[int, List[int]]

List of numbers of steps for the random walks. If int, a list is generated from 1 to ksteps.

num_nodes int

Number of nodes in the graph

cache Dict[str, Any]

Dictionary of cached objects

pos_type str

Desired output

'rw_return_probs' or 'rw_transition_probs'
space_dim int

Estimated dimensionality of the space. Used to correct the random-walk diagonal by a factor k^(space_dim/2). In euclidean space, this correction means that the height of the gaussian distribution stays almost constant across the number of steps, if space_dim is the dimension of the euclidean space.


Returns: Two possible outputs: rw_return_probs [num_nodes, len(ksteps)]: Random-Walk k-step landing probabilities rw_transition_probs [num_nodes, num_nodes, len(ksteps)]: Random-Walk k-step transition probabilities base_level: Indicator of the output pos_level (node, edge, nodepair, graph) -> here either node or nodepair cache: Updated dictionary of cached objects

get_Pks(ksteps, edge_index, edge_weight=None, num_nodes=None, start_Pk=None, start_k=None)

Compute Random Walk landing probabilities for given list of K steps.


Name Type Description Default
ksteps List[int]

List of numbers of k-steps for which to compute the RW landings

edge_index Tuple[Tensor, Tensor]

PyG sparse representation of the graph

edge_weight Optional[Tensor]

Edge weights

num_nodes Optional[int]

Number of nodes in the graph



Type Description
Dict[int, ndarray]

2D Tensor with shape (num_nodes, len(ksteps)) with RW landing probs



check if a string can be converted to float, return none if it can't Parameters: string: str Returns: val: float or None