graphium.features¶
Feature extraction and manipulation
Featurizer¶
graphium.features.featurizer
¶
GraphDict
¶
Bases: dict
__init__(dic)
¶
Store the parameters required to initialize a pyg.data.Data
, but
as a dictionary to reduce memory consumption.
Possible keys for the dictionary:
-
adj: A sparse Tensor containing the adjacency matrix
-
ndata: A dictionnary containing different keys and Tensors associated to the node features.
-
edata: A dictionnary containing different keys and Tensors associated to the edge features.
-
dtype: The dtype for the floating data.
-
mask_nan: Deal with molecules that fail a part of the featurization. NaNs can happen when taking the of a noble gas, or other properties that are not measured for specific atoms.
- "raise": Raise an error when there is a nan or inf in the featurization
- "warn": Raise a warning when there is a nan or inf in the featurization
- "None": DEFAULT. Don't do anything
- "Floating value": Replace nans or inf by the specified value
make_pyg_graph(**kwargs)
¶
Convert the current dictionary of parameters, containing an adjacency matrix with node/edge data
into a pyg.data.Data
of torch Tensors.
**kwargs
can be used to overwrite any parameter from the current dictionary. See GraphDict.__init__
for a list of parameters
get_estimated_bond_length(bond, mol)
¶
Estimate the bond length between atoms by looking at the estimated atomic radius that depends both on the atom type and the bond type. The resulting bond-length is then the sum of the radius.
Keep in mind that this function only provides an estimate of the bond length and not the true one based on a conformer. The vast majority od estimated bond lengths will have an error below 5% while some bonds can have an error up to 20%. This function is mostly useful when conformer generation fails for some molecules, or for increased computation speed.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
bond |
Chem.rdchem.Bond
|
The bond to measure its lenght |
required |
mol |
dm.Mol
|
The molecule containing the bond (used to get neighbouring atoms) |
required |
Returns:
Name | Type | Description |
---|---|---|
bond_length |
float
|
The bond length in Angstrom, typically a value around 1-2. |
get_mol_atomic_features_float(mol, property_list, offset_carbon=True, mask_nan='raise')
¶
Get a dictionary of floating-point arrays of atomic properties. To ensure all properties are at a similar scale, some of the properties are divided by a constant.
There is also the possibility of offseting by the carbon value using
the offset_carbon
parameter.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
mol |
dm.Mol
|
molecule from which to extract the properties |
required |
property_list |
Union[List[str], List[Callable]]
|
A list of atomic properties to get from the molecule, such as 'atomic-number', 'mass', 'valence', 'degree', 'electronegativity'. Some elements are divided by a factor to avoid feature explosion. Accepted properties are:
|
required |
offset_carbon |
bool
|
Whether to subract the Carbon property from the desired atomic property. For example, if we want the mass of the Lithium (6.941), the mass of the Carbon (12.0107) will be subracted, resulting in a value of -5.0697 |
True
|
mask_nan |
Union[str, float, type(None)]
|
Deal with molecules that fail a part of the featurization. NaNs can happen when taking the of a noble gas, or other properties that are not measured for specific atoms.
|
'raise'
|
Returns:
Name | Type | Description |
---|---|---|
prop_dict |
Dict[str, np.ndarray]
|
A dictionnary where the element of |
get_mol_atomic_features_onehot(mol, property_list)
¶
Get the following set of features for any given atom
- One-hot representation of the atom
- One-hot representation of the atom degree
- One-hot representation of the atom implicit valence
- One-hot representation of the the atom hybridization
- Whether the atom is aromatic
- The atom's formal charge
- The atom's number of radical electrons
Additionally, the following features can be set, depending on the value of input Parameters
- One-hot representation of the number of hydrogen atom in the the current atom neighborhood if
explicit_H
is false - One-hot encoding of the atom chirality, and whether such configuration is even possible
Parameters:
Name | Type | Description | Default |
---|---|---|---|
mol |
dm.Mol
|
molecule from which to extract the properties |
required |
property_list |
List[str]
|
A list of integer atomic properties to get from the molecule. The integer values are converted to a one-hot vector. Callables are not supported by this function. Accepted properties are:
|
required |
Returns:
Name | Type | Description |
---|---|---|
prop_dict |
Dict[str, Tensor]
|
A dictionnary where the element of |
get_mol_conformer_features(mol, property_list, mask_nan=None)
¶
obtain the conformer features of a molecule
Parameters:
Name | Type | Description | Default |
---|---|---|---|
mol |
dm.Mol
|
molecule from which to extract the properties |
required |
property_list |
Union[List[str], List[Callable]]
|
A list of conformer property to get from the molecule Accepted properties are: - "positions_3d" |
required |
Returns:
Name | Type | Description |
---|---|---|
prop_dict |
Dict[str, np.ndarray]
|
a dictionary where the element of |
get_mol_edge_features(mol, property_list, mask_nan='raise')
¶
Get the following set of features for any given bond
See graphium.features.nmp
for allowed values in one hot encoding
- One-hot representation of the bond type. Note that you should not kekulize your molecules, if you expect this to take aromatic bond into account.
- Bond stereo type, following CIP classification
- Whether the bond is conjugated
- Whether the bond is in a ring
Parameters:
Name | Type | Description | Default |
---|---|---|---|
mol |
dm.Mol
|
rdkit.Chem.Molecule the molecule of interest |
required |
property_list |
List[str]
|
A list of edge properties to return for the given molecule. Accepted properties are:
|
required |
Returns:
Name | Type | Description |
---|---|---|
prop_dict |
Dict[str, np.ndarray]
|
A dictionnary where the element of |
get_simple_mol_conformer(mol)
¶
If the molecule has a conformer, then it will return the conformer at idx 0
.
Otherwise, it generates a simple molecule conformer using rdkit.Chem.rdDistGeom.EmbedMolecule
and returns it. This is meant to be used in simple functions like GetBondLength
,
not in functions requiring complex 3D structure.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
mol |
dm.Mol
|
Rdkit Molecule |
required |
Returns:
Name | Type | Description |
---|---|---|
conf |
Union[Chem.rdchem.Conformer, None]
|
A conformer of the molecule, or |
mol_to_adj_and_features(mol, atom_property_list_onehot=[], atom_property_list_float=[], conformer_property_list=[], edge_property_list=[], add_self_loop=False, explicit_H=False, use_bonds_weights=False, pos_encoding_as_features=None, dtype=np.float16, mask_nan='raise')
¶
Transforms a molecule into an adjacency matrix representing the molecular graph and a set of atom and bond features.
It also returns the positional encodings associated to the graph.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
mol |
Union[str, dm.Mol]
|
The molecule to be converted |
required |
atom_property_list_onehot |
List[str]
|
List of the properties used to get one-hot encoding of the atom type,
such as the atom index represented as a one-hot vector.
See function |
[]
|
atom_property_list_float |
List[Union[str, Callable]]
|
List of the properties used to get floating-point encoding of the atom type,
such as the atomic mass or electronegativity.
See function |
[]
|
conformer_property_list |
List[str]
|
list of properties used to encode the conformer information, outside of atom properties, currently support "positions_3d" |
[]
|
edge_property_list |
List[str]
|
List of the properties used to encode the edges, such as the edge type and the stereo type. |
[]
|
add_self_loop |
bool
|
Whether to add a value of |
False
|
explicit_H |
bool
|
Whether to consider the Hydrogens explicitely. If |
False
|
use_bonds_weights |
bool
|
Whether to use the floating-point value of the bonds in the adjacency matrix, such that single bonds are represented by 1, double bonds 2, triple 3, aromatic 1.5 |
False
|
pos_encoding_as_features |
Dict[str, Any]
|
keyword arguments for function |
None
|
dtype |
np.dtype
|
The torch data type used to build the graph |
np.float16
|
mask_nan |
Union[str, float, type(None)]
|
Deal with molecules that fail a part of the featurization. NaNs can happen when taking the of a noble gas, or other properties that are not measured for specific atoms.
|
'raise'
|
Returns:
Name | Type | Description |
---|---|---|
adj |
Union[coo_matrix, Union[Tensor, None], Union[Tensor, None], Dict[str, Tensor], Union[Tensor, None], Dict[str, Tensor]]
|
torch coo sparse adjacency matrix of the molecule |
ndata |
Union[coo_matrix, Union[Tensor, None], Union[Tensor, None], Dict[str, Tensor], Union[Tensor, None], Dict[str, Tensor]]
|
Concatenated node data of the atoms, based on the properties from
|
edata |
Union[coo_matrix, Union[Tensor, None], Union[Tensor, None], Dict[str, Tensor], Union[Tensor, None], Dict[str, Tensor]]
|
Concatenated node edge of the molecule, based on the properties from
|
pe_dict |
Union[coo_matrix, Union[Tensor, None], Union[Tensor, None], Dict[str, Tensor], Union[Tensor, None], Dict[str, Tensor]]
|
Dictionary of all positional encodings. Current supported keys:
|
conf_dict |
Union[coo_matrix, Union[Tensor, None], Union[Tensor, None], Dict[str, Tensor], Union[Tensor, None], Dict[str, Tensor]]
|
contains the 3d positions of a conformer of the molecule or 0s if none is found |
mol_to_adjacency_matrix(mol, use_bonds_weights=False, add_self_loop=False, dtype=np.float32)
¶
Convert a molecule to a sparse adjacency matrix, as a torch Tensor.
Instead of using the Rdkit GetAdjacencyMatrix()
method, this method
uses the bond ordering from the molecule object, which is the same as
the bond ordering in the bond features.
Warning
Do not use Tensor.coalesce()
on the returned adjacency matrix, as it
will change the ordering of the bonds.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
mol |
dm.Mol
|
A molecule in the form of a SMILES string or an RDKit molecule object. |
required |
use_bonds_weights |
bool
|
If |
False
|
add_self_loop |
bool
|
If |
False
|
dtype |
np.dtype
|
The data type used to build the graph |
np.float32
|
Returns:
Name | Type | Description |
---|---|---|
adj |
coo_matrix
|
coo sparse adjacency matrix of the molecule |
mol_to_graph_dict(mol, atom_property_list_onehot=[], atom_property_list_float=[], conformer_property_list=[], edge_property_list=[], add_self_loop=False, explicit_H=False, use_bonds_weights=False, pos_encoding_as_features=None, dtype=np.float16, on_error='ignore', mask_nan='raise', max_num_atoms=None)
¶
Transforms a molecule into an adjacency matrix representing the molecular graph
and a set of atom and bond features, and re-organizes them into a dictionary
that allows to build a pyg.data.Data
object.
Compared to mol_to_pyggraph
, this function does not build the graph directly,
and is thus faster, less memory heavy, and compatible with other frameworks.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
mol |
dm.Mol
|
The molecule to be converted |
required |
atom_property_list_onehot |
List[str]
|
List of the properties used to get one-hot encoding of the atom type,
such as the atom index represented as a one-hot vector.
See function |
[]
|
atom_property_list_float |
List[Union[str, Callable]]
|
List of the properties used to get floating-point encoding of the atom type,
such as the atomic mass or electronegativity.
See function |
[]
|
conformer_property_list |
List[str]
|
list of properties used to encode the conformer information, outside of atom properties, currently support "positions_3d" |
[]
|
edge_property_list |
List[str]
|
List of the properties used to encode the edges, such as the edge type and the stereo type. |
[]
|
add_self_loop |
bool
|
Whether to add a value of |
False
|
explicit_H |
bool
|
Whether to consider the Hydrogens explicitely. If |
False
|
use_bonds_weights |
bool
|
Whether to use the floating-point value of the bonds in the adjacency matrix, such that single bonds are represented by 1, double bonds 2, triple 3, aromatic 1.5 |
False
|
pos_encoding_as_features |
Dict[str, Any]
|
keyword arguments for function |
None
|
dtype |
np.dtype
|
The numpy data type used to build the graph |
np.float16
|
on_error |
str
|
What to do when the featurization fails. This can change the
behavior of
|
'ignore'
|
mask_nan |
Union[str, float, type(None)]
|
Deal with molecules that fail a part of the featurization. NaNs can happen when taking the of a noble gas, or other properties that are not measured for specific atoms.
|
'raise'
|
max_num_atoms |
Optional[int]
|
Maximum number of atoms for a given molecule. If a molecule with more atoms
is give, an error is raised, but catpured according to the rules of
|
None
|
Returns:
Name | Type | Description |
---|---|---|
graph_dict |
Union[GraphDict, str]
|
A dictionary
|
mol_to_graph_signature(featurizer_args=None)
¶
Get the default arguments of mol_to_graph_dict
and update it
with a provided dict of arguments in order to get a fulle signature
of the featurizer args actually used for the features computation.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
featurizer_args |
Dict[str, Any]
|
A dictionary of featurizer arguments to update |
None
|
Returns:
Type | Description |
---|---|
Dict[str, Any]
|
A dictionary of featurizer arguments |
mol_to_pyggraph(mol, atom_property_list_onehot=[], atom_property_list_float=[], conformer_property_list=[], edge_property_list=[], add_self_loop=False, explicit_H=False, use_bonds_weights=False, pos_encoding_as_features=None, dtype=np.float16, on_error='ignore', mask_nan='raise', max_num_atoms=None)
¶
Transforms a molecule into an adjacency matrix representing the molecular graph and a set of atom and bond features.
Then, the adjacency matrix and node/edge features are used to build a
pyg.data.Data
with pytorch Tensors.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
mol |
dm.Mol
|
The molecule to be converted |
required |
atom_property_list_onehot |
List[str]
|
List of the properties used to get one-hot encoding of the atom type,
such as the atom index represented as a one-hot vector.
See function |
[]
|
atom_property_list_float |
List[Union[str, Callable]]
|
List of the properties used to get floating-point encoding of the atom type,
such as the atomic mass or electronegativity.
See function |
[]
|
conformer_property_list |
List[str]
|
list of properties used to encode the conformer information, outside of atom properties, currently support "positions_3d" |
[]
|
edge_property_list |
List[str]
|
List of the properties used to encode the edges, such as the edge type and the stereo type. |
[]
|
add_self_loop |
bool
|
Whether to add a value of |
False
|
explicit_H |
bool
|
Whether to consider the Hydrogens explicitely. If |
False
|
use_bonds_weights |
bool
|
Whether to use the floating-point value of the bonds in the adjacency matrix, such that single bonds are represented by 1, double bonds 2, triple 3, aromatic 1.5 |
False
|
pos_encoding_as_features |
Dict[str, Any]
|
keyword arguments for function |
None
|
dtype |
np.dtype
|
The numpy data type used to build the graph |
np.float16
|
on_error |
str
|
What to do when the featurization fails. This can change the
behavior of
|
'ignore'
|
mask_nan |
Union[str, float, type(None)]
|
Deal with molecules that fail a part of the featurization. NaNs can happen when taking the of a noble gas, or other properties that are not measured for specific atoms.
|
'raise'
|
max_num_atoms |
Optional[int]
|
Maximum number of atoms for a given molecule. If a molecule with more atoms
is give, an error is raised, but catpured according to the rules of
|
None
|
Returns:
Name | Type | Description |
---|---|---|
graph |
Union[Data, str]
|
Pyg graph, with |
to_dense_array(array, dtype=None)
¶
Assign the node data
Parameters:
Name | Type | Description | Default |
---|---|---|---|
array |
np.ndarray
|
The array to convert to dense |
required |
dtype |
str
|
The dtype of the array |
None
|
Returns:
Type | Description |
---|---|
np.ndarray
|
The dense array |
to_dense_tensor(tensor, dtype=None)
¶
Assign the node data
Parameters:
Name | Type | Description | Default |
---|---|---|---|
array |
The array to convert to dense |
required | |
dtype |
str
|
The dtype of the array |
None
|
Returns:
Type | Description |
---|---|
Tensor
|
The dense array |
Positional Encoding¶
graphium.features.positional_encoding
¶
get_all_positional_encodings(adj, num_nodes, pos_kwargs=None)
¶
Get features positional encoding.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
adj |
[num_nodes, num_nodes]
|
Adjacency matrix of the graph |
required |
num_nodes |
int
|
Number of nodes in the graph |
required |
pos_encoding_as_features |
keyword arguments for function |
required |
Returns:
Name | Type | Description |
---|---|---|
pe_dict |
Tuple[OrderedDict[str, np.ndarray]]
|
Dictionary of positional and structural encodings |
graph_positional_encoder(adj, num_nodes, pos_type=None, pos_level=None, pos_kwargs=None, cache=None)
¶
Get a positional encoding that depends on the parameters.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
adj |
[num_nodes, num_nodes]
|
Adjacency matrix of the graph |
required |
num_nodes |
int
|
Number of nodes in the graph |
required |
pos_type |
Optional[str]
|
The type of positional encoding to use. If None, it must be provided by |
None
|
pos_level |
Optional[str]
|
Positional level to output. If None, it must be provided by |
None
|
pos_kwargs |
Optional[Dict[str, Any]]
|
Extra keyword arguments for the positional encoding. Can include the keys pos_type and pos_level. |
None
|
cache |
Optional[Dict[str, Any]]
|
Dictionary of cached objects |
None
|
Returns:
Name | Type | Description |
---|---|---|
pe |
Dict[str, np.ndarray]
|
Positional or structural encoding |
cache |
Dict[str, Any]
|
Updated dictionary of cached objects |
Properties¶
graphium.features.properties
¶
get_prop_or_none(prop, n, *args, **kwargs)
¶
return properties. If error, return list of None
with lenght n
.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
prop |
Callable
|
The property to compute. |
required |
n |
int
|
The number of elements in the property. |
required |
*args |
Union[dm.Mol, str]
|
The arguments to pass to the property. |
()
|
**kwargs |
Union[dm.Mol, str]
|
The keyword arguments to pass to the property. |
{}
|
Returns:
Type | Description |
---|---|
Union[List[float], List[None]]
|
The property or a list of |
get_props_from_mol(mol, properties='autocorr3d')
¶
Function to get a given set of desired properties from a molecule, and output a property list.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
mol |
Union[dm.Mol, str]
|
The molecule from which to compute the properties. |
required |
properties |
Union[List[str], str]
|
The list of properties to compute for each molecule. It can be the following:
|
'autocorr3d'
|
Returns:
Name | Type | Description |
---|---|---|
props |
np.ndarray
|
np.array(float) The array of properties for the desired molecule |
classes_start_idx |
np.ndarray
|
list(int)
The list of index specifying the start of each new class of
descriptor or property. For example, if props has 20 elements,
the first 5 are rotatable bonds, the next 8 are morse, and
the rest are whim, then |
classes_names |
np.ndarray
|
list(str) The name of the classes associated to each starting index. Will be usefull to understand what property is the network learning. |
Spectral PE¶
graphium.features.spectral
¶
compute_laplacian_pe(adj, num_pos, cache, disconnected_comp=True, normalization='none')
¶
Compute the Laplacian eigenvalues and eigenvectors of the Laplacian of the graph.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
adj |
[num_nodes, num_nodes]
|
Adjacency matrix of the graph |
required |
num_pos |
int
|
Number of Laplacian eigenvectors to compute |
required |
cache |
Dict[str, Any]
|
Dictionary of cached objects |
required |
disconnected_comp |
bool
|
Whether to compute the eigenvectors for each connected component |
True
|
normalization |
str
|
Normalization to apply to the Laplacian |
'none'
|
Returns:
Name | Type | Description |
---|---|---|
np.ndarray
|
Two possible outputs: eigvals [num_nodes, num_pos]: Eigenvalues of the Laplacian repeated for each node. This repetition is necessary in case of disconnected components, where the eigenvalues of the Laplacian are not the same for each node. eigvecs [num_nodes, num_pos]: Eigenvectors of the Laplacian |
|
base_level |
str
|
Indicator of the output pos_level (node, edge, nodepair, graph) -> here node |
cache |
Dict[str, Any]
|
Updated dictionary of cached objects |
normalize_matrix(matrix, degree_vector=None, normalization=None)
¶
Normalize a given matrix using its degree vector
Parameters¶
matrix: torch.tensor(N, N) or scipy.sparse.spmatrix(N, N)
A square matrix representing either an Adjacency matrix or a Laplacian.
degree_vector: torch.tensor(N) or np.ndarray(N) or None
A vector representing the degree of ``matrix``.
``None`` is only accepted if ``normalization==None``
normalization: str or None, Default='none'
Normalization to use on the eig_matrix
- 'none' or ``None``: no normalization
- 'sym': Symmetric normalization ``D^-0.5 L D^-0.5``
- 'inv': Inverse normalization ``D^-1 L``
Returns¶
matrix: torch.tensor(N, N) or scipy.sparse.spmatrix(N, N)
The normalized matrix
Random Walk PE¶
graphium.features.rw
¶
compute_rwse(adj, ksteps, num_nodes, cache, pos_type='rw_return_probs' or 'rw_transition_probs', space_dim=0)
¶
Compute Random Walk Spectral Embedding (RWSE) for given list of K steps.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
adj |
[num_nodes, num_nodes]
|
Adjacency matrix |
required |
ksteps |
Union[int, List[int]]
|
List of numbers of steps for the random walks. If int, a list is generated from 1 to ksteps. |
required |
num_nodes |
int
|
Number of nodes in the graph |
required |
cache |
Dict[str, Any]
|
Dictionary of cached objects |
required |
pos_type |
str
|
Desired output |
'rw_return_probs' or 'rw_transition_probs'
|
space_dim |
int
|
Estimated dimensionality of the space. Used to
correct the random-walk diagonal by a factor |
0
|
Returns:
Name | Type | Description |
---|---|---|
np.ndarray
|
Two possible outputs: rw_return_probs [num_nodes, len(ksteps)]: Random-Walk k-step landing probabilities rw_transition_probs [num_nodes, num_nodes, len(ksteps)]: Random-Walk k-step transition probabilities |
|
base_level |
str
|
Indicator of the output pos_level (node, edge, nodepair, graph) -> here either node or nodepair |
cache |
Dict[str, Any]
|
Updated dictionary of cached objects |
get_Pks(ksteps, edge_index, edge_weight=None, num_nodes=None, start_Pk=None, start_k=None)
¶
Compute Random Walk landing probabilities for given list of K steps.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
ksteps |
List[int]
|
List of numbers of k-steps for which to compute the RW landings |
required |
edge_index |
Tuple[torch.Tensor, torch.Tensor]
|
PyG sparse representation of the graph |
required |
edge_weight |
Optional[torch.Tensor]
|
Edge weights |
None
|
num_nodes |
Optional[int]
|
Number of nodes in the graph |
None
|
Returns:
Type | Description |
---|---|
Dict[int, np.ndarray]
|
2D Tensor with shape (num_nodes, len(ksteps)) with RW landing probs |