Skip to content

graphium.data

module for loading datasets and collating batches

Data Module


graphium.data.datamodule

BaseDataModule

Bases: lightning.LightningDataModule

get_num_workers property

get the number of workers to use

predict_ds property writable

Get the dataset used for the prediction

__init__(batch_size_training=16, batch_size_inference=16, batch_size_per_pack=None, num_workers=0, pin_memory=True, persistent_workers=False, collate_fn=None)

base dataset module for all datasets (to be inherented)

Parameters:

Name Type Description Default
batch_size_training int

batch size for training

16
batch_size_inference int

batch size for inference

16
num_workers int

number of workers for data loading

0
pin_memory bool

whether to pin memory

True
persistent_workers bool

whether to use persistent workers

False
collate_fn Optional[Callable]

collate function for batching

None
get_dataloader(dataset, shuffle, stage)

Get the dataloader for a given dataset

Parameters:

Name Type Description Default
dataset Dataset

The dataset from which to load the data

required
shuffle bool

set to True to have the data reshuffled at every epoch.

required
stage RunningStage

Whether in Training, Validating, Testing, Sanity-checking, Predicting, or Tuning phase.

required

Returns:

Type Description
DataLoader

The dataloader to sample from

get_dataloader_kwargs(stage, shuffle, **kwargs)

Get the options for the dataloader depending on the current stage.

Parameters:

Name Type Description Default
stage RunningStage

Whether in Training, Validating, Testing, Sanity-checking, Predicting, or Tuning phase.

required
shuffle bool

set to True to have the data reshuffled at every epoch.

required

Returns:

Type Description
Dict[str, Any]

Arguments to pass to the DataLoader during initialization

get_max_num_edges_datamodule(stages=None)

Get the maximum number of edges across all datasets from the datamodule

Parameters:

Name Type Description Default
datamodule

The datamodule from which to extract the maximum number of nodes

required
stages Optional[List[str]]

The stages from which to extract the max num nodes. Possible values are ["train", "val", "test", "predict"]. If None, all stages are considered.

None

Returns:

Name Type Description
max_num_edges int

The maximum number of edges across all datasets from the datamodule

get_max_num_nodes_datamodule(stages=None)

Get the maximum number of nodes across all datasets from the datamodule

Parameters:

Name Type Description Default
datamodule

The datamodule from which to extract the maximum number of nodes

required
stages Optional[List[str]]

The stages from which to extract the max num nodes. Possible values are ["train", "val", "test", "predict"]. If None, all stages are considered.

None

Returns:

Name Type Description
max_num_nodes int

The maximum number of nodes across all datasets from the datamodule

predict_dataloader(**kwargs)

return the dataloader for prediction

test_dataloader(**kwargs)

return the test dataloader

train_dataloader(**kwargs)

return the training dataloader

val_dataloader(**kwargs)

return the validation dataloader

DatasetProcessingParams

__init__(task_level=None, df=None, df_path=None, smiles_col=None, label_cols=None, weights_col=None, weights_type=None, idx_col=None, sample_size=None, split_val=0.2, split_test=0.2, seed=None, splits_path=None, split_names=['train', 'val', 'test'], label_normalization=None)

object to store the parameters for the dataset processing

Parameters:

Name Type Description Default
task_level str

The task level, wether it is graph, node, edge or nodepair

None
df pd.DataFrame

The dataframe containing the data

None
df_path Optional[Union[str, os.PathLike]]

The path to the dataframe containing the data

None
smiles_col str

The column name of the smiles

None
label_cols List[str]

The column names of the labels

None
weights_col str

The column name of the weights

None
weights_type str

The type of weights

None
idx_col str

The column name of the indices

None
sample_size Union[int, float, Type[None]]

The size of the sample

None
split_val float

The fraction of the data to use for validation

0.2
split_test float

The fraction of the data to use for testing

0.2
seed int

The seed to use for the splits and subsampling

None
splits_path Optional[Union[str, os.PathLike]]

The path to the splits

None

FakeDataModule

Bases: MultitaskFromSmilesDataModule

A fake datamodule that generates artificial data by mimicking the true data coming from the provided dataset. It is useful to test the speed and performance of the model on a dataset without having to featurize it and wait for the workers to load it.

generate_data(label_cols, smiles_col)

Returns:

Type Description

pd.DataFrame

get_fake_graph()

Low memory footprint method to get the first datapoint DGL graph. The first 10 rows of the data are read in case the first one has a featurization error. If all 20 first element, then None is returned, otherwise the first graph to not fail is returned.

prepare_data()

Called only from a single process in distributed settings. Steps:

  • If each cache is set and exists, reload from cache and return. Otherwise,
  • For each single-task dataset:
    • Load its dataframe from a path (if provided)
    • Subsample the dataframe
    • Extract the smiles, labels from the dataframe
  • In the previous step, we were also able to get the unique smiles, which we use to compute the features
  • For each single-task dataframe and associated data (smiles, labels, etc.):
    • Filter out the data corresponding to molecules which failed featurization.
    • Create a corresponding SingletaskDataset
    • Split the SingletaskDataset according to the task-specific splits for train, val and test
setup(stage=None)

Prepare the torch dataset. Called on every GPUs. Setting state here is ok.

Parameters:

Name Type Description Default
stage str

Either 'fit', 'test', or None.

None

GraphOGBDataModule

Bases: MultitaskFromSmilesDataModule

__init__(task_specific_args, cache_data_path=None, featurization=None, batch_size_training=16, batch_size_inference=16, batch_size_per_pack=None, num_workers=0, pin_memory=True, persistent_workers=False, featurization_n_jobs=-1, featurization_progress=False, featurization_backend='loky', collate_fn=None, prepare_dict_or_graph='pyg:graph', **kwargs)

Load an OGB (Open-graph-benchmark) GraphProp dataset.

Parameters:

Name Type Description Default
task_specific_args Dict[str, Dict[str, Any]]

Arguments related to each task, with the task-name being the key, and the specific arguments being the values. The arguments must be a Dict containing the following keys:

  • "dataset_name": Name of the OGB dataset to load. Examples of possible datasets are "ogbg-molhiv", "ogbg-molpcba", "ogbg-moltox21", "ogbg-molfreesolv".
  • "sample_size": The number of molecules to sample from the dataset. Default=None, meaning that all molecules will be considered.
required
cache_data_path Optional[Union[str, os.PathLike]]

path where to save or reload the cached data. The path can be remote (S3, GS, etc).

None
featurization Optional[Union[Dict[str, Any], omegaconf.DictConfig]]

args to apply to the SMILES to Graph featurizer.

None
batch_size_training int

batch size for training and val dataset.

16
batch_size_inference int

batch size for test dataset.

16
num_workers int

Number of workers for the dataloader. Use -1 to use all available cores.

0
pin_memory bool

Whether to pin on paginated CPU memory for the dataloader.

True
featurization_n_jobs int

Number of cores to use for the featurization.

-1
featurization_progress bool

whether to show a progress bar during featurization.

False
featurization_backend str

The backend to use for the molecular featurization.

  • "multiprocessing": Found to cause less memory issues.
  • "loky": joblib's Default. Found to cause memory leaks.
  • "threading": Found to be slow.
'loky'
collate_fn Optional[Callable]

A custom torch collate function. Default is to graphium.data.graphium_collate_fn

None
sample_size
  • int: The maximum number of elements to take from the dataset.
  • float: Value between 0 and 1 representing the fraction of the dataset to consider
  • None: all elements are considered.
required
to_dict()

geenrate a dictionary representation of the class

Returns:

Name Type Description
dict Dict[str, Any]

dictionary representation of the class

IPUDataModuleModifier

__init__(ipu_inference_opts=None, ipu_training_opts=None, ipu_dataloader_training_opts=None, ipu_dataloader_inference_opts=None, *args, **kwargs)

wrapper functions from the a DataModule to support IPU and IPU options To be used in dual inheritance, for example:

IPUDataModule(BaseDataModule, IPUDataModuleModifier):
    def __init__(self, **kwargs):
        BaseDataModule.__init__(self, **kwargs)
        IPUDataModuleModifier.__init__(self, **kwargs)

Parameters:

Name Type Description Default
ipu_inference_opts Optional[poptorch.Options]

Options for the IPU in inference mode. Ignore if not using IPUs

None
ipu_training_opts Optional[poptorch.Options]

Options for the IPU in training mode. Ignore if not using IPUs

None
ipu_dataloader_kwargs_train_val

Options for the dataloader for the IPU. Ignore if not using IPUs

required
ipu_dataloader_kwargs_test

Options for the dataloader for the IPU. Ignore if not using IPUs

required
args

Arguments for the DataModule

()
kwargs

Keyword arguments for the DataModule

{}

MultitaskFromSmilesDataModule

Bases: BaseDataModule, IPUDataModuleModifier

in_dims property

Return all input dimensions for the set of graphs. Including node/edge features, and raw positional encoding dimensions such eigval, eigvec, rwse and more

num_edge_feats property

Return the number of edge features in the first graph

num_node_feats property

Return the number of node features in the first graph

__init__(task_specific_args, cache_data_path=None, processed_graph_data_path=None, featurization=None, batch_size_training=16, batch_size_inference=16, batch_size_per_pack=None, num_workers=0, pin_memory=True, persistent_workers=False, featurization_n_jobs=-1, featurization_progress=False, featurization_backend='loky', featurization_batch_size=1000, collate_fn=None, prepare_dict_or_graph='pyg:graph', **kwargs)

only for parameters beginning with task_*, we have a dictionary where the key is the task name and the value is specified below.

Parameters:

Name Type Description Default
task_df

(value) a dataframe

required
task_df_path

(value) a path to a dataframe to load (CSV file). df takes precedence over df_path.

required
task_smiles_col

(value) Name of the SMILES column. If set to None, it will look for a column with the word "smile" (case insensitive) in it. If no such column is found, an error will be raised.

required
task_label_cols

(value) Name of the columns to use as labels, with different options.

  • list: A list of all column names to use
  • None: All the columns are used except the SMILES one.
  • str: The name of the single column to use
  • *str: A string starting by a * means all columns whose name ends with the specified str
  • str*: A string ending by a * means all columns whose name starts with the specified str
required
task_weights_col

(value) Name of the column to use as sample weights. If None, no weights are used. This parameter cannot be used together with weights_type.

required
task_weights_type

(value) The type of weights to use. This parameter cannot be used together with weights_col. It only supports multi-label binary classification.

Supported types:

  • None: No weights are used.
  • "sample_balanced": A weight is assigned to each sample inversely proportional to the number of positive value. If there are multiple labels, the product of the weights is used.
  • "sample_label_balanced": Similar to the "sample_balanced" weights, but the weights are applied to each element individually, without computing the product of the weights for a given sample.
required
task_idx_col

(value) Name of the columns to use as indices. Unused if set to None.

required
task_sample_size

(value)

  • int: The maximum number of elements to take from the dataset.
  • float: Value between 0 and 1 representing the fraction of the dataset to consider
  • None: all elements are considered.
required
task_split_val

(value) Ratio for the validation split.

required
task_split_test

(value) Ratio for the test split.

required
task_seed

(value) Seed to use for the random split and subsampling. More complex splitting strategy should be implemented.

required
task_splits_path

(value) A path a CSV file containing indices for the splits. The file must contains 3 columns "train", "val" and "test". It takes precedence over split_val and split_test.

required
cache_data_path Optional[Union[str, os.PathLike]]

path where to save or reload the cached data. The path can be remote (S3, GS, etc).

None
featurization Optional[Union[Dict[str, Any], omegaconf.DictConfig]]

args to apply to the SMILES to Graph featurizer.

None
batch_size_training int

batch size for training and val dataset.

16
batch_size_inference int

batch size for test dataset.

16
num_workers int

Number of workers for the dataloader. Use -1 to use all available cores.

0
pin_memory bool

Whether to pin on paginated CPU memory for the dataloader.

True
featurization_n_jobs int

Number of cores to use for the featurization.

-1
featurization_progress bool

whether to show a progress bar during featurization.

False
featurization_backend str

The backend to use for the molecular featurization.

  • "multiprocessing": Found to cause less memory issues.
  • "loky": joblib's Default. Found to cause memory leaks.
  • "threading": Found to be slow.
'loky'
featurization_batch_size int

Batch size to use for the featurization.

1000
collate_fn Optional[Callable]

A custom torch collate function. Default is to graphium.data.graphium_collate_fn

None
prepare_dict_or_graph str

Whether to preprocess all molecules as Graph dict or PyG graphs. Possible options:

  • "pyg:dict": Process molecules as a dict. It's faster and requires less RAM during pre-processing. It is slower during training with with num_workers=0 since pyg Data will be created during data-loading, but faster with large num_workers, and less likely to cause memory issues with the parallelization.
  • "pyg:graph": Process molecules as pyg.data.Data.
'pyg:graph'
__len__()

Returns the number of elements of the current DataModule, which is the combined size of all single-task datasets given.

Returns:

Name Type Description
num_elements int

Number of elements in the current DataModule

__repr__()

Controls how the class is printed

calculate_statistics(dataset, train=False)

Calculate the statistics of the labels for each task, and overwrites the self.task_norms attribute.

Parameters:

Name Type Description Default
dataset Datasets.MultitaskDataset

the dataset to calculate the statistics from

required
train bool

whether the dataset is the training set

False
get_data_cache_fullname(compress=False)

Create a hash for the dataset, and use it to generate a file name

Parameters:

Name Type Description Default
compress bool

Whether to compress the data

False

Returns:

Type Description
str

full path to the data cache file

get_data_hash()

Get a hash specific to a dataset and smiles_transformer. Useful to cache the pre-processed data.

get_dataloader(dataset, shuffle, stage)

Get the poptorch dataloader for a given dataset

Parameters:

Name Type Description Default
dataset Dataset

The dataset from which to load the data

required
shuffle bool

set to True to have the data reshuffled at every epoch.

required
stage RunningStage

Whether in Training, Validating, Testing, Sanity-checking, Predicting, or Tuning phase.

required

Returns:

Type Description
Union[DataLoader, poptorch.DataLoader]

The poptorch dataloader to sample from

get_dataloader_kwargs(stage, shuffle, **kwargs)

Get the options for the dataloader depending on the current stage.

Parameters:

Name Type Description Default
stage RunningStage

Whether in Training, Validating, Testing, Sanity-checking, Predicting, or Tuning phase.

required
shuffle bool

set to True to have the data reshuffled at every epoch.

required

Returns:

Type Description
Dict[str, Any]

Arguments to pass to the DataLoader during initialization

get_fake_graph()

Low memory footprint method to get the featurization of a fake graph without reading the dataset. Useful for getting the number of node/edge features.

Returns:

Name Type Description
graph

A fake graph with the right featurization

get_label_statistics(data_path, data_hash, dataset, train=False)

Get the label statistics from the dataset, and save them to file, if needed. self.task_norms will be modified in-place with the label statistics.

Parameters:

Name Type Description Default
data_path Union[str, os.PathLike]

the path to save and load the label statistics to. If None, no saving and loading will be done.

required
data_hash str

the hash of the dataset generated by get_data_hash()

required
dataset Datasets.MultitaskDataset

the dataset to calculate the statistics from

required
train bool

whether the dataset is the training set

False
get_subsets_of_datasets(single_task_datasets, task_train_indices, task_val_indices, task_test_indices)

From a dictionary of datasets and their associated indices, subset the train/val/test sets

Parameters:

Name Type Description Default
single_task_datasets Dict[str, Datasets.SingleTaskDataset]

Dictionary of datasets

required
task_train_indices Dict[str, Iterable]

Dictionary of train indices

required
task_val_indices Dict[str, Iterable]

Dictionary of val indices

required
task_test_indices Dict[str, Iterable]

Dictionary of test indices

required

Returns:

Name Type Description
train_singletask_datasets Subset

Dictionary of train subsets

val_singletask_datasets Subset

Dictionary of val subsets

test_singletask_datasets Subset

Dictionary of test subsets

load_data_from_cache(verbose=True, compress=False)

Load the datasets from cache. First create a hash for the dataset, and verify if that hash is available at the path given by self.cache_data_path.

Parameters:

Name Type Description Default
verbose bool

Whether to print the progress

True
compress bool

Whether to compress the data

False

Returns:

Name Type Description
cache_data_exists bool

Whether the cache exists (if the hash matches) and the loading succeeded

normalize_label(dataset, stage)

Normalize the labels in the dataset using the statistics in self.task_norms.

Parameters:

Name Type Description Default
dataset Datasets.MultitaskDataset

the dataset to normalize the labels from

required

Returns:

Type Description
Datasets.MultitaskDataset

the dataset with normalized labels

prepare_data()

Called only from a single process in distributed settings. Steps:

  • If each cache is set and exists, reload from cache and return. Otherwise,
  • For each single-task dataset:
    • Load its dataframe from a path (if provided)
    • Subsample the dataframe
    • Extract the smiles, labels from the dataframe
  • In the previous step, we were also able to get the unique smiles, which we use to compute the features
  • For each single-task dataframe and associated data (smiles, labels, etc.):
    • Filter out the data corresponding to molecules which failed featurization.
    • Create a corresponding SingletaskDataset
    • Split the SingletaskDataset according to the task-specific splits for train, val and test
save_data_to_cache(verbose=True, compress=False)

Save the datasets from cache. First create a hash for the dataset, use it to generate a file name. Then save to the path given by self.cache_data_path.

Parameters:

Name Type Description Default
verbose bool

Whether to print the progress

True
compress bool

Whether to compress the data

False
setup(stage=None, save_smiles_and_ids=False)

Prepare the torch dataset. Called on every GPUs. Setting state here is ok.

Parameters:

Name Type Description Default
stage str

Either 'fit', 'test', or None.

None
to_dict()

Returns a dictionary representation of the current DataModule

Returns:

Name Type Description
obj_repr Dict[str, Any]

Dictionary representation of the current DataModule

Collate Module


graphium.data.collate

collage_pyg_graph(pyg_graphs, batch_size_per_pack=None)

Function to collate pytorch geometric graphs. Convert all numpy types to torch Convert edge indices to int64

Parameters:

Name Type Description Default
pyg_graphs Iterable[Union[Data, Dict]]

Iterable of PyG graphs

required
batch_size_per_pack Optional[int]

The number of graphs to pack together. This is useful for using packing with the Transformer,

None

collate_labels(labels, labels_size_dict=None, labels_dtype_dict=None)

Collate labels for multitask learning.

Parameters:

Name Type Description Default
labels List[Data]

List of labels

required
labels_size_dict Optional[Dict[str, Any]]

Dict of the form Dict[tasks, sizes] which has task names as keys and the size of the label tensor as value. The size of the tensor corresponds to how many labels/values there are to predict for that task.

None
labels_dtype_dict Optional[Dict[str, Any]]

(Note): This is an attribute of the MultitaskDataset. A dictionary of the form Dict[tasks, dtypes] which has task names as keys and the dtype of the label tensor as value. This is necessary to ensure the missing labels are added with NaNs of the right dtype

None

Returns:

Type Description

A dictionary of the form Dict[tasks, labels] where tasks is the name of the task and labels

is a tensor of shape (batch_size, *labels_size_dict[task]).

collate_pyg_graph_labels(pyg_labels)

Function to collate pytorch geometric labels. Convert all numpy types to torch

Parameters:

Name Type Description Default
pyg_labels List[Data]

Iterable of PyG label Data objects

required

get_expected_label_size(label_data, task, label_size)

Determines expected label size based on the specfic graph properties and the number of targets in the task-dataset.

graphium_collate_fn(elements, labels_size_dict=None, labels_dtype_dict=None, mask_nan='raise', do_not_collate_keys=[], batch_size_per_pack=None)

This collate function is identical to the default pytorch collate function but add support for pyg.data.Data to batch graphs.

Beside pyg graph collate, other objects are processed the same way as the original torch collate function. See https://pytorch.org/docs/stable/data.html#dataloader-collate-fn for more details.

Note

If graphium needs to manipulate other tricky-to-batch objects. Support for them should be added to this single collate function.

Parameters:

Name Type Description Default
elements Union[List[Any], Dict[str, List[Any]]]

The elements to batch. See torch.utils.data.dataloader.default_collate.

required
labels_size_dict Optional[Dict[str, Any]]

(Note): This is an attribute of the MultitaskDataset. A dictionary of the form Dict[tasks, sizes] which has task names as keys and the size of the label tensor as value. The size of the tensor corresponds to how many labels/values there are to predict for that task.

None
labels_dtype_dict Optional[Dict[str, Any]]

(Note): This is an attribute of the MultitaskDataset. A dictionary of the form Dict[tasks, dtypes] which has task names as keys and the dtype of the label tensor as value. This is necessary to ensure the missing labels are added with NaNs of the right dtype

None
mask_nan Union[str, float, Type[None]]

Deal with the NaN/Inf when calling the function make_pyg_graph. Some values become Inf when changing data type. This allows to deal with that.

  • "raise": Raise an error when there is a nan or inf in the featurization
  • "warn": Raise a warning when there is a nan or inf in the featurization
  • "None": DEFAULT. Don't do anything
  • "Floating value": Replace nans or inf by the specified value
'raise'
do_not_batch_keys

Keys to ignore for the collate

required
batch_size_per_pack Optional[int]

The number of graphs to pack together. This is useful for using packing with the Transformer. If None, no packing is done. Otherwise, indices are generated to map the nodes to the pack they belong to under the key "pack_from_node_idx", with an additional mask to indicate which nodes are from the same graph under the key "pack_attn_mask".

None

Returns:

Type Description
Union[Any, Dict[str, Any]]

The batched elements. See torch.utils.data.dataloader.default_collate.

pad_nodepairs(pe, num_nodes, max_num_nodes_per_graph)

This function zero-pads nodepair-level positional encodings to conform with the batching logic.

Parameters:

Name Type Description Default
pe torch.Tensor, [num_nodes, num_nodes, num_feat]

Nodepair pe

required
num_nodes int

Number of nodes of processed graph

required
max_num_nodes_per_graph int

Maximum number of nodes among graphs in current batch

required

Returns:

Name Type Description
padded_pe torch.Tensor, [num_nodes, max_num_nodes_per_graph, num_feat]

padded nodepair pe tensor

pad_to_expected_label_size(labels, label_size)

Determine difference of labels shape to expected shape label_size and pad with torch.nan accordingly.

Util Functions


graphium.data.utils

download_graphium_dataset(name, output_path, extract_zip=True, progress=False)

Download a Graphium dataset to a specified location.

Parameters:

Name Type Description Default
name str

Name of the Graphium dataset from graphium.data.utils.get_graphium_datasets().

required
output_path str

Directory path where to download the dataset to.

required
extract_zip bool

Whether to extract the dataset if it's a zip file.

True
progress bool

Whether to show a progress bar during download.

False

Returns:

Name Type Description
str str

Path to the downloaded dataset.

graphium_package_path(graphium_path)

Return the path of a graphium file in the package.

list_graphium_datasets()

List Graphium datasets available to download.

Returns:

Name Type Description
set set

A set of Graphium dataset names.

load_micro_zinc()

Return a dataframe of micro ZINC (1000 data points).

Returns:

Type Description
pd.DataFrame

pd.DataFrame: A dataframe of micro ZINC.

load_tiny_zinc()

Return a dataframe of tiny ZINC (100 data points).

Returns:

Type Description
pd.DataFrame

pd.DataFrame: A dataframe of tiny ZINC.