graphium.data¶

module for loading datasets and collating batches

Contents

Data Module
Collate Module
Util Functions

Data Module¶

`graphium.data.datamodule` ¶

`BaseDataModule` ¶

Bases: lightning.LightningDataModule

`get_num_workers` `property` ¶

get the number of workers to use

`predict_ds` `property` `writable` ¶

Get the dataset used for the prediction

`init(batch_size_training=16, batch_size_inference=16, batch_size_per_pack=None, num_workers=0, pin_memory=True, persistent_workers=False, collate_fn=None)` ¶

base dataset module for all datasets (to be inherented)

Parameters:

Name	Type	Description	Default
`batch_size_training`	`int`	batch size for training	`16`
`batch_size_inference`	`int`	batch size for inference	`16`
`num_workers`	`int`	number of workers for data loading	`0`
`pin_memory`	`bool`	whether to pin memory	`True`
`persistent_workers`	`bool`	whether to use persistent workers	`False`
`collate_fn`	`Optional[Callable]`	collate function for batching	`None`

`get_dataloader(dataset, shuffle, stage)` ¶

Get the dataloader for a given dataset

Parameters:

Name	Type	Description	Default
`dataset`	`Dataset`	The dataset from which to load the data	required
`shuffle`	`bool`	set to `True` to have the data reshuffled at every epoch.	required
`stage`	`RunningStage`	Whether in Training, Validating, Testing, Sanity-checking, Predicting, or Tuning phase.	required

Returns:

Type	Description
`DataLoader`	The dataloader to sample from

`get_dataloader_kwargs(stage, shuffle, **kwargs)` ¶

Get the options for the dataloader depending on the current stage.

Parameters:

Name	Type	Description	Default
`stage`	`RunningStage`	Whether in Training, Validating, Testing, Sanity-checking, Predicting, or Tuning phase.	required
`shuffle`	`bool`	set to `True` to have the data reshuffled at every epoch.	required

Returns:

Type	Description
`Dict[str, Any]`	Arguments to pass to the `DataLoader` during initialization

`get_max_num_edges_datamodule(stages=None)` ¶

Get the maximum number of edges across all datasets from the datamodule

Parameters:

Name	Type	Description	Default
`datamodule`		The datamodule from which to extract the maximum number of nodes	required
`stages`	`Optional[List[str]]`	The stages from which to extract the max num nodes. Possible values are ["train", "val", "test", "predict"]. If None, all stages are considered.	`None`

Returns:

Name	Type	Description
`max_num_edges`	`int`	The maximum number of edges across all datasets from the datamodule

`get_max_num_nodes_datamodule(stages=None)` ¶

Get the maximum number of nodes across all datasets from the datamodule

Parameters:

Name	Type	Description	Default
`datamodule`		The datamodule from which to extract the maximum number of nodes	required
`stages`	`Optional[List[str]]`	The stages from which to extract the max num nodes. Possible values are ["train", "val", "test", "predict"]. If None, all stages are considered.	`None`

Returns:

Name	Type	Description
`max_num_nodes`	`int`	The maximum number of nodes across all datasets from the datamodule

`predict_dataloader(**kwargs)` ¶

return the dataloader for prediction

`test_dataloader(**kwargs)` ¶

return the test dataloader

`train_dataloader(**kwargs)` ¶

return the training dataloader

`val_dataloader(**kwargs)` ¶

return the validation dataloader

`DatasetProcessingParams` ¶

`init(task_level=None, df=None, df_path=None, smiles_col=None, label_cols=None, weights_col=None, weights_type=None, idx_col=None, sample_size=None, split_val=0.2, split_test=0.2, seed=None, splits_path=None, split_names=['train', 'val', 'test'], label_normalization=None)` ¶

object to store the parameters for the dataset processing

Parameters:

Name	Type	Description	Default
`task_level`	`str`	The task level, wether it is graph, node, edge or nodepair	`None`
`df`	`pd.DataFrame`	The dataframe containing the data	`None`
`df_path`	`Optional[Union[str, os.PathLike]]`	The path to the dataframe containing the data	`None`
`smiles_col`	`str`	The column name of the smiles	`None`
`label_cols`	`List[str]`	The column names of the labels	`None`
`weights_col`	`str`	The column name of the weights	`None`
`weights_type`	`str`	The type of weights	`None`
`idx_col`	`str`	The column name of the indices	`None`
`sample_size`	`Union[int, float, Type[None]]`	The size of the sample	`None`
`split_val`	`float`	The fraction of the data to use for validation	`0.2`
`split_test`	`float`	The fraction of the data to use for testing	`0.2`
`seed`	`int`	The seed to use for the splits and subsampling	`None`
`splits_path`	`Optional[Union[str, os.PathLike]]`	The path to the splits	`None`

`FakeDataModule` ¶

Bases: MultitaskFromSmilesDataModule

A fake datamodule that generates artificial data by mimicking the true data coming from the provided dataset. It is useful to test the speed and performance of the model on a dataset without having to featurize it and wait for the workers to load it.

`generate_data(label_cols, smiles_col)` ¶

Returns:

Type	Description
	pd.DataFrame

`get_fake_graph()` ¶

Low memory footprint method to get the first datapoint DGL graph. The first 10 rows of the data are read in case the first one has a featurization error. If all 20 first element, then None is returned, otherwise the first graph to not fail is returned.

`prepare_data()` ¶

Called only from a single process in distributed settings. Steps:

If each cache is set and exists, reload from cache and return. Otherwise,
For each single-task dataset:
- Load its dataframe from a path (if provided)
- Subsample the dataframe
- Extract the smiles, labels from the dataframe
In the previous step, we were also able to get the unique smiles, which we use to compute the features
For each single-task dataframe and associated data (smiles, labels, etc.):
- Filter out the data corresponding to molecules which failed featurization.
- Create a corresponding SingletaskDataset
- Split the SingletaskDataset according to the task-specific splits for train, val and test

`setup(stage=None)` ¶

Prepare the torch dataset. Called on every GPUs. Setting state here is ok.

Parameters:

Name	Type	Description	Default
`stage`	`str`	Either 'fit', 'test', or None.	`None`

`GraphOGBDataModule` ¶

Bases: MultitaskFromSmilesDataModule

`init(task_specific_args, cache_data_path=None, featurization=None, batch_size_training=16, batch_size_inference=16, batch_size_per_pack=None, num_workers=0, pin_memory=True, persistent_workers=False, featurization_n_jobs=-1, featurization_progress=False, featurization_backend='loky', collate_fn=None, prepare_dict_or_graph='pyg:graph', **kwargs)` ¶

Load an OGB (Open-graph-benchmark) GraphProp dataset.

Parameters:

Name	Type	Description	Default
`task_specific_args`	`Dict[str, Dict[str, Any]]`	Arguments related to each task, with the task-name being the key, and the specific arguments being the values. The arguments must be a Dict containing the following keys: "dataset_name": Name of the OGB dataset to load. Examples of possible datasets are "ogbg-molhiv", "ogbg-molpcba", "ogbg-moltox21", "ogbg-molfreesolv". "sample_size": The number of molecules to sample from the dataset. Default=None, meaning that all molecules will be considered.	required
`cache_data_path`	`Optional[Union[str, os.PathLike]]`	path where to save or reload the cached data. The path can be remote (S3, GS, etc).	`None`
`featurization`	`Optional[Union[Dict[str, Any], omegaconf.DictConfig]]`	args to apply to the SMILES to Graph featurizer.	`None`
`batch_size_training`	`int`	batch size for training and val dataset.	`16`
`batch_size_inference`	`int`	batch size for test dataset.	`16`
`num_workers`	`int`	Number of workers for the dataloader. Use -1 to use all available cores.	`0`
`pin_memory`	`bool`	Whether to pin on paginated CPU memory for the dataloader.	`True`
`featurization_n_jobs`	`int`	Number of cores to use for the featurization.	`-1`
`featurization_progress`	`bool`	whether to show a progress bar during featurization.	`False`
`featurization_backend`	`str`	The backend to use for the molecular featurization. "multiprocessing": Found to cause less memory issues. "loky": joblib's Default. Found to cause memory leaks. "threading": Found to be slow.	`'loky'`
`collate_fn`	`Optional[Callable]`	A custom torch collate function. Default is to `graphium.data.graphium_collate_fn`	`None`
`sample_size`		`int`: The maximum number of elements to take from the dataset. `float`: Value between 0 and 1 representing the fraction of the dataset to consider `None`: all elements are considered.	required

`to_dict()` ¶

geenrate a dictionary representation of the class

Returns:

Name	Type	Description
`dict`	`Dict[str, Any]`	dictionary representation of the class

`IPUDataModuleModifier` ¶

`init(ipu_inference_opts=None, ipu_training_opts=None, ipu_dataloader_training_opts=None, ipu_dataloader_inference_opts=None, *args, **kwargs)` ¶

wrapper functions from the a DataModule to support IPU and IPU options To be used in dual inheritance, for example:

IPUDataModule(BaseDataModule, IPUDataModuleModifier):
    def __init__(self, **kwargs):
        BaseDataModule.__init__(self, **kwargs)
        IPUDataModuleModifier.__init__(self, **kwargs)

Parameters:

Name	Type	Description	Default
`ipu_inference_opts`	`Optional[poptorch.Options]`	Options for the IPU in inference mode. Ignore if not using IPUs	`None`
`ipu_training_opts`	`Optional[poptorch.Options]`	Options for the IPU in training mode. Ignore if not using IPUs	`None`
`ipu_dataloader_kwargs_train_val`		Options for the dataloader for the IPU. Ignore if not using IPUs	required
`ipu_dataloader_kwargs_test`		Options for the dataloader for the IPU. Ignore if not using IPUs	required
`args`		Arguments for the `DataModule`	`()`
`kwargs`		Keyword arguments for the `DataModule`	`{}`

`MultitaskFromSmilesDataModule` ¶

Bases: BaseDataModule, IPUDataModuleModifier

`in_dims` `property` ¶

Return all input dimensions for the set of graphs. Including node/edge features, and raw positional encoding dimensions such eigval, eigvec, rwse and more

`num_edge_feats` `property` ¶

Return the number of edge features in the first graph

`num_node_feats` `property` ¶

Return the number of node features in the first graph

`init(task_specific_args, cache_data_path=None, processed_graph_data_path=None, featurization=None, batch_size_training=16, batch_size_inference=16, batch_size_per_pack=None, num_workers=0, pin_memory=True, persistent_workers=False, featurization_n_jobs=-1, featurization_progress=False, featurization_backend='loky', featurization_batch_size=1000, collate_fn=None, prepare_dict_or_graph='pyg:graph', **kwargs)` ¶

only for parameters beginning with task_*, we have a dictionary where the key is the task name and the value is specified below.

Parameters:

Name	Type	Description	Default
`task_df`		(value) a dataframe	required
`task_df_path`		(value) a path to a dataframe to load (CSV file). `df` takes precedence over `df_path`.	required
`task_smiles_col`		(value) Name of the SMILES column. If set to `None`, it will look for a column with the word "smile" (case insensitive) in it. If no such column is found, an error will be raised.	required
`task_label_cols`		(value) Name of the columns to use as labels, with different options. `list`: A list of all column names to use `None`: All the columns are used except the SMILES one. `str`: The name of the single column to use `str`: A string starting by a `` means all columns whose name ends with the specified `str` `str`: A string ending by a `` means all columns whose name starts with the specified `str`	required
`task_weights_col`		(value) Name of the column to use as sample weights. If `None`, no weights are used. This parameter cannot be used together with `weights_type`.	required
`task_weights_type`		(value) The type of weights to use. This parameter cannot be used together with `weights_col`. It only supports multi-label binary classification. Supported types: `None`: No weights are used. `"sample_balanced"`: A weight is assigned to each sample inversely proportional to the number of positive value. If there are multiple labels, the product of the weights is used. `"sample_label_balanced"`: Similar to the `"sample_balanced"` weights, but the weights are applied to each element individually, without computing the product of the weights for a given sample.	required
`task_idx_col`		(value) Name of the columns to use as indices. Unused if set to None.	required
`task_sample_size`		(value) `int`: The maximum number of elements to take from the dataset. `float`: Value between 0 and 1 representing the fraction of the dataset to consider `None`: all elements are considered.	required
`task_split_val`		(value) Ratio for the validation split.	required
`task_split_test`		(value) Ratio for the test split.	required
`task_seed`		(value) Seed to use for the random split and subsampling. More complex splitting strategy should be implemented.	required
`task_splits_path`		(value) A path a CSV file containing indices for the splits. The file must contains 3 columns "train", "val" and "test". It takes precedence over `split_val` and `split_test`.	required
`cache_data_path`	`Optional[Union[str, os.PathLike]]`	path where to save or reload the cached data. The path can be remote (S3, GS, etc).	`None`
`featurization`	`Optional[Union[Dict[str, Any], omegaconf.DictConfig]]`	args to apply to the SMILES to Graph featurizer.	`None`
`batch_size_training`	`int`	batch size for training and val dataset.	`16`
`batch_size_inference`	`int`	batch size for test dataset.	`16`
`num_workers`	`int`	Number of workers for the dataloader. Use -1 to use all available cores.	`0`
`pin_memory`	`bool`	Whether to pin on paginated CPU memory for the dataloader.	`True`
`featurization_n_jobs`	`int`	Number of cores to use for the featurization.	`-1`
`featurization_progress`	`bool`	whether to show a progress bar during featurization.	`False`
`featurization_backend`	`str`	The backend to use for the molecular featurization. "multiprocessing": Found to cause less memory issues. "loky": joblib's Default. Found to cause memory leaks. "threading": Found to be slow.	`'loky'`
`featurization_batch_size`	`int`	Batch size to use for the featurization.	`1000`
`collate_fn`	`Optional[Callable]`	A custom torch collate function. Default is to `graphium.data.graphium_collate_fn`	`None`
`prepare_dict_or_graph`	`str`	Whether to preprocess all molecules as Graph dict or PyG graphs. Possible options: "pyg:dict": Process molecules as a `dict`. It's faster and requires less RAM during pre-processing. It is slower during training with with `num_workers=0` since pyg `Data` will be created during data-loading, but faster with large `num_workers`, and less likely to cause memory issues with the parallelization. "pyg:graph": Process molecules as `pyg.data.Data`.	`'pyg:graph'`

`len()` ¶

Returns the number of elements of the current DataModule, which is the combined size of all single-task datasets given.

Returns:

Name	Type	Description
`num_elements`	`int`	Number of elements in the current DataModule

`repr()` ¶

Controls how the class is printed

`calculate_statistics(dataset, train=False)` ¶

Calculate the statistics of the labels for each task, and overwrites the self.task_norms attribute.

Parameters:

Name	Type	Description	Default
`dataset`	`Datasets.MultitaskDataset`	the dataset to calculate the statistics from	required
`train`	`bool`	whether the dataset is the training set	`False`

`get_data_cache_fullname(compress=False)` ¶

Create a hash for the dataset, and use it to generate a file name

Parameters:

Name	Type	Description	Default
`compress`	`bool`	Whether to compress the data	`False`

Returns:

Type	Description
`str`	full path to the data cache file

`get_data_hash()` ¶

Get a hash specific to a dataset and smiles_transformer. Useful to cache the pre-processed data.

`get_dataloader(dataset, shuffle, stage)` ¶

Get the poptorch dataloader for a given dataset

Parameters:

Name	Type	Description	Default
`dataset`	`Dataset`	The dataset from which to load the data	required
`shuffle`	`bool`	set to `True` to have the data reshuffled at every epoch.	required
`stage`	`RunningStage`	Whether in Training, Validating, Testing, Sanity-checking, Predicting, or Tuning phase.	required

Returns:

Type	Description
`Union[DataLoader, poptorch.DataLoader]`	The poptorch dataloader to sample from

`get_dataloader_kwargs(stage, shuffle, **kwargs)` ¶

Get the options for the dataloader depending on the current stage.

Parameters:

Name	Type	Description	Default
`stage`	`RunningStage`	Whether in Training, Validating, Testing, Sanity-checking, Predicting, or Tuning phase.	required
`shuffle`	`bool`	set to `True` to have the data reshuffled at every epoch.	required

Returns:

Type	Description
`Dict[str, Any]`	Arguments to pass to the `DataLoader` during initialization

`get_fake_graph()` ¶

Low memory footprint method to get the featurization of a fake graph without reading the dataset. Useful for getting the number of node/edge features.

Returns:

Name	Type	Description
`graph`		A fake graph with the right featurization

`get_label_statistics(data_path, data_hash, dataset, train=False)` ¶

Get the label statistics from the dataset, and save them to file, if needed. self.task_norms will be modified in-place with the label statistics.

Parameters:

Name	Type	Description	Default
`data_path`	`Union[str, os.PathLike]`	the path to save and load the label statistics to. If None, no saving and loading will be done.	required
`data_hash`	`str`	the hash of the dataset generated by `get_data_hash()`	required
`dataset`	`Datasets.MultitaskDataset`	the dataset to calculate the statistics from	required
`train`	`bool`	whether the dataset is the training set	`False`

`get_subsets_of_datasets(single_task_datasets, task_train_indices, task_val_indices, task_test_indices)` ¶

From a dictionary of datasets and their associated indices, subset the train/val/test sets

Parameters:

Name	Type	Description	Default
`single_task_datasets`	`Dict[str, Datasets.SingleTaskDataset]`	Dictionary of datasets	required
`task_train_indices`	`Dict[str, Iterable]`	Dictionary of train indices	required
`task_val_indices`	`Dict[str, Iterable]`	Dictionary of val indices	required
`task_test_indices`	`Dict[str, Iterable]`	Dictionary of test indices	required

Returns:

Name	Type	Description
`train_singletask_datasets`	`Subset`	Dictionary of train subsets
`val_singletask_datasets`	`Subset`	Dictionary of val subsets
`test_singletask_datasets`	`Subset`	Dictionary of test subsets

`load_data_from_cache(verbose=True, compress=False)` ¶

Load the datasets from cache. First create a hash for the dataset, and verify if that hash is available at the path given by self.cache_data_path.

Parameters:

Name	Type	Description	Default
`verbose`	`bool`	Whether to print the progress	`True`
`compress`	`bool`	Whether to compress the data	`False`

Returns:

Name	Type	Description
`cache_data_exists`	`bool`	Whether the cache exists (if the hash matches) and the loading succeeded

`normalize_label(dataset, stage)` ¶

Normalize the labels in the dataset using the statistics in self.task_norms.

Parameters:

Name	Type	Description	Default
`dataset`	`Datasets.MultitaskDataset`	the dataset to normalize the labels from	required

Returns:

Type	Description
`Datasets.MultitaskDataset`	the dataset with normalized labels

`prepare_data()` ¶

Called only from a single process in distributed settings. Steps:

If each cache is set and exists, reload from cache and return. Otherwise,
For each single-task dataset:
- Load its dataframe from a path (if provided)
- Subsample the dataframe
- Extract the smiles, labels from the dataframe
In the previous step, we were also able to get the unique smiles, which we use to compute the features
For each single-task dataframe and associated data (smiles, labels, etc.):
- Filter out the data corresponding to molecules which failed featurization.
- Create a corresponding SingletaskDataset
- Split the SingletaskDataset according to the task-specific splits for train, val and test

`save_data_to_cache(verbose=True, compress=False)` ¶

Save the datasets from cache. First create a hash for the dataset, use it to generate a file name. Then save to the path given by self.cache_data_path.

Parameters:

Name	Type	Description	Default
`verbose`	`bool`	Whether to print the progress	`True`
`compress`	`bool`	Whether to compress the data	`False`

`setup(stage=None, save_smiles_and_ids=False)` ¶

Prepare the torch dataset. Called on every GPUs. Setting state here is ok.

Parameters:

Name	Type	Description	Default
`stage`	`str`	Either 'fit', 'test', or None.	`None`

`to_dict()` ¶

Returns a dictionary representation of the current DataModule

Returns:

Name	Type	Description
`obj_repr`	`Dict[str, Any]`	Dictionary representation of the current DataModule

Collate Module¶

`graphium.data.collate` ¶

`collage_pyg_graph(pyg_graphs, batch_size_per_pack=None)` ¶

Function to collate pytorch geometric graphs. Convert all numpy types to torch Convert edge indices to int64

Parameters:

Name	Type	Description	Default
`pyg_graphs`	`Iterable[Union[Data, Dict]]`	Iterable of PyG graphs	required
`batch_size_per_pack`	`Optional[int]`	The number of graphs to pack together. This is useful for using packing with the Transformer,	`None`

`collate_labels(labels, labels_size_dict=None, labels_dtype_dict=None)` ¶

Collate labels for multitask learning.

Parameters:

Name	Type	Description	Default
`labels`	`List[Data]`	List of labels	required
`labels_size_dict`	`Optional[Dict[str, Any]]`	Dict of the form Dict[tasks, sizes] which has task names as keys and the size of the label tensor as value. The size of the tensor corresponds to how many labels/values there are to predict for that task.	`None`
`labels_dtype_dict`	`Optional[Dict[str, Any]]`	(Note): This is an attribute of the `MultitaskDataset`. A dictionary of the form Dict[tasks, dtypes] which has task names as keys and the dtype of the label tensor as value. This is necessary to ensure the missing labels are added with NaNs of the right dtype	`None`

Returns:

Type	Description
	A dictionary of the form Dict[tasks, labels] where tasks is the name of the task and labels
	is a tensor of shape (batch_size, *labels_size_dict[task]).

`collate_pyg_graph_labels(pyg_labels)` ¶

Function to collate pytorch geometric labels. Convert all numpy types to torch

Parameters:

Name	Type	Description	Default
`pyg_labels`	`List[Data]`	Iterable of PyG label Data objects	required

`get_expected_label_size(label_data, task, label_size)` ¶

Determines expected label size based on the specfic graph properties and the number of targets in the task-dataset.

`graphium_collate_fn(elements, labels_size_dict=None, labels_dtype_dict=None, mask_nan='raise', do_not_collate_keys=[], batch_size_per_pack=None)` ¶

This collate function is identical to the default pytorch collate function but add support for pyg.data.Data to batch graphs.

Beside pyg graph collate, other objects are processed the same way as the original torch collate function. See https://pytorch.org/docs/stable/data.html#dataloader-collate-fn for more details.

Note

If graphium needs to manipulate other tricky-to-batch objects. Support for them should be added to this single collate function.

Parameters:

Name	Type	Description	Default
`elements`	`Union[List[Any], Dict[str, List[Any]]]`	The elements to batch. See `torch.utils.data.dataloader.default_collate`.	required
`labels_size_dict`	`Optional[Dict[str, Any]]`	(Note): This is an attribute of the `MultitaskDataset`. A dictionary of the form Dict[tasks, sizes] which has task names as keys and the size of the label tensor as value. The size of the tensor corresponds to how many labels/values there are to predict for that task.	`None`
`labels_dtype_dict`	`Optional[Dict[str, Any]]`	(Note): This is an attribute of the `MultitaskDataset`. A dictionary of the form Dict[tasks, dtypes] which has task names as keys and the dtype of the label tensor as value. This is necessary to ensure the missing labels are added with NaNs of the right dtype	`None`
`mask_nan`	`Union[str, float, Type[None]]`	Deal with the NaN/Inf when calling the function `make_pyg_graph`. Some values become `Inf` when changing data type. This allows to deal with that. "raise": Raise an error when there is a nan or inf in the featurization "warn": Raise a warning when there is a nan or inf in the featurization "None": DEFAULT. Don't do anything "Floating value": Replace nans or inf by the specified value	`'raise'`
`do_not_batch_keys`		Keys to ignore for the collate	required
`batch_size_per_pack`	`Optional[int]`	The number of graphs to pack together. This is useful for using packing with the Transformer. If None, no packing is done. Otherwise, indices are generated to map the nodes to the pack they belong to under the key `"pack_from_node_idx"`, with an additional mask to indicate which nodes are from the same graph under the key `"pack_attn_mask"`.	`None`

Returns:

Type	Description
`Union[Any, Dict[str, Any]]`	The batched elements. See `torch.utils.data.dataloader.default_collate`.

`pad_nodepairs(pe, num_nodes, max_num_nodes_per_graph)` ¶

This function zero-pads nodepair-level positional encodings to conform with the batching logic.

Parameters:

Name	Type	Description	Default
`pe`	`torch.Tensor, [num_nodes, num_nodes, num_feat]`	Nodepair pe	required
`num_nodes`	`int`	Number of nodes of processed graph	required
`max_num_nodes_per_graph`	`int`	Maximum number of nodes among graphs in current batch	required

Returns:

Name	Type	Description
`padded_pe`	`torch.Tensor, [num_nodes, max_num_nodes_per_graph, num_feat]`	padded nodepair pe tensor

`pad_to_expected_label_size(labels, label_size)` ¶

Determine difference of labels shape to expected shape label_size and pad with torch.nan accordingly.

Util Functions¶

`graphium.data.utils` ¶

`download_graphium_dataset(name, output_path, extract_zip=True, progress=False)` ¶

Download a Graphium dataset to a specified location.

Parameters:

Name	Type	Description	Default
`name`	`str`	Name of the Graphium dataset from `graphium.data.utils.get_graphium_datasets()`.	required
`output_path`	`str`	Directory path where to download the dataset to.	required
`extract_zip`	`bool`	Whether to extract the dataset if it's a zip file.	`True`
`progress`	`bool`	Whether to show a progress bar during download.	`False`

Returns:

Name	Type	Description
`str`	`str`	Path to the downloaded dataset.

`graphium_package_path(graphium_path)` ¶

Return the path of a graphium file in the package.

`list_graphium_datasets()` ¶

List Graphium datasets available to download.

Returns:

Name	Type	Description
`set`	`set`	A set of Graphium dataset names.

`load_micro_zinc()` ¶

Return a dataframe of micro ZINC (1000 data points).

Returns:

Type	Description
`pd.DataFrame`	pd.DataFrame: A dataframe of micro ZINC.

`load_tiny_zinc()` ¶

Return a dataframe of tiny ZINC (100 data points).

Returns:

Type	Description
`pd.DataFrame`	pd.DataFrame: A dataframe of tiny ZINC.

graphium.data¶

Data Module¶

graphium.data.datamodule ¶

BaseDataModule ¶

get_num_workers property ¶

predict_ds property writable ¶

__init__(batch_size_training=16, batch_size_inference=16, batch_size_per_pack=None, num_workers=0, pin_memory=True, persistent_workers=False, collate_fn=None) ¶

get_dataloader(dataset, shuffle, stage) ¶

get_dataloader_kwargs(stage, shuffle, **kwargs) ¶

get_max_num_edges_datamodule(stages=None) ¶

get_max_num_nodes_datamodule(stages=None) ¶

predict_dataloader(**kwargs) ¶

test_dataloader(**kwargs) ¶

train_dataloader(**kwargs) ¶

val_dataloader(**kwargs) ¶

DatasetProcessingParams ¶

__init__(task_level=None, df=None, df_path=None, smiles_col=None, label_cols=None, weights_col=None, weights_type=None, idx_col=None, sample_size=None, split_val=0.2, split_test=0.2, seed=None, splits_path=None, split_names=['train', 'val', 'test'], label_normalization=None) ¶

FakeDataModule ¶

generate_data(label_cols, smiles_col) ¶

get_fake_graph() ¶

prepare_data() ¶

setup(stage=None) ¶

GraphOGBDataModule ¶

to_dict() ¶

IPUDataModuleModifier ¶

__init__(ipu_inference_opts=None, ipu_training_opts=None, ipu_dataloader_training_opts=None, ipu_dataloader_inference_opts=None, *args, **kwargs) ¶

MultitaskFromSmilesDataModule ¶

in_dims property ¶

num_edge_feats property ¶

num_node_feats property ¶

__len__() ¶

__repr__() ¶

calculate_statistics(dataset, train=False) ¶

get_data_cache_fullname(compress=False) ¶

get_data_hash() ¶

get_dataloader(dataset, shuffle, stage) ¶

get_dataloader_kwargs(stage, shuffle, **kwargs) ¶

get_fake_graph() ¶

get_label_statistics(data_path, data_hash, dataset, train=False) ¶

get_subsets_of_datasets(single_task_datasets, task_train_indices, task_val_indices, task_test_indices) ¶

load_data_from_cache(verbose=True, compress=False) ¶

normalize_label(dataset, stage) ¶

prepare_data() ¶

save_data_to_cache(verbose=True, compress=False) ¶

setup(stage=None, save_smiles_and_ids=False) ¶

to_dict() ¶

Collate Module¶

graphium.data.collate ¶

collage_pyg_graph(pyg_graphs, batch_size_per_pack=None) ¶

collate_labels(labels, labels_size_dict=None, labels_dtype_dict=None) ¶

collate_pyg_graph_labels(pyg_labels) ¶

get_expected_label_size(label_data, task, label_size) ¶

graphium_collate_fn(elements, labels_size_dict=None, labels_dtype_dict=None, mask_nan='raise', do_not_collate_keys=[], batch_size_per_pack=None) ¶

pad_nodepairs(pe, num_nodes, max_num_nodes_per_graph) ¶

pad_to_expected_label_size(labels, label_size) ¶

Util Functions¶

graphium.data.utils ¶

download_graphium_dataset(name, output_path, extract_zip=True, progress=False) ¶

graphium_package_path(graphium_path) ¶

list_graphium_datasets() ¶

load_micro_zinc() ¶

load_tiny_zinc() ¶

`graphium.data.datamodule` ¶

`BaseDataModule` ¶

`get_num_workers` `property` ¶

`predict_ds` `property` `writable` ¶

`init(batch_size_training=16, batch_size_inference=16, batch_size_per_pack=None, num_workers=0, pin_memory=True, persistent_workers=False, collate_fn=None)` ¶

`get_dataloader(dataset, shuffle, stage)` ¶

`get_dataloader_kwargs(stage, shuffle, **kwargs)` ¶

`get_max_num_edges_datamodule(stages=None)` ¶

`get_max_num_nodes_datamodule(stages=None)` ¶

`predict_dataloader(**kwargs)` ¶

`test_dataloader(**kwargs)` ¶

`train_dataloader(**kwargs)` ¶

`val_dataloader(**kwargs)` ¶

`DatasetProcessingParams` ¶

`init(task_level=None, df=None, df_path=None, smiles_col=None, label_cols=None, weights_col=None, weights_type=None, idx_col=None, sample_size=None, split_val=0.2, split_test=0.2, seed=None, splits_path=None, split_names=['train', 'val', 'test'], label_normalization=None)` ¶

`FakeDataModule` ¶

`generate_data(label_cols, smiles_col)` ¶

`get_fake_graph()` ¶

`prepare_data()` ¶

`setup(stage=None)` ¶

`GraphOGBDataModule` ¶

`to_dict()` ¶

`IPUDataModuleModifier` ¶

`init(ipu_inference_opts=None, ipu_training_opts=None, ipu_dataloader_training_opts=None, ipu_dataloader_inference_opts=None, *args, **kwargs)` ¶

`MultitaskFromSmilesDataModule` ¶

`in_dims` `property` ¶

`num_edge_feats` `property` ¶

`num_node_feats` `property` ¶

`len()` ¶

`repr()` ¶

`calculate_statistics(dataset, train=False)` ¶

`get_data_cache_fullname(compress=False)` ¶

`get_data_hash()` ¶

`get_dataloader(dataset, shuffle, stage)` ¶

`get_dataloader_kwargs(stage, shuffle, **kwargs)` ¶

`get_fake_graph()` ¶

`get_label_statistics(data_path, data_hash, dataset, train=False)` ¶

`get_subsets_of_datasets(single_task_datasets, task_train_indices, task_val_indices, task_test_indices)` ¶

`load_data_from_cache(verbose=True, compress=False)` ¶

`normalize_label(dataset, stage)` ¶

`prepare_data()` ¶

`save_data_to_cache(verbose=True, compress=False)` ¶

`setup(stage=None, save_smiles_and_ids=False)` ¶

`to_dict()` ¶

`graphium.data.collate` ¶

`collage_pyg_graph(pyg_graphs, batch_size_per_pack=None)` ¶

`collate_labels(labels, labels_size_dict=None, labels_dtype_dict=None)` ¶

`collate_pyg_graph_labels(pyg_labels)` ¶

`get_expected_label_size(label_data, task, label_size)` ¶

`graphium_collate_fn(elements, labels_size_dict=None, labels_dtype_dict=None, mask_nan='raise', do_not_collate_keys=[], batch_size_per_pack=None)` ¶

`pad_nodepairs(pe, num_nodes, max_num_nodes_per_graph)` ¶

`pad_to_expected_label_size(labels, label_size)` ¶

`graphium.data.utils` ¶

`download_graphium_dataset(name, output_path, extract_zip=True, progress=False)` ¶

`graphium_package_path(graphium_path)` ¶

`list_graphium_datasets()` ¶

`load_micro_zinc()` ¶

`load_tiny_zinc()` ¶