graphium.data¶
module for loading datasets and collating batches
Data Module¶
graphium.data.datamodule
¶
Copyright (c) 2023 Valence Labs, Recursion Pharmaceuticals and Graphcore Limited.
Use of this software is subject to the terms and conditions outlined in the LICENSE file. Unauthorized modification, distribution, or use is prohibited. Provided 'as is' without warranties of any kind.
Valence Labs, Recursion Pharmaceuticals and Graphcore Limited are not liable for any damages arising from its use. Refer to the LICENSE file for the full terms and conditions.
ADMETBenchmarkDataModule
¶
Bases: MultitaskFromSmilesDataModule
Wrapper to use the ADMET benchmark group from the TDC (Therapeutics Data Commons).
Dependency
This class requires PyTDC to be installed.
Citation
Huang, K., Fu, T., Gao, W., Zhao, Y., Roohani, Y., Leskovec, J., Coley, C., Xiao, C., Sun, J., & Zitnik, M. (2021). Therapeutics Data Commons: Machine Learning Datasets and Tasks for Drug Discovery and Development. Proceedings of Neural Information Processing Systems, NeurIPS Datasets and Benchmarks.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
tdc_benchmark_names |
Optional[Union[str, List[str]]]
|
This can be any subset of the benchmark names that make up the ADMET benchmarking group.
If
|
None
|
tdc_train_val_seed |
int
|
TDC recommends a default splitting method for the train-val split. This parameter is used to seed that splitting method. |
0
|
BaseDataModule
¶
Bases: LightningDataModule
get_num_workers
property
¶
get the number of workers to use
predict_ds
property
writable
¶
Get the dataset used for the prediction
__init__(batch_size_training=16, batch_size_inference=16, batch_size_per_pack=None, num_workers=0, pin_memory=True, persistent_workers=False, multiprocessing_context=None, collate_fn=None)
¶
base dataset module for all datasets (to be inherented)
Parameters:
Name | Type | Description | Default |
---|---|---|---|
batch_size_training |
int
|
batch size for training |
16
|
batch_size_inference |
int
|
batch size for inference |
16
|
num_workers |
int
|
number of workers for data loading |
0
|
pin_memory |
bool
|
whether to pin memory |
True
|
persistent_workers |
bool
|
whether to use persistent workers |
False
|
multiprocessing_context |
Optional[str]
|
multiprocessing context for data worker creation |
None
|
collate_fn |
Optional[Callable]
|
collate function for batching |
None
|
get_dataloader(dataset, shuffle, stage)
¶
Get the dataloader for a given dataset
Parameters:
Name | Type | Description | Default |
---|---|---|---|
dataset |
Dataset
|
The dataset from which to load the data |
required |
shuffle |
bool
|
set to |
required |
stage |
RunningStage
|
Whether in Training, Validating, Testing, Sanity-checking, Predicting, or Tuning phase. |
required |
Returns:
Type | Description |
---|---|
DataLoader
|
The dataloader to sample from |
get_dataloader_kwargs(stage, shuffle, **kwargs)
¶
Get the options for the dataloader depending on the current stage.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
stage |
RunningStage
|
Whether in Training, Validating, Testing, Sanity-checking, Predicting, or Tuning phase. |
required |
shuffle |
bool
|
set to |
required |
Returns:
Type | Description |
---|---|
Dict[str, Any]
|
Arguments to pass to the |
get_max_num_edges_datamodule(stages=None)
¶
Get the maximum number of edges across all datasets from the datamodule
Parameters:
Name | Type | Description | Default |
---|---|---|---|
datamodule |
The datamodule from which to extract the maximum number of nodes |
required | |
stages |
Optional[List[str]]
|
The stages from which to extract the max num nodes. Possible values are ["train", "val", "test", "predict"]. If None, all stages are considered. |
None
|
Returns:
Name | Type | Description |
---|---|---|
max_num_edges |
int
|
The maximum number of edges across all datasets from the datamodule |
get_max_num_nodes_datamodule(stages=None)
¶
Get the maximum number of nodes across all datasets from the datamodule
Parameters:
Name | Type | Description | Default |
---|---|---|---|
datamodule |
The datamodule from which to extract the maximum number of nodes |
required | |
stages |
Optional[List[str]]
|
The stages from which to extract the max num nodes. Possible values are ["train", "val", "test", "predict"]. If None, all stages are considered. |
None
|
Returns:
Name | Type | Description |
---|---|---|
max_num_nodes |
int
|
The maximum number of nodes across all datasets from the datamodule |
predict_dataloader(**kwargs)
¶
return the dataloader for prediction
test_dataloader(**kwargs)
¶
return the test dataloader
train_dataloader(**kwargs)
¶
return the training dataloader
val_dataloader(**kwargs)
¶
return the validation dataloader
DatasetProcessingParams
dataclass
¶
__init__(task_level=None, df=None, df_path=None, smiles_col=None, label_cols=None, weights_col=None, weights_type=None, idx_col=None, mol_ids_col=None, sample_size=None, split_val=0.2, split_test=0.2, seed=None, epoch_sampling_fraction=1.0, splits_path=None, split_names=['train', 'val', 'test'], label_normalization=None)
¶
object to store the parameters for the dataset processing Parameters: task_level: The task level, wether it is graph, node, edge or nodepair df: The dataframe containing the data df_path: The path to the dataframe containing the data. If list, will read all files, sort them alphabetically and concatenate them. smiles_col: The column name of the smiles label_cols: The column names of the labels weights_col: The column name of the weights weights_type: The type of weights idx_col: The column name of the indices mol_ids_col: The column name of the molecule ids sample_size: The size of the sample split_val: The fraction of the data to use for validation split_test: The fraction of the data to use for testing seed: The seed to use for the splits and subsampling splits_path: The path to the splits, or a dictionary with the splits
FakeDataModule
¶
Bases: MultitaskFromSmilesDataModule
A fake datamodule that generates artificial data by mimicking the true data coming from the provided dataset. It is useful to test the speed and performance of the model on a dataset without having to featurize it and wait for the workers to load it.
generate_data(label_cols, smiles_col)
¶
Returns: pd.DataFrame
get_fake_graph()
¶
Low memory footprint method to get the first datapoint DGL graph.
The first 10 rows of the data are read in case the first one has a featurization
error. If all 20 first element, then None
is returned, otherwise the first
graph to not fail is returned.
prepare_data()
¶
Called only from a single process in distributed settings. Steps:
- If each cache is set and exists, reload from cache and return. Otherwise,
- For each single-task dataset:
- Load its dataframe from a path (if provided)
- Subsample the dataframe
- Extract the smiles, labels from the dataframe
- In the previous step, we were also able to get the unique smiles, which we use to compute the features
- For each single-task dataframe and associated data (smiles, labels, etc.):
- Filter out the data corresponding to molecules which failed featurization.
- Create a corresponding SingletaskDataset
- Split the SingletaskDataset according to the task-specific splits for train, val and test
setup(stage=None)
¶
Prepare the torch dataset. Called on every GPUs. Setting state here is ok. Parameters: stage (str): Either 'fit', 'test', or None.
GraphOGBDataModule
¶
Bases: MultitaskFromSmilesDataModule
__init__(task_specific_args, processed_graph_data_path=None, dataloading_from='ram', featurization=None, batch_size_training=16, batch_size_inference=16, batch_size_per_pack=None, num_workers=0, pin_memory=True, persistent_workers=False, multiprocessing_context=None, featurization_n_jobs=-1, featurization_progress=False, featurization_backend='loky', collate_fn=None, prepare_dict_or_graph='pyg:graph', **kwargs)
¶
Load an OGB (Open-graph-benchmark) GraphProp dataset.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
task_specific_args |
Dict[str, Union[DatasetProcessingParams, Dict[str, Any]]]
|
Arguments related to each task, with the task-name being the key, and the specific arguments being the values. The arguments must be a Dict containing the following keys:
|
required |
processed_graph_data_path |
Optional[Union[str, PathLike]]
|
Path to the processed graph data. If None, the data will be downloaded from the OGB website. |
None
|
dataloading_from |
str
|
Whether to load the data from RAM or disk. Default is "ram". |
'ram'
|
featurization |
Optional[Union[Dict[str, Any], DictConfig]]
|
args to apply to the SMILES to Graph featurizer. |
None
|
batch_size_training |
int
|
batch size for training and val dataset. |
16
|
batch_size_inference |
int
|
batch size for test dataset. |
16
|
num_workers |
int
|
Number of workers for the dataloader. Use -1 to use all available cores. |
0
|
pin_memory |
bool
|
Whether to pin on paginated CPU memory for the dataloader. |
True
|
featurization_n_jobs |
int
|
Number of cores to use for the featurization. |
-1
|
featurization_progress |
bool
|
whether to show a progress bar during featurization. |
False
|
featurization_backend |
str
|
The backend to use for the molecular featurization.
|
'loky'
|
collate_fn |
Optional[Callable]
|
A custom torch collate function. Default is to |
None
|
sample_size |
|
required |
to_dict()
¶
geenrate a dictionary representation of the class Returns: dict: dictionary representation of the class
IPUDataModuleModifier
¶
__init__(ipu_inference_opts=None, ipu_training_opts=None, ipu_dataloader_training_opts=None, ipu_dataloader_inference_opts=None, *args, **kwargs)
¶
wrapper functions from the a DataModule
to support IPU and IPU options To be used in dual inheritance, for example:
IPUDataModule(BaseDataModule, IPUDataModuleModifier):
def __init__(self, **kwargs):
BaseDataModule.__init__(self, **kwargs)
IPUDataModuleModifier.__init__(self, **kwargs)
Parameters:
Name | Type | Description | Default |
---|---|---|---|
ipu_inference_opts |
Optional[Options]
|
Options for the IPU in inference mode. Ignore if not using IPUs |
None
|
ipu_training_opts |
Optional[Options]
|
Options for the IPU in training mode. Ignore if not using IPUs |
None
|
ipu_dataloader_kwargs_train_val |
Options for the dataloader for the IPU. Ignore if not using IPUs |
required | |
ipu_dataloader_kwargs_test |
Options for the dataloader for the IPU. Ignore if not using IPUs |
required | |
args |
Arguments for the |
()
|
|
kwargs |
Keyword arguments for the |
{}
|
MultitaskFromSmilesDataModule
¶
Bases: BaseDataModule
, IPUDataModuleModifier
in_dims
property
¶
Return all input dimensions for the set of graphs. Including node/edge features, and raw positional encoding dimensions such eigval, eigvec, rwse and more
num_edge_feats
property
¶
Return the number of edge features in the first graph
num_node_feats
property
¶
Return the number of node features in the first graph
__init__(task_specific_args, processed_graph_data_path=None, dataloading_from='ram', featurization=None, batch_size_training=16, batch_size_inference=16, batch_size_per_pack=None, num_workers=0, pin_memory=True, persistent_workers=False, multiprocessing_context=None, featurization_n_jobs=-1, featurization_progress=False, featurization_backend='loky', featurization_batch_size=1000, collate_fn=None, prepare_dict_or_graph='pyg:graph', **kwargs)
¶
only for parameters beginning with task_*, we have a dictionary where the key is the task name
and the value is specified below.
Parameters:
task_specific_args: A dictionary where the key is the task name (for the multi-task setting), and
the value is a DatasetProcessingParams
object. The DatasetProcessingParams
object
contains multiple parameters to define how to load and process the files, such as:
- `task_level`
- `df`
- `df_path`
- `smiles_col`
- `label_cols`
dataloading_from: Whether to load the data from RAM or from disk. If set to "disk", the data
must have been previously cached with `processed_graph_data_path` set. If set to "ram", the data
will be loaded in RAM and the `processed_graph_data_path` will be ignored.
featurization: args to apply to the SMILES to Graph featurizer.
batch_size_training: batch size for training and val dataset.
batch_size_inference: batch size for test dataset.
num_workers: Number of workers for the dataloader. Use -1 to use all available
cores.
pin_memory: Whether to pin on paginated CPU memory for the dataloader.
featurization_n_jobs: Number of cores to use for the featurization.
featurization_progress: whether to show a progress bar during featurization.
featurization_backend: The backend to use for the molecular featurization.
- "multiprocessing": Found to cause less memory issues.
- "loky": joblib's Default. Found to cause memory leaks.
- "threading": Found to be slow.
featurization_batch_size: Batch size to use for the featurization.
collate_fn: A custom torch collate function. Default is to `graphium.data.graphium_collate_fn`
prepare_dict_or_graph: Whether to preprocess all molecules as Graph dict or PyG graphs.
Possible options:
- "pyg:dict": Process molecules as a `dict`. It's faster and requires less RAM during
pre-processing. It is slower during training with with `num_workers=0` since
pyg `Data` will be created during data-loading, but faster with large
`num_workers`, and less likely to cause memory issues with the parallelization.
- "pyg:graph": Process molecules as `pyg.data.Data`.
__len__()
¶
Returns the number of elements of the current DataModule, which is the combined size of all single-task datasets given. Returns: num_elements: Number of elements in the current DataModule
__repr__()
¶
Controls how the class is printed
Returns:
calculate_statistics(dataset, train=False)
¶
Calculate the statistics of the labels for each task, and overwrites the self.task_norms
attribute.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
dataset |
MultitaskDataset
|
the dataset to calculate the statistics from |
required |
train |
bool
|
whether the dataset is the training set |
False
|
get_data_cache_fullname(compress=False)
¶
Create a hash for the dataset, and use it to generate a file name
Parameters:
Name | Type | Description | Default |
---|---|---|---|
compress |
bool
|
Whether to compress the data |
False
|
Returns: full path to the data cache file
get_data_hash()
¶
Get a hash specific to a dataset and smiles_transformer. Useful to cache the pre-processed data.
get_dataloader(dataset, shuffle, stage)
¶
Get the poptorch dataloader for a given dataset
Parameters:
Name | Type | Description | Default |
---|---|---|---|
dataset |
Dataset
|
The dataset from which to load the data |
required |
shuffle |
bool
|
set to |
required |
stage |
RunningStage
|
Whether in Training, Validating, Testing, Sanity-checking, Predicting, or Tuning phase. |
required |
Returns:
Type | Description |
---|---|
Union[DataLoader, DataLoader]
|
The poptorch dataloader to sample from |
get_dataloader_kwargs(stage, shuffle, **kwargs)
¶
Get the options for the dataloader depending on the current stage.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
stage |
RunningStage
|
Whether in Training, Validating, Testing, Sanity-checking, Predicting, or Tuning phase. |
required |
shuffle |
bool
|
set to |
required |
Returns:
Type | Description |
---|---|
Dict[str, Any]
|
Arguments to pass to the |
get_fake_graph()
¶
Low memory footprint method to get the featurization of a fake graph without reading the dataset. Useful for getting the number of node/edge features.
Returns:
Name | Type | Description |
---|---|---|
graph |
A fake graph with the right featurization |
get_label_statistics(data_path, data_hash, dataset, train=False)
¶
Get the label statistics from the dataset, and save them to file, if needed.
self.task_norms
will be modified in-place with the label statistics.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
data_path |
Union[str, PathLike]
|
the path to save and load the label statistics to. If None, no saving and loading will be done. |
required |
data_hash |
str
|
the hash of the dataset generated by |
required |
dataset |
MultitaskDataset
|
the dataset to calculate the statistics from |
required |
train |
bool
|
whether the dataset is the training set |
False
|
get_subsets_of_datasets(single_task_datasets, task_train_indices, task_val_indices, task_test_indices)
¶
From a dictionary of datasets and their associated indices, subset the train/val/test sets
Parameters:
Name | Type | Description | Default |
---|---|---|---|
single_task_datasets |
Dict[str, SingleTaskDataset]
|
Dictionary of datasets |
required |
task_train_indices |
Dict[str, Iterable]
|
Dictionary of train indices |
required |
task_val_indices |
Dict[str, Iterable]
|
Dictionary of val indices |
required |
task_test_indices |
Dict[str, Iterable]
|
Dictionary of test indices |
required |
Returns: train_singletask_datasets: Dictionary of train subsets val_singletask_datasets: Dictionary of val subsets test_singletask_datasets: Dictionary of test subsets
load_data_from_cache(verbose=True, compress=False)
¶
Load the datasets from cache. First create a hash for the dataset, and verify if that
hash is available at the path given by self.processed_graph_data_path
.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
verbose |
bool
|
Whether to print the progress |
True
|
compress |
bool
|
Whether to compress the data |
False
|
Returns:
Name | Type | Description |
---|---|---|
cache_data_exists |
bool
|
Whether the cache exists (if the hash matches) and the loading succeeded |
normalize_label(dataset, stage)
¶
Normalize the labels in the dataset using the statistics in self.task_norms
.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
dataset |
MultitaskDataset
|
the dataset to normalize the labels from |
required |
Returns:
Type | Description |
---|---|
MultitaskDataset
|
the dataset with normalized labels |
prepare_data(save_smiles_and_ids=False)
¶
Called only from a single process in distributed settings. Steps:
- If each cache is set and exists, reload from cache and return. Otherwise,
- For each single-task dataset:
- Load its dataframe from a path (if provided)
- Subsample the dataframe
- Extract the smiles, labels from the dataframe
- In the previous step, we were also able to get the unique smiles, which we use to compute the features
- For each single-task dataframe and associated data (smiles, labels, etc.):
- Filter out the data corresponding to molecules which failed featurization.
- Create a corresponding SingletaskDataset
- Split the SingletaskDataset according to the task-specific splits for train, val and test
setup(stage=None, save_smiles_and_ids=False)
¶
Prepare the torch dataset. Called on every GPUs. Setting state here is ok. Parameters: stage (str): Either 'fit', 'test', or None.
to_dict()
¶
Returns a dictionary representation of the current DataModule Returns: obj_repr: Dictionary representation of the current DataModule
Collate Module¶
graphium.data.collate
¶
Copyright (c) 2023 Valence Labs, Recursion Pharmaceuticals and Graphcore.
Use of this software is subject to the terms and conditions outlined in the LICENSE file. Unauthorized modification, distribution, or use is prohibited. Provided 'as is' without warranties of any kind.
Valence Labs, Recursion Pharmaceuticals and Graphcore are not liable for any damages arising from its use. Refer to the LICENSE file for the full terms and conditions.
collage_pyg_graph(pyg_graphs, batch_size_per_pack=None)
¶
Function to collate pytorch geometric graphs. Convert all numpy types to torch Convert edge indices to int64
Parameters:
Name | Type | Description | Default |
---|---|---|---|
pyg_graphs |
Iterable[Union[Data, Dict]]
|
Iterable of PyG graphs |
required |
batch_size_per_pack |
Optional[int]
|
The number of graphs to pack together. This is useful for using packing with the Transformer, |
None
|
collate_labels(labels, labels_size_dict=None, labels_dtype_dict=None)
¶
Collate labels for multitask learning.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
labels |
List[Data]
|
List of labels |
required |
labels_size_dict |
Optional[Dict[str, Any]]
|
Dict of the form Dict[tasks, sizes] which has task names as keys and the size of the label tensor as value. The size of the tensor corresponds to how many labels/values there are to predict for that task. |
None
|
labels_dtype_dict |
Optional[Dict[str, Any]]
|
(Note): This is an attribute of the |
None
|
Returns:
Type | Description |
---|---|
A dictionary of the form Dict[tasks, labels] where tasks is the name of the task and labels |
|
is a tensor of shape (batch_size, *labels_size_dict[task]). |
collate_pyg_graph_labels(pyg_labels)
¶
Function to collate pytorch geometric labels. Convert all numpy types to torch
Parameters:
Name | Type | Description | Default |
---|---|---|---|
pyg_labels |
List[Data]
|
Iterable of PyG label Data objects |
required |
get_expected_label_size(label_data, task, label_size)
¶
Determines expected label size based on the specfic graph properties and the number of targets in the task-dataset.
graphium_collate_fn(elements, labels_size_dict=None, labels_dtype_dict=None, mask_nan='raise', do_not_collate_keys=[], batch_size_per_pack=None)
¶
This collate function is identical to the default
pytorch collate function but add support for pyg.data.Data
to batch graphs.
Beside pyg graph collate, other objects are processed the same way as the original torch collate function. See https://pytorch.org/docs/stable/data.html#dataloader-collate-fn for more details.
Note
If graphium needs to manipulate other tricky-to-batch objects. Support for them should be added to this single collate function.
Parameters:
elements:
The elements to batch. See `torch.utils.data.dataloader.default_collate`.
labels_size_dict:
(Note): This is an attribute of the `MultitaskDataset`.
A dictionary of the form Dict[tasks, sizes] which has task names as keys
and the size of the label tensor as value. The size of the tensor corresponds to how many
labels/values there are to predict for that task.
labels_dtype_dict:
(Note): This is an attribute of the `MultitaskDataset`.
A dictionary of the form Dict[tasks, dtypes] which has task names as keys
and the dtype of the label tensor as value. This is necessary to ensure the missing labels are added with NaNs of the right dtype
mask_nan:
Deal with the NaN/Inf when calling the function `make_pyg_graph`.
Some values become `Inf` when changing data type. This allows to deal
with that.
- "raise": Raise an error when there is a nan or inf in the featurization
- "warn": Raise a warning when there is a nan or inf in the featurization
- "None": DEFAULT. Don't do anything
- "Floating value": Replace nans or inf by the specified value
do_not_batch_keys:
Keys to ignore for the collate
batch_size_per_pack: The number of graphs to pack together.
This is useful for using packing with the Transformer.
If None, no packing is done.
Otherwise, indices are generated to map the nodes to the pack they belong to under the key `"pack_from_node_idx"`,
with an additional mask to indicate which nodes are from the same graph under the key `"pack_attn_mask"`.
Returns:
Type | Description |
---|---|
Union[Any, Dict[str, Any]]
|
The batched elements. See |
pad_nodepairs(pe, num_nodes, max_num_nodes_per_graph)
¶
This function zero-pads nodepair-level positional encodings to conform with the batching logic.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
pe |
(Tensor, [num_nodes, num_nodes, num_feat])
|
Nodepair pe |
required |
num_nodes |
int
|
Number of nodes of processed graph |
required |
max_num_nodes_per_graph |
int
|
Maximum number of nodes among graphs in current batch |
required |
Returns:
Name | Type | Description |
---|---|---|
padded_pe |
(Tensor, [num_nodes, max_num_nodes_per_graph, num_feat])
|
padded nodepair pe tensor |
pad_to_expected_label_size(labels, label_size)
¶
Determine difference of labels
shape to expected shape label_size
and pad
with torch.nan
accordingly.
Util Functions¶
graphium.data.utils
¶
Copyright (c) 2023 Valence Labs, Recursion Pharmaceuticals and Graphcore Limited.
Use of this software is subject to the terms and conditions outlined in the LICENSE file. Unauthorized modification, distribution, or use is prohibited. Provided 'as is' without warranties of any kind.
Valence Labs, Recursion Pharmaceuticals and Graphcore Limited are not liable for any damages arising from its use. Refer to the LICENSE file for the full terms and conditions.
download_graphium_dataset(name, output_path, extract_zip=True, progress=False)
¶
Download a Graphium dataset to a specified location.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
name |
str
|
Name of the Graphium dataset from |
required |
output_path |
str
|
Directory path where to download the dataset to. |
required |
extract_zip |
bool
|
Whether to extract the dataset if it's a zip file. |
True
|
progress |
bool
|
Whether to show a progress bar during download. |
False
|
Returns:
Name | Type | Description |
---|---|---|
str |
str
|
Path to the downloaded dataset. |
found_size_mismatch(task, features, labels, smiles)
¶
Check if a size mismatch exists between features and labels with respect to node/edge/nodepair.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
task |
str
|
The task name is needed to determine the task level (graph, node, edge or nodepair) |
required |
features |
Union[Data, GraphDict]
|
Features/information of molecule/graph (e.g., edge_index, feat, edge_feat, num_nodes, etc.) |
required |
labels |
ndarray
|
Target label of molecule for the task |
required |
smiles |
str
|
Smiles string of molecule |
required |
Returns:
Name | Type | Description |
---|---|---|
mismatch |
bool
|
Boolean variable indicating if a size mismatch was found between featurs and labels. |
graphium_package_path(graphium_path)
¶
Return the path of a graphium file in the package.
list_graphium_datasets()
¶
List Graphium datasets available to download. Returns: set: A set of Graphium dataset names.
load_micro_zinc()
¶
Return a dataframe of micro ZINC (1000 data points). Returns: pd.DataFrame: A dataframe of micro ZINC.
load_tiny_zinc()
¶
Return a dataframe of tiny ZINC (100 data points). Returns: pd.DataFrame: A dataframe of tiny ZINC.