ToyMix Baseline - Test set metrics¶
From the paper to be released soon. Below, you can see the baselines for the ToyMix
dataset, a multitasking dataset comprising of QM9
, Zinc12k
and Tox21
. The datasets and their splits are available on this link. The following baselines are all for models with ~150k parameters.
One can observe that the smaller datasets (Zinc12k
and Tox21
) beneficiate from adding another unrelated task (QM9
), where the labels are computed from DFT simulations.
NEW baselines added 2023/09/18: Multitask baselines have been added for GatedGCN and MPNN++ (sum aggretator) using 3 random seeds. They achieve the best performance by a significant margin on Zinc12k and Tox21, while sacrificing a little on QM9.
Dataset | Model | MAE ↓ | Pearson ↑ | R² ↑ | MAE ↓ | Pearson ↑ | R² ↑ | ||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Single-Task Model | Multi-Task Model | ||||||||||||
QM9 | GCN | 0.102 ± 0.0003 | 0.958 ± 0.0007 | 0.920 ± 0.002 | 0.119 ± 0.01 | 0.955 ± 0.001 | 0.915 ± 0.001 | ||||||
GIN | 0.0976 ± 0.0006 | 0.959 ± 0.0002 | 0.922 ± 0.0004 | 0.117 ± 0.01 | 0.950 ± 0.002 | 0.908 ± 0.003 | |||||||
GINE | 0.0959 ± 0.0002 | 0.955 ± 0.002 | 0.918 ± 0.004 | 0.102 ± 0.01 | 0.956 ± 0.0009 | 0.918 ± 0.002 | |||||||
GatedGCN | 0.1212 ± 0.0009 | 0.9457 ± 0.0002 | 0.8964 ± 0.0006 | ||||||||||
MPNN++ (sum) | 0.1174 ± 0.0012 | 0.9460 ± 0.0005 | 0.8989 ± 0.0008 | ||||||||||
Zinc12k | GCN | 0.348 ± 0.02 | 0.941 ± 0.002 | 0.863 ± 0.01 | 0.226 ± 0.004 | 0.973 ± 0.0005 | 0.940 ± 0.003 | ||||||
GIN | 0.303 ± 0.007 | 0.950 ± 0.003 | 0.889 ± 0.003 | 0.189 ± 0.004 | 0.978 ± 0.006 | 0.953 ± 0.002 | |||||||
GINE | 0.266 ± 0.02 | 0.961 ± 0.003 | 0.915 ± 0.01 | 0.147 ± 0.009 | 0.987 ± 0.001 | 0.971 ± 0.003 | |||||||
GatedGCN | 0.1282 ± 0.0045 | 0.9850 ± 0.0006 | 0.9639 ± 0.0024 | ||||||||||
MPNN++ (sum) | 0.1002 ± 0.0025 | 0.9909 ± 0.0004 | 0.9777 ± 0.0014 |
BCE ↓ | AUROC ↑ | AP ↑ | BCE ↓ | AUROC ↑ | AP ↑ | ||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Single-Task Model | Multi-Task Model | ||||||||||||
Tox21 | GCN | 0.202 ± 0.005 | 0.773 ± 0.006 | 0.334 ± 0.03 | 0.176 ± 0.001 | 0.850 ± 0.006 | 0.446 ± 0.01 | ||||||
GIN | 0.200 ± 0.002 | 0.789 ± 0.009 | 0.350 ± 0.01 | 0.176 ± 0.001 | 0.841 ± 0.005 | 0.454 ± 0.009 | |||||||
GINE | 0.201 ± 0.007 | 0.783 ± 0.007 | 0.345 ± 0.02 | 0.177 ± 0.0008 | 0.836 ± 0.004 | 0.455 ± 0.008 | |||||||
GatedGCN | 0.1733 ± 0.0015 | 0.8522 ± 0.0022 | 0.4620 ± 0.0118 | ||||||||||
MPNN++ (sum) | 0.1725 ± 0.0012 | 0.8569 ± 0.0005 | 0.4598 ± 0.0044 |
LargeMix Baseline¶
LargeMix test set metrics¶
From the paper to be released soon. Below, you can see the baselines for the LargeMix
dataset, a multitasking dataset comprising of PCQM4M_N4
, PCQM4M_G25
, PCBA_1328
, L1000_VCAP
, and L1000_MCF7
. The datasets and their splits are available on this link. The following baselines are all for models with 4-6M parameters.
One can observe that the smaller datasets (L1000_VCAP
and L1000_MCF7
) beneficiate tremendously from the multitasking. Indeed, the lack of molecular samples means that it is very easy for a model to overfit.
While PCQM4M_G25
has no noticeable changes, the node predictions of PCQM4M_N4
and assay predictions of PCBA_1328
take a hit, but it is most likely due to underfitting since the training loss is also increased. It seems that 4-6M parameters is far from sufficient to capturing all of the tasks simultaneously, which motivates the need for a larger model.
Dataset | Model | MAE ↓ | Pearson ↑ | R² ↑ | MAE ↓ | Pearson ↑ | R² ↑ | ||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Single-Task Model | Multi-Task Model | ||||||||||||
Pcqm4m_g25 | GCN | 0.2362 ± 0.0003 | 0.8781 ± 0.0005 | 0.7803 ± 0.0006 | 0.2458 ± 0.0007 | 0.8701 ± 0.0002 | 0.8189 ± 0.0004 | ||||||
GIN | 0.2270 ± 0.0003 | 0.8854 ± 0.0004 | 0.7912 ± 0.0006 | 0.2352 ± 0.0006 | 0.8802 ± 0.0007 | 0.7827 ± 0.0005 | |||||||
GINE | 0.2223 ± 0.0007 | 0.8874 ± 0.0003 | 0.7949 ± 0.0001 | 0.2315 ± 0.0002 | 0.8823 ± 0.0002 | 0.7864 ± 0.0008 | |||||||
Pcqm4m_n4 | GCN | 0.2080 ± 0.0003 | 0.5497 ± 0.0010 | 0.2942 ± 0.0007 | 0.2040 ± 0.0001 | 0.4796 ± 0.0006 | 0.2185 ± 0.0002 | ||||||
GIN | 0.1912 ± 0.0027 | 0.6138 ± 0.0088 | 0.3688 ± 0.0116 | 0.1966 ± 0.0003 | 0.5198 ± 0.0008 | 0.2602 ± 0.0012 | |||||||
GINE | 0.1910 ± 0.0001 | 0.6127 ± 0.0003 | 0.3666 ± 0.0008 | 0.1941 ± 0.0003 | 0.5303 ± 0.0023 | 0.2701 ± 0.0034 |
BCE ↓ | AUROC ↑ | AP ↑ | BCE ↓ | AUROC ↑ | AP ↑ | ||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Single-Task Model | Multi-Task Model | ||||||||||||
Pcba_1328 | GCN | 0.0316 ± 0.0000 | 0.7960 ± 0.0020 | 0.3368 ± 0.0027 | 0.0349 ± 0.0002 | 0.7661 ± 0.0031 | 0.2527 ± 0.0041 | ||||||
GIN | 0.0324 ± 0.0000 | 0.7941 ± 0.0018 | 0.3328 ± 0.0019 | 0.0342 ± 0.0001 | 0.7747 ± 0.0025 | 0.2650 ± 0.0020 | |||||||
GINE | 0.0320 ± 0.0001 | 0.7944 ± 0.0023 | 0.3337 ± 0.0027 | 0.0341 ± 0.0001 | 0.7737 ± 0.0007 | 0.2611 ± 0.0043 | |||||||
L1000_vcap | GCN | 0.1900 ± 0.0002 | 0.5788 ± 0.0034 | 0.3708 ± 0.0007 | 0.1872 ± 0.0020 | 0.6362 ± 0.0012 | 0.4022 ± 0.0008 | ||||||
GIN | 0.1909 ± 0.0005 | 0.5734 ± 0.0029 | 0.3731 ± 0.0014 | 0.1870 ± 0.0010 | 0.6351 ± 0.0014 | 0.4062 ± 0.0001 | |||||||
GINE | 0.1907 ± 0.0006 | 0.5708 ± 0.0079 | 0.3705 ± 0.0015 | 0.1862 ± 0.0007 | 0.6398 ± 0.0043 | 0.4068 ± 0.0023 | |||||||
L1000_mcf7 | GCN | 0.1869 ± 0.0003 | 0.6123 ± 0.0051 | 0.3866 ± 0.0010 | 0.1863 ± 0.0011 | 0.6401 ± 0.0021 | 0.4194 ± 0.0004 | ||||||
GIN | 0.1862 ± 0.0003 | 0.6202 ± 0.0091 | 0.3876 ± 0.0017 | 0.1874 ± 0.0013 | 0.6367 ± 0.0066 | 0.4198 ± 0.0036 | |||||||
GINE | 0.1856 ± 0.0005 | 0.6166 ± 0.0017 | 0.3892 ± 0.0035 | 0.1873 ± 0.0009 | 0.6347 ± 0.0048 | 0.4177 ± 0.0024 |
LargeMix training set loss¶
Below is the loss on the training set. One can observe that the multi-task model always underfits the single-task, except on the two L1000
datasets.
This is not surprising as they contain two orders of magnitude more datapoints and pose a significant challenge for the relatively small models used in this analysis. This favors the Single dataset setup (which uses a model of the same size) and we conjecture larger models to bridge this gap moving forward.
CE or MSE loss in single-task \(\downarrow\) | CE or MSE loss in multi-task \(\downarrow\) | ||
---|---|---|---|
Pcqm4m_g25 | GCN | 0.2660 ± 0.0005 | 0.2767 ± 0.0015 |
GIN | 0.2439 ± 0.0004 | 0.2595 ± 0.0016 | |
GINE | 0.2424 ± 0.0007 | 0.2568 ± 0.0012 | |
Pcqm4m_n4 | GCN | 0.2515 ± 0.0002 | 0.2613 ± 0.0008 |
GIN | 0.2317 ± 0.0003 | 0.2512 ± 0.0008 | |
GINE | 0.2272 ± 0.0001 | 0.2483 ± 0.0004 | |
Pcba_1328 | GCN | 0.0284 ± 0.0010 | 0.0382 ± 0.0005 |
GIN | 0.0249 ± 0.0017 | 0.0359 ± 0.0011 | |
GINE | 0.0258 ± 0.0017 | 0.0361 ± 0.0008 | |
L1000_vcap | GCN | 0.1906 ± 0.0036 | 0.1854 ± 0.0148 |
GIN | 0.1854 ± 0.0030 | 0.1833 ± 0.0185 | |
GINE | 0.1860 ± 0.0025 | 0.1887 ± 0.0200 | |
L1000_mcf7 | GCN | 0.1902 ± 0.0038 | 0.1829 ± 0.0095 |
GIN | 0.1873 ± 0.0033 | 0.1701 ± 0.0142 | |
GINE | 0.1883 ± 0.0039 | 0.1771 ± 0.0010 |
NEW: Largemix improved sweep - 2023/08-18¶
Unsatisfied with the prior results, we ran a bayesian search over a broader set of parameters, and including only more expressive models, namely GINE, GatedGCN and MPNN++. We further increase the number of parameters to 10M due to evidence of underfitting. We evaluate only the multitask setting.
We observe a significant improvement over all tasks, with a very notable r2-score increase of +0.53 (0.27 -> 0.80) compared to the best node-level property prediction on PCQM4M_N4.
The results are reported below over 1 seed. We are currently running more seeds of the same models.
Dataset | Model | MAE ↓ | Pearson ↑ | R² ↑ |
---|---|---|---|---|
PCQM4M_G25 | GINE | 0.2250 | 0.8840 | 0.7911 |
GatedGCN | 0.2457 | 0.8698 | 0.7688 | |
MPNN++ (sum) | 0.2269 | 0.8802 | 0.7855 | |
PCQM4M_N4 | GINE | 0.2699 | 0.8475 | 0.7182 |
GatedGCN | 0.3337 | 0.8102 | 0.6566 | |
MPNN++ (sum) | 0.2114 | 0.8942 | 0.8000 |
Dataset | Model | BCE ↓ | AUROC ↑ | AP ↑ |
---|---|---|---|---|
PCBA_1328 | GINE | 0.0334 | 0.7879 | 0.2808 |
GatedGCN | 0.0351 | 0.7788 | 0.2611 | |
MPNN++ (sum) | 0.0344 | 0.7815 | 0.2666 | |
L1000_VCAP | GINE | 0.1907 | 0.6416 | 0.4042 |
GatedGCN | 0.1866 | 0.6395 | 0.4092 | |
MPNN++ (sum) | 0.1867 | 0.6478 | 0.4131 | |
L1000_MCF7 | GINE | 0.1931 | 0.6352 | 0.4235 |
GatedGCN | 0.1859 | 0.6547 | 0.4224 | |
MPNN++ (sum) | 0.1870 | 0.6593 | 0.4254 |
UltraLarge Baseline¶
UltraLarge test set metrics¶
For UltraLarge
, we provide results for the same GNN baselines as for
LargeMix
. Each model is trained for 50 epochs and results are averaged over 3 seeds. The remaining
setup is the same as for TOYMIX (Section E.1), reporting metrics on the Single Dataset and Multi Dataset using the same performance metrics. We further use the same models (in terms of size) as used for LargeMix
.
For now, we report only the results for a subset representing 5% of the total dataset due to computational constraint, but aim to provide the full results soon.
Results discussion. UltraLarge
results can be found in Table 6. Interestingly, on both graph- and node-level tasks we observe that there is no advantage of multi-tasking in terms of performance. We
expect that for this ultra-large dataset, significantly larger models are needed to successfully leverage the multi-task setup. This could be attributed to underfitting, as already demonstrated for LargeMix
. Nonetheless, our baselines set the stage for large-scale pre-training on UltraLarge
.
The results presented used approximately 500 GPU hours of compute, with more compute used for development and hyperparameter search.
We further note that the graph-level tasks results are very strong. Regarding the node-level tasks, they are expected to underperform in low-parameters regime, due to clear signs of underfitting, a very large amount of labels to learn, and susceptibility to over-smoothing from traditional GNNs.
Dataset | Model | MAE ↓ | Pearson ↑ | R² ↑ | MAE ↓ | Pearson ↑ | R² ↑ | ||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Single-Task Model | Multi-Task Model | ||||||||||||
Pm6_83m_g62 | GCN | .2606 ± .0011 | .9004 ± .0003 | .7997 ± .0009 | .2625 ± .0011 | .8896 ± .0001 | .7982 ± .0001 | ||||||
GIN | .2546 ± .0021 | .9051 ± .0019 | .8064 ± .0037 | .2562 ± .0000 | .8901 ± .0000 | .806 ± .0000 | |||||||
GINE | .2538 ± .0006 | .9059 ± .0010 | .8082 ± .0015 | .258 ± .0011 | .904 ± .0000 | .8048 ± .0001 | |||||||
Pm6_83m_n7 | GCN | .5803 ± .0001 | .3372 ± .0004 | .1191 ± .0002 | .5971 ± .0002 | .3164 ± .0001 | .1019 ± .0011 | ||||||
GIN | .573 ± .0002 | .3478 ± .0001 | .1269 ± .0002 | .5831 ± .0001 | .3315 ± .0005 | .1141 ± .0000 | |||||||
GINE | .572 ± .0004 | .3487 ± .0002 | .1266 ± .0001 | .5839 ± .0004 | .3294 ± .0002 | .1104 ± .0000 |
UltraLarge training set loss¶
In the table below, we observe that the multi-task model slightly underfits the single-task model, indicating that parameters can be efficiently shared between the node-level and graph-level tasks. We further note that the training loss and the test MAE are almost equal for all tasks, indicating further benefits as we scale both the model and the data.
MAE loss in single-task ↓ | MAE loss in multi-task ↓ | ||
---|---|---|---|
Pm6_83m_g62 | GCN | .2679 ± .0020 | .2713 ± .0017 |
GIN | .2582 ± .0018 | .2636 ± .0014 | |
GINE | .2567 ± .0036 | .2603 ± .0021 | |
Pm6_83m_n7 | GCN | .5818 ± .0021 | .5955 ± .0023 |
GIN | .5707 ± .0019 | .5851 ± .0038 | |
GINE | .5724 ± .0015 | .5832 ± .0027 |