Data and methods

Data and methods

This resource combines sequence-based variant effect prediction with structure-based annotation to support mechanistic interpretation of human missense variants.

The central idea is simple. A pathogenicity score may suggest that a variant disrupts protein function. Structural context helps indicate how that disruption may occur.

The data shown here therefore combines predicted variant effect with predicted effects on stability, protein-protein interfaces, and small-molecule binding pockets.

What is shown for each variant

Each variant record is built from a small set of model outputs and derived labels.

AlphaMissense provides a pathogenicity probability and class.
ESM1b provides a sequence-model score and pathogenicity label.
FoldX provides a predicted change in protein stability as ΔΔG.
Interface annotations indicate whether the residue lies in a predicted protein-protein interface.
Pocket annotations indicate whether the residue lies in a predicted small-molecule binding pocket.
Mechanism labels summarise the most likely structural mode of disruption when supported by the available annotations.

The goal is not to collapse everything into one score. It is to present a structured view of variant effect.

Sequence-based pathogenicity prediction

AlphaMissense

AlphaMissense is used as the main pathogenicity predictor. It returns a probability score between 0 and 1 together with a discrete class. In the underlying work, AlphaMissense predictions below 0.34 were treated as benign, scores above 0.564 as pathogenic, and intermediate values as ambiguous.

AlphaMissense is a functional effect predictor. It estimates whether a missense variant is likely to disrupt protein function. It does not assign a biochemical mechanism on its own.

ESM1b

ESM1b provides an additional sequence-based signal. In this resource it is shown as a likelihood-related score together with a pathogenicity label. Lower ESM1b values indicate stronger support for a deleterious effect.

ESM1b is included as a complementary sequence model, not as a replacement for structural annotation.

Structural models and residue annotation

AlphaFold2 protein structures

Structure-based annotations are derived from AlphaFold2 models of human proteins. These models provide the residue-level structural context used for stability prediction and pocket detection.

Low-confidence regions were filtered before stability analysis. In the underlying study, regions with mean pLDDT below 50 across a 10-residue window were removed before FoldX calculations. Stability predictions were then generated for positions in regions with adequate AlphaFold confidence.

AlphaFold2-Multimer protein complexes

Protein-protein interface annotations are derived from large-scale AlphaFold2-Multimer modelling of binary human protein interactions. The underlying study modelled 486,099 human interactions, then ranked predicted complexes by pDockQ confidence.

This means that interface labels depend on predicted complex structures rather than direct experimental observation.

Stability prediction

FoldX ΔΔG

Predicted effects on protein stability are based on FoldX calculations applied to AlphaFold2 structures. FoldX estimates the change in Gibbs free energy, ΔΔG, for a missense substitution.

In this resource, higher positive ΔΔG values indicate greater predicted destabilisation. Values near zero suggest limited predicted stability impact. In the underlying analysis, FoldX scores above 2 were treated as destabilising, below -2 as stabilising, and intermediate values as neutral.

A stability prediction should be read as a structural hypothesis. It estimates whether the substitution is likely to affect fold stability, not whether the variant is clinically pathogenic.

Pocket annotation

Pocket detection

Small-molecule binding pockets were identified from AlphaFold2 protein structures using AutoSite. Raw pocket predictions were then ranked using a modified pocket score that incorporated AlphaFold confidence at the residue level.

This mattered because AlphaFold2 can produce spurious pockets in uncertain regions. Including pLDDT in the scoring scheme improved recovery of known binding sites and helped prioritise higher-confidence pockets.

Pocket label

A residue is labelled as pocket-associated when it falls within the defined distance threshold for a predicted pocket. In the underlying work, pocket-associated residues were defined within 4.5 Å of the pocket.

Pocket annotations therefore indicate structural proximity to a predicted ligand-binding region. They do not prove direct ligand contact in vivo.

Interface annotation

Predicted interfaces

Protein-protein interfaces are inferred from AlphaFold2-Multimer complex models. Predicted models were ranked by pDockQ, which provides a confidence estimate for the overall complex. Interface residues were then identified from the predicted structures.

In the underlying study, interface enrichment analyses used residues within 5 Å across chains, and confidence improved when low-pLDDT residues were excluded.

Interface label

A residue is labelled as interface-associated when it lies in a predicted interaction surface in the selected model.

These labels are useful, but they require caution. The authors note that false positives are non-negligible, especially for low-confidence models or weakly supported interaction pairs. Interface annotations should therefore be interpreted as predicted structural context, not as definitive interaction evidence.

Mechanism labels

The mechanism label is a summary layer built from the structural annotations.

In the underlying work, likely pathogenic variants were assigned a putative mechanism using the available structural evidence. The principal mechanism categories were:

  • Stability
  • Pockets
  • Interface
  • Unassigned

A mechanism label is only as strong as the evidence available for that residue. Many variants remain unassigned because no sufficiently supported mechanism could be inferred from current structural coverage.

This is important. The absence of a mechanism label does not imply the absence of functional effect. It means that no specific structural mechanism could be assigned with the available annotations.

Scope and coverage

The underlying study applied structure-based analysis at proteome scale.

It reported:

  • stability predictions for over 200 million missense variants in regions with sufficient AlphaFold confidence
  • 547,401 predicted pockets, of which 109,599 were retained as high-confidence pockets
  • 486,099 predicted human protein-protein interaction models
  • putative interface annotations for 10,583 human proteins at higher confidence thresholds

This creates broad coverage, but not complete coverage. Interface predictions cover only part of the human interactome, and some proteins or residues will have limited structural support.

How the labels should be interpreted

These fields are intended to be read together.

A variant may have a high pathogenicity score but no clear structural mechanism. A variant may be predicted as destabilising but not lie in a pocket or interface. A residue may lie in a pocket or interface without strong sequence-based evidence for pathogenicity.

The resource is therefore designed to support interpretation through combination, not through any single field alone.

Limitations

This resource is based largely on computational prediction.

AlphaMissense and ESM1b are predictive models. FoldX provides estimated stability effects, not direct measurements. Pocket and interface annotations rely on predicted structures. AlphaFold confidence varies across residues, and protein-protein complex predictions can include false positives.

The original study also notes that even when interface models are imperfect, the approximate interacting region within a protein may still be informative. That is useful in practice, but it is not equivalent to experimental confirmation.

For these reasons, the outputs should be treated as structured evidence for interpretation and prioritisation. They do not replace validation, biochemical follow-up, or expert review.

Source and provenance

This page reflects the methods and annotations described in:

Jänes et al. Predicted mechanistic impacts of human protein missense variants. bioRxiv (2024).

https://doi.org/10.1101/2024.05.29.596373

Related public access is available through ProtVar at EMBL-EBI:

https://www.ebi.ac.uk/ProtVar/

The underlying project also provides bulk data and code through the ProtVar FTP and the af2genomics repository described in the paper.