Part I. Sub-Workshop: Statistical models/machine learning as a complement to directed evolution in the search for new enzyme

Stefan Born, (Mark Doerr)

Start: Tue, 28.3.2023, 9:00h, Inst. f. Biochemistry Seminarroom D213, 2nd floor.

Workshop details:

Input format for enzymes, substrates and measured properties

It would be desirable to use a standard format for an ‘enzyme search/design problem’  such that a defined pipeline can be run for any such problem. The definition of a data format is no trivial task.   A further step would be to integrate such tools into lab automation software (LARA) as part of a general workflow that plans new experiments using previous experiments and external data. —  For the moment we put up with a provisional solution and read enzymes, substrates and tabular data from some standard formats, which involves some manual intervention.

Representations of amino acid sequences

A module provides a variety of different representations

  1. 1-hot sequence encoding
  2. sequence encoding by physico-chemical AA properties
  3. sequence encoding using models trained on other tasks (transfer learning)
  4. vector representation of sequences of different lengths using models trained on other tasks (transfer learning)

Representations of substrates  (possibly)

A module provides different representations of the substrates, that capture physico-chemical and geometric properties.

Model classes for the prediction of properties from representations

A module provides simple and more complex prediction models (regression/classification) to be used on some representation

  • Ridge Regression, Lasso, Random Forest, Gradient Boosted trees, ARD
  • A collection of artificial neural networks

Each model must come with a description of the hyperparameters that determine the model’s behaviour.

Composition of models and representations

A module provides a simple infrastructure to compose a model out of the given components.

Training and validation of models

An infrastructure for training the models is given:

  1. A metric can be chosen in a set of metrics to assess model performance.
  2. Given an annotated dataset cross-validation would be used to chose optimal hyperparameters.
  3. A list of metrics reported to assess the performance of the  optimized model in cross validation.

For very small datasets (<20?) hyperparameters of simple models can be optimized using information criteria (AIC; BIC).

Proposal generation and ranking

Proposals for new enzyme variants would often come from a different source. For a small number of residues a module is provided that creates a combinatorial library of mutants. For the third use case relevant residues could be identified from the data.

Rank mutants according to some ‘acquisition function’ (for non-probabilistic models this would just be the predicted activity).