Stefan Born, (Mark Doerr)
Start: Tue, 28.3.2023, 9:00h, Inst. f. Biochemistry Seminarroom D213, 2nd floor.
Input format for enzymes, substrates and measured properties
It would be desirable to use a standard format for an ‘enzyme search/design problem’ such that a defined pipeline can be run for any such problem. The definition of a data format is no trivial task. A further step would be to integrate such tools into lab automation software (LARA) as part of a general workflow that plans new experiments using previous experiments and external data. — For the moment we put up with a provisional solution and read enzymes, substrates and tabular data from some standard formats, which involves some manual intervention.
Representations of amino acid sequences
A module provides a variety of different representations
- 1-hot sequence encoding
- sequence encoding by physico-chemical AA properties
- sequence encoding using models trained on other tasks (transfer learning)
- vector representation of sequences of different lengths using models trained on other tasks (transfer learning)
Representations of substrates (possibly)
A module provides different representations of the substrates, that capture physico-chemical and geometric properties.
Model classes for the prediction of properties from representations
A module provides simple and more complex prediction models (regression/classification) to be used on some representation
- Ridge Regression, Lasso, Random Forest, Gradient Boosted trees, ARD
- A collection of artificial neural networks
Each model must come with a description of the hyperparameters that determine the model’s behaviour.
Composition of models and representations
A module provides a simple infrastructure to compose a model out of the given components.
Training and validation of models
An infrastructure for training the models is given:
- A metric can be chosen in a set of metrics to assess model performance.
- Given an annotated dataset cross-validation would be used to chose optimal hyperparameters.
- A list of metrics reported to assess the performance of the optimized model in cross validation.
For very small datasets (<20?) hyperparameters of simple models can be optimized using information criteria (AIC; BIC).
Proposal generation and ranking
Proposals for new enzyme variants would often come from a different source. For a small number of residues a module is provided that creates a combinatorial library of mutants. For the third use case relevant residues could be identified from the data.
Rank mutants according to some ‘acquisition function’ (for non-probabilistic models this would just be the predicted activity).