Your browser doesn't support the features required by impress.js, so you are presented with a simplified
version of this presentation.
For the best experience please use the latest Chrome, Safari or Firefox browser.
LARA - semantically annotated experimentation from ground up &
robotic enzyme screening for Machine Learning applications.
mark doerr & uwe bornscheuer & KIWI / SiLA / AnIML / NFDI4Cat teams institute for biochemistry, university
greifswald
greifswald/göteborg, 2024-03-27
Any element with the class="notes" will not be displayed. This can
be used for speaker notes. In fact, the impressConsole plugin will
show it in the speaker console!
Press ctrl-C to activate the console
* lara intro
ML biocatalysis - e.g. transamination reaction
- highly selective enzymes replace conventional chemistry
Structure- and Data-Driven Protein Engineering of Transaminases for Improving Activity and Stereoselectivity
Yu-Fei Ao et. al, Angewandte Chemie 2023. https://doi.org/10.1002/anie.202301660
* lara intro
3FCR transaminase substrate screening
Structure- and Data-Driven Protein Engineering of Transaminases for Improving Activity and Stereoselectivity
Yu-Fei Ao et. al, Angewandte Chemie 2023. https://doi.org/10.1002/anie.202301660
* lara intro
"classical" ML approaches
Structure- and Data-Driven Protein Engineering of Transaminases for Improving Activity and Stereoselectivity
Yu-Fei Ao et. al, Angewandte Chemie 2023. https://doi.org/10.1002/anie.202301660
* lara intro
the greifswald protein screening platform LARA
* lara intro
what is semantics enabled machine learning (ML) ?
- better "understanding" of the data by the ML algorithms
- semantics guided ML model building
by automated feature extraction
- improved consitancy checking, because the ML algorithm can extract physical limits
- improved error handling, because the ML algorithm "knows" more about the predicted system
*
requirements for semantics enabled machine learning
what are we trying to build ?
- generic software platform closed loop designs of the simple kind
- autonomous experiment design
- autonomous experiment exection
- autonomous experiment evaluation
- AI and machine learning
- redesign of the experiment
- continue this process until an desired outcome is reached or resources are used up
*
the challange
- full SiLA communication of all devices in all platforms
- platform independant process description language
- flexible programming language for complex processes
- dynamic and error tolerant scheduling
- full experimental metadata for machine learning applications
- storage and exchange of data between labs (KIWI biolab)
- fully semantic annotation of the metadata (ontologies)
* protein screening engineering
* findind the right enzyme in 1E5 to 1E9 variants
* lara movie
benfits for machine learning applications
- rich metadata from ground up
- ontology based (s. ontology development workflow)
- allowing autonomous feature extraction
*
challanges to overcome
- heterogenous data
- non-standardised data structure
- non-standardised metadata - no semantics
- non-standardised device communication
- black-box software - closed source
- no advanced, comprehensive data storage
*
building homgeneous infrastructure from ground up
*
* lara intro
* lara intro
... what software components are required ?
- device control
- data readout
- data transfer
- data storage
- semantic understanding of the data by the machines
- machine learning algorithms to "work on the data"
- feedback software to the instruments
- common description language
- process execution / scheduler
*
SiLA servers/devices of LARA
*
LARA, SiLA, AnIML/JSON-LD pythonLab, pythonLabScheduler, LabDataReader
*
pythonLab
https://gitlab.com/opensourcelab/pythonLab
universal, python based, automation language
*
pythonLabOrchestrator & Scheduler
https://gitlab.com/opensourcelab/pythonlabscheduler
*
pythonLabScheduler in the wild (stefan maak)
https://gitlab.com/opensourcelab/pythonlabscheduler
*
LabDataReader
https://gitlab.com/opensourcelab/LabDataReader
- generic data reader framework for propriatory (text based) data formats
- new data formats can be added as plugins
- rich meta data support
- automatic semantic annotation of the data
- fully written in python
- output formats: pandas data frame, JSON-LD, csv, (AnIML/JSON-LD - under development)
*
holistic approach of the LARA suite
- planning of experiments
- storing all required data for the planning, like literature, substances, material, devices,
experimentalists ...
- generating the processes
- execution of the processes, communication with the lab devices
- collection of the data (very structured, well prepared to learn from it)
- evaluation and visualisation of the data (also DoE and machine learning)
- reporting / publishing / exchange between labs
* In the very early days of personal computing, I was wondering, why the computer was not used
overview of final architecture
fully open sourced and python based
*
ontology - development - for semantic search / ML
*
EMMO - European Multiperspective Material Ontology
- top- & mid level ontology
- sould theoretical foundation
- rooted in historical philosopy (mereology), topology, physics and quantum physics
- all ITEMS are, e.g., SpaceTime Objects
- multiperspective and multidisciplinary
- modelling and experiments
- small -> fast reasoning
- python representation (EMMOntoPy)
*
EMMOntoPy
(github.com/emmo-repo/EMMOntoPy)
- all OWL classes and Properties are modeled as Python Objects
- generation of an ontology and reaoning can be done completely in python
- modular / object oriented modelling possible
- easy interaction / integration in own python applications
- SPARQL endpoint
- fast SQLITE triple store
- code managed in git repository
- tools for validation, documentation and visualisation
*
ontology development pipeline
*
ontologies @ OpenSourceLab
*
exmple: OSO measurement
*
NFDI4Cat
National Research Data Infrastructure - for Catalysis
*
summary
- building a large, community driven open source infrastructure
- no black box, everything is adjustable and extendible
- advanced queries and automated feature extraction
- feedback loop driven protein engineering is possible with the tools in place
- machine learing applications will dramatically improve, when the right metadata is in place
- we need a modern mind set (biochemistry and machine learning)
- we are not alone, please join and contribute !
*
acknowledgements
Stefan Maak
project partners
- Stefan Born (TU Berlin)
- Peter Neubauer's group (TU Berlin)
- Johannes Kabisch's group and associates (Uni Trondheim)
- Egon Heuson (Uni Lille)
- Uwe Bornscheuer
and our group (Univ. Greifswald)
KIWI-UG / NFDI4Cat
SiLA team
AnIML team
This work was supported by the German Federal Ministry of Education and Research through the Program
“International
Future Labs for Artificial Intelligence” (Grant number 01DD20002A)
We are grateful to the Deutsche Forschungsgemeinschaft (DFG, INST 292/
118-1 FUGG) and the federal state Mecklenburg-Vorpommern for
financing the robotic platform.