The RDM System LARA:
semantics through automation from bottom up

mark doerr, stefan maak & uwe t. bornscheuer
institute for biochemistry, university greifswald
Karlsruhe, 2023-09-14

Any element with the class="notes" will not be displayed. This can be used for speaker notes. In fact, the impressConsole plugin will show it in the speaker console! Press ctrl-C to activate the console

* lara intro

the big vision

let's build ....

we made a plan ....

planning & control

*

the project and experiment planning module

*

the (zotero) literature module

*

process execution:
pythonLab, labOrchestrator pythonLabScheduler

*

process description language : pythonLab

universal, python based, automation language
gitlab.com/opensourcelab/pythonLab

*

pythonLabOrchestrator & Scheduler

gitlab.com/opensourcelab/pythonlabscheduler

*

pythonLabScheduler in the wild (stefan maak)

gitlab.com/opensourcelab/pythonlabscheduler

*

(meta-) data:
data-transfer, storage, ontologies

*

what is ?

sila-standard.org

laboratory automation communication standard
standardised data transfer
standardised data storage (AnIML)

* feature

which data shall be stored ?

the data module

*

LabDataReader

https://gitlab.com/opensourcelab/LabDataReader

generic data reader framework for propriatory (text based) data formats
new data formats can be added as plugins
rich meta data support
automatic semantic annotation of the data
fully written in python
output formats: pandas data frame, JSON-LD, csv, (SciDat/AnIML - under development)

*

ontology - development - for semantic search / ML

*

EMMO - European Multiperspective Material Ontology

top- & mid level ontology
sound theoretical foundation
rooted in historical philosopy (mereology), topology, physics and quantum physics
all ITEMS are, e.g., SpaceTime Objects
multiperspective and multidisciplinary
modelling and experiments
small -> fast reasoning
python representation (EMMOntoPy)

*

EMMOntoPy (github.com/emmo-repo/EMMOntoPy)

all OWL classes and Properties are modeled as Python Objects
generation of an ontology and reaoning can be done completely in python
modular / object oriented modelling possible
easy interaction / integration in own python applications
SPARQL endpoint
fast SQLITE triple store
code managed in git repository
tools for validation, documentation and visualisation

*

ontology development pipeline

*

ontologies @ OpenSourceLab

*

exmple: OSO measurement

*

SciDat

gitlab.com/opensourcelab/ScientificData/SciDat

efficient data and metadata storage for scientific / machine learning needs
proper nullable data / missing data handling (pyarrow / parquet)
data modalities, like range / limits, type / continuos / categorial ...
fast exchange and loading (fastparquet, pyarrow, arrow flight)
semantic annotations / metadata in RDF compliant format
programming language agnostic
commonly used in ETL pipelines (Apache Spark, prefect, ... )
suitable for S3 file storage systems (MinIO)

- data and metadata storage for scientific / machine learning needs (semantic annotation, based on ontologies, derivatives of owlready2) - proper nullable data / missing data handling (pyarrow / parquet) - data modalities, like range / limits, type / continuos / categorial/ variable treatment in case of range violation (parquet metadta) - cardinality (parquet metadata) - efficient storage (parquet) - metadata and data stored at one place (parquet) - metadata conservation when saving / loading / processing (parquet -> arrow) - fast data exchange (arrow flight, MinIO active replication) - fast loading (fastparquet, pyarrow) - fast data processing without in-memory re-writing after loading ( pandas with pyarrow backend, arrow flight, polars) - "modalities" for the machine learning models - semantic annotations / metadata in RDF compliant format - for creating instances of ontology classes and SPARQL reasoning (JSON-LD, rdflib, owlready2) - fast data processing (direct loading into pyarrow driven dataframe ) - programming language agnostic / independent (parquet) - easy to use (SciDat / labDataReader framework, currently in implementation by me) - commonly used in ETL pipelines (Apache Spark, prefect, ... ) - suitable for S3 file storage systems (MinIO)

NFDI4Cat

National Research Data Infrastructure - for Catalysis

TA1 ontology workgroup
vocabulary / thesaurus for (bio-) catalysis
vocabulary building pipeline

see talk of Alexander Behr et al., (Wed, 11:00h, Enabling RDM I)

working with (meta-) data:
SPARQL, jupyter

*

build-in SPARQL interface

*

working with jupyter

*

architecture:
LARA

*

* lara intro

sharing data:
collaborations, repositories, publications

*

* lara intro

implementation:
all open source, python, gitlab

*

building homgeneous infrastructure from ground up

*

... what software components are required ?

device control
data readout
data transfer
data storage
semantic understanding of the data by the machines
machine learning algorithms to "work on the data"
feedback software to the instruments
common description language
process execution / scheduler

*

SiLA servers/devices of LARA

*

what are we trying to build ?

generic software platform closed loop designs of the simple kind
autonomous experiment design
autonomous experiment exection
autonomous experiment evaluation
AI and machine learning
redesign of the experiment
continue this process until an desired outcome is reached or resources are used up

*

the challange

full SiLA communication of all devices in all platforms
platform independant process description language
flexible programming language for complex processes
dynamic and error tolerant scheduling
full experimental metadata for machine learning applications
storage and exchange of data between labs (KIWI biolab)
fully semantic annotation of the metadata (ontologies)

* protein screening engineering * findind the right enzyme in 1E5 to 1E9 variants * lara movie

benfits for machine learning applications

rich metadata from ground up
ontology based (s. ontology development workflow)
allowing autonomous feature extraction

*

challanges to overcome

heterogenous data
non-standardised data structure
non-standardised metadata - no semantics
non-standardised device communication
black-box software - closed source
no advanced, comprehensive data storage

*

what is semantics enabled machine learning (ML) ?

better "understanding" of the data by the ML algorithms
semantics guided ML model building
by automated feature extraction
improved consitancy checking, because the ML algorithm can extract physical limits
improved error handling, because the ML algorithm "knows" more about the predicted system

*

outlook

simple idea, but implementation is not so simple
simplification through (mobile) web apps
voice control (e.g. OpenAI whisper)
utilisation of LLMs for queries
utilisation of evolving ontologies for consistancy and error checking

*

summary

building a large, community driven open source infrastructure
no black box, everything is adjustable and extendible
advanced queries and automated feature extraction
"robot scientist" - feedback loop driven science possible
machine learing applications will dramatically improve, when the right metadata is in place
we need a modern mind set (science and machine learning)
we are not alone, please join and contribute !

*

acknowledgements

Stefan Maak

project partners

Stefan Born (TU Berlin)
Peter Neubauer's group (TU Berlin)
Johannes Kabisch's group and associates (Uni Trondheim)
Egon Heuson (Uni Lille)
Uwe Bornscheuer and our group (Univ. Greifswald)

KIWI-UG / NFDI4Cat

SiLA team

AnIML team

This work was supported by the German Federal Ministry of Education and Research through the Program “International Future Labs for Artificial Intelligence” (Grant number 01DD20002A)

We are grateful to the Deutsche Forschungsgemeinschaft (DFG, INST 292/ 118-1 FUGG) and the federal state Mecklenburg-Vorpommern for financing the robotic platform.

The RDM System LARA: semantics through automation from bottom up

mark doerr, stefan maak & uwe t. bornscheuer institute for biochemistry, university greifswald Karlsruhe, 2023-09-14

the big vision

let's build ....

we made a plan ....

planning & control

the project and experiment planning module

the (zotero) literature module

process execution: pythonLab, labOrchestrator pythonLabScheduler

process description language : pythonLab

universal, python based, automation language gitlab.com/opensourcelab/pythonLab

pythonLabOrchestrator & Scheduler

gitlab.com/opensourcelab/pythonlabscheduler

pythonLabScheduler in the wild (stefan maak)

gitlab.com/opensourcelab/pythonlabscheduler

(meta-) data: data-transfer, storage, ontologies

what is ?

sila-standard.org

which data shall be stored ?

the data module

LabDataReader

https://gitlab.com/opensourcelab/LabDataReader

ontology - development - for semantic search / ML

EMMO - European Multiperspective Material Ontology

EMMOntoPy (github.com/emmo-repo/EMMOntoPy)

ontology development pipeline

ontologies @ OpenSourceLab

exmple: OSO measurement

SciDat

gitlab.com/opensourcelab/ScientificData/SciDat

NFDI4Cat

National Research Data Infrastructure - for Catalysis

see talk of Alexander Behr et al., (Wed, 11:00h, Enabling RDM I)

working with (meta-) data: SPARQL, jupyter

build-in SPARQL interface

working with jupyter

architecture: LARA

sharing data: collaborations, repositories, publications

implementation: all open source, python, gitlab

building homgeneous infrastructure from ground up

... what software components are required ?

SiLA servers/devices of LARA

what are we trying to build ?

the challange

benfits for machine learning applications

challanges to overcome

what is semantics enabled machine learning (ML) ?

outlook

summary

acknowledgements

project partners

KIWI-UG / NFDI4Cat

SiLA team

AnIML team

The RDM System LARA:
semantics through automation from bottom up

mark doerr, stefan maak & uwe t. bornscheuer
institute for biochemistry, university greifswald
Karlsruhe, 2023-09-14

process execution:
pythonLab, labOrchestrator pythonLabScheduler

universal, python based, automation language
gitlab.com/opensourcelab/pythonLab

(meta-) data:
data-transfer, storage, ontologies

working with (meta-) data:
SPARQL, jupyter

architecture:
LARA

sharing data:
collaborations, repositories, publications

implementation:
all open source, python, gitlab