The RDM System LARA:
semantics through automation from bottom up
mark doerr, stefan maak & uwe t. bornscheuer institute for biochemistry, university
greifswald
Karlsruhe, 2023-09-14
Any element with the class="notes" will not be displayed. This can
be used for speaker notes. In fact, the impressConsole plugin will
show it in the speaker console!
Press ctrl-C to activate the console
SciDat
- efficient data and metadata storage for scientific / machine learning needs
- proper nullable data / missing data handling (pyarrow / parquet)
- data modalities, like range / limits, type / continuos / categorial ...
- fast exchange and loading (fastparquet, pyarrow, arrow flight)
- semantic annotations / metadata in RDF compliant format
- programming language agnostic
- commonly used in ETL pipelines (Apache Spark, prefect, ... )
- suitable for S3 file storage systems (MinIO)
- data and metadata storage for scientific / machine learning needs (semantic annotation, based on ontologies, derivatives of owlready2)
- proper nullable data / missing data handling (pyarrow / parquet)
- data modalities, like range / limits, type / continuos / categorial/ variable treatment in case of range violation (parquet metadta)
- cardinality (parquet metadata)
- efficient storage (parquet)
- metadata and data stored at one place (parquet)
- metadata conservation when saving / loading / processing (parquet -> arrow)
- fast data exchange (arrow flight, MinIO active replication)
- fast loading (fastparquet, pyarrow)
- fast data processing without in-memory re-writing after loading ( pandas with pyarrow backend, arrow flight, polars)
- "modalities" for the machine learning models
- semantic annotations / metadata in RDF compliant format - for creating instances of ontology classes and SPARQL reasoning (JSON-LD, rdflib, owlready2)
- fast data processing (direct loading into pyarrow driven dataframe )
- programming language agnostic / independent (parquet)
- easy to use (SciDat / labDataReader framework, currently in implementation by me)
- commonly used in ETL pipelines (Apache Spark, prefect, ... )
- suitable for S3 file storage systems (MinIO)