Once again, the team took centre stage. All scieneers from Karlsruhe, Cologne and Hamburg came together for a two-day autumn event at the end of September. As well as having some exciting discussions and taking part in some joint activities, we also welcomed five new colleagues. We are now a team of around 50 people!
https://www.scieneers.de/wp-content/uploads/2025/10/S4100449-scaled.jpg14402560shinchit.han@scieneers.dehttps://www.scieneers.de/wp-content/uploads/2020/04/scieneers-gradient.pngshinchit.han@scieneers.de2025-10-08 16:20:152025-10-27 11:54:37Throwback to our fall event 2025
PyData Berlin 2025 at the Berlin Congress Center was three days full of talks, tutorials, and tech community spirit. The focus was on open-source tools and agentic AI, as well as addressing the question: How can LLMs be used productively and in a controlled manner? We from scieneers gave a presentation on LiteLLM, titled “One API to Rule Them All? LiteLLM in Production”.
Missense variants, that is, single amino acid substitutions in proteins, are often difficult to assess. Our machine learning workflow uses protein structure-based graph embeddings to predict the pathogenicity of such variants. In doing so, the structural information enhances existing approaches like the CADD score and provides new insights for genomic medical diagnostics.
https://www.scieneers.de/wp-content/uploads/2025/08/protein_structure_banner-1.png10241792Martin Dannerhttps://www.scieneers.de/wp-content/uploads/2020/04/scieneers-gradient.pngMartin Danner2025-08-06 11:32:032025-08-07 13:24:04Machine learning workflow for evaluating genetic variants based on protein structure embeddings
Most of our colleagues from our three locations Karlsruhe, Cologne, and Hamburg met in Hamburg for two days. We discussed specialist and internal topics, gained new ideas, and shared experiences.
https://www.scieneers.de/wp-content/uploads/2025/05/DSCF06962-scaled.jpg14402560shinchit.han@scieneers.dehttps://www.scieneers.de/wp-content/uploads/2020/04/scieneers-gradient.pngshinchit.han@scieneers.de2025-06-04 13:21:432025-06-04 13:21:51Throwback to our spring event 2025
After giving an overview of Real-Time Intelligence in Microsoft Fabric in the previous article, today we’ll dive a bit deeper and take a closer look at Eventstreams.
Events and streams in general
Let’s start by taking a step back and reflect on what an “event” or “event stream” is outside of Fabric.
For example, let’s say we are storing perishable food in a warehouse. To make sure it’s always cool enough, we want to monitor the temperature. So, we’ve installed a sensor that transmits the current temperature once a second.
Whenever that happens, we speak of an event. Over time, this results in a sequence of events that—at least in theory—never ends: a stream of events.
At an abstract level, an event is a data package that is emitted at a specific point in time and typically describes a change in state, e.g. a shift in temperature, a change in a stock price, or an updated vehicle location.
Eventstreams in Fabric
Let’s shift our focus to Microsoft Fabric. Here, an Eventstream represents a stream of events originating from (at least) one source, which is optionally transformed and finally routed to (at least) one destination.
What’s nice is that Eventstreams work without any coding. You can create and configure eventstreams easily via the browser-based user interface.
Here is an example of what an event stream might look like:
Each Eventstream is built from three types of elements, which we’ll examine more closely below.
① Sources
② Transformations
③ Destinations
① Sources
To get started, you need a data source that delivers events.
In terms of technologies, a wide range of options is supported. In addition to Microsoft services (e.g. Azure IoT Hub, Azure Event Hub, OneLake events), these also include Apache Kafka streams, Amazon Kinesis Data Streams, Google Cloud Pub/Sub, and Change Data Captures.
If none of these are suitable, you can use a custom endpoint, which supports Kafka, AMQP, and Event Hub. You can find an overview of all supported sources here.
Tip: Microsoft offers various “sample” data sources, which are great for testing and experimentation.
② Transformations
The incoming event data can now be cleansed and transformed in various ways. To do this, you append and configure one of several transformation operators after the source. These operators allow you to filter, combine, and aggregate data, select fields, and so on.
Example: Suppose the data source transmits the current room temperature multiple times per second, but for our planned analysis a one-minute granularity would be perfectly sufficient. So we use the “Group by” transformation to calculate the average, minimum, and maximum temperature for each 5-second window. This significantly reduces the data volume (and associated costs) before storage, while still preserving all the relevant information.
③ Destinations
After all transformation steps are completed, the event data is sent to a destination. Most often, this is a table in an Eventhouse. The following destinations are supported:
Eventhouses: An Eventhouse is a data store in Fabric that is optimized for event data. It supports continuous ingestion of new data and very fast analytics on that data. We will discuss Eventhouses in more detail in another blog post.
Lakehouse: A lakehouse is Fabric’s “typical” data store for traditional (batch) scenarios. It supports both structured and unstructured data.
Activator: An activator enables triggering actions based on certain conditions. For example, you could send an automatic email when the measured temperature exceeds a threshold. For more complex cases, a Power Automate flow can be triggered.
Stream: Another event stream (a “derived stream”). This means you have the ability to chain Eventstreams, which helps break down complex logic and enables reuse.
Custom Endpoint: As for the sources, you can also use a custom endpoint as a destination and thus connect any third-party systems. Kafka, AMQP, and Event Hub are supported here as well.
Event streams also support multiple destinations. This is useful, for instance, when implementing a Lambda architecture: you store fine-grained data (e.g. on a per-second basis) in an Eventhouse for a limited time to support real-time scenarios. In parallel, you aggregate the data (e.g. per minute) and store the result in a Lakehouse for historical data analysis.
Costs
Using Eventstreams requires a paid Fabric Capacity. Microsoft recommends at least an F4 SKU (monthly prices can be found here). In practice, the adequate capacity level depends on several factors, particularly the needed compute power, data volume, and total Eventstreams run time. Further details can be found here.
If you don’t need an Eventstream for some time, you can deactivate it to avoid unnecessary load on your Fabric Capacity. This can be done separately for each source and destination.
https://www.scieneers.de/wp-content/uploads/2025/06/2025-06-04-085111-dall-e-3.png8191024shinchit.han@scieneers.dehttps://www.scieneers.de/wp-content/uploads/2020/04/scieneers-gradient.pngshinchit.han@scieneers.de2025-06-04 11:01:532025-06-04 19:05:42Microsoft Fabric: Processing real-time data with Eventstreams
In today’s business world, there is an increasing focus on basing decisions and processes on a solid foundation of data. The technical solution is usually a combination of data warehouses and dashboards that compile company data and present it in a visual format that is easy for everyone to understand and use.
Implementation often relies on batch processing, whereby data is collected and prepared automatically — for instance, once a day or less frequently, such as every hour or every few minutes.
This approach works well for many applications. However, it has its limitations when it comes to analysing information ‘in real time’ with a delay of a few seconds at most. Here are a few examples:
Production: monitoring sensor data to prevent machine breakdowns (‘predictive maintenance’).
Supply chain: tracking location data and weather events to recognise delivery delays early on.
Finance: Monitoring and analysing share prices in real time.
IT: analysing log data to detect problems immediately after updates.
Marketing: analysing social media posts during live events.
None of this is new, but previous solutions were often challenging to implement and required considerable expertise.
This is precisely where Microsoft stepped in: in 2024, the Fabric platform was expanded to include Real-Time Intelligence and various building blocks that can be used to bypass much of the complexity and swiftly develop functioning solutions.
In future articles, we will closely examine the most important of these ‘fabric items‘. Here is a brief overview:
Eventstream
Event streams continuously receive real-time data (events) from various sources. This data is transformed as required and ultimately forwarded to a destination responsible for storing it. Typically, this destination is an event hub. No code is required.
Eventhouse
An event house is an optimised data storage facility designed specifically for events. It contains at least one KQL database, which stores event data in table form.
Real-Time Dashboard
Real-time dashboards are similar to Power BI reports, but they are a standalone solution independent of Power BI. A real-time dashboard contains tiles with visualisations, such as diagrams or tables. These are interactive, and you can apply filters, for example. Each visual retrieves the necessary data via a database query formulated in KQL (Kusto Query Language), typically from an event hub.
Activator
Activator enables you to automatically perform an action based on certain conditions, such as a real-time dashboard or a KQL query. The simplest action is sending a message via email or Teams, but you can also trigger a Power Automate flow.
So what do I need to learn to implement a solution with Real-Time Intelligence?,
It essentially boils down to KQL and a basic understanding of the aforementioned fabric items. A lot of it is no-code. KQL is less common than SQL, but the basics are easy to learn, and feel natural after a short time.
We’ll be back soon with more posts on Fabric Real-Time Intelligence in our blog, delving deeper into the various topics.
https://www.scieneers.de/wp-content/uploads/2025/05/real_time_intelligence_1030x258.jpg258344shinchit.han@scieneers.dehttps://www.scieneers.de/wp-content/uploads/2020/04/scieneers-gradient.pngshinchit.han@scieneers.de2025-05-23 14:24:082025-05-23 14:48:43Real-Time Intelligence: Real-time data in Microsoft Fabric
At this year’s Minds Mastering Machines (M3) conference in Karlsruhe, the focus was on best practices for GenAI, RAG systems, case studies from different industries, agent systems, and LLM, as well as legal aspects of ML. We gave three talks about our projects.
Global models such as TFT and TimesFM are revolutionising district heating forecasting by providing more accurate predictions, using synergies between systems and effectively solving the cold start problem.
Is the CI/CD pipeline taking forever? Is it taking too long to build a container image locally? One possible reason could be the size of container images – they are often unnecessarily bloated. This article presents several strategies to optimise images and make them faster and more efficient. 🚀
No unnecessary dependencies
A grown project with numerous dependencies in pyproject.toml can quickly become confusing. Before taking the next step – containerization with Docker, for example – it is worth first checking which dependencies are still needed and which are now obsolete. This allows you to streamline the code base, reduce potential security risks, and improve maintainability.
One option would be to delete all dependencies and the virtual environment and then go through the source code file by file to add only the dependencies that are needed. The command line tool deptry offers a more efficient strategy. It takes over this tedious task and helps to quickly identify superfluous dependencies. The installation is carried out with
uv add --dev deptry
The analysis of the project can then be started directly in the project folder with the following command
deptry .
After that, deptry lists the dependencies that are no longer used
Scanning 126 files...
pyproject.toml: DEP002 'pandas' defined as a dependency but not used in the codebase
Found 1 dependency issue.
In this case, pandas no longer appears to be used. It is recommended to check this and then remove all dependencies that are no longer needed.
uv remove deptry pandas
Alternative index
If you are using a package such as pytorch, docling, or sparrow with torch(vision) as a dependency and only want to use the CPU, you can omit the installation of the CUDA libraries. This can be achieved by specifying an alternate index for torch(vision), where uv will look for the package first, with no dependencies on the CUDA libraries for torch(vision) defined in this index. To do this, add the following entry to pyproject.toml under dependencies.
[tool.uv.sources]
torch = [
{ index = "pytorch-cpu" },
]
torchvision = [
{ index = "pytorch-cpu" },
]
[[tool.uv.index]]
name = "pytorch-cpu"
url = "https://download.pytorch.org/whl/cpu"
This is what the images look like with and without the alternative index:
REPOSITORY TAG IMAGE ID CREATED SIZE
sample_torchvision gpu f0f89156f089 5 minutes ago 6.46GB
sample_torchvision cpu 0e4b696bdcb2 About a minute ago 657MB
With the alternative index, the image is only 1/10 as large!
The correct Dockerfile
Whether the Python project is just starting or has been around for a while, it is worth having a look at the sample docker files provided by uv: uv-docker-example.
These provide a reasonable base configuration and are optimised to create the smallest possible images. They are extensively commented and use a minimal base image with Python and uv preinstalled. Dependencies and the project are installed in separate commands, so that layer caching works optimally. Only the regular dependencies are installed, while dev dependencies such as the previously installed deptry are excluded.
In the multistage example, only the virtual environment and project files are copied to the runtime image, so that no superfluous build artefacts end up in the final image.
Bonus tip for Azure WebApp users
This tip will not reduce the size of the image, but it may save you some headaches in an emergency.
When deploying the Docker image in an Azure WebApp, /home or underlying paths should not be used as WORKDIR. The /home path can be used to share data across multiple WebApp instances. This is controlled by the environment variable WEBSITES_ENABLE_APP_SERVICE_STORAGE. If this is set to true, the shared storage is mounted to /home, which means that the files contained in the image are no longer visible in the container.
(If the Dockerfile is based on the uv examples, then the WORKDIR is already configured correctly under “/app”.)
https://www.scieneers.de/wp-content/uploads/2025/03/ChatGPT-Image-Apr-1-2025-09_43_18-PM.png10241536Nico Kreilinghttps://www.scieneers.de/wp-content/uploads/2020/04/scieneers-gradient.pngNico Kreiling2025-04-11 09:40:582025-04-11 09:45:14Smaller docker images with uv
Almost every company conducts research and development (R&D) to bring innovative products to market. The development of innovative products is inherently risky. As they are products that have never been offered, it is usually unclear whether and to what extent the desired product features can be realized and how exactly they can be realized. The success of R&D therefore depends largely on the acquisition and use of knowledge about the feasibility of product features.
Scrum is by far the most popular form of agile project management today1. Some of the key features of Scrum are constant transparency of product progress, goal orientation, simple processes, flexible working methods, and efficient communication. Scrum is a flexible framework and contains only a handful of activities and artefacts2. Scrum deliberately leaves open how Backlog Items (BLIs) are structured, what the Definition of Done (DoD) of BLIs is, and how exactly the refinement process should work. A Scrum team has to manage these aspects itself. This flexibility is another reason why Scrum is so successful: Scrum is used in various domains, supplemented with domain-specific processes or artefacts3.
Scrum is also well established in R&D4. As described above, the success of an R&D Scrum team depends on how efficiently the team acquires and uses knowledge. In Scrum, so-called spikes have become established for knowledge acquisition. Spikes are BLIs in which knowledge about the feasibility and cost of product features is gained without actually realising the features. In this article we want to show what is important when implementing spikes in Scrum and how a Scrum team can ensure that the knowledge gained from spikes is used optimally. We illustrate these good practices with concrete examples from 1.5 years of Scrum in a research project with EnBW.
A widely used method for gaining knowledge in Scrum is called spikes – BLIs, where the feasibility and effort of product features are assessed without realizing the features. The idea of spikes comes from eXtreme Programming (XP)5.
Acquiring knowledge through spikes serves to reduce excessive risk6. The proportion of spikes in the backlog should, therefore, be proportional to the current risk in product development. In an R&D Scrum team, there may be many open questions and risks, and spikes may account for more than half of a sprint, as in the Google AdWords Scrum team7. However, an excessive proportion of spikes inhibits a team’s productivity because too much time is spent gaining knowledge and too little time is spent implementing the product8. Prioritizing spikes over regular BLIs is an important factor for the success of R&D Scrum projects.
Another important factor is the definition of quality criteria for spikes. Specifically, a Scrum team can establish a Definition of Ready (DoR) and a Definition of Done (DoD) for spikes. The DoR defines what criteria a spike must meet to be processed; the DoD defines what criteria a spike must meet to be done. Both definitions influence the quality and quantity of the generated knowledge and its usability in the Scrum process. The DoD is a mandatory part of Scrum. On the other hand, the introduction of a DoR is at the discretion of the Scrum team and is not always useful9.
The reason for a spike is almost always an acute and concrete risk in product development. However, the lessons learned from a spike can be valuable beyond the specific reason for the spike and can help inform product development decisions in the long term. To have a long-term effect, the team needs to store the knowledge gained from spikes appropriately.
The use of spikes in R&D projects therefore raises at least three questions:
How are spikes prioritised against each other and against other BLIs?
How are DoR and DoD formulated for spikes?
How is the knowledge gained from spikes stored?
The literature on Scrum leaves these questions largely unanswered. In a multi-year R&D research project with EnBW, we applied various answers to the questions and gained experience with them. We developed two good practices in the process, which we present in this blog post: (1) a regular interval for “spike maintenance,” i.e., for sharpening concrete hypotheses in spikes, and (2) a lightweight lab book for logging findings. We also supplement the two practices with a suitable DoR and DoD.
Good Practice: Regular interval of the „Spike Maintenance“ in the Backlog
Refinement is an ongoing activity of the whole Scrum team. BLIs from the backlog are prepared for processing: BLIs are reformulated, divided into independent BLIs and broken down into specific work steps. Many Scrum teams use regular intervals to refine and gain a common understanding of the backlog.
In our experience, collaborative refinement is a common source of spikes. This is because unresolved issues and risks come to light when the team discusses the backlog. Unresolved risks in a BLI often manifest in the team’s difficulty defining a concrete DoD and the wide variation in team members’ estimates of the effort involved. Specific phrases used by team members may also indicate risks, e.g.:
“I don’t know if it’s possible.”
“I don’t know how to do it yet.”
“I’ll have to find out first.”
Once a risk has been identified, the team must decide whether it is so great that it should be reduced with a spike. The spike is then created in the backlog as a BLI draft. To not exceed the regular refinement deadline, we recommend that newly created spikes are not refined and prioritized in this deadline. Instead, we recommend that the team meet regularly one to two days after the regular refinement to perform “spike maintenance.” In the meantime, spike drafts can be assigned to individual team members to prepare for the spike maintenance interval, e.g., by collecting concrete hypotheses and ideas for experiments.
Of course, spikes can also be created outside of a joint sharpening appointment, e.g., when performing a BLI. Even then, it is a good idea to collect the spikes as drafts and then sharpen them at the next spike maintenance interval. This way, the spike events are not forgotten but do not lead to scope creep in the current sprint.
During the spike maintenance interval, the collected spikes must be refined and prioritized in the backlog. In our experience, most spikes are directly related to one or more BLIs – the BLIs whose implementation poses acute risks. In this case, prioritization is simple: use the existing prioritization of BLIs and add each spike to the list before its associated BLI.
We suggest a light lab book to store the knowledge gained from the spikes. The lab notebook documents the following aspects for each spike in 1-2 sentences.
Problem definition: What acute risk in product development needs to be mitigated? What questions need to be answered?
Hypotheses: Formulate hypotheses that are as concrete as possible and that can either be researched or tested experimentally.
Findings: Summary of the experimental or research results.
Consequences: The significance of the findings for the product. E.g. “Product characteristic X can be achieved with Y, but not with Z”.
Relevant links, e.g. to detailed experimental results and BLIs
Completeness and compression are crucial for the long-term added value of the lab notebook. Completeness means that the results of all spikes end up in the lab book; compression means that only the essential results and consequences for the project are briefly and concisely recorded in the lab book. In this quality, the lab book can be used as a discussion reference.
In the project with EnBW, we implemented the lab book as a wiki page in the team’s project wiki. We sorted the entries on the page in descending chronological order, i.e., the knowledge from the current spikes was at the top. We used the lab book for about a year and noticed no scaling problems. The lab book was highly regarded as a source of knowledge within the team and was regularly used in discussions.
This brings us to our final recommendation: a concrete definition of the DoR and DoD for spikes, in line with the two previous good practices.
The DoR specifies which criteria spikes must fulfil before they can be included and worked on in a sprint. As mentioned, the DoR is not part of Scrum, but a commonly used extension. Some Scrum experts are critical of the DoR because it can lead to important BLIs not being included in a Sprint for “aesthetic reasons”10. We, therefore, propose a compromise: important BLIs are always included in the next sprint, but the person working on the BLI is responsible for ensuring that the BLI fulfills the DoR before it is worked on. This means fulfilling the DoR criteria is the first step of any BLI.
Our DoR proposal for spikes is simple: before a spike is processed, the lab book entry for the spike must be filled in as precisely as possible. In particular, the hypotheses to be investigated must be listed. In the EnBW team, we have had good experience with one team member formulating the hypotheses before working on the spike and then having them critically reviewed by another team member. Once the hypotheses are complete, understandable, and clearly defined, work on the spike can begin.
The lab book entry provides a framework for the spike. This is similar to the well-known Test-Driven Development (TDD) in software development. Unit tests are first used to define what the software should do, and then it is implemented. The creation of a lab book entry before each spike could be described by analogy as hypothesis-driven learning (HDL): first, hypotheses are used to define what is to be learned, then the hypotheses are tested.
The analogy between TDD and HDL is also useful in describing the advantages of HDL. Just as TDD provides a constant incentive to reduce software complexity, HDL provides an incentive to keep experiments simple and focused. And just as the programmed tests in TDD replace much of the written documentation of software, the lab book entries in HDL replace much of the written documentation of knowledge from spikes.
A spike’s DoD is at least as important as its DoR. It determines what criteria must be met for the spike to be completed. The spike’s lab book entry can also determine these criteria. Specifically, we propose that a spike is complete when all hypotheses listed in the lab notebook entry have been confirmed or rejected, the lab notebook entry has been written, and the results have been communicated to the team.
In this blog post we have made a concrete proposal on how to use spikes in R&D Scrum projects. Our proposal enables an R&D Scrum team to assess the feasibility of innovative product features – in time before the product is implemented and with reasonable effort.
At the core of our proposal are two good practices that have proven successful in our own R&D Scrum projects: first, a spike maintenance interval, and second, a lightweight lab book for logging findings. We have shown how both practices can be integrated into Scrum using a simple DoR and DoD.
With the two good practices we fill a gap in the Scrum literature: there are hardly any concrete practices for the creation, prioritisation, quality assurance and long-term use of spikes. The proposed good practices are simple and general enough to be applicable and useful for most R&D Scrum teams. We would be happy if other Scrum teams could be inspired by them.
https://www.scieneers.de/wp-content/uploads/2025/02/as.png10241024shinchit.han@scieneers.dehttps://www.scieneers.de/wp-content/uploads/2020/04/scieneers-gradient.pngshinchit.han@scieneers.de2025-03-17 09:17:522025-03-17 09:17:53Using Scrum to Succeed in Research and Development