Data and analytics, that's how you do it!

On March 28 and 29, the fourth edition of the Data Warehousing and Business Intelligence Summit took place. Concepts such as Fast data, internet of data, analytics, Data Vault, data virtualization, the logical data warehouse and even Logical Data Lakes were all discussed, but there was a lot of focus on the how: how do you consolidate existing data warehouses and data marts? How do you integrate fast and big data with the enterprise data warehouse?

Janssen, Sjoerd
11 April 2017

News

How do you get started with data virtualization? How do you implement a data delivery platform? How do you apply analytics successfully? And how do you do it all in an agile way? Thus, the focus of this well-attended summit was mainly on practice.

The fourth phase: Fast data
According to chairman of the day and speaker Rick van der Lans, after the classic data warehouse (first phase), self-service BI (second phase) and big data (third phase), we are now moving toward a fourth phase of Business Intelligence. The phase in which we also apply analytics to huge amounts of streaming data generated by sensors and weblogs. Of course, analyzing streaming data is not something new. There are plenty of industrial companies that already have the necessary experience with it. But the amount of data, the speed at which we want to analyze and the complexity of the analyses have increased dramatically. Added to this is the challenge that we now want to use this data not only for very specific applications, but make it widely accessible and combine it with data from traditional Business Intelligence environments.

In his presentation, Rick gives an overview of the different types of products that can play a role as puzzle pieces in the implementation of a Fast data architecture: products for sending (a.o. Apache Kafka and Flume), storing (e.g., HDFS, Hbase, NoSQL, NewSQL), analyzing (e.g., Apache Storm, Spark Streaming), mining (e.g., MOA, SOMOA), monitoring and managing (e.g., Apache NiFi, Hortonworks DataFLow) Fast data are all reviewed. But he particularly addresses the considerations that should play a role in architecture choices. Do you want to hold the huge amount of data? And if so: for how long? Where will you perform the analytics: centrally, or close to the source? Do you want to embed analytics in your operational process? How will you combine the data and analytics with data from your traditional systems, and where will you do that? Considerations that you need to think about carefully in advance, and where you need to make a good assessment of feasibility so as not to run into problems.

The central role of architecture
During his first presentation, Mark Madsen also discusses the crucial role architecture has in the challenges of harnessing all types of data for the goals an organization has. Those goals, and not the technology, are precisely what should be central to an architecture. Hadoop is not an architecture; it is a set of technologies. Also, the problem is often not in the technology, but in how it is applied for a particular purpose. So we need to focus primarily on how the data is applied, and less on the data or the technology itself. In his presentation, Mark provides a generic model for applying data, which consists of the following steps:

Monitoring data
Analyzing exceptions
Analyze the causes of these exceptions
Making decisions based on the data
Take actions based on the decisions.

In step 5, there is a distinction between intervening as part of a process and taking actions based on the results of a process. Actions taken as part of a process are predefined, and thus predictable. Consequently, the execution of these actions usually takes place immediately or within the same day. This is different for actions taken on the basis of the results of a process. This often requires additional analyses, which themselves often require new data.

The major role of analytics
In his presentation, Professor Bart Baesens discusses the increasing use of analytics, identifying patterns, or generating mathematical models based on a prepared data set. Fraud detection, social network analysis, predicting churn, determining credit risk, website optimization, customer segmentation and predicting customer value are telling examples. He covers the steps to be taken in any analytics process and elaborates on some concrete examples.

In his highly interactive presentation, he devotes a lot of attention to what are now key success factors for applying analytics. One of those success factors is the trust needed on the business side in the model developed by the analyst. His advice here is to start with simple models so that this trust can be built. A second success factor is in the operational efficiency with which the model can be applied. Although there will be an inclination on the part of the analyst to achieve the highest scoring model, it is important to keep a constant eye on the costs involved, and how well the model can be embedded into existing processes.

In successfully applying analytics, there are two gaps that must be bridged. The first gap is between the data and the data scientist. The data is sometimes unstructured, may be scattered across different systems, may contain errors and may change over time. This while the data scientist is looking for patterns, statistical significance and predictability of the model. The second gap is between the data scientist and the business expert. Whereas the data scientist must also have a strong focus on the statistical aspects of the model, the business expert is completely focused on the areas in which this model will be applied. In bridging these two gaps, the professor points out, data visualization can play a major role. On the one hand, data visualization can help the data scientist gain more insight into the data; on the other hand, data visualization can help communicate the developed model toward the business expert. And not in the form of a complex formula, but for example in the form of a table or graph.

Mark Madsen also reflected on the use of analytics in his closing presentation. In his presentation, he offers the necessary "food for thought" by drawing parallels between analytics and art. After all, both involve an abstraction of reality, and both also involve which perspective, or perspectives, you choose. In addition, he amusingly recounts his search for the origins of the beer and diapers myth, which is often cited as one of the best-known examples of an early application of analytics.

Caveats to the data lake
Both Mark Madsen and Rick van der Lans address the hype surrounding data lakes. First, they by no means solve all the data-related problems that organizations currently face. As with data warehouses, the governance issues of a centralized solution come into play here, because who decides what data is allowed in there and who gets access to what information? And how do we guarantee availability? Relative to a data warehouse, by applying schema on read, you may not have the governance challenges of a centralized model. After all, such a schema on read approach is extremely flexible. But at times you also want to be able to achieve a certain repeatability and data quality.
Second, even if you consider a data lake purely as a playground for data scientists, the question is whether that is always the best solution. Data scientists themselves have never asked for a data lake. They have only asked for access to as much data as possible. And the question is whether all possible data can be stored in the data lake. For example, is it feasible to start storing all Fast data? Some data is just "too big to move," and then it may be necessary to perform the transformations and analysis where the data is produced.

Data virtualization as a rescue
Data virtualization technology, Rick van der Lans argues, can be used to give the data scientist actual access to all information in a unified way. This brings all sorts of benefits. It allows the data scientist to access all data, whether it is stored in a data warehouse or datalake or linked or pushed directly from the source, in a unified manner. It also reduces the need to store the data outside the original source, which avoids all sorts of security and compliance issues. And by combining multiple local data virtualization servers with a central data virtualization server, you can even push the transformations to the local data to avoid the need to pump over (too) large amounts of data.
And that data virtualization is not an immature technology is made clear during the DWH and BI summit by several practical cases that emerged in the presentations by Erik Fransen of Centennium, Jos Kuijper of Volkswagen Pon Financial Services and Kishan Shri of Erasmus MC.

This is how you do it!
During the summit there is also attention for concepts that are still in their infancy in terms of application. In his presentation, Pieter den Hamer discusses the 'Internet of Data' in which, with the combination of linked data and artificial intelligence (to determine or extract the ontology of a data set), we can integrate and analyze data in a more natural and dynamic manner.
But above all, the Summit gives visitors many practical tools. William McKnight's presentations also provide many tips and examples on how to consolidate different enterprise data warehouses and data marts into one environment and on how to take a truly agile approach to Business Intelligence and data warehouse projects. These and other presentations and case studies will give Summit attendees a good idea of how to successfully handle data and analytics in an organization. This is how you do it!

Comments

MORE REACTIONS

Protect your online content from web scraping by AI providers

Blog

ECB urges banks to further improve aggregation and reporting of risk data

News press release

Paper exposes uneasy relationship between privacy and data brokers

Articles

Prof. Piet Daas: 'Harnessing opportunities of big data for official statistics'

News press release

Big data can help produce reliable and timely statistics, CBS' core business. Prof. Piet Daas, professor at Eindhoven University of Technology (TU/e),...

VNG Data strategy: ethical and task-oriented

News press release

VNG