In a previous post, we walked through how to implement a custom Java transformation in Oracle Big Data Discovery. While that post was more technical in nature, this follow up post will highlight a detailed use case for the transformations and illustrate how they can be used to augment an existing dataset.
In our first post introducing Oracle Big Data Discovery, we highlighted the data transform capabilities of BDD. The transform editor provides a variety of built in functions for transforming datasets. While these built in functions are straightforward to use and don't require any additional configuration, they are also limited to a predefined set of transformations. Fortunately, for those looking for additional functionality during transform, it is possible to introduce custom transformations that can leverage external Java libraries by implementing a custom Groovy script. The rest of this post will walk through the implementation of a basic example, and a subsequent post will go in depth with a few real world use cases.
The most exciting thing about Oracle Big Data Discovery is its integration with all the latest tools in the Hadoop ecosystem. This includes Spark, which is rapidly supplanting MapReduce as the processing paradigm of choice on distributed architectures. BDD also makes clever use of the tried and tested Hive as a metadata layer, meaning it has a stable foundation on which to build its complex data processing operations.
We have been anticipating the intersection of big data with data discovery for quite some time. What exactly that will look like in the coming years is still up for debate, but we think Oracle's new Big Data Discovery application provides a window into what true discovery on Hadoop might entail.
Ok. Now it is time to showoff. Check out some of the secret sauce Ranzal brings to the solution around unstructured data and custom visualizations...
Interested to understand how cutting edge healthcare providers are turning to data discovery solutions to unlock the insights in their medical records? Check out this real-world demonstration of what a recent Ranzal customer is doing to unlock a 360 degree view of their clinical outcomes leveraging all of their EMR data -- both the structured and unstructured information.
Coupling disparate data sets into meaningful "mashups" is a powerful way to test new hypotheses and ask new questions of your organization's data. However, more often than not, the most valuable data in your organization has already been transformed and warehoused by IT in order to support the analytics needed to run the business. Tools that neglect these IT-managed silos don't allow your organization to tell the most accurate story possible when pursuing their discovery initiatives. Data discovery should not focus only on the new varieties of data that exist outside your data warehouse. The value from social media data and machine generated data cannot be fully realized until it can be paired with the transactional data your organization already stockpiles.
Adjectives like "agile" and "self-service" have long been used to describe approaches to BI that enable organizations to ask their own questions and produce their own answers. Applied to both processes and products, these labels are applicable any time an organization can relax the "IT bottleneck". Over the past decade, the core tenets of the Endeca vision ("no data left behind, ease of use, and agile delivery") have shaped a product that has empowered organizations to unlock insights in their enterprise data in ways never before possible while simultaneously reducing their reliance on IT to do so. Notice I said "reduce" their reliance, not "eliminate".
A fairly common approach...
More often than not, when pulling data from a database into OEID, we need to employ incremental updates. To introduce incremental updates, we need a way to identify which records have been added, updated or deleted since our last load. This change identification is commonly referred to as change data capture, or CDC. There is no one way to accomplish CDC and often the best approach is dictated by the mechanisms in place in the source system. Usually the database we're pulling from isn't leveraging any explicit change data capture (CDC) mechanism.