Inside Analysis

The New Generation of Database Technology Includes Semantics and Search

David Gorbet, VP of Engineering for MarkLogic, chatted with Bloor Group Principal Robin Bloor in a recent Briefing Room. This is an excerpt from the conversation.

Robin Bloor: In my view, we have reached a situation where there will be multiple “data engines.” Is that MarkLogic’s view?

David Gorbet: Yes, just as the change from hierarchical to relational didn’t instantly eliminate all mainframe systems, the new generation of database technology will not completely replace the existing generation. It’s all about what technology is appropriate for your data and use case. If you’re using relational technology for something that it shouldn’t be used for because that’s all you have, then yes we see MarkLogic replacing that; but if your data and use case are highly structured and you have a relational system running well, we don’t suggest ripping that out. There will always be multiple data engines in an organization for this reason because there are business silos that build their own systems or acquisitions that result in new systems. That said, many of our customers have found it advantageous to leverage MarkLogic as a metadata store across these systems so that even though they have multiple data engines, they can still query everything they have.

RB: Specifically, are there data structures or database contexts for which MarkLogic is inappropriate?

DG: We always advocate using the right tool for the job. Because MarkLogic is so flexible, we can handle data with any degree of structure, but we’re not optimized for the small, highly structured records you often find in a relational database management system (RDBMS). If you have a RDBMS running that’s giving you what you need, then you shouldn’t replace it just for the sake of it. Of course, if you want to combine your relational data with unstructured data or more complex structured data, then it can make sense to ingest it into MarkLogic so that you can benefit from unified search and query across all relevant data. Think of it as a data warehouse, but where you get to search and query all data instead of just pre-computed aggregates along predefined dimensions.

RB: In your view, is the “age of the data warehouse” over?

DG: I think there will always be a role for data warehouse technology to provide business lines with the data they need to make decisions. I do think though that the concept of a single enterprise-wide data warehouse that can be used for company-wide reporting and discovery has proven to be much harder to implement and maintain than anticipated. Billions of dollars have been spent in this area and many organizations have little to show for it. It reminds me of an expression a friend used to use: “In theory there’s no difference between theory and practice, but in practice there is.” The reality is that the organizational environment is too dynamic and the schema design, data cleansing and ETL tasks required to make the enterprise data warehouse work are too complex and take too long for this approach to succeed. Organizations are starting to look at a more federated approach, combining metadata from various data silos so that they can query across these silos. Of course metadata is extraordinarily complex as well, so not every technology is up to the task. MarkLogic has several customers doing this, and our schema-agnostic design really helps here.

RB: Which sectors/businesses are currently in MarkLogic’s “sweet spot”?

DG: MarkLogic grew up with publishing, military and intelligence applications because of the complexity of the data those industries have and the importance of search as part of the discovery process for that data. In the last few years though, we’ve been seeing a lot of interest in civilian public sector and in the financial services industry, where we now have several major global banks as customers. We’ve also seen a lot of interest in healthcare, and one of our more interesting projects is the CMS health insurance marketplace. Data in these industries is complex and hard to model using relational technologies, but they still require a database that provides all the mission-critical “run the bank” functionality they need. In addition, we’re seeing interest in other sectors such as consumer manufacturing, oil and gas, and basically anywhere where they have highly complex mission-critical data.

RB: Data analytics involves much more than having analytical functions in the database. It is more than 50 percent data prep (merging, cleansing, joining, transformation, etc.). How does MarkLogic accommodate that?

DG: Yes, the only way to analyze completely unstructured data is via search, and we do that, but the secret to getting the most value out of your data is to pre-process it to give it some structure. For example, we have a customer who is monitoring products and adverse effects in social media. They use our content processing framework to do some text analytics on the data on the way into MarkLogic to mark up instances of product names and adverse effects. Once that’s done, we can do all kinds of analytics on that data. The most interesting is co-occurrence, where we show them which products are most often mentioned alongside which adverse effects, so they don’t have to stab in the dark with repeated searches to figure this out. With MarkLogic, you can enrich your data on the way in (through CPF), or you can enrich it when it’s already in the database through Hadoop MapReduce. The beauty is that you don’t have to do all the enrichment or cleansing up front.

If for example you have two data streams with different element names for the same data element, you can still ingest these, still discover them (via search) and write your queries to combine results across them without having to clean the data. If you later want to make these consistent, you can do it without having to change a schema and re-ingest everything.

RB: What is MarkLogic’s attitude to the cloud? Specifically, where would it recommend cloud deployment?

DG: We’ve had customers running MarkLogic in the cloud for years. For public cloud, it’s usually Amazon, and generally customers use this for test and development but not for production. That’s their choice, and we’re perfectly happy to support cloud deployment for production applications. Lately we’ve seen an increase in interest for cloud capabilities, both for public cloud and private cloud. In MarkLogic 7, we’re beefing up our cloud story with new features that provide elasticity; the ability to grow and shrink your cluster in the cloud to respond to changing business conditions, all while keeping the database online and serving queries with high performance. We’ll be automatically rebalancing data across nodes as you grow and shrink your cluster, while maintaining transactional consistency of the data at all times. We have about 70 companies piloting MarkLogic 7 today via our Early Access program.

RB: Which companies/products do you regard as competitors/partners?

DG: Mostly in sales situations we see the relational players – primarily Oracle – because most customers already have an Oracle enterprise agreement and Oracle says they can do anything. That’s been consistent for the last ten years. What’s changed recently is that customers are getting much more savvy about the existence of new technologies like NoSQL databases that handle complex data much better than traditional RDBMS technology, so we’re spending less time educating customers about why they should consider a different technology and more time talking about what the criteria should be for this new system. Obviously enterprise customers don’t want to give up security, reliability, transactional consistency, etc., and we don’t think they should have to.

In the NoSQL world, there are many technologies, most relatively young. All of them have their place, but none of them have the enterprise capabilities we have, so in our enterprise customer accounts, we’re not seeing a lot of adoption for other NoSQL technologies beyond the pilot or small-scale departmental use. We’re keeping an eye on this though, because it’s a fast changing market. Having been around for more than ten years, we’re way ahead, but our strategy is to keep moving fast to bring new innovations to market.

There’s still a lot of opportunity to light up new scenarios for our customers. That’s why we’re excited about our semantics capabilities in MarkLogic 7. We believe that semantics technology is the next generation of search and discovery, allowing queries based on the concepts you’re looking for and not just the words and phrases. MarkLogic 7 will be the only database to allow semantics queries combined with document search and element/value queries all in one place. Our customers are excited about this.

Leave a Reply

Your email address will not be published. Required fields are marked *