I have all the data in the world…now what? InterOp Las Vegas

Rows of serversThe Big Data Workshop at InterOp Las Vegas wrapped up the morning with a presentation on Big Data requirements by John West, CTO and Founder of Fabless Labs. John kicked off with the challenge of having your enormous data set all ready to work with when you discover any one of the following problems:

  • Your network is too slow to handle the extra load
  • Hadoop has created a crater where your virtualized storage array used to be
  • Map/Reduce programming is slow and hard
  • At a large scale, math is really hard
  • It takes two days to load your big data cluster each week
  • Tuning some of the queries is a bear
  • The data is corrupt
  • My Hadoop queries seem to be very network intensive
  • You failed your security audit

These aren’t a stretch as John emphasized that he has encountered one or several of these challenges on a regular basis when working with Big Data projects. The truth of the matter is that Big Data is quite complicated.

John continued to expand on the simple facts that:

  • Analytics are amazing but action is usually required (OK, always)
  • Big data deployments have implications for the rest of your environments
  • Current big stacks have a lot of components
  • Hadoop is not generally a system of record. It is a data processing environment. Date has to be moved in and out to be useful.
  • Hadoop is not quite like your existing data warehousing cluster database

What is the deal with NoSQL?

He next gave the audience a great overview of NoSQL, BigData’s response to unstructured and semi-structured data. West described NoSQL as “The data workhorse of Big Data.” He want on to describe it as:

  • Open source, horizontally scalable, distributed database system
  • Database platform that doesn’t adhere to the RDBMS standard
  • Designed for enormous datasets where retrieve and append operations are the norm
  • Data is sharded and replicated, no single point of failure
  • Key-value store
  • Hive, Hbase and similar tools give database semantics
  • In-memory systems

The number one issue of Big Data

In West’s view, the single biggest problem of Big Data is ‘data provenance’, which includes the following challenges:

  • How is data stored and changes are tracked over time?
  • How is the data secured?
  • How will data issues be investigated?
  • Recording information about the data at its birth is not useful unless this information can be interpreted and carried along through the data analysis pipeline
  • If one of your key products is to “crunch data” and derive or extract value from it then you should be concerned about data provenance
  • This is true whether you are crunching your own data or third-party data

An excellent session that had a great amount of detail around Big Data requirements. John can be reached at John@FablessLabs.com.

John’s sample Big Data architecture:

Fabless Labs Big Data Environment Example

Fabless Labs Big Data Environment Example

Advertisements

Tags: ,

Categories: Data Analytics / Big Data

Author:Chris Taylor

Reimagining the way work is done through big data, analytics, and event processing. There's no end to what we can change and improve. I wear myself out...

Subscribe to the blog

Subscribe and receive an email when new articles are published

5 Comments on “I have all the data in the world…now what? InterOp Las Vegas”

  1. May 8, 2013 at 10:08 am #

    Not a very balanced overview Chris; As with your other anti-big data posts this one has quite a few assertions that are off-base. To pick just a couple;

    “Map/Reduce programming is slow and hard” – um, why aren’t you using Pig, Jaql, or Python, visual tools that work like spreadsheets, or simply something like Cognos or R? Why not mention that SQL on Hadoop is becoming commonplace?

    “The data is corrupt” – Oh really, what did you let happen for that to be the case? Is data magically never corrupt in traditional systems if you use worst practices?

    “It takes two days to load your big data cluster each week” – Why? It appears you are doing something wrong then. Suggests you don’t actually know how to architect these solutions. And are you saying that EVERY job takes two days OR are you simply looking for problems to put forth?

    “Big data deployments have implications for the rest of your environments” – Um, OK… Any new system would, your point being?

    “Current big stacks have a lot of components” – Yes, and they mostly all install at one time and can be up in running in hours. If you are struggling with that it suggests you don’t know the technology. That there are several components is becuase they provide flexibility not available elsewhere – this is a strength.

    “Data has to be moved in and out to be useful.” Completely and totally FALSE.

    “Hadoop is not quite like your existing data warehousing cluster database”, No, and that is why it is useful.

    • May 14, 2013 at 10:03 am #

      Hi Tom,

      Wanted to give you a thoughtful response and, having a day job, I had to wait for a break in my schedule. As someone who is doing such projects now, I wanted to give folks a more realistic view of the topic rather than a naive view of the topic.

      So, I will address your points one at a time.
      “Mapreduce programming is slow and hard” – I stand by this as a current practiioner. Even though PigLatin exists and is a useful tool, when data is complicated or involves many join steps, the amount of code and number of jobs running can be quite large, particularly if the query has performance implications and requires techniques such as map-side joins.

      “Current big data stacks have a lot of components” – While installing the components is somewhat straight forward, installation is by far the smallest element of time spent in handling the components. Each has to be setup, monitored, programmed, debugged etc. In comparison to an RDBMS environment, there is way more moving parts in a big data implementation. That does not mean that the parts are bad, but organizations should take into account the extra complexity required in big data.

      :”Data has to be moved in and out to be useful” – I am not sure why this is even a questions. Any data that will be processed in Hadoop has to be moved there, does it not? Getting the volume of data and updates into a big data cluster that is required has been the longest pull item on most of the big data efforts I have been involved in. When you say this is complete false, what do you mean? Does the data teleport there or move by magic? Not sure why my comment about this would draw such a sharp response. Hadoop is often used to do data mining and/or data mart construction using some means of aggregation. This required the models, cooficients or something to be move out to use the information found in the big data store.

      We seem to agree on the point that Hadoop not being a data warehouse. It is useful precisely because it is not the thing we had before. I work in big data quite a bit, including in-memory for some sorts of processing. Like all newish technologies, some of the final packaging is left up to the user of the system to complete. And that takes time and skilled resources to do.

      Don’t get me wrong. I love working with the newer stuff. it is really useful for many types of data and some types of analysis.

      • May 14, 2013 at 11:03 am #

        Thanks for your response, John. You’re one of the smartest people I know and it is good to see you weigh in on Tom’s comments.

  2. May 8, 2013 at 1:21 pm #

    Tom, I appreciate the IBM view of the world that their products are the best, but we both know that there are many use cases and many software products out there.

    I appreciate your partisan spirit but let’s keep the conversation positive.

  3. May 14, 2013 at 8:04 pm #

    Hi John – thanks for an actual response on the topic. Unfortunately this is still has some pretty major errors and errors of omission. To pick up a couple of them;

    John says “Mapreduce programming is slow and hard ” Depends on how experienced you are I guess. But why not mention that you have visual tools like BigSheets, Datameer and other options like Jaql and even different ways to organize the data like Hive and HBase that have web services APIs? And, of course, there are now a number of SQL options as well, kind of important to mention. Why not explain that Hadoop provides a way to execute python (and other scripts) via the Streaming API?

    For every person writing low level Java mapreduce you should have many (possibly hundreds or more) not having to do that. On balance the VAST majority of people leveraging Hadoop shouldn’t have any interaction approaching that complexity. Same dynamic as ratio of people tuning an index in a database vs users… Pretty serious error of omission not to point that out.

    John says “In comparison to an RDBMS environment, there is way more moving parts in a big data implementation.” No, not necessarily. If you look at using Hadoop to store mixed data and do the analytics in place it can be simpler. Compare Hadoop and Mahout to an RDBMS, ECM system, and File system and a separate math environment like SAS. Doing it all in Hadoop is actually simpler in that case. For the record, we do this sort of thing all the time so this isn’t theoretical.

    Besides, think managing a large scale partitioned HA database farm is easy? Would you even try to let non-experts have at that? Hadoop at he same scale is easier than that, and most clusters need only occasional admin attention.

    John says “Data has to be moved in and out to be useful.” You said “and out”, that is simply untrue. While some Hadoop use case are ELT (notice not ETL) not all are, indeed most are not. That means data can be leveraged or referenced in place. There is a big difference between moving data and referencing it.And yes, you do need to load data but if you are not able to do that efficiently you designed the system wrong, are using the wrong tools, designed the job badly, or are doing some sort of big bang approach that you probably shouldn’t be doing.

    John says “This required the models, coefficients or something to be move out to use the information found in the big data store.” Not true John. First, most of the major modeling packages have an ability to reference data from Hadoop for jobs. Referencing is NOT the same as moving. Second, going forward nearly all of the major modeling tools will have an ability to push modeling into Hadoop for execution.

    As I’ve written elsewhere I’m all in favor of not getting caught up in the hype and do my part to provide real-world level sets, but
    being overly negative or biased is just as wrong as baseless hype. In fact I recorded a vendor neutral podcast on this back in January since it was becoming such nonsense (pretty much predicted some of this unfortunately); http://www.ibmbigdatahub.com/podcast/backlash-against-big-data-backlash and post here: http://ibmdatamag.com/2013/03/getting-past-the-big-data-hype-and-backlash/

    Anyway, good to have an actual dialog and I’d welcome a chance to talk live.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: