Chapter [ ]: Big Data

What is your definition of Big Data, and what is the largest size of data you have worked with? Did you parallelize your code?

While the term “big data” is relatively new, the act of gathering and storing large amounts of information for eventual analysis is ages old. The concept gained momentum in the early 2000s when industry analyst Doug Laney articulated the now-mainstream definition of big data as the three Vs:

Volume - Organizations collect data from a variety of sources, including business transactions, social media and information from sensor or machine-to-machine data. In the past, storing it would’ve been a problem – but new technologies (such as Hadoop) have eased the burden.

Velocity - Data streams in at an unprecedented speed and must be dealt with in a timely manner. RFID tags, sensors and smart metering are driving the need to deal with torrents of data in near-real time.

Variety - Data comes in all types of formats – from structured, numeric data in traditional databases to unstructured text documents, email, video, audio, stock ticker data and financial transactions.

Much has changed since the early 2000s. We consider two additional dimensions when it comes to big data:

Variability - In addition to the increasing velocities and varieties of data, data flows can be highly inconsistent with periodic peaks. Is something trending in social media? Daily, seasonal and event-triggered peak data loads can be challenging to manage. Even more so with unstructured data.

Complexity - Today's data comes from multiple sources, which makes it difficult to link, match, cleanse and transform data across systems. However, it’s necessary to connect and correlate relationships, hierarchies and multiple data linkages or your data can quickly spiral out of control.

How do you work with large data sets?

If the answer only comes out as hadoop it clearly shows that their view of solving problems is extremely narrow.

Large data problems can be solved with:

efficient algorithms
multi-threaded applications
distributed programming
http://blog.revolutionanalytics.com/2013/12/tips-on-computing-with-big-data-in-r.html

Write a mapper function to count word frequencies (even if its just pseudo code)

import sys

for line in sys.stdin:
    line = line.strip()
    words = line.split()

    for word in words: 
        print '%s\t%s' % (word, "1")

Write a reducer function for counting word frequencies (even if its just pseudo code)

word_count = {}

for line in sys.stdin:
    line = line.strip()

    # parse the input we got from mapper
    word, count = line.split('\t', 1)

    try:
        count = int(count)
    except ValueError:
        continue

    try:
        word_count[word] = word_count[word]+count
    except:
        word_count[word] = count

for word in word2count.keys():
    print '%s\t%s'% ( word, word2count[word] )

What is a cron job?

A scheduled task that runs in the background on a server.

cron is a Linux utility which schedules a command or script on your server to run automatically at a specified time and date. A cron job is the scheduled task itself. Cron jobs can be very useful to automate repetitive tasks.

For example, you can set a cron job to delete temporary files every week to conserve your disk space.

Scripts executed as a cron job are typically used to modify files or databases. However, they can perform other tasks that do not modify data on the server, like sending out email notifications.

How can you make sure a Map Reduce application has good load balance? What is load balance?

Load balancing is helpful in spreading the load equally across the free nodes when a node is loaded above its threshold level. Though load balancing is not so significant in execution of a MapReduce algorithm, it becomes essential when handling large files for processing and when hardware resources use is critical. As a highlight, it enhances hardware utilization in resource-critical situations with a slight improvement in performance.

A module was implemented to balance the disk space usage on a Hadoop Distributed File System cluster when some data nodes became full or when new empty nodes joined the cluster. The balancer (Class Balancer tool) was started with a threshold value; this parameter is a fraction between 0 and 100 percent with a default value of 10 percent. This sets the target for whether the cluster is balanced; the smaller the threshold value, the more balanced a cluster will be. Also, the longer it takes to run the balancer. (Note: A threshold value can be so small that you cannot balance the state of the cluster because applications may be writing and deleting files concurrently.)

A cluster is considered balanced if for each data node, the ratio of used space at the node to the total capacity of node (known as theutilization of the node) differs from the the ratio of used space at the cluster to the total capacity of the cluster (utilization of the cluster) by no more than the threshold value.

The module moves blocks from the data nodes that are being utilized a lot to the poorly used ones in an iterative fashion; in each iteration a node moves or receives no more than the threshold fraction of its capacity and each iteration runs no more than 20 minutes.

In this implementation, nodes are classified as highly-utilized, average-utilized, and under-utilized. Depending upon the utilization rating of each node, load was transferred between nodes and the cluster was balanced.

Is it better to have 100 small hash tables or one big hash table in memory, in terms of access speed (assuming both fit within RAM)? What do you think about in-database analytics?

The answer here is, it depends. Often best way to compare performance is to implement it both ways, and then use a profiler or high resolution timers to compare.

Generally speaking, if you already know which of your smaller tables to access, the quicker it could be. (Assuming you’re using stl, and the data lookup is a binary search). If the table is a straight lookup, then a smaller table still could be an advantage in terms of data cache. Repeated accesses to the same tiny table could get fewer cache misses.

If you have to check a list of tables, each time you do a lookup, likely a single larger table will perform better.

So this depends on the overhead of finding the proper table, and the type of table.

Big Data