Data Driven Cultures

What drives data driven cultures...besides coffee?

In a conversation with a long standing branding expert and travelista, sifting through #realtime #bigdata #analytics catered by @dataswft, naturally, he looked for the warmer human things that drives technologies such as Dataswft. To be 'data driven' he pointed out conclusively, is a culture, unaware of the Tableau sponsored report from Economist "Fostering a data-driven culture" that went on to say "IT security is indeed a job for experts, but data are everyone's business." 

The creative guide exhorted with a question - is Dataswft a technical thingy for data driven or is it enabling data driven cultures? Time taken to act is a key metric for data-driven. And so, to be data driven is an everyday matter. Lots of small questions! Let reporting handle the big questions. 

What is a small question? Is it what a high adrenalin driven digital campaign manager asks of the data as the campaign rolls out across multiple channels and determines where the budget should be dynamically reallocated, every 15 minutes or every hour, or when ever he chooses to intervene. Small questions require touching lots of data. In this case upwards of 60 terabytes if you consider streaming data as well as historical behaviour. Real time technologies serve these requirements. 

Data driven is a culture, first, because of the questions. Providing answers instantly is what technology does next. With the technology, its the culture that realizes potential and pushes the envelope. 

Consider an investment bank that needs to run value at risk calculations covering a host of financial products invested in by clients, touples of market price data for months, hundreds of sophisticated risk models designed to predict risk against different scenarios. To this person, its important to know how much money is to be set aside against the dynamic risk and how much capital can be unlocked to earn. Calculations here can run into over 15 billion and to execute in under 30 seconds can make a huge difference to these money managers.
Data driven cultures are those we see at the top of the culture pyramid, crunching all their data by the second, minute and hour and not some end of the day or end of the week event. Both reporting for overall annual KRA measurements let by big question, and the everyday operations led by small questions require dealing with big data. But as one moves higher up the pyramid, the response times that the culture will accept reduces from minutes and hours to mere seconds. 

Put another way, data driven cultures are those complementing their solely heuristic decision making process (read gut feel) with a data driven approach and so, what do the data say! Thats not just about the quantity of it, but also a quality that heurisitc rules and human cognition are likely to be overwhelmed given the volume, variety, velocity. Data driven cultures, to survive, need real time performance with their analytics, and powerfully complement their answers to all their questions - both big and small.

RocksDB CentOS - 6.1 Installation with JNI

RocksDBis a way to leverage SSD hardware optimally. It's a way to un congest the network. However, the single digit micro second performance  comes from C++ simple calls of GET , SET on KV structure. Any complexity of data operation, requires custom logic implementation.

This blog is all about connecting to RocksDB from Java application. It can also be done using Thrift API.

Download the installation packages

Set the repository location and enable C++ repo.

cd /etc/yum.repos.d wget
                                http://people.centos.org/tru/devtools-1.1/devtools-1.1.repoyum --enablerepo=testing-1.1-devtools-6 install devtoolset-1.1-gcc devtoolset-1.1-gcc-c++
                                export CC=/opt/centos/devtoolset-1.1/root/usr/bin/gcc export CPP=/opt/centos/devtoolset-1.1/root/usr/bin/cpp
                                export CXX=/opt/centos/devtoolset-1.1/root/usr/bin/c++

Set  the rocksdb home, download the rocksdb package from github. Unzip the package.

export ROCKSDB_HOME=/mnt/softwares/rocksdb/rocksdb-master export JAVA_HOME=/usr/java/jdk1.7.0_51
                                ls $JAVA_HOME/lib/tools.jar export BASEDIR=/mnt/softwares/rocksdb cd /tmp
                                wget
                                https://github.com/facebook/rocksdb/archive/master.zipunzip master mv rocksdb-master $ROCKSDB_HOME cd $ROCKSDB_HOME; pwd ; ls
                                cd ..

Enable C++ 2.0

wget
                                https://gflags.googlecode.com/files/gflags-2.0-no-svn-files.tar.gztar -xzvf gflags-2.0-no-svn-files.tar.gz cd gflags-2.0 ./configure &&
                                make && sudo make install

Setup SNAPPY compression

cd .. wget
                                https://snappy.googlecode.com/files/snappy-1.1.1.tar.gztar -xzvf snappy-1.1.1.tar.gz cd snappy-1.1.1 ./configure && make
                                && sudo make install cd ..

Install Other compression libraries ZLIB and BZIP

sudo yum install zlib sudo yum install zlib-devel sudo yum install bzip2
                                sudo yum install bzip2-devel

Build ROCKSDB

export LD_LIBRARY_PATH=/usr/local/lib/ cd $ROCKSDB_HOME make clean; make
                                make check make librocksdb.so make librocksdb.a

Build RocksDB JNI

cd .. wget -O rocksdbjni.zip
                                https://github.com/fusesource/rocksdbjni/archive/master.zipunzip rocksdbjni.zip wget
                                http://apache.tradebit.com/pub/maven/maven-3/3.1.1/binaries/apache-maven-3.1.1-bin.tar.gzgzip -d apache-maven-3.1.1-bin.tar.gz tar -xf apache-maven-3.1.1-bin.tar
                                export SNAPPY_HOME=${BASEDIR}/snappy-1.1.1; ls -alt $SNAPPY_HOME export
                                ROCKSDBJNI_HOME=${BASEDIR}/rocksdbjni-master; ls $ROCKSDBJNI_HOME export
                                LIBRARY_PATH=${SNAPPY_HOME} export C_INCLUDE_PATH=${LIBRARY_PATH} export
                                CPLUS_INCLUDE_PATH=${LIBRARY_PATH} cd ${ROCKSDB_HOME} make librocksdb.a
                                mkdir ${BASEDIR}/dist/ cp librocksdb.so ${BASEDIR}/dist/ export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:${BASEDIR}/dist/
                                cd ${ROCKSDBJNI_HOME} ${BASEDIR}/apache-maven-3.1.1/bin/mvn clean install
                                cd rocksdbjni-linux64 ${BASEDIR}/apache-maven-3.1.1/bin/mvn install cd
                                ${ROCKSDBJNI_HOME}

Distribute the JAR File.

cp rocksdbjni/target/*.jar ${BASEDIR}/dist/

Minimum Requirement on the system where final jar is being used:
1) GCC version of 4.7 and later (Newer gcc with C++11 support)
2) gflags-2.0

Next Generation Cache will be in SSD+DRAM from DRAM+NETWORK

Caching in Big Data Processing : Last Decade

There are four fundamental components leveraged for processing bigdata
  1. CPU
  2. Memory
  3. Disk
  4. Network
Architects play with these ingredients to balance the operational requirements and the cost per byte parameter. The result - Numerous hybrid architectural models which are majorly determined based on the nature of application in context.
 
One such application pattern in bigdata world is to fetch large volume of data for analysis. To perform interactive business intelligence, the data is to be served from cache layer. An effective cache layer means more cache hit. And that is determined by analysts behavior. Is most recently analyzed dataset requested again? In most cases it happens, as during an analysis, analysts keep playing with a subset of data to discover patterns. In these scenarios, a cache layer serves wonder by serving hot data from memory, saving the overhead of going to disk and looking up the inode table as well as seeking on the cylinders.
 
Last decade memcache served this layer with wonders in a distributed system setup. However stretching this aspect started building stresses on the network layers. And the first attempt at Facebook is to short circuit the packet layer by running Memcache with UDP and the result - A phenomenal increase in throughput and performance. However, the fundamental problem of network saturation still remained un addressed with this solution. With more and more distributed components hammering the network layer, it continues to reach the saturation point.
 

Future Bigdata Cache Layer

As these distributed systems were being evolving,  new memory hardware, SSD, was making it's way to enterprises as well as retail laptops with mass production. The maturity of SSD along with price drop created the component viable for architects to play with.

So the OLD model of DRAM + NETWORK has been challenged with new generation caches on DRAM + SSD.
To me RocksDBis the first step on this direction. Going forward, more and more vendors and products will explore on this area.

Real-Time Big Data Architecture Patterns

Bigdata Architecture Pattern1 – Cache to Shelter, answer to high writes. 

1. App server writes to Shared in-memory cluster with a TTL and to queue dispatcher. 

2. Queue Dispatcher asynchronously writes to persistent bigdata database (With out TTL) 

3. Reads are first look in memory cache database. If it is missed, then look in persistent bigdata database.

Bigdata Architecture Pattern2 – Distributed Merge & Update 

1. Read-Modify-Write step in a single RPC 

2. Saves data flow over network.

Bigdata Architecture Pattern3 – Distributed Merge & Update 

1. Good for write once read many workload. 

2. Embedded database (with TTL) for local in process cache. 

3. Embedded database used as cache and designed to use Memory and SSD deficiently. 

4. Bigdata Database works for distributed writes, sharding and failover.

World's 10 Big HBase Database Cluster Details


For the last couple of years there has been lots of conferences on big data database. HBase has emerged as a closely integrated hadoop database in the eco-system. 

Specially at Facebook, month by month hundred of terabytes of data is flowing to HBase clusters. I have compiled these sessions to analyze the similarities among various implementations, configurations and take the learning to apply it for productionizing HBase.

See the Google Doc for details


Data Fracking

In 2006, Clive Humby drew the analogy between crude oil and data in a blog piece titled " Data is the new Oil" which since then has captured imagination of several commentators on big data. No one doubts the value of the 'resources' that varies in the effort required to extract. During a discussion with a billion dollar company CIO, he indicated that there is a lot of data but can you make it "analyzeable."

Perhaps, he was inferring to the challenges of dealing with unstructured data in a company's communication and information systems, besides the structured data silos that are also teeming with data. In our work with a few analytics companies, we found validation of this premise. Data in log files, PDFs, Images, etc. is one part of it. There is also the deep web, that part of data not readily accessible by googling, etc. or as this Howstuffworks articlerefers to it as 'hidden from plain site.'

Bizosys's HSearch is a Hadoop based search and analytics engine that has been adapted to deal with this challenge faced by data analysts, referred to commonly as Data Preparation or Data Harvesting. If indeed finding value in data poses these challlenges, then Clive's analogy to crude oil is valid. Take a look at our take on this. If today, Shale gas extraction represents the next frontier in oil extraction employing a process known as Hydraulic Fracturing, or Fracking, then our take on that is 'data fracking' as a process of making data accessible.

The Origins of Big Data

While sharing our thoughts on big data with our communications team, we were story tellers. The story around big data was impromptu! We realized the oft quoted Volume, Varietyand Velocityactually can be mapped to Transactions, Interactionsand Actions. I have represented it using a Visual.ly infographic background.

Here is a summary -

“The trend we observe is that the problems around big data are increasingly being spoken about more in business terms viz. Transactions, Interactions, Actionsand less in technology terms viz. Volume, Variety, Velocity. These two represent complementary aspects and from a big data perspective promise better business-IT alignment in 2013, as business gets hungrier still for more actionable information.”

Volume - Transactions

More interestingly, as in a story, it flowed along time and we realize that big data appears on the scene as an IT challenge to grapple with when the Volume happens, which comes from either growing rate of transactions. Sometimes transactions occur in several hundreds per second, or as billions of database records required to process in a single batch were the volume is multiplied due to newer, more sophisticated models being applied as in the case of risk analysis. Big data appears on the scene as a serious IT challenge and project to deal with associated issues around scale and performance of large volumes of data. Typically, these are Operational in nature and internal facing.

These large volumes are often dealt with by relying on a public Cloud infrastructure such as Amazon, Rackspace, Azure, etc. or more sophisticated solutions involving 'big data appliances' that combine Terabyte scale RAM at hardware level with  in-memory processing software from large companies such as HP, Oracle, SAP, etc.

Variety - Interactions

The next level of big data problems surface when dealing with external facing information arising out of Interactions with customers, and other stakeholders. Here one is confronted with a huge variety of information, mostly textual, captured from customers interactions with call centers, emails, or meta data from these including videos, logs, etc. The challenge is in semantic analysis of huge volumes of text to determine either user intent or sentiment and project brand reputation, etc. However, despite ability to process this volume and variety, getting a reasonably accurate measurement that is 'good enough' still remains a daunting challenge.

Value - Transactions + Interactions

The third level of big data appears where some try to make sense of all the data that is available - structured and unstructured, transactions and interactions, current and historical, to enrich the information, pollinate the raw text by determining business entities extracted, linking them to richer structured data, linking to yet other sources of external information, to triangulate and derive a better view of the data for more Value.

Velocity - Actions

And finally, we deal with Velocity of the information as it happens. Could be for obvious aspects like Fraud detection, etc. but also to determine actionable insights before information goes stale. This requires addressing all aspects of big data to be addressed as it flows and within a highly crunched time frame. For example, an equity analyst or broker would like to be informed about trading anomalies or patterns detected as intraday trades happen.

The business impact of Bigdata

As a company engaged in Big data before the term became as common as it is today, we are constantly having conversations around solutions that have a big data problem. Naturally, a lot of talk ranges around Hadoop, NoSQL, and other such technologies. 

But what we notice is a pattern in how this is impacting business. There is a company that caters to researchers who till recently were dealing with petabytes of data. This is a client company and we helped implement our HSearch real time big data search engine for Hadoop. Before this intervention, the norm was to wait for upto 3 days at times to receive a report for a query spanning the petabytes of distributed information that was characterized by huge volumes and lot of variety. Today, the norm has changed with big data solution and it is about sub second response times. 

Similarly, in a conversation with a Telecom industry veteran, we were told that the health of telecom has always been networks monitored across large volume of transmission towers and together generate over 1 Terabyte of data each day as machine logs, sensor data, etc. The norm here was to receive health reports compiled at a weekly frequency. Now, some players are not satisfied with that and want to receive these reports on a daily basis, and possibly hourly or even in real time. 

Not stopping at reporting as it happens, or in near real-time, the next question business is asking, if you can tell so fast, can you predict it will happen, especially in a world of monitoring IT systems and machine generated data. We will leave predicting around human generated data analytics (read - social feed analysis) out of the story for the moment. Predictive analysis could mean predicting that a certain purple shade large size polo neck is soon going to run out of stock for a certain market region given other events. Or it could mean, more feasible, that a machine serving rising number of visitors to a site is likely to go down soon since its current sensor data indicates a historical pattern, therefore, alert the adminstrator or better still bring up a new node on demand and keep it warm and ready. 

So it seems the value of big data is in its degree of freshness and actionability, and at most basic level, simply get the analysis or insight out faster by a manifold factor!

V Power and big data

Even a decade back, when you say mention the alphabet V one thinks of powerful diesel engines. Today, some say data is the new oil and big data is best understood by the V words - Volume, Velocity, Variety. According to this IBM pagethese are not enough. Apparently, business still dont trust the output from any appliance or architecture that deals with these three V things. So they introduce a new V word - Veracity. With Veracity, you can establish or frame what you want from big data, and more importantly decide to believe it. Fair enough! Now, Evan Quinn of ESG Global, an IT research and advisory company, has added 2 new Vs to the string - Visualization and Value. Now that is a V6 engine!
Image courtesy:  Educational Technology Clearinghouse

Top Indian IT body NASSCOM awards Bizosys

Last month, at Interop2012held in Mumbai, India's leading IT body NASSCOM awarded Bizosysand 9 other IT companies the Top 10 "Made in India" Enterprise Software Product Companies recognition.
Bizosys was chosen for its HSearchreal time big data enterprise grade search engine for Hadoop that is available as open source on github.

In continuation, Bizosys will be also participating in NASSCOM's flagship event the NASSCOM Product Conclavein Bengaluru, India on Nov 7th and 8th.


Sunil Guttula, CEO and Co-Founder at Bizosys will participate in a panel discussion on the topic " Big Data - why enterprises are struggling and how startups can step in ." Bizosys will also be presenting its story on moving from lab to the street that shares its GTM approach, challenges it faces, etc in a session titled "Enterprise Software Startup Highlight" where eminent IT and VC representatives will hear and provide feedback.

Starting the HBase Server from Eclipse

Why Start the HBASE server inside Eclipse!
  1. HBase custom filters are a powerful feature which helps to move processing near to data. However deployment of these custom filters require one to compile the dependent classes for the filter, package in a jar and make it available to the region server.  Any new changes to these custom filter code requires the complete cycle of stopping server, packaging new jar, copying to hbase lib folder and restarting it.
  2. How to debug the code writer inside these custom filters by putting a breakpoint.
  3. To see the code execution status, look to the Region Server Log file which is same as run Eclipse + CYGWIN + Notepad to view the log status
    Because of these above shortfalls, I decided to run HBase from Eclipse. Run as a Java Application or Debug as a Java application and setting breakpoints on my filter classes to see the execution path along with stacks.

    Steps to configure HBase for Eclipse
    What we need (All are included in the deployment package - Download HStartup.zip)
    1. chmod.exe program (32 bit is included)
    2. favicon.gif  
    3. HBaseLuncher.class
    4. Unzipping HBase release


    Setting things up
    1. Unzip the attached eclipse project folder.
    2. Add hbase/lib/ folder jars in the project build path libraies.
    3. Add hbase/conf folder to project build path libraies.
    4. Add hbase.jar to project build path libraies.
    5. Add your project to the Required projects in the Java build path.

    Starting the HBase in debug mode
    1. Run HBaseLuncher in debug/run mode.
    2. In the windows tray, you will see an HBase tray icon.
    3. Right click on the tray icon to start/stop the HBase server. 

        
    For any issues, please write to me:- abinash [at] bizosys [dot] com

    HBase Backup to Amazon S3

    HSearch is our opensource, NoSQL, distributed, real-time search engine built on Hadoop and HBase. You can find more about it on http://hadoopsearch.net

    We have evaluated various options to backup data inside HBase and built a solution. This post will explain the options and also provide the solution for anyone to download and implement it for their own HBase installations.

    Option
    Pros
    Cons
    Backup the Hadoop DFS
    Block data files are backed up quickly.
    Even if there is no visible external load on HBase, HBase internal processes such as region balancing, compaction goes on updating the HDFS blocks. So a raw copy may result in an inconsistence state.
    Secondly, Hadoop, HBase as well as Hadoop HDFS keeps data in memory and flush at periodic intervals. So raw copy may result in an inconsistent state.
    HBase Import and Export tool
    The Map-Reduce Job downloads data to the given output path.
    Providing a path like s3://backupbucket/ fails the program with exceptions like: Jets3tFileSystemStore failed with AWSCredentials.
    HBase Table Copy tools
    Another parallel replicated setup to switch.
    Huge investment to keep running another parallel environment to replicate production data.

    After considering these options we developed a simple tool, which backs up  data to Amazon S3 and restore when needed. Another requirement is to take a full backup over weekend and a daily incremental backup.

    In case of failures, it should first initiate a clean environment with all tables created and populated with latest full backup data and then apply all incremental backups sequentially. However, in this method deletes are not captured which may lead to some unnecessary data in tables. This is a known disadvantage of this method of backup and restore.
    This backup program internally used HBase Import and Export tools to execute the programs in a Map-Reduce method.

    Top 10 Features of the backup tool
    1. Export complete data for the given set of tables to S3 bucket.
    2. Export incrementally data for the given set of tables to S3 bucket.
    3. List all complete as well as incremental backup repositories.
    4. Restore a table from backup based on the given backup repository.
    5. Runs in Map-Reduce
    6. In case of connection failure, retries with increasing delays
    7. Handles special characters like _ which creates the export and import activities.
    8. Enhancement of existing Export and Import tool with detail logging to report a failure than just exiting with a program status of 1.
    9. Works in human readable time format for taking, listing and restoring of backup than using system tick time or unix EPOCH time (Time represented as a Number than readabale format as YYYY.MM.DD 24HH:MINUTE:SECOND:MILLSECOND TIMEZONE
    10. All parameters are taken from command line which allows the cron job to run this at regular interval.

    Setting up the tool

    Step # 1 : Download the package  from  http://hsearch0.94.s3.amazonaws.com/hbackup.install.tar
    This package includes the necessary jar files and the source code.

    Step # 2 : Setup a configuration file.  Download the hbase-site.xml file.
    Add to this fs.s3.awsAccessKeyId , fs.s3.awsSecretAccessKey, fs.s3n.awsAccessKeyIdand fs.s3n.awsSecretAccessKeyproperties

    Step # 3 : Setup the class path w ith all jars existing inside the hbase/lib directory, hbase.jar file, java-xmlbuilder-0.4.jar, jets3t-0.8.1a.jar and hbackup-1.0-core.jar file bundled inside the downloaded hbackup.install.tar. Make sure hbackup-1.0-core.jar at the beginning of the classpath. In addition to this add the configuration directory to CLASSPATH which has kept hbase-site.xml file.

    Running the tool

    Usage: It runs in 4 modes as [backup.full], [backup.incremental], [backup.history] and [restore]
    ----------------------------------------
    mode=backup.full tables="comma separated tables" backup.folder=S3-Path  date="YYYY.MM.DD 24HH:MINUTE:SECOND:MILLSECOND TIMEZONE"

    Ex. mode=backup.full tables=tab1,tab2,tab3 backup.folder=s3://S3BucketABC/ date="2011.12.01 17:03:38:546 IST"
    Ex. Default time is now
    mode=backup.full tables=tab1,tab2,tab3 backup.folder=s3://S3BucketABC/

    ----------------------------------------

    mode=backup.incremental tables="comma separated tables" backup.folder=S3-Path duration.mins=In Minutes
                Ex. mode=backup.incremental backup.folder=s3://S3BucketABC/ duration.mins=30 tables=tab1,tab2,tab3

    This will backup changes happend in last 30 mins

    ----------------------------------------

    mode=backup.history backup.folder=S3-Path

    Ex. mode=backup.history backup.folder=s3://S3BucketABC/
    This will list all past archives. Incremental one ends with .incr

    ----------------------------------------

    mode=restore  backup.folder=S3-Path/ArchieveDate tables="comma separated tables"

    Ex. mode=backup.history backup.folder=s3://S3-Path/DAY_MON_HH_MI_SS_SSS_ZZZ_YYYY tables=tab1,tab2,tab3
    This will add the rows arcieved during that date. First apply a full backup and then apply incremental backups.

    -------------------------------------

    Some sample scripts to run the backup tool.

    $ cat setenv.sh
    for file in `ls /mnt/hbase/lib`
    do
    export CLASSPATH=$CLASSPATH:/mnt/hbase/lib/$file;
    done

    export CLASSPATH=/mnt/hbase/hbase-0.90.4.jar:$CLASSPATH

    export CLASSPATH=/mnt/hbackup/hbackup-1.0-core.jar:/mnt/hbackup/java-xmlbuilder-0.4.jar:/mnt/hbackup/jets3t-0.8.1a.jar:/mnt/hbackup/conf:$CLASSPATH



    $ cat backup_full.sh
    . /mnt/hbackup/bin/setenv.sh

    dd=`date "+%Y.%m.%d %H:%M:%S:000 %Z"`
    echo Backing up for date $dd
    for table in `echo table1 table2 table3`
    do
    /usr/lib/jdk/bin/java com.bizosys.oneline.maintenance.HBaseBackup mode=backup.full backup.folder=s3://mybucket/ tables=$table "date=$dd"
    sleep 10
    done

    $ cat list.sh
    . /mnt/hbackup/bin/setenv.sh
    /usr/lib/jdk/bin/java com.bizosys.oneline.maintenance.HBaseBackup mode=backup.history backup.folder=s3://mybucket 

    Mapping an app on Google AppEngine (GAE) to a custom domain

    After fiddling for an hour or so trying to figure out this seemingly simple stuff, I finally got to map both root and sub-domains from GoDaddy to my app on Google App Engine. So here goes.


    1. Register the domain on Google Apps (this is required as App Engine uses Google Apps to handle redirection)

    2. Add Google Apps site admin mail id as an admin for the app on GAE.When you do this an email is sent to the site admin mail id.
    3. Add Domain in Application Settings / Domain Setup. It is important to specify the domain as mydomain.comwithout the www. Otherwise it will error out.
    4. To link Godaddy domain server to Google Apps, the A record has to be setup on GoDaddy, these instructions are available on Domain Setting Page in Google Apps.
    5. Map naked url of mydomain.com to www using Google info on the domain page.
    6. Add www or any sub-domain mapping to GAE app inside Google Apps. This will also require www CNAME record to point to ghs.google.com.
    7. Wait for the domain settings to trickle down and it should work.

    No Schema NoSQL database

    There are two parts to scale out:
    • Distributed processing
    • Distributed Storage
    Long ago Grid technology promised this and failed to deliver because network fails in Grid with heavy data flow over the wire. Hadoop HDFS addressed this by intelligently moving processing near to the data.
    NOSQL Database JourneyUsing this technology, pure products like  HyperTableHBase NOSQL databases are designed. These databases transparently break the data for distributed storage, replication and processing. The question is why can’t they use the regular databases hosting them at multiple machines and firing queries to each of them and assimilating the result? Yes, it happened and many companies took an easy path of distributed processing using  HadoopMap-Reduce framework by arranging data with traditional databases (Postgress, MySQL) - refer to  HadoopDBAsterData products for details. This works but availability becomes an issue. If one server availability is 90%, the overall availability for 2 servers is 81% (90 * 90). And this drastically falls as more servers are added to scale out. Replication is a solution to this but it breaks memory caching which many products heavily rely on for vertical scaling.
    In same fashion, KATA and many other products provided distributed processing using  Hadoop Map-Reduce framework over open source search engine ( Lucene and  Solr). These also fail to address high availability requirement.
    Still No FreedomHowever, the rigidity that comes with data structure stays as all these databases need a schema. Early on we envisioned a schema free design which would allow us to keep all data in a single table and on need basis query it. We knew Search Engines usually allow this flexibility. It will help users to start their journey by typing a context they want to find than browsing the rigid menu system; Menu system is often tied to underneath data structure –
     "Freedom comes by arranging data by value than structure"
     But most search engines failed in enterprise where data and permissions change very frequently. The search heavy design fails in write heavy transaction environments.
     We modified and extended  HBase to build a Distributed Scalable Search Engine. It allowed us a schema free design, scaling out to process load and support huge amount of writes. We have tested this engine by indexing complete wikipedia documents, 180 millions small records with concurrent 100 users with only 4 machines to prove the linear horizontal scalability capability at Intel Innovation Lab.

    J2EE application server deployment for HBase File Locality

    The data node that shares the same physical host has a copy of all data the region server requires. If you are running a scan or get or any other use-case you can be sure to get the best performance “ ( Reference :  http://www.larsgeorge.com/2010/05/hbase-file-locality-in-hdfs.html )
    Now let's consider an Mail Server application which stores all mails of users.
    Usually to handle large concurrent user base, when a user logs on he is routed to one of the available server. This logic could be round robin or load base routing. Once the session gets created in the allocated machine, the user starts accessing his mails from  HBase. But the data may not be served from local machine. This is because, user might have logged on in a different machine and a high chance that the record was created to the same region server node of that machine. Now as the requesting machine is not definite, the information will flow through wire and gets served.
    Co-locating the client and the original region server, would minimize this network trip. This is possible by sticky routing the user to the same machine again and again across days as long as the node is available. This will ensure the local data access via same region server to same data node to local disk. But most of the load balancers are not designed like that. In reality they are designed to route based on number of active connections. This model works OK to balance out CPU and memory. A hybrid model will work best for balancing CPU, memory and network together.
    This way of co-locating application server,  Hbase region server,  hdfs data node may impose a security risk for credit card transactional systems. Those kind of systems may like to have one more firewall between the database and application server. In high traffic that will primarily choke the network. In best interest of security and scalability, information architects need to divide their application’s sensitive data (ex. Credit card information) and the “low risk data” creating the threat model. Based on this, a dedicated remote  HBase cluster backed by a firewall could be created for serving sensitive information.

    Value-Name Columnar Structure

    Lucene search engine allows to define one data structure,   NUTCH and other search engines have a pre defined structure ( ex. title, url, body). In other side, in a RDBMS we build different tables for different data structure.
    How can I store various XML formats, documents together and search it?  For  Nutch and Lucene, we need to remove fieldtypes from filteration criteria. In a database, each table, each column needs to be looked for finding the data. It will be very slow.
    These constraints push us develop a schema which can allow search while preserving the structure and allowing write operations.

    Rise of the Algorithms

    In a recent column " The New Algorithm of Web Marketing" in New York Times, Tanzina Vega reported how big brands such as Nike, P&G are increasingly relying on "programmatic buying" to deliver ads to individual consumers in a highly targeted manner, such that they are leaving the traditional cousins Print, TV, etc. way behind in the dust for the same promise. Till recently, it seemed algorithmic applied only to equity trading, but now ad buying has reached this. Publishers of online networks such as AOL and Weather.com apparently view themselves as "Marketing Engines" instead of Publishers, and are relying on automated ad purchasing technologies. Where traditional media only provided context, the new digital properties now add fine tuned placement to it. Just as brokerages earn in fractional value, but when volume is added to the equation, the results to top line of publishers promise to be handsome, naturally.

    The question I have is - granted context of a buyer and great placement are done in a timely manner to increase multi-fold the chances of viewer clicking the ad. But what next? Its an area Bizosys is interested in as to how dynamic or optimized is the post click experience. Seems ripe for innovation. Tell us!
    Go to top