Dataswft combines big data engineering with advanced machine learning to STORE - ANALYSE - PREDICT

Dataswft Index architecture, Tiered storage and Advanced machine learning out of the box is well-equipped to extract reliable insights and actionable predictions from large scale machine data in real-time and cost-effectively.

It’s Indexing architecture is designed to deal with different datatypes at column level within a record. It is designed to be Read-optimized (Write once ) that is particularly well-suited for machine data at tera/petabyte scale. Dataswft doesn't use a secondary Index and instead stores data and Index together. Dataswft stores data seamlessly across DRAM, SSD, HDD and Objectstores such as S3, giving a significant price to performance benefit. Dataswft benchmarks demonstrate how it complements Spark, Hive, Tez, by significantly speeding up their query performance. 


Dataswft Architecture

Dataswft combines highly multithreaded computation to exploit many core CPUs and a unique index architecture to leverage fast i/o from SSD+DRAM keeping the last stage of computation in L1,L2 cache.

• Dataswft Directory Service directs queries to specific machines in a cluster instead of all machines. 

• All computations performed on data in L1,L2 without exploding bytes into objects. 

• Group level Index to choose right shards based on query predicates. 

• Fault tolerance is achieved by leveraging Hadoop replication, switching to available slate nodes where replicas are located. 

• Binary search, min, max on sorted binary data with lazy serde. 

• Leverage data colocation with hot data caching and Hadoop short circuit processing. 

• Logical data partition on multiple fields. Partitioning by columns and Partitioning by rows based on record skew size. 

• Physical file partition and replication using Hadoop Map-File partition 

• Dataswft Tiered Storage automatically distributes data across DRAM & SSD for hot data, HDD for warm data & HDFS/S3/Objectstore for historical data. 

• Does NOT need a data node to keep data with queries that can span realtime data to historical data in Cloud. 

• Leverages RocksDB to cache data on DRAM/SSD.

Dataswft Features

1. Interactive Queries on historical machine data- Dataswft machine data indexing architecture allows analytics queries on historical data to run in milliseconds. Dataswft’s query latency is optimized by reading and scanning only exactly what is needed. 

2. Data availability- Keep data only on HDFS, data availability is based on HDFS replication. Advanced 

3. Extended data availability- Additionally you can also keep data on a object store and no need to run datanode for keeping data 

4. Query historical + real time data- Automatically query data from Dataswft and real time datastores such as HBase, MySql 

5. Query from .NET or Java application- Dataswft provides JDBC driver or REST API to work on data. 

6. Supports select, aggregate and filter data with complex boolean gate using SQL- Dataswft provide SQL to query the both tall and fat tables with hundred of billions of records. No need to aquire new skill. However, it does not allow JOINs 

7. Scalable- Keep adding data to dataswft. Dataswft is designed to be petabyte scale. 

8. Auto schema generation- Given a sample dataset, the schema creation is automated. 

9. Query engines integration- Data inside Dataswft can be queried using Hive+Tez, Hive+Spark 

10. Visualization tools integration- Pentaho, Weka can query data from Dataswft with the limitation of a single table, however tall and fat 

11. Input data sources- Dataswft is readily integrated to take data from HDFS or Hbase as datasource for all data. It runs map-reduce to create indexes. 

12. Incremental updates- Dataswft is readily integrated to read incremental data updates from HBase. 

13. Remove historical Records- Historical records can be easily purged from the system without the need of running compaction. 

14. Validate batch loading, publish or rollback- The data of a batch load is easily verified from Console by checking the loaded data boundaries and published. Any input file can be rolled back. 

15. High accuracy predictions- iSAX algorithms integrated with sliding window pattern detection for anomalies, discords, novelties, surprises, deviants 

16. Enterprise Management Console- Monitoring cluster health and availability. Query measures - actual queries, response times, no data. Data size of the cluster 

17. Backups, including incremental 

18. Support- Email, phone support


Dataswft complements popular big data products viz. Apache Spark, Hive, Tez, Cassandra, etc in the Hadoop ecosystem to offer a highly cost effective tiered storage with sub-second response times on analytical queries, and high accuracy predictions using advanced machine learning algorithms such as Indexing Symbolic Aggregate Approximation or iSAX. Dataswft brings new capabilities to existing big data analytics and integrates readily for rapid deployment.

Dataswft with Apache Spark, Hive - Representative high level architecture

Dataswft with Apache Spark

Dataswft with Hive


Dataswft Benchmarks

Dataswft is designed to be Read Optimized (Write once workload) well-suited for machine data, hence a great complement for most other big data technologies such as HBase, Cassandra that are Write Optimized.

Industrial sensor data measuring 19.5 Terabyte was generated for this performance benchmark study spanning 17.551 billion Rows across 202 Columns using node machine from Amazon EC2 with Data stored in S3. The objective was to evaluate performance for typical SQL queries.

Dataswft Index is speed optimised. The size will reduce to approximately half or lower for storage optimization. This will impact the select values queries and not aggregation or filtering. For write once workload, the query time is faster using Dataswft, however it has increased the index size. Dataswft offers around 21 different schema configurations, which can be tuned to balance between index size and query speed.

Dataswft does not pre-compute or cache result of any query. It uses RocksDB to cache data blocks and RocksDb manages the data blocks between DRAM and SSD. So it's possible to run dataswft only with 1GB RAM. 

ENVIRONMENT 

EC2 ami used : ami-640af40c (Compute Optimized (c3.2xlarge)) 

Operating System : CentOs 6.5 

vCPUs : 8 

Memory(GiB) : 15 

Instance Storage (GB) : 2 X 80 (SDD) (Used with RAID Level-0)



Data size : 19.5 TB, 

Columns : 202, 

Rows : 17551000000 (17.551 Billion), 

Data in S3, 

hot-storage : SSD 

Total Nodes : 10


Sql:[select min(column168) as minval1 , max(column168) as maxval1, count(column168) as countcol from sensortable where partitionkey='100' AND column168 BETWEEN 4.0 AND 10.9]

______________________________ 

minval1    maxval1     countcol 

4.5017   |  4.5017    |    1 

______________________________ 

177 milliseconds 

_____________________________________________ 

Sql:[select min(column168) as minval1 , max(column168) as maxval1, count(column168) as countcol from sensortable where partitionkey='100'] 

______________________________ 

minval1     maxval1     countcol 

3.4364 | 454992.03 | 100094 

______________________________ 

366 milliseconds 

_____________________________________________ 

Sql:[select min(column168) as minval1 , max(column168) as maxval1, count(column168) as countcol from sensortable where partitionkey in ('100','101','102','103','104','105','106','107')] 

______________________________ 

minval1     maxval1     countcol 

0.0257 | 454999.75 | 799735 

______________________________ 

786 milliseconds 

____________________________________________ 

JDBC Counter: Sql:[select count(column0) as countcol , min(column0) as mincol, max(column0) as maxcol from sensortable where column0 < 12] 

______________________________ 

countcol    mincol   maxcol 

3872         |    1.0      |  11.0 

______________________________ 

10222 ms (10.22 seconds) 

____________________________________________ 

JDBC Counter: Sql:[select count(column0) as countcol from sensortable ]

 _________________________ 

countcol 17551000000 

_______________________________ 

40120 ms (40.12 seconds) 

____________________________________________ 

JDBC Counter: Sql:[select count(column0) as countcol , min(column0) as mincol, max(column0) as maxcol from sensortable ] 

________________________________ 

countcol                 mincol      maxcol 

17551000000 |    1.0        |   5.0E7 

_________________________________ 

45462 ms (45.46 seconds) 

_____________________________________________

Typical Dataswft PoC executed in 1 to 4 weeks 

Evaluate on Your cluster, Your data, with Your Queries 

Dataswft is a non-disturbing, 'bolt-on' deployment on top of your Hadoop cluster. 

Your current workloads co-exist alongside Dataswft serving realtime queries.

Try First, Buy Later

Dataswft is available on Annual Term License based on volume of data stored and indexed inside Dataswft (‘License Slabs’). It is measured based on raw data ingested into Dataswft during the license term and determines the License Slab. Dataswft can be deployed on your entire cluster without limitations on number of nodes or users.

Write to us at try@dataswft.com for your 30 day evaluation copy.