System Architecture

NewTrek's Hero Platform integrates Hadoop, Spark, and Apache Phoenix with operational analytics capabilities, pub-sub event streaming, service and metrics monitoring, and distributed storage to power a new generation of big data applications. 

Open Source Projects & Tools

Hadoop

Apache HadoopTM was born out of a need to process and store big data – both structured and unstructured. It is now widely used across industries, including finance, media and entertainment, government, healthcare, information services, retail, and other industries with big data requirements. 

Spark

Apache SparkTM is a fast and general engine for large-scale data batch processing. Spark Streaming is an extension of the core Spark API that enables high- throughput, fault-tolerant stream processing of live data streams. Spark Streaming provides streaming computation, meaning that processing occurs in real-time on data is streamed from a source. 

Hive 

Apache Hive is a powerful data warehousing application built on top of telecom, hotel, it enables you to access your data using Hive QL, a language that is similar to SQL. 

Pig

It is a tool/platform which is used to analyze larger sets of data representing them as data flows. Pig is generally used with Hadoop; we can perform all the data manipulation operations in Hadoop using Pig. 

Phoenix

Apache Phoenix enables OLTP and operational analytics in Hadoop for low latency applications by combining the power of standard SQL and JDBC APIs with full ACID transaction capabilities and the flexibility of late-bound, schema-on-read capabilities from the NoSQL world by leveraging HBase as its backing store. Apache Phoenix is fully integrated with other Hadoop products such as Spark, Hive, Pig, Flume, and Map Reduce. 

Solr

Solr is highly reliable, scalable and fault tolerant, providing distributed indexing, replication and load-balanced querying, automated failover and recovery, centralized configuration and more. Solr is integrated with Hbase to provide faster query result with indexing. 

Hdfs

HDFS is a Java-based file system that provides scalable and reliable data
storage. HDFS provides high throughput access to application data and is suitable for applications that have large data sets. 

Hbase

Apache HBase is a is an open-source, distributed, versioned, non-relational database that runs on a Hadoop cluster and provides random, realtime read/write access to your Big Data. Clients can access HBase data through either a native Java API, or through a Thrift or REST gateway, making it accessible by any language. 

Kafka

Kafka® is used for building real-time data pipelines and streaming apps. It is horizontally scalable, fault-tolerant, wicked fast and lets you publish and subscribe to streams of records. 

Oozie

Apache Oozie is a valuable tool for Hadoop users to automate commonly
performed tasks in order to save time and prevent user error. With Oozie, users can describe workflows to be performed on a Hadoop cluster, schedule those
workflows to execute under a specified condition, and even combine multiple workflows and schedules together into a package to manage their full lifecycle. 

Azkaban

Azkaban is a batch workflow job scheduler created at LinkedIn to run Hadoop jobs. Azkaban resolves the ordering through job dependencies and provides an easy to use web user interface to maintain and track your workflows. 

Hue

Hue (Hadoop User Experience) offers a web GUI to Hadoop users to simplify the process of creating, maintaining, and running many types of Hadoop jobs. Hue is made up of several applications that interact with Hadoop components, and has an open SDK to allow new applications to be created. 

Zookeeper

ZooKeeper is a centralized service for maintaining configuration information, naming, providing distributed synchronization, and providing group services. A distributed Apache HBase (TM) installation depends on a running ZooKeeper cluster. All participating nodes and clients need to be able to access the running ZooKeeper ensemble. 

Cloudera

Cloudera provides a scalable, flexible, integrated platform that makes it easy to deploy and manage Apache Hadoop and related projects, manipulate and analyze your data, and keep that data secure and protected. Cloudera Manager—A sophisticated application used to deploy, manage, monitor, and diagnose issues with Cloudera distribution of Apache Hadoop and other related open-source projects. 

Prometheus

Prometheus is an open-source monitoring database store metrics and alerting data. Prometheus scrapes metrics from instrumented jobs, either directly or via an intermediary push gateway for short-lived jobs. It stores all scraped samples locally and runs rules over this data to either aggregate and record new time series from existing data or generate alerts. 

Grafana

Grafana is an open source software for time series analytics. It is used to visualize the metrics data collected in Prometheus database. 

hero diagram.png
Copyright © 2018 Newtrek Big Data Limited. All Rights Reserved

Platform Architecture and Flow