Apache Kylin is a data cube solution on top of Hadoop providing an ODBC interface for BI tools.
OLAP cubes boosts performance for analytics via using a subset of data, enriched with pre-calculations on specific dimensions of interest.
It enables loading dimensions from a Hive data source, therefore accelerating BI tool access via pre-calculating data and adding it to HBase.
In our example we have a large dataset of flight information:
- Daily data of flights from an origin to a destination with expected and actual flight times
- Additional metadata is available on operators, cancellations, diverts, etc.
The goal of the experiment was to install Kylin, load a subset of our couple of gigabytes worth flight data and connect it to Microsoft PowerBI.
Kylin Pros and Cons
Pros
- Fast performance
- Easy to use UI
- ODBC connection to conventional BI tools
- Free and open-source tool integrating to the Hadoop eco-system
Cons
- A cube is pre-constructed, therefore provides access to a data snapshot. It should be re-built in case of updates and built differently in case of changes in dimensions or measures
- Kylin sometimes failed to build a cube or exited during the experiment
Experiment – POC with Kylin
Platform
- 8 core VM with 16 GBs of RAM
- CentOS 6.7
- HDP 2.5
- Kylin 1.5.4.1
- Hive has flight data loaded for each months of 2015 downloadable from the Bureau of Transportation Statictics
Data of interest
We are interested in the average delays on a month of a day and on a carrier basis.
In cube terms, dimensions categorizes facts for the users’ business questions.
E.g. for flights we are interested in a daily distribution of delays, so we select the day of the month or interested in an average delay per carrier, so we select carrier as dimensions.
A measure is a property for the calculation, so in our example is the delay in minutes.
Installing Kylin
Is easy – assuming Hive and HBase is working properly,
- Download Kylin
- tar xzvf kylin.tar.gz
- export KYLIN_HOME=(UntarredFolder)
- cd $KYLIN_HOME && ./bin/kylin.sh start
Kylin logs to $KYLIN_HOME/logs/kylin.log
Using Kylin to create cubes
Access Kylin via web browser at http://hadoopIP:7070/kylin, user ADMIN and password KYLIN
Unfortunately Kylin’s docs on cube creation is outdated -? as the interface looks a bit different, than the documentation.
How to create a cube with Kylin – Step by step screenshots
Constructing the cube took 7.78 minutes.
Apache Kylin vs Apache Hive query performance
Hive
AVG delay on specific months of a day
AVG delay on specific months of a day SELECT DAYOFWEEK, AVG(DEPDELAY) FROM ON_TIME_PERFORMANCE_2015_1 GROUP BY DAYOFWEEK;
82.31 sec
AVG delay per carrier
AVG delay on specific months of a day SELECT DAYOFWEEK, AVG(DEPDELAY) FROM ON_TIME_PERFORMANCE_2015_1 GROUP BY DAYOFWEEK;
142.73 sec
Kylin
AVG delay on specific months of a day
AVG delay on specific months of a day SELECT DAYOFWEEK, AVG(DEPDELAY) FROM ON_TIME_PERFORMANCE_2015_1 GROUP BY DAYOFWEEK;
0.18 sec
AVG delay per carrier
AVG delay on specific months of a day SELECT DAYOFWEEK, AVG(DEPDELAY) FROM ON_TIME_PERFORMANCE_2015_1 GROUP BY DAYOFWEEK;
0.36 sec