Monday, August 1, 2016

Hive on Spark - additions to the Getting Started tutorial

Hive on spark is much faster then using the MapReduce alternative and as far as I understand from the documentation is that this option is going to be deprecated in future versions. 
If you want to use Apache Spark as your execution engine for Hive queries, you will find it bit hard to configure even though there is a very good Getting Started tutorial (https://cwiki.apache.org/confluence/display/Hive/Hive+on+Spark%3A+Getting+Started).
You will notice that the integration isn't trivial, since its in active development and periodically merged into spark and hive branches. 
The integration issues:

  1.  Version compatibility issues - tried several pairs Spark 1.6.1 + Hive 2.0.1, Spark 1.6.1 + Hive 1.2.1, Spark 1.6.2 + Hive 1.2.1 and others. My Hive queries Failed with a "return code 3" message. Reading the Hive debug logs, I found out that right after the spark-submit, SparkClientImpl thrown java.lang.AbstractMethodError exception. The bottom line is: you can use the following pairs: Spark 1.3.1 + Hive 1.2.1, Spark 1.4.1 + Hive 1.2.1 or Spark 1.6.2 + Hive 2.0.1. 
  2. java.lang.NoClassDefFoundError org/apache/hive/spark/client/Job or SparkTask - you will need to add the following configuration properties to both conf/hive-site.xml files located in the classpath of both distributions (Spark and Hive):

<property>
    <name>spark.driver.extraClassPath</name>
    <value>/usr/lib/apache-hive-2.0.1-bin/lib/hive-exec-2.0.1.jar</value>
    <description/>
  </property> 

<property>
    <name>spark.executor.extraClassPath</name>
    <value>/usr/lib/apache-hive-2.0.1-bin/lib/hive-exec-2.0.1.jar</value>
    <description/>
 </property>


Obviously these are just integration issues - to make it work and not more than that!
Good luck

No comments: