Sunday, November 6, 2016

UTF-8 Encoding - MySQL and Spark

Lets say your data is in Hebrew or other non-Latin language and you want to process it in Spark and store some of the results in MySQL. Cool... so you are setting the table charset and collate to UTF-8 either during the creation or by using ALTER to modify if already been created:

CREATE DATABASE name DEFAULT CHARACTER SET utf8 COLLATE utf8_bin;
CREATE TABLE table_name (column_name column_type CHARACTER SET utf8 DEFAULT NULL,...) 
ENGINE=InnoDB DEFAULT CHARSET=utf8 COLLATE=utf8_bin;

but its not enough. You will need to set the MySQL JDBC client connection parameters 
either by concatenating the following to the URL: 
.... ?useUnicode=true&characterEncoding=UTF-8

Or by setting connection parameters for the dataframe we are going to write:
...
connProps.setProperty("characterEncoding", "UTF-8")
connProps.setProperty("useUnicode", "true")
resultsetDf.write.mode(saveMode).jdbc(mysqljdbcurl, tableName, connProps)


Good luck



Query that returns a large ResultSet using Hive JDBC takes ages to complete

You are trying to execute a query that the size of its result set is huge and its execution time using the beeline CLI is fine. The Hive and Spark logs don't show any errors but you might see lots of Kryo messages in the Hive debug logs - in such cases its highly recommended to start Hive using the following command:
hiveserver2 --hiveconf hive.root.logger=DEBUG,console
It usually happens because of the Kryo serialization/deserialization process time in case you have configured Hive on Spark.
In such cases I recommend executing the query using Spark so the end-to-end process is much faster and equals to the beeline execution time.

Good luck

Thursday, November 3, 2016

Initiating a SparkContext throws javax.servlet.FilterRegistration SecurityException


If you are trying to load hive-jdbc, hadoop-client and jetty all in the same Scala project along with your Spark dependencies, you might not be able to load a standalone Spark application. 
While trying to initiate the SparkContext, it will throw a javax.servlet.FilterRegistration SecurityException because of mixed javax.servlet dependencies imported with different versions from several sources. 

How to avoid this conflict?
You will need to add several ExclusionRules to some of your dependencies located at the build.sbt file:

libraryDependencies += "org.apache.hadoop" % "hadoop-client" % "2.6.4" 
excludeAll(ExclusionRule(organization = "javax.servlet"), 
ExclusionRule(organization = "org.mortbay.jetty"))

libraryDependencies += "org.apache.hive" % "hive-jdbc" % "1.2.1" 
excludeAll ExclusionRule(organization = "javax.servlet")

Good luck

Monday, August 1, 2016

Hive on Spark - additions to the Getting Started tutorial

Hive on spark is much faster then using the MapReduce alternative and as far as I understand from the documentation is that this option is going to be deprecated in future versions. 
If you want to use Apache Spark as your execution engine for Hive queries, you will find it bit hard to configure even though there is a very good Getting Started tutorial (https://cwiki.apache.org/confluence/display/Hive/Hive+on+Spark%3A+Getting+Started).
You will notice that the integration isn't trivial, since its in active development and periodically merged into spark and hive branches. 
The integration issues:

  1.  Version compatibility issues - tried several pairs Spark 1.6.1 + Hive 2.0.1, Spark 1.6.1 + Hive 1.2.1, Spark 1.6.2 + Hive 1.2.1 and others. My Hive queries Failed with a "return code 3" message. Reading the Hive debug logs, I found out that right after the spark-submit, SparkClientImpl thrown java.lang.AbstractMethodError exception. The bottom line is: you can use the following pairs: Spark 1.3.1 + Hive 1.2.1, Spark 1.4.1 + Hive 1.2.1 or Spark 1.6.2 + Hive 2.0.1. 
  2. java.lang.NoClassDefFoundError org/apache/hive/spark/client/Job or SparkTask - you will need to add the following configuration properties to both conf/hive-site.xml files located in the classpath of both distributions (Spark and Hive):

<property>
    <name>spark.driver.extraClassPath</name>
    <value>/usr/lib/apache-hive-2.0.1-bin/lib/hive-exec-2.0.1.jar</value>
    <description/>
  </property> 

<property>
    <name>spark.executor.extraClassPath</name>
    <value>/usr/lib/apache-hive-2.0.1-bin/lib/hive-exec-2.0.1.jar</value>
    <description/>
 </property>


Obviously these are just integration issues - to make it work and not more than that!
Good luck

Sunday, June 26, 2016

Eclipse 4.5.0 fails to load properly on Ubuntu 16.04

It seems like Eclipse, Spring Source Tool Suits (STS) or any other IDE built on top of Eclipse fail to load properly (frozen, slow, 100% CPU, etc) on Ubuntu 16.04.
You are not alone... GTK compatibility issues.
You will need to start Eclipse with GTK 2, just edit the eclipse.ini file and add the following:
--launcher.GTK_version
2
before the --launcher.appendVmargs line