Sunday, November 6, 2016

UTF-8 Encoding - MySQL and Spark

Lets say your data is in Hebrew or other non-Latin language and you want to process it in Spark and store some of the results in MySQL. Cool... so you are setting the table charset and collate to UTF-8 either during the creation or by using ALTER to modify if already been created:

CREATE DATABASE name DEFAULT CHARACTER SET utf8 COLLATE utf8_bin;
CREATE TABLE table_name (column_name column_type CHARACTER SET utf8 DEFAULT NULL,...) 
ENGINE=InnoDB DEFAULT CHARSET=utf8 COLLATE=utf8_bin;

but its not enough. You will need to set the MySQL JDBC client connection parameters 
either by concatenating the following to the URL: 
.... ?useUnicode=true&characterEncoding=UTF-8

Or by setting connection parameters for the dataframe we are going to write:
...
connProps.setProperty("characterEncoding", "UTF-8")
connProps.setProperty("useUnicode", "true")
resultsetDf.write.mode(saveMode).jdbc(mysqljdbcurl, tableName, connProps)


Good luck



Query that returns a large ResultSet using Hive JDBC takes ages to complete

You are trying to execute a query that the size of its result set is huge and its execution time using the beeline CLI is fine. The Hive and Spark logs don't show any errors but you might see lots of Kryo messages in the Hive debug logs - in such cases its highly recommended to start Hive using the following command:
hiveserver2 --hiveconf hive.root.logger=DEBUG,console
It usually happens because of the Kryo serialization/deserialization process time in case you have configured Hive on Spark.
In such cases I recommend executing the query using Spark so the end-to-end process is much faster and equals to the beeline execution time.

Good luck

Thursday, November 3, 2016

Initiating a SparkContext throws javax.servlet.FilterRegistration SecurityException


If you are trying to load hive-jdbc, hadoop-client and jetty all in the same Scala project along with your Spark dependencies, you might not be able to load a standalone Spark application. 
While trying to initiate the SparkContext, it will throw a javax.servlet.FilterRegistration SecurityException because of mixed javax.servlet dependencies imported with different versions from several sources. 

How to avoid this conflict?
You will need to add several ExclusionRules to some of your dependencies located at the build.sbt file:

libraryDependencies += "org.apache.hadoop" % "hadoop-client" % "2.6.4" 
excludeAll(ExclusionRule(organization = "javax.servlet"), 
ExclusionRule(organization = "org.mortbay.jetty"))

libraryDependencies += "org.apache.hive" % "hive-jdbc" % "1.2.1" 
excludeAll ExclusionRule(organization = "javax.servlet")

Good luck

Monday, August 1, 2016

Hive on Spark - additions to the Getting Started tutorial

Hive on spark is much faster then using the MapReduce alternative and as far as I understand from the documentation is that this option is going to be deprecated in future versions. 
If you want to use Apache Spark as your execution engine for Hive queries, you will find it bit hard to configure even though there is a very good Getting Started tutorial (https://cwiki.apache.org/confluence/display/Hive/Hive+on+Spark%3A+Getting+Started).
You will notice that the integration isn't trivial, since its in active development and periodically merged into spark and hive branches. 
The integration issues:

  1.  Version compatibility issues - tried several pairs Spark 1.6.1 + Hive 2.0.1, Spark 1.6.1 + Hive 1.2.1, Spark 1.6.2 + Hive 1.2.1 and others. My Hive queries Failed with a "return code 3" message. Reading the Hive debug logs, I found out that right after the spark-submit, SparkClientImpl thrown java.lang.AbstractMethodError exception. The bottom line is: you can use the following pairs: Spark 1.3.1 + Hive 1.2.1, Spark 1.4.1 + Hive 1.2.1 or Spark 1.6.2 + Hive 2.0.1. 
  2. java.lang.NoClassDefFoundError org/apache/hive/spark/client/Job or SparkTask - you will need to add the following configuration properties to both conf/hive-site.xml files located in the classpath of both distributions (Spark and Hive):

<property>
    <name>spark.driver.extraClassPath</name>
    <value>/usr/lib/apache-hive-2.0.1-bin/lib/hive-exec-2.0.1.jar</value>
    <description/>
  </property> 

<property>
    <name>spark.executor.extraClassPath</name>
    <value>/usr/lib/apache-hive-2.0.1-bin/lib/hive-exec-2.0.1.jar</value>
    <description/>
 </property>


Obviously these are just integration issues - to make it work and not more than that!
Good luck

Sunday, June 26, 2016

Eclipse 4.5.0 fails to load properly on Ubuntu 16.04

It seems like Eclipse, Spring Source Tool Suits (STS) or any other IDE built on top of Eclipse fail to load properly (frozen, slow, 100% CPU, etc) on Ubuntu 16.04.
You are not alone... GTK compatibility issues.
You will need to start Eclipse with GTK 2, just edit the eclipse.ini file and add the following:
--launcher.GTK_version
2
before the --launcher.appendVmargs line


Saturday, July 4, 2015

ClassCastException on remote EJB call between EARs in Websphere 8.5

We have 2 EARs running on the same Websphere 8.5 Application Server. Each EAR is running in its own classloader. We have a service that initiates a remote EJB call to another service exists in another EAR. Stubs are being generated automatically on startup since WAS 7.
One of our teams tried to perform some clean up and reorder to their legacy classpath.
Right after this classpath clean up, we were trying to initiate a remote EJB call, but for some reason it thrown the following exception:
"...ClassCastException...unable to load ..._Stub..."
We didnt understand why this exception came up? why ClassCastException?
We realized that our Remote interfaces were not identical between the EARs and that there were missing classes in classpath. So whenever you have such an architecture (if you are still using EJBs while calling your legacy systems :)), you should make sure that your remote interfaces are identical and you dont have any missing classes in the classpath of the calling EAR.

Tuesday, March 31, 2015

getpocket.com Delicious (del.icio.us) bookmark importer

If you have ever tried to import all/part of your delicious links into getpocket.com, you probably had difficulties using the tool provided by Pocket. Pocket provides a tool that enables you to import your delicious bookmark but it doesnt work well and for some reason, large amount of your private and/or public links are not being imported. I've found out that it could be a parsing issue because of the exported delicious HTML file structure.
An important note by Pocket: Pocket is not a replacement for archival type bookmarking. Your list is a collection of links that you intend to view later. We strongly advise against importing thousands of items here.