SAP HANA Hadoop Integration

Folks working in the technology industry in recent years will have heard that every year, the new data that gets created accounts for about 90% of all data that has been created in history. For years, tech companies have suffered first hand, as their storage costs went up exponentially year after year. At the same time, with so much data to look through, searching such huge data repositories quickly and efficiently also became a tremendous challenge. Traditional storage and searching mechanisms were found wanting. Folks at Google came up with a distributed storage and processing architecture to get around this critical problem and thus Hadoop was born. If you are an enterprise or individual with a need to process files that are tens of terabytes in size, you may have come across suggestions to make use of Hadoop. Hadoop makes use of Hadoop Distributed File System (HDFS) which splits up huge files into smaller modules and then stores them across multiple storage nodes (think servers and computers around the world). Accelerated processing of this widely distributed data has been made possible by storing programs at these nodes so that just like the storage the processing can also be distributed across different nodes which carry out computations in parallel. The distributed data does not have to be moved back and forth for processing since the code used for processing is already available at all nodes.

In Hadoop, all of this is achieved by making use of the MapReduce programming model. In the ‘traditional’ sense, this means the types of processing that you can perform on Hadoop is restricted to the applications that are pre-installed on the different nodes. It may not work or perform well with ad hoc queries or processes since that is not what Hadoop is designed to do.During the same time that Hadoop and MapReduce have taken flight, SAP has moved towards the adoption of SAP HANA and its in-memory computing engine. For enterprises that are already making use of SAP HANA, combining the in-memory processing power of SAP HANA with Hadoop’s capability to process large volumes of data could be a game changer. This could enable high-speed analysis of large datasets.To simplify, you might say that there could be an architecture where Hadoop is used to store and pre-process data and then SAP HANA is used to perform analysis and prepare summarizations on that data.The easy scalability of Hadoop by adding new nodes makes it possible to enlarge your data storage capacity with minimal lead time while the in-built modeling features and integrability with R that you get with SAP HANA makes processing that data simple and straightforward. If you would like to take advantage of the potential of integrating SAP HANA with Hadoop, there are now many options available to set up that connection.

1.SAP HANA Spark Controller
When connecting SAP HANA with Hadoop using the Spark Controller, an SQL interface is used to connect SAP HANA to a pre-existing Hive metastore. This approach requires that SAP HANA Spark controller be installed on the Hadoop side of the architecture. In the SAP HANA cluster, a Spark SQL adapter is used to form a connection to the spark controller on the Hadoop cluster. This acts as a moderator for all the data transfer and query execution that occurs between SAP HANA and Hadoop. The first step in setting up the connection is to install HANA spark controller on the Hadoop cluster. It is this controller that enables in-memory access to HDFS data files through an SQL interface. The connection to the HDFS system is made through YARN (a resource management layer for the Hadoop environment) and Spark Assembly JAR. Spark SQL acts as an interface between the Spark controller and the distributed file system (DBS) or DLM. On the other side of the integration, SAP HANA makes use of smart data access to create a remote source which utilizes a ‘sparksql’ adapter. This creates virtual tables that SAP HANA can access directly.Once this is done, the end-to-end connectivity between SAP HANA and Hadoop is complete. For all of this to be possible, YARN, spark 1.x (and assembly file) or 2.x (and individual JAR files) and additional spark libraries that are typically not bundles with spark are necessary. The Spark Controller can be installed either manually or using the cluster management tools Cloudera or Ambari.

2.Simba HiveServer ODBC driver
Apache Hive is an SQL-oriented query language used to work with data that is stored in Hadoop Distributed File System (HDFS) on a Hadoop cluster. The Simba HiveServer ODBC driver acts as a connector to Apache Hive. However, this approach is restricted to connections with Intel-based hardware platforms only. Hive makes it possible to run SQL queries to combine data that is present on Hadoop and SAP HANA. To establish the connection, the first step is to set up the ODBC driver. Once the driver is set up, the Hadoop system can be added as a remote source. With this MapReduce programs can be used to access data present in Hadoop by developing a virtual function.

3.SAP VORA
This is an interesting option as it enables in-memory processing running on the Hadoop cluster itself. Just like Hadoop, SAP VORA also has the ability to scale up to thousands of nodes and has been designed specifically for use in large clusters of distributed data and can handle the scale of big data. There are two ways of connecting SAP HANA to SAP VORA – SAP HANA Spark Controller and SAP VORA Remote Source Adapter (voraodbc).This optionsis available for when using SAP VORA 1.3 or later versions. Once the standalone systems are up and running, creating this connection is as simple as running a few lines of code on the SAP HANA instance and then refreshing the SAP HANA system. Once the connection is established, the tables in SAP VORA that need to be accessed have to be selected in the context menu associated with the remote system for the SAP VORA system. These tables are added as virtual tables and can be accessed just as any SAP HANA table would be accessed.

4.Virtual Functions and MapReduce Programs
Virtual functions created in SAP HANA and Java MapReduce jobs can also be used to establish connections between SAP HANA and Hadoop. Simple SQL statements can be used to query Hadoop systems using a custom MapReduce job created in SAP HANA if the virtual function has been added in the ‘Hadoop Virtual Function’ within the Database Development section of SAP HANA. Similarly, you can make use of .hdbmrjob repository files to push Java MapReduce jobs into SAP HANA.All the relevant JAR files that are necessary for the created MapReduce job will be included inside the repository file. The SAP HANA system has a’ Hadoop MR Jobs Archive’ within the Database Management section. This is where new virtual functions are created. After the MapReduce function has been created and activated, the runtime object .hdbmrjob file is made available as a catalog object. Once this is done, SQL statements can be used to trigger MapReduce job files to connect to Hadoop.

5.ETL to SAP HANA
Typically, not all the data that is available on a Hadoop system needs to be available on SAP HANA. Hadoop might even have data at a level of granularity that is not necessary for the analysis that is performed on SAP HANA. In such a situation, it may be sufficient to periodically load the necessary data into the SAP HANA system. The necessary data that is available on the Hadoop cluster can by extracted, transformed, aggregated and loaded into the SAP HANA system to have it readily available for analysis. This can be done by making use of ETL tools such as SAP BODS. The data from Hadoop is first processed before it is stored as structured data in Hadoop itself before it is loaded into the SAP HANA system. This approach has the added advantage that it does not require any additional configuration on the Hadoop cluster. The downsides are that this makes accessing real time data out of question and this is not an appropriate approach for very large datasets.

Conclusion
The in-house memory engine of SAP HANA combined with the scale and efficiency of Hadoop has potential to enable all data engineering requirements that any enterprise could have. Being that they were developed by two entirely different enterprises, it was crucial that integration be made possible. Thanks to efforts by both enterprises on connectivity of their products, depending on what IT infrastructure is already available to you there are many ways to integrate SAP HANA and Hadoop. The architectural complexity behind establishing those connections and using them is not evident from how easy it is to implement the connection channels. Thanks to the interoperability of SAP HANA and Hadoop, one can even set up a connection in a matter of minutes. Each approach that we have discussed comes with its own pros, cons and restrictions. It is up to the system architect to understand the specific use case for the integration to pick from the many options that are now available for someone who wants to integrate SAP HANA and Hadoop.

Leave a Reply

Your email address will not be published. Required fields are marked *

Prev
SAP and Blockchain

SAP and Blockchain

Blockchain refers to a data structure for storing information in a distributed

Next
Blockchain

Blockchain

Report On Blockchain on the SAP Cloud Platform Blockchain refers to a data

You May Also Like