Integrating Big Data and SQL Server 2016
Big Data is quickly becoming an important asset for many businesses as new technologies like IoT (Internet-Of-Things) are being adopted that drive the collection and analysis of vast amounts of data. Michael Otey explains how SQL Server 2016’s PolyBase can help.
April 21, 2016
Big Data is quickly becoming an important asset for many businesses as new technologies like IoT (Internet-Of-Things) are being adopted that drive the collection and analysis of vast amounts of data. Big Data isn’t typically like relational data in that it isn’t always collected in ordered relational sets. There are three Vs that characterize Big Data: Volume, Velocity or Variety. This means that Big Data projects are either dealing with lots of data, data that is streamed or collected very rapidly or data that comes in many shapes and sizes. Relational databases aren’t typically used for Big Data projects. Instead, you’ll typically see NoSQL databases used. Many Big Data projects are implemented using systems like Microsoft’s HDInsight or Hadoop. That said, these Big Data systems usually need to be integrated with existing data from relational databases or data warehouses.
That’s where SQL Server 2016’s PolyBase feature comes into play. PolyBase is essentially a bridge between SQL Server’s relational or data warehousing databases and unstructured Big Data databases like Hadoop. PolyBase isn’t exactly a new feature. PolyBase was formerly only found in the SQL Server Parallel Data Warehouse, which has since been renamed to the Microsoft Analytics Platform System. The Parallel Data Warehouse was typically only used by the largest of organizations. The inclusion of PolyBase into the main SQL Server 2016 line-up opens up this Big Data bridge for all SQL Server 2016 installations.
PolyBase enables you to use your existing SQL Server tools like SSMS and T-SQL to query these Big Data stores. You can access data in Hadoop clusters and even write queries that join this semi-structured Big Data with tables in SQL Server relational databases. With the initial release of SQL Server 2016, PolyBase is able to retrieve external data from the Cloudera and Hortonworks versions of Hadoop and Azure Blob Storage. Microsoft has stated that they intend to add other data sources in the future. You can see an overview of the connection to Hadoop and Azure blob storage in Figure 1.
Figure 1 – Connecting SQL Server 2016 to Big Data using PolyBase
PolyBase is a SQL Server installation option. To install PolyBase, you need to select the PolyBase Query Service for External Data option when you install SQL Server 2016. Unlike any of the other SQL Server 2016 options, PolyBase requires the Oracle Java SE Runtime Environment (JRE) version 7.51 or higher. The Java Runtime must be installed before you install the PolyBase feature.
Using PolyBase is much like using a linked server. PolyBase is transparent to the application and the external data sources are incorporated into the database schema. When you use PolyBase to query Hadoop data, the PolyBase engine determines when to generate a Hadoop map, reduces jobs on an as-needed basis and then returns the data to SQL Server.
Integrating Big Data into SQL Server extends the reach of SQL Server and its tools and enables you to use that data in your applications as well as in your reporting tools like Reporting Services, Power Query and Power BI.
Underwritten by HPE and Microsoft
About the Author
You May Also Like