How to Provision IBM Cloud Analytics Engine instance

Terraform — Overview

Terraform is an open source software from Hashicorp that enables you to easily provision and manage resources of your cloud provider using a simple declarative json like configuration files to define your requirements.

In general, using Terraform simplifies the provisioning aspects especially when you have to deal with multi clouds and multiple environments. Almost all cloud and other infra providers have the predefined templates and plugins that makes it easy to use and navigate — so that you, as a devops engineer, don’t have to get into the details of the API…

ImageSource: Unsplash

Introduction

Many Business Intelligence(BI) and Reporting tools such as MicroStrategy, Tableau, SPSS require ODBC connectivity to databases for running analytical SQL queries. This blog is a walkthrough of how you can connect to IBM Analytics Engine’s Hive endpoint using any of the standard Hive ODBC drivers. Analytics Engine is built on top of Hortonworks Data Platform 3.1.5. (Apache Hive is at version 3.1.0)
The configuration samples include connecting from the following clients:
a) Linux system CLI
b) From SPSS Modeler
c) From MicrosoftExcel

ODBC Drivers — Overview

Some of the enterprise data analytics platforms come with inbuilt drivers and connectors that can connect to all…

Yes, you can! By using transactional Hive ORC tables

Answer

One question that’s often asked is — “How can I modify or delete data that is on S3 or IBM Cloud Object Storage?” The answer is surprisingly simple. You can do that with the following caveats:
- It works only with Hive Transactional tables
- It is supported only for Hive ORC format
- It is supported only from Hive. That is — it is not supported from Spark. So you cannot use Spark SQL to create or work with these tables. Yet.
- It is supported only for Managed tables

Hive Transactional Tables

Apache Hive supports transactions — which means you can…

Simple, Easy and Quick script that you can set up for monitoring on your cluster

Ambari REST API based Monitoring for Analytics Engine

This article has been co-written with Chetan Bhatia, DevOps Consultant, IBM Chetanbhatia

Overview

As a data scientist, you would want to concentrate on the business logic of your program and not be worried about the stability and availability of the compute engine that runs your application. In an ideal world. Practically speaking, there may be several dependencies of the infra that can roll up and cause disruptions to your Spark or Hadoop jobs.

That is where infusing monitoring with alerts and notifications plays a key role to building a solid, enterprise grade application system. A typical organization has several environments across…

Overview

Amadeus, the travel technology company set up by a group of European airlines to enable travel agents to carry out flight ticketing online, has turned to a NoSQL database technology to enable travellers to ask complex questions about their journey. Travel site Kayak is using Amadeus Instant Search technology to increase conversion rates from “looking” to “booking”.

The company chose NoSQL database MongoDB to help it build an “instant search” application that can browse billions of travel options across multiple criteria in real time.

Above are excerpts from an article on ComputerWeekly.com that gives an interesting take on why the…

TL;DR

This story is based on a customer use case — how you can combine Serverless in conjunction with Managed Services for constructing analytic workflows. We’ll see the need for such a requirement, and the steps to go about it.

Why serverless when there is a service? Ben Kehoe’s excellent article Serverless is a State of Mind talks about how…

“You should use functions as the glue, containing your business logic, between managed services that are providing the heavy lifting that forms the majority of your application.

Overview

  1. Data comes in batches and gets stored on the Cloud Object Storage. The frequency…

Overview

The benefits of using private endpoints in IBM Cloud are three fold:
1. Your data does not pass through the public network, so it is more secure.
2. You get better performance
3. Overall, there is a saving on cost incurred if the data transfer is internal.

This article demonstrates how you can architect a solution around private end points as far as possible, to leverage these benefits.

Traffic Flows

This article and this diagram concerns itself with the upstream and downstream services usually associated with IBM Analytics Engine. In the picture, there are two different instances of Analytics Engine. The Analytics…

Overview

Safeguarding business data is one of the critical pieces of any application design. It is important that that data is accessed safely and securely and only by authorized credentials. IP whitelisting is an additional tenet of security that can be implemented to allow only trusted hosts to access your data.

This writeup discusses how you can whitelist the IP Addresses (both private and public) of your IBM Analytics Engine cluster to access your data stored in either On-Prem or in IBM Cloud Object Storage.

PART 1: ANALYTICS ENGINE API TO GET THE PUBLIC & PRIVATE IPs

Before we get started with whitelisting, first let us get the Public & Private IPs of…

In this blog you will learn how and why to make your IBM Analytics Engine (1.2) cluster stateless by keeping your data and (hive) metadata outside of the cluster. We use IBM Cloud Object Storage and Databases For PostGreSQL

Overview

Separating storage from compute is a recommended paradigm that brings in flexibility and optimization of resources. The decoupling allows you to scale up (or scale down) either of the two, independently, without impacting the other. Specifically in the case of IBM Analytics Engine, it allows you to get into the cattle-vs-pets way of thinking. Need to move from a cluster on…

In this tutorial you will understand how analytics can be performed on a shared compute engine and shared cloud object storage but with separate jobs and job configurations. This solution leverages the Analytics Engine and Cloud Object Storage services to perform spark analytics in separate contexts, whilst sharing the underlying engine and storage instances.

This is a fictional scenario of an analytics team (Sportify Inc) that has two data scientists and a data engineer. To save on costs and administration, the company wants to use one instance of the processing server and a common data lake for all of their…

Mrudula Madiraju

Senior Consultant, IBM Cloud. Sharing titbits of epiphanies...

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store