Monthly Archives: May 2016

Greenplum Single Node Installation

tep 1 Download a CentOS 6 VM from http://virtual-machine.org/. Step 2 Download the latest Greenplum binaries for RedHat Enterprise Linux 6 from http://network.pivotal.io Step 3 Start the Virtual Machine with VMWare Fusion or something similar. Memory: 8GB Cores: 4 Disk: 50GB You can use less memory and cores but the more you provide the VM, the… Read More »

Test if a directory is empty in Bash

Using Bash, there are a number of ways to test if a directory is empty. One of those ways is to use ls -A to list all files, including those starting with . and see if anything is printed. This can be done like so: if [ ! “$(ls -A <path>)” ] then echo “<path>… Read More »

Zookeeper & Kafka Install : A single node and a single broker cluster – 2016

Starting Zookeeper In previous chapter, we ran ZooKeeper package that’s available in Ubuntu’s default repositories as daemon (zookeeperd). Let’s stop the ZooKeeper daemon if it’s running: $ sudo service zookeeper stop To launch a single local Zookeeper, we’ll use the default configuration that Kafka provides: $ $ ls ~/kafka/config consumer.properties producer.properties test-log4j.properties zookeeper.properties log4j.properties server.properties… Read More »

Kafka Quick Start

Step 1: Download the code Download the 0.8 release. > tar xzf kafka-<VERSION>.tgz > cd kafka-<VERSION> > ./sbt update > ./sbt package > ./sbt assembly-package-dependency This tutorial assumes you are starting on a fresh zookeeper instance with no pre-existing data. If you want to migrate from an existing 0.7 installation you will need to follow… Read More »

How To Install Apache Kafka on Ubuntu 14.04

Introduction Apache Kafka is a popular distributed message broker designed to handle large volumes of real-time data efficiently. A Kafka cluster is not only highly scalable and fault-tolerant, but it also has a much higher throughput compared to other message brokers such as ActiveMQ and RabbitMQ. Though it is generally used as a pub/sub messaging… Read More »

Advanced Spark Programming

Spark contains two different types of shared variables − one is broadcast variables and second is accumulators. Broadcast variables − used to efficiently, distribute large values. Accumulators − used to aggregate the information of particular collection. Broadcast Variables Broadcast variables allow the programmer to keep a read-only variable cached on each machine rather than shipping… Read More »

Apache Spark – Deployment

Spark application, using spark-submit, is a shell command used to deploy the Spark application on a cluster. It uses all respective cluster managers through a uniform interface. Therefore, you do not have to configure your application for each one. Example Let us take the same example of word count, we used before, using shell commands.… Read More »

Apache Spark – Core Programming

Spark Core is the base of the whole project. It provides distributed task dispatching, scheduling, and basic I/O functionalities. Spark uses a specialized fundamental data structure known as RDD (Resilient Distributed Datasets) that is a logical collection of data partitioned across machines. RDDs can be created in two ways; one is by referencing datasets in… Read More »