Hadoop-BAM is a Java library for the manipulation of files in common bioinformatics formats using the Hadoop MapReduce framework with the Picard SAM JDK, and command line tools similar to SAMtools. The file formats currently supported are BAM, SAM, FASTQ, FASTA, QSEQ, BCF, and VCF.

Download

Hadoop is an open-source software environment of The Apache Software Foundation that allows applications petabytes of unstructured data in a cloud environment on commodity hardware can handle. Because the system is based on Google's MapReduce and Google File System (GFS), large data sets into smaller data blocks are divided so that a cluster parallel processing.

Hadoop works with a distributed file system (HDFS) what makes that data on multiple nodes and aggregated with a high bandwidth a cluster can be treated.

Given the fact that Hadoop is relatively new for many organizations, it is important to see how these 'early adopters' use of Hadoop. Despite the fact that Hadoop is in its early stages, and that there are constant projects to be started that the range of uses of Hadoop expand, we still see all new patterns arise. Let these real-life situations to look at from both a technical and a business perspective.

Hadoop screenshots


You can free download Hadoop and safe install the latest trial or new full version for Windows 10 (x32, 64 bit, 86) from the official site.

Devices: Desktop PC, Laptop (ASUS, HP, DELL, Acer, Lenovo, MSI), Ultrabook

OS: Professional, Enterprise, Education, Home Edition, versions: 1507, 1511, 1607, 1703, 1709, 1803, 1809

Instructions tested with Windows 10 64-bit.

It is highly recommend that you use Mac OS X or Linux for this course, these instructions are only for people who cannot run Mac OS X or Linux on their computer.

Install and Setup

Spark provides APIs in Scala, Java, Python (PySpark) and R. We use PySpark and Jupyter, previously known as IPython Notebook, as the development environment. There are many articles online that talk about Jupyter and what a great tool it is, so we won’t introduce it in details here.

This Guide Assumes you already have Anaconda and Gnu On Windows installed. See https://mas-dse.github.io/startup/anaconda-windows-install/

1. Go to http://www.java.com and install Java 7+.

2. Get Spark pre-built package from the downloads page of the Spark project website.

3. Open PowerShell by pressing ⊞ Win-R, typing “powershell” in Run dialog box and clicking “OK”. Change your working directory to where you downloaded the Spark package.

4. Type the commands in red to uncompress the Spark download. Alternatively, you can use any other software of your preference to uncompress.

> gzip -d spark-2.1.0-bin-hadoop2.7.tgz
> tar xvf spark-2.1.0-bin-hadoop2.7.tar

5. Type the commands in red to move Spark to the c:optspark directory.

> mkdir C:opt
> move spark-2.1.0-bin-hadoop2.7 C:optspark

6. Type the commands in red to download winutils.exe for Spark.

> cd C:optsparkbin
> curl -k -L -o winutils.exe https://github.com/steveloughran/winutils/blob/master/hadoop-2.6.0/bin/winutils.exe?raw=true

7. Create an environment variable with variable name = SPARK_HOME and variable value = C:/opt/spark. This link provides a good description of how to set environment variable in windows

Exe

8. Type the commands in red to create a temporary directory.

> mkdir ~/Documents/jupyter-temp/
> cd ~/Documents/jupyter-temp/

9. Type the commands in red to install, configure and run Jupyter Notebook. Jupyter Notebook will launch using your default web browser.

> conda install jupyter -y
> ipython kernelspec install-self
> jupyter notebook

First Spark Application

In our first Spark application, we will run a Monte Carlo experiment to find an estimate for $pi$.

JBridgeM is a plugin wrapper designed for bridging VST plugins (up to the 2.4 VST specification) in Mac OS X.Using inter-process communication mechanisms, JBridge is designed to let you run 32bit plugins in 64bit hosts and 64bit plugins in 32bit hosts. Download jbridge 1.2 free

Here is how we are going to do it. The figure bellow shows a circle with radius $r = 1$ inscribed within a 2×2 square. The ratio between the area of the circle and the area of the square is $frac{pi}{4}$. If we sample enough points in the square, we will have approximately $rho = frac{pi}{4}$ of these points that lie inside the circle. So we can estimate $pi$ as $4 rho$.

1. Create a new Notebook by selecting Python 2 from the New drop down list at the right of the page.

2. First we will create the Spark Context. Copy and paste the red text into the first cell then click the (run cell) button:

import os
import sys
import findspark
findspark.init()
from pyspark import SparkContext
sc = SparkContext(master='local[4]')

3. Next, we draw a sufficient amount of points inside the square. Copy and paste the red text into the next cell then click the (run cell) button:

import numpy as np
TOTAL = 1000000
dots = sc.parallelize([2.0 * np.random.random(2) - 1.0 for i in range(TOTAL)]).cache()
print('Number of random points:', dots.count())
stats = dots.stats()
print('Mean:', stats.mean())
print('stdev:', stats.stdev())

Output:

('Number of random points:', 1000000)
('Mean:', array([-0.0004401 , 0.00052725]))
('stdev:', array([ 0.57720696, 0.57773085]))

4. We can sample a small fraction of these points and visualize them. Copy and paste the red text into the next cell then click the (run cell) button:

%matplotlib inline
from operator import itemgetter
from matplotlib import pyplot as plt
plt.figure(figsize = (10, 5))
# Plot 1
plt.subplot(1, 2, 1)
plt.xlim((-1.0, 1.0))
plt.ylim((-1.0, 1.0))
sample = dots.sample(False, 0.01)
X = sample.map(itemgetter(0)).collect()
Y = sample.map(itemgetter(1)).collect()
plt.scatter(X, Y)
# Plot 2
plt.subplot(1, 2, 2)
plt.xlim((-1.0, 1.0))
plt.ylim((-1.0, 1.0))
inCircle = lambda v: np.linalg.norm(v) <= 1.0
dotsIn = sample.filter(inCircle).cache()
dotsOut = sample.filter(lambda v: not inCircle(v)).cache()
# inside circle
Xin = dotsIn.map(itemgetter(0)).collect()
Yin = dotsIn.map(itemgetter(1)).collect()
plt.scatter(Xin, Yin, color = 'r')
# outside circle
Xout = dotsOut.map(itemgetter(0)).collect()
Yout = dotsOut.map(itemgetter(1)).collect()
plt.scatter(Xout, Yout)

Output:

<matplotlib.collections.PathCollection at 0x17a78780>

5. Finally, let’s compute the estimated value of $pi$. Copy and paste the red text into the next cell then click the (run cell) button:

pi = 4.0 * (dots.filter(inCircle).count() / float(TOTAL))
print('The estimation of pi is:', pi)

Output:

Next Steps

References

  • Example Python Spark programs on the Spark Github repository