Overview

About HDInsight and Hadoop

Iterative data exploration

Data Warehouse on demand

ETL at scale

Streaming at scale

Machine learning

Batch & Interactive Processing

Run Custom Programs

Upload Data for HDInsight

Hadoop components on HDInsight

R Server

Apache Hive

Apache Spark

Apache HBase

Apache Storm

Apache Kafka

Domain-joined HDInsight clusters

Azure HDInsight and Hadoop Architecture

HDInsight Architecture

Hadoop Architecture

Lifecycle of an HDInsight Cluster

High availability

Capacity planning

Release notes

Recent

Archive

Get Started

Start with Hadoop and Hive

Start with Spark

Start with R Server

Start with HBase & NoSQL

Start with Storm

Start with Interactive Hive (Preview)

Start with Kafka (Preview)

Hadoop sandbox

Data Lake Tools with Hortonworks Sandbox

Tools for Visual Studio

HDInsight storage options

How To

Import and export data

Upload data for Hadoop jobs

Import and export tabular data with Sqoop

Run Sqoop using SSH

Run Sqoop using cURL

Run Sqoop using the .NET SDK

Run Sqoop using PowerShelll

Batch process data

Use Hadoop for batch processing

Use MapReduce with HDInsight

Use MapReduce via SSH

Use MapReduce via cURL

Use MapReduce with the .NET SDK

Use PowerShell

Run the MapReduce samples

Use Hive for batch queries

Use Hive with HDInsight

Use the Hive View

Use Hive with Beeline

Use Hive with cURL

Use Hive with PowerShell

Use Hive with the .NET SDK

Use the HDInsight tools for Visual Studio

Create and deploy a Hive User Defined Function

Use Pig for batch processing

Use Pig with HDInsight

Use Pig via SSH

Use Pig via PowerShell

Use Pig via the .NET SDK

Use Pig via cURL

Use Pig with DataFu

Use Spark for batch processing

Use Spark with HDInsight

Use Spark to process data in Data Lake Store

Use Spark to process data in Azure Storage blobs

ACTION: TODO: Where are the ADLS/WASB paths covered?

Submit Spark batch jobs using Livy

Use Spark SQL for batch queries

Use Spark SQL with HDInsight

Interactively query data

Use Spark with notebooks

Use Zeppelin notebooks with Spark

Use Jupyter notebook with Spark

Use external packages with Jupyter using cell magic

Use external packages with Jupyter using script action

Use a local Jupyter notebook

Process data in real-time

Use Spark for stream processing

What is Spark Streaming?

What is Spark Structured Streaming?

Use Spark DStreams to process events from Kafka

Use Spark DStreams to process events from Event Hubs

Use Spark Structured Streaming to process events from Kafka

Use Spark Structured Streaming to process events from Event Hubs

Creating highly available Spark Streaming jobs in YARN

Creating Spark Streaming jobs with exactly once event processing guarantees

Use BI tools with HDInsight

Use Spark from Power BI and Tableau

Build data processing pipelines

Use Azure Data Factory

Use on-demand HDInsight clusters from Data Factory

Use Oozie

Use Oozie for workflows

Use time-based Oozie coordinators

Perform Machine Learning

Use R Server

What is R Server?

Submit jobs from Visual Studio Tools for R

Submit R jobs from R Studio Server

Analyze data from Azure Storage and Data Lake Store using R

Selecting a compute context

Use Spark for Machine Learning

Use Spark for Machine Learning

Configuring R Server on Spark

ACTION: MIGRATE suggest migrating content from here

Creating SparkML pipelines

Creating SparkML models in notebooks

Use the Microsoft Cognitive Toolkit from Spark

Perform Deep Learning

Use Caffe for deep learning with Spark

Use HBase

Analyze real-time tweets

Develop an app with Java

Create HBase clusters on a virtual network

Configure HBase replication

Configure Backup and Replication for HBase and Phoenix on HDInsight

Monitor HBase with OMS

HBase storage options

Using Spark with HBase

Using the HBase REST SDK

HBase - Migrating to a New Version

Use Phoenix

Phoenix in HDInsight

Use Phoenix and SQLLine

Configure Backup and Replication for HBase and Phoenix on HDInsight

Phoenix performance monitoring

Bulk Loading with Phoenix with psql

Using Spark with Phoenix

Using the Phoenix Query Server REST SDK

Use Storm

Deploy and manage topologies

Develop data processing apps in SCP

Storm examples

Write to Data Lake Store

Develop Java-based topologies with Maven

Develop C# topologies with Hadoop tools

Determine Twitter trending topics

Process events with C# topologies

Process events with Java topologies

Use Power BI with a topology

Analyze real-time sensor data

Process vehicle sensor data

Correlate events over time

Develop topologies using Python

Use domain-joined HDInsight (Preview)

Configure domain joined clusters

Configure Hive policies

Add ACLs at the file and folder levels

Manage domain joined clusters

Manage authorized Ambari users

Sync Other Users from Azure Active Directory to Cluster

Use Kafka (Preview)

Replicate Kafka data

Use with Virtual Networks

Use with Spark

Use with Storm

Develop

Develop MapReduce programs

Develop C# streaming MapReduce programs

Develop Java MapReduce programs

Develop Python streaming programs

Use Python with Hive and Pig

Develop Hive applications

Hive and ETL Overview

Connect to Hive with JDBC or ODBC

Using external metadata stores

Writing Hive applications using Java

Writing Hive applications using Python

Creating user defined functions

Process and analyze JSON documents with Hive

Hive samples

Query Hive using Excel

Analyze stored sensor data using Hive

Analyze stored tweets using beeline and Hive

Analyze flight delay data with Hive

Analyze website logs with Hive

Develop Spark applications

Spark Scenarios

Run Spark from the Shell

Create a standalone app

Choosing between Spark RDD, dataframe and dataset

Use the HDInsight Tools for Eclipse

Create apps using the Azure Toolkit for IntelliJ

Debug jobs remotely with IntelliJ

Optimizing and configuring Spark jobs for performance

Configuring Spark settings

Spark samples

Analyze Application Insights telemetry with Spark

Analyze website logs with Spark SQL

Developing Machine Learning solutions with HDInsight

SparkML samples

Creating Spark ML Pipelines

Spark MLLib samples

Predict HVAC performance

Predict food inspection results

R Server on HDInsight samples

Predicting flight delays using R Server on Spark

Mahout on HDInsight Samples

Generate recommendations with Mahout

Serialize and deserialize data

Serialize data with Avro Library

Analyze big data

Analyze using Power Query

Deep Dives

Advanced Analytics Deep Dive

ETL Deep Dive

Operationalize Data Pipelines with Oozie

Streaming and Business Intelligence

Extend clusters

Customize clusters using Bootstrap

Customize clusters using Script Action

Develop script actions

Install and use Presto

Install or update Mono

Add Hive libraries

Use Giraph

Use Hue

Use R

Use Solr

Use Virtual Network

Use Zeppelin

Build HDInsight applications

Install HDInsight apps

Install custom apps

Use REST to install apps

Publish HDInsight apps to Azure Marketplace

Install and use Cask Data Application Platform (CDAP)

Install and use Dataiku

Install and use Datameer

Install and use H2O

Install and use Streamsets

Secure

Use SSH with HDInsight

Use SSH tunneling

Restrict access to data

Add ACLs for users at the file and folder levels

Create .NET applications that run with a non-interactive identity

Manage

Manage Clusters

Key scenarios to monitor

Administering HDInsight using the Azure Portal

Ports used by Hadoop services on HDInsight

Upgrade HDInsight cluster to newer version

OS patching for HDInsight cluster

Scaling best practices

Manage Linux Clusters

Create Linux clusters

Manage Cluster Logs

Manage Linux clusters using the Ambari web UI

Manage configurations with Ambari

Use Ambari REST API

Use Azure PowerShell

Use cURL and the Azure REST API

Use the .NET SDK

Use the Azure CLI

Use the Azure portal

Use Azure Resource Manager templates

Migrate to Resource Manager development tools

Availability and reliability

Use empty edge nodes

Using a single Data Lake Store from multiple HDInsight clusters

Install R Studio Server on HDInsight

Manage Hadoop clusters

Use .NET SDK

Use Azure PowerShell

Use the Azure CLI

Add storage accounts to a running cluster

Manage Spark cluster settings

Troubleshoot

Troubleshooting a Failed or Slow HDInsight Cluster

Common Problems FAQ

Tips for Linux

Analyze HDInsight logs

Debug apps with YARN logs

Enable heap dumps

Fix errors from WebHCat

Use Ambari Views to debug Tez Jobs

More troubleshooting

Hive settings fix Out of Memory error

Optimize Hive queries

Hive query performance

Troubleshooting Spark on HDInsight

Track and debug jobs

Known issues

Reference

PowerShell

.NET (Hadoop)

.NET (HBase)

.NET (Avro)

REST

REST (Spark)

Related

Migrating from Windows clusters

Migrate Windows clusters to Linux clusters

Migrate .NET solutions to Linux clusters

Resources

Windows tools for HDInsight

Get help on the forum

Learning path