Overview About HDInsight and Hadoop Iterative data exploration Data Warehouse on demand ETL at scale Streaming at scale Machine learning Batch & Interactive Processing Run Custom Programs Upload Data for HDInsight Hadoop components on HDInsight R Server Apache Hive Apache Spark Apache HBase Apache Storm Apache Kafka Domain-joined HDInsight clusters Azure HDInsight and Hadoop Architecture HDInsight Architecture Hadoop Architecture Lifecycle of an HDInsight Cluster High availability Capacity planning Release notes Recent Archive Get Started Start with Hadoop and Hive Start with Spark Start with R Server Start with HBase & NoSQL Start with Storm Start with Interactive Hive (Preview) Start with Kafka (Preview) Hadoop sandbox Data Lake Tools with Hortonworks Sandbox Tools for Visual Studio HDInsight storage options How To Import and export data Upload data for Hadoop jobs Import and export tabular data with Sqoop Run Sqoop using SSH Run Sqoop using cURL Run Sqoop using the .NET SDK Run Sqoop using PowerShelll Batch process data Use Hadoop for batch processing Use MapReduce with HDInsight Use MapReduce via SSH Use MapReduce via cURL Use MapReduce with the .NET SDK Use PowerShell Run the MapReduce samples Use Hive for batch queries Use Hive with HDInsight Use the Hive View Use Hive with Beeline Use Hive with cURL Use Hive with PowerShell Use Hive with the .NET SDK Use the HDInsight tools for Visual Studio Create and deploy a Hive User Defined Function Use Pig for batch processing Use Pig with HDInsight Use Pig via SSH Use Pig via PowerShell Use Pig via the .NET SDK Use Pig via cURL Use Pig with DataFu Use Spark for batch processing Use Spark with HDInsight Use Spark to process data in Data Lake Store Use Spark to process data in Azure Storage blobs ACTION: TODO: Where are the ADLS/WASB paths covered? Submit Spark batch jobs using Livy Use Spark SQL for batch queries Use Spark SQL with HDInsight Interactively query data Use Spark with notebooks Use Zeppelin notebooks with Spark Use Jupyter notebook with Spark Use external packages with Jupyter using cell magic Use external packages with Jupyter using script action Use a local Jupyter notebook Process data in real-time Use Spark for stream processing What is Spark Streaming? What is Spark Structured Streaming? Use Spark DStreams to process events from Kafka Use Spark DStreams to process events from Event Hubs Use Spark Structured Streaming to process events from Kafka Use Spark Structured Streaming to process events from Event Hubs Creating highly available Spark Streaming jobs in YARN Creating Spark Streaming jobs with exactly once event processing guarantees Use BI tools with HDInsight Use Spark from Power BI and Tableau Build data processing pipelines Use Azure Data Factory Use on-demand HDInsight clusters from Data Factory Use Oozie Use Oozie for workflows Use time-based Oozie coordinators Perform Machine Learning Use R Server What is R Server? Submit jobs from Visual Studio Tools for R Submit R jobs from R Studio Server Analyze data from Azure Storage and Data Lake Store using R Selecting a compute context Use Spark for Machine Learning Use Spark for Machine Learning Configuring R Server on Spark ACTION: MIGRATE suggest migrating content from here Creating SparkML pipelines Creating SparkML models in notebooks Use the Microsoft Cognitive Toolkit from Spark Perform Deep Learning Use Caffe for deep learning with Spark Use HBase Analyze real-time tweets Develop an app with Java Create HBase clusters on a virtual network Configure HBase replication Configure Backup and Replication for HBase and Phoenix on HDInsight Monitor HBase with OMS HBase storage options Using Spark with HBase Using the HBase REST SDK HBase - Migrating to a New Version Use Phoenix Phoenix in HDInsight Use Phoenix and SQLLine Configure Backup and Replication for HBase and Phoenix on HDInsight Phoenix performance monitoring Bulk Loading with Phoenix with psql Using Spark with Phoenix Using the Phoenix Query Server REST SDK Use Storm Deploy and manage topologies Develop data processing apps in SCP Storm examples Write to Data Lake Store Develop Java-based topologies with Maven Develop C# topologies with Hadoop tools Determine Twitter trending topics Process events with C# topologies Process events with Java topologies Use Power BI with a topology Analyze real-time sensor data Process vehicle sensor data Correlate events over time Develop topologies using Python Use domain-joined HDInsight (Preview) Configure domain joined clusters Configure Hive policies Add ACLs at the file and folder levels Manage domain joined clusters Manage authorized Ambari users Sync Other Users from Azure Active Directory to Cluster Use Kafka (Preview) Replicate Kafka data Use with Virtual Networks Use with Spark Use with Storm Develop Develop MapReduce programs Develop C# streaming MapReduce programs Develop Java MapReduce programs Develop Python streaming programs Use Python with Hive and Pig Develop Hive applications Hive and ETL Overview Connect to Hive with JDBC or ODBC Using external metadata stores Writing Hive applications using Java Writing Hive applications using Python Creating user defined functions Process and analyze JSON documents with Hive Hive samples Query Hive using Excel Analyze stored sensor data using Hive Analyze stored tweets using beeline and Hive Analyze flight delay data with Hive Analyze website logs with Hive Develop Spark applications Spark Scenarios Run Spark from the Shell Create a standalone app Choosing between Spark RDD, dataframe and dataset Use the HDInsight Tools for Eclipse Create apps using the Azure Toolkit for IntelliJ Debug jobs remotely with IntelliJ Optimizing and configuring Spark jobs for performance Configuring Spark settings Spark samples Analyze Application Insights telemetry with Spark Analyze website logs with Spark SQL Developing Machine Learning solutions with HDInsight SparkML samples Creating Spark ML Pipelines Spark MLLib samples Predict HVAC performance Predict food inspection results R Server on HDInsight samples Predicting flight delays using R Server on Spark Mahout on HDInsight Samples Generate recommendations with Mahout Serialize and deserialize data Serialize data with Avro Library Analyze big data Analyze using Power Query Deep Dives Advanced Analytics Deep Dive ETL Deep Dive Operationalize Data Pipelines with Oozie Streaming and Business Intelligence Extend clusters Customize clusters using Bootstrap Customize clusters using Script Action Develop script actions Install and use Presto Install or update Mono Add Hive libraries Use Giraph Use Hue Use R Use Solr Use Virtual Network Use Zeppelin Build HDInsight applications Install HDInsight apps Install custom apps Use REST to install apps Publish HDInsight apps to Azure Marketplace Install and use Cask Data Application Platform (CDAP) Install and use Dataiku Install and use Datameer Install and use H2O Install and use Streamsets Secure Use SSH with HDInsight Use SSH tunneling Restrict access to data Add ACLs for users at the file and folder levels Create .NET applications that run with a non-interactive identity Manage Manage Clusters Key scenarios to monitor Administering HDInsight using the Azure Portal Ports used by Hadoop services on HDInsight Upgrade HDInsight cluster to newer version OS patching for HDInsight cluster Scaling best practices Manage Linux Clusters Create Linux clusters Manage Cluster Logs Manage Linux clusters using the Ambari web UI Manage configurations with Ambari Use Ambari REST API Use Azure PowerShell Use cURL and the Azure REST API Use the .NET SDK Use the Azure CLI Use the Azure portal Use Azure Resource Manager templates Migrate to Resource Manager development tools Availability and reliability Use empty edge nodes Using a single Data Lake Store from multiple HDInsight clusters Install R Studio Server on HDInsight Manage Hadoop clusters Use .NET SDK Use Azure PowerShell Use the Azure CLI Add storage accounts to a running cluster Manage Spark cluster settings Troubleshoot Troubleshooting a Failed or Slow HDInsight Cluster Common Problems FAQ Tips for Linux Analyze HDInsight logs Debug apps with YARN logs Enable heap dumps Fix errors from WebHCat Use Ambari Views to debug Tez Jobs More troubleshooting Hive settings fix Out of Memory error Optimize Hive queries Hive query performance Troubleshooting Spark on HDInsight Track and debug jobs Known issues Reference PowerShell .NET (Hadoop) .NET (HBase) .NET (Avro) REST REST (Spark) Related Migrating from Windows clusters Migrate Windows clusters to Linux clusters Migrate .NET solutions to Linux clusters Resources Windows tools for HDInsight Get help on the forum Learning path