Having a lot of small files in HDFS is not efficient for processing and also not good for NameNode metadata. This FAQ provides some common questions and answers on how to convert many small files to a larger sequence file and how to access it. What use cases sequence file option would be better than […]
Hadoop
How to run Hadoop without using SSH
The start-all.sh and stop-all.sh scripts in the hadoop/bin directory will use SSH to launch some of the Hadoop daemons. If for some reason SSH is not available on the server, please follow the steps below to run Hadoop without using SSH. The goal is to modify all “hadoop-daemons.sh” with “hadoop-daemon.sh“. The “hadoop-daemons.sh” simply runs “hadoop-daemon.sh” […]
How To Modify Hadoop Log Level
By default, Hadoop’s log level is set to INFO. This can be too much for most instances, as it will generate huge log files, even in an environment with low to moderate traffic. Changing the root logger in log4j.properties file in Hadoop will not change the log level. Follow the steps below for changing the […]
Understanding the Hadoop MapReduce framework
This post introduces the MapReduce framework that enables you to write applications that process vast amounts of data, in parallel, on large clusters of commodity hardware, in a reliable and fault-tolerant manner. In addition, this post describes the architectural components of MapReduce and lists the benefits of using MapReduce. MapReduce It is a software framework […]
CCA131 – Configure NameNode HA
Note: This post is part of the CCA Administrator Exam (CCA131) objectives series HDFS High Availability Overview A single NameNode is a single point of failure in a Hadoop cluster. You can experience HDFS downtime from an unexpected NameNode Crash or planned maintenance of NameNode. Having a NameNode high availability setup avoids these single points […]
CCA131 – Configuring HDFS snapshot Policy
Note: This post is part of the CCA Administrator Exam (CCA131) objectives series What is HDFS Snapshot Policy You can create Snapshot Policies using Cloudera Manager for taking an automated snapshot of snapshottable paths on HDFS. The snapshot policies run at the time specified (hourly, daily, weekly etc) by the user. Before we can create […]
CCA 131 – Create/restore a snapshot of an HDFS directory (Using Cloudera Manager)
Note: This post is part of the CCA Administrator Exam (CCA131) objectives series HDFS Snapshot Directories in HDFS can be snapshotted, which means creating one or more point-in-time images, or snapshots, of the directory. Snapshots include subdirectories, and can even include the entire filesystem (be careful with this for obvious reasons). Snapshots can be used […]
CCA131 – Create an HDFS user’s home directory
Note: This post is part of the CCA Administrator Exam (CCA131) objectives series In the exam, you may be asked to create a home directory for an existing local user onto HDFS. You may further be asked to set a specific ownership or permission to the home directory. The process basically involves: Create a local […]
CCA 131 – Configure HDFS ACLs
Note: This post is part of the CCA Administrator Exam (CCA131) objectives series The basis for Hadoop Access Control Lists is POSIX ACLs, available on the Linux filesystem. These ACLs allow you to link a set of permissions to a file or directory that is not limited to just one user and a group who […]
CCA 131 – Rebalance the cluster
Note: This post is part of the CCA Administrator Exam (CCA131) objectives series In a long-running cluster, there might be an unequal distribution of data across Datanodes. This could be due to failures of nodes or the addition of nodes to the cluster. To make sure that the data is equally distributed across Datanodes, it […]