Posted on

Hadoop Archives Guide – How to Create an Archive in Hadoop

LEARN HADOOP archive commands

HADOOP ARCHIVES GUIDE

Hadoop archives are special format archives. A Hadoop archive maps to a file system directory. A Hadoop archive always has a *.har extension. A Hadoop archive directory contains metadata (in the form of _index and _masterindex) and data (part-*) files. The _index file contains the name of the files that are part of the archive and the location within the part files.

TO LEARN MORE ABOUT HADOOP COMMAND GUIDE, READ OUR PREVIOUS POST  HADOOP COMMANDS GUIDE

How to Create Hadoop archives

How to Create an Hadoop Archive

Step 1:

First you have to create a -archiveName. For Example: foo.har. (.har is the extension for all the archive). All the names should have .har extension.

Step 2:

To specify relative path to which the files should be archived you need Parent argument.

For Example: -p /foo/bar x/y/z

Here

  • /foo/bar is the parent path and
  • x/y/z are relative paths to parent

Alway keep in mind that it is Map/Reduce that creates the archives and a Map/Reduce cluster that Runs the show.

For Example:

-r indicates the desired replication factor; if this optional argument is not specified, a replication factor of 3 will be used.

If you just want to archive a single directory /foo/bar then you can just use

hadoop archive – archiveName zoo.har -p /foo/bar -r 3 /outputdir.

HDFS directories can be created using the following series of commands:

How to Look Up Files in Hadoop Archives

  • URI for Hadoop Archives is

har://scheme-hostname:port/archivepath/fileinarchive

  • If no scheme is provided by you. In that case the URI would look like:

har:///archivepath/fileinarchive

How to Unarchive an Hadoop Archive

To unarchive sequentially:

hdfs dfs -cp har:///user/zoo/foo.har/die1 hdfs:/user/zoo/newdir

To unarchive in parallel:

hadoop distcp har: ///user/zoo/foo.har/dir1 hdfs: /user/zoo/newdir