HADOOP ARCHIVES GUIDE
Hadoop archives are special format archives. A Hadoop archive maps to a file system directory. A Hadoop archive always has a *.har extension. A Hadoop archive directory contains metadata (in the form of _index and _masterindex) and data (part-*) files. The _index file contains the name of the files that are part of the archive and the location within the part files.
TO LEARN MORE ABOUT HADOOP COMMAND GUIDE, READ OUR PREVIOUS POST HADOOP COMMANDS GUIDE
How to Create an Hadoop Archive
First you have to create a -archiveName. For Example: foo.har. (.har is the extension for all the archive). All the names should have .har extension.
To specify relative path to which the files should be archived you need Parent argument.
For Example: -p /foo/bar x/y/z
- /foo/bar is the parent path and
- x/y/z are relative paths to parent
Alway keep in mind that it is Map/Reduce that creates the archives and a Map/Reduce cluster that Runs the show.
-r indicates the desired replication factor; if this optional argument is not specified, a replication factor of 3 will be used.
If you just want to archive a single directory /foo/bar then you can just use
hadoop archive – archiveName zoo.har -p /foo/bar -r 3 /outputdir.
HDFS directories can be created using the following series of commands:
hdfs dfs -mkdir /user/zoo
hdfs dfs -mkdir /user/hadoop
hdfs dfs -mkdir /user/hadoop/dir1
hdfs dfs -mkdir /user/hadoop/dir2
How to Look Up Files in Hadoop Archives
- URI for Hadoop Archives is
- If no scheme is provided by you. In that case the URI would look like:
How to Unarchive an Hadoop Archive
To unarchive sequentially:
hdfs dfs -cp har:///user/zoo/foo.har/die1 hdfs:/user/zoo/newdir
To unarchive in parallel:
hadoop distcp har: ///user/zoo/foo.har/dir1 hdfs: /user/zoo/newdir