Posted on

Hadoop – What is DistCp (Distributed copy) in Hadoop

What is DistCp in Hadoop

Design Structure of DistCp (Distributed copy)

DistCp (distributed copy) is a tool used for large inter/intra-cluster copying. It uses MapReduce  for distribution, recovery and reporting and error handling.It expands a list of files and directories into input to map tasks, each of these files and directories will copy a partition of the files in the respected source list.

What is DistCp in Hadoop

How to use DistCp option Inter-Cluster Copying in Hadoop?

Use the following command to start an inter-cluster copy:

This command under foo/bar will expand the namespace on nn1 into a temporary file and will start a partition of task in map tasks and later start a copy on each NodeManager from nn1 to nn2.

How to use DistCp option Inter-Cluster to Copy Multiple Tasks?

Use the following command to start an inter-cluster multiple copy:

This command under foo/a will expand the namespace on nn1 into a temporary file and will start a partition of task in map tasks and later start a copy on each NodeManager from nn1 to nn2. Similarly for foo/b.

LEARN WHAT IS HDFS IN HADOOP

Few times when two sources collide while copying them with multiple inter-cluster copy , system shows errors. Don’t worry this error can be resolved by the specified Options.

What is Update and Overwrite in Hadoop

Update

If you want to copy files from source that don’t exist in the current target version, than you can use -update.

Overwrite

If you want to overwrite an existing file in current target version, you can use -overwrite.

Example: How to use Update and Overwrite

Suppose you want to copy from /source/first/ and /source/second/to/target/, where the source paths have the following contents:

When DistCp is used without -update and -overwrite that two directories first and second will be created under target:

and will give following result at target:

Result when DistCp is used with -update or -overwrite then content of only specified source directory is copied to respected target.

and will give following result at target:

LEARN WHAT IS COMMANDS GUIDE OF HADOOP

Command Line Options

  1. -p[rbugpcaxt]: Preserve r: replication number b: block size u: user g: group p: permission c: checksum-type a: ACL x: XAttr t: timestamp.
  2. -i: This will Ignore failures.
  3. -log <logdir>: Write logs to <logdir>.
  4. -m <num_maps>: Copies Maximum number of simultaneous copies.
  5. -overwrite: Overwrites an existing file.
  6. -update: If you want to copy files from source that don’t exist in the current target version, than you can use    –update.
  7. -append: Incremental copy of file with same name but different length.
  8. -f <urilist_uri>: Use list at <urilist_uri> as src list.
  9. -filters: The path to a file containing a list of pattern strings.
  10. -strategy {dynamic|uniformsize}: Help choose the copy-strategy to be used in DistCp.
  11. -bandwidth: Specify bandwidth per map, in MB/second.
  12. -atomic {-tmp <tmp_dir>}: Specify atomic commit, with optional tmp directory.
  13. -numListstatusThreads: Number of threads to use for building file listing.

FEW TIPS:

  1. For copying between two different major versions of Hadoop, one will usually use WebHdfsFileSystem.
  2. What If DistCp run out of memory?

Change -Xmx JVM heap-size parameters, as follows:

LEARN HOW TO CREATE ARCHIVES GUIDE IN HADOOP