Skip to content

Spark hadoop commands for pipelines

vjrj edited this page Oct 27, 2022 · 6 revisions

Spark / hadoop

Some hdfs commands useful for pipelines:

Make a directory:
sudo -u spark /data/hadoop/bin/hdfs dfs -mkdir -p /dwca-imports
Copy to local some file or directory:
sudo -u spark /data/hadoop/bin/hdfs dfs -copyToLocal -f /pipelines-data/dr251/ /tmp
Delete some dr
sudo -u spark /data/hadoop/bin/hdfs dfs -rm -r /pipelines-data/dr*
or bigger clean
sudo -u spark /data/hadoop/bin/hdfs dfs -rm -r /pipelines-data/dr* /pipelines-all-datasets/* pipelines-clustering/* /pipelines-species/* /dwca-exports/* /pipelines-jackknife/*

If you are trying to remove everything, perhaps:

  1. shutdown the hadoop cluster
  2. use:
hdfsadmin -dfs format

(this was suggested by Dave in Slack).

copy all duplicateKeys.csv files into /tmp

If some dr has duplicate keys it cannot be indexed and you can see a log like:

The dataset can not be indexed. See logs for more details: HAS_DUPLICATES

In this case a duplicateKeys.csv file is generated with details of the duplicate records. You can copy these files into local filesystem with:

for i in `sudo -u spark /data/hadoop/bin/hdfs dfs -ls -S /pipelines-data/dr*/1/validation/duplicateKeys.csv | grep -v "     0 " | cut -d "/" -f 3` ; do sudo -u spark /data/hadoop/bin/hdfs dfs -copyToLocal -f /pipelines-data/$i/1/validation/duplicateKeys.csv /tmp/duplicateKeys-$i.csv; done
Clone this wiki locally