Skip to content

Spark hadoop commands for pipelines

vjrj edited this page Oct 27, 2022 · 6 revisions

Spark / hadoop

Some hdfs commands

Make a directory:
sudo -u spark /data/hadoop/bin/hdfs dfs -mkdir -p /dwca-imports
Copy to local some file or directory:
sudo -u spark /data/hadoop/bin/hdfs dfs -copyToLocal -f /pipelines-data/dr251/ /tmp
Delete some dr
sudo -u spark /data/hadoop/bin/hdfs dfs -rm -r /pipelines-data/dr*
or bigger clean
sudo -u spark /data/hadoop/bin/hdfs dfs -rm -r /pipelines-data/dr* /pipelines-all-datasets/* pipelines-clustering/* /pipelines-species/* /dwca-exports/* /pipelines-jackknife/*

If you are trying to remove everything, perhaps:

  1. shutdown the hadoop cluster
  2. use:
hdfsadmin -dfs format

(this was suggested by Dave in Slack).

copy all duplicate key csv files into /tmp
for i in `sudo -u spark /data/hadoop/bin/hdfs dfs -ls -S /pipelines-data/dr*/1/validation/duplicateKeys.csv | grep -v "     0 " | cut -d "/" -f 3` ; do sudo -u spark /data/hadoop/bin/hdfs dfs -copyToLocal -f /pipelines-data/$i/1/validation/duplicateKeys.csv /tmp/duplicateKeys-$i.csv; done
Clone this wiki locally