Skip to content

Spark hadoop commands for pipelines

vjrj edited this page Oct 27, 2022 · 6 revisions

Intro

We try to describe here some hdfs and spark commands useful for pipelines.

Make a directory

sudo -u spark /data/hadoop/bin/hdfs dfs -mkdir -p /dwca-imports

Copy to local some file or directory

sudo -u spark /data/hadoop/bin/hdfs dfs -copyToLocal -f /pipelines-data/dr251/ /tmp

Delete some dr

sudo -u spark /data/hadoop/bin/hdfs dfs -rm -r /pipelines-data/dr*

or bigger clean

sudo -u spark /data/hadoop/bin/hdfs dfs -rm -r /pipelines-data/dr* /pipelines-all-datasets/* pipelines-clustering/* /pipelines-species/* /dwca-exports/* /pipelines-jackknife/*

If you are trying to remove everything, perhaps:

  1. shutdown the hadoop cluster
  2. use:
hdfsadmin -dfs format

(this was suggested by Dave in Slack).

copy all duplicateKeys.csv files into /tmp

If some dr has duplicate keys it cannot be indexed and you can see a log like:

The dataset can not be indexed. See logs for more details: HAS_DUPLICATES

In this case a duplicateKeys.csv file is generated with details of the duplicate records. You can copy these files into local filesystem with:

for i in `sudo -u spark /data/hadoop/bin/hdfs dfs -ls -S /pipelines-data/dr*/1/validation/duplicateKeys.csv | grep -v "     0 " | cut -d "/" -f 3` ; do sudo -u spark /data/hadoop/bin/hdfs dfs -copyToLocal -f /pipelines-data/$i/1/validation/duplicateKeys.csv /tmp/duplicateKeys-$i.csv; done

Restart spark & hadoop

sudo -u spark /data/spark/sbin/stop-slaves.sh
sudo -u spark /data/spark/sbin/stop-master.sh
sudo -u spark rm -Rf /data/spark-tmp/*
sudo -u hdfs /data/hadoop/sbin/stop-dfs.sh
sudo -u hdfs /data/hadoop/sbin/start-dfs.sh
sudo -u spark /data/spark/sbin/start-master.sh
sudo -u spark /data/spark/sbin/start-slaves.sh 
Clone this wiki locally