-
Notifications
You must be signed in to change notification settings - Fork 2
Spark hadoop commands for pipelines
vjrj edited this page Oct 27, 2022
·
6 revisions
We try to describe here some hdfs and spark commands useful for pipelines.
sudo -u spark /data/hadoop/bin/hdfs dfs -mkdir -p /dwca-imports
sudo -u spark /data/hadoop/bin/hdfs dfs -copyToLocal -f /pipelines-data/dr251/ /tmp
sudo -u spark /data/hadoop/bin/hdfs dfs -rm -r /pipelines-data/dr*
sudo -u spark /data/hadoop/bin/hdfs dfs -rm -r /pipelines-data/dr* /pipelines-all-datasets/* pipelines-clustering/* /pipelines-species/* /dwca-exports/* /pipelines-jackknife/*
If you are trying to remove everything, perhaps:
- shutdown the hadoop cluster
- use:
hdfsadmin -dfs format
(this was suggested by Dave in Slack).
If some dr has duplicate keys it cannot be indexed and you can see a log like:
The dataset can not be indexed. See logs for more details: HAS_DUPLICATES
In this case a duplicateKeys.csv file is generated with details of the duplicate records. You can copy these files into local filesystem with:
for i in `sudo -u spark /data/hadoop/bin/hdfs dfs -ls -S /pipelines-data/dr*/1/validation/duplicateKeys.csv | grep -v " 0 " | cut -d "/" -f 3` ; do sudo -u spark /data/hadoop/bin/hdfs dfs -copyToLocal -f /pipelines-data/$i/1/validation/duplicateKeys.csv /tmp/duplicateKeys-$i.csv; done
sudo -u spark /data/spark/sbin/stop-slaves.sh
sudo -u spark /data/spark/sbin/stop-master.sh
sudo -u spark rm -Rf /data/spark-tmp/*
sudo -u hdfs /data/hadoop/sbin/stop-dfs.sh
sudo -u hdfs /data/hadoop/sbin/start-dfs.sh
sudo -u spark /data/spark/sbin/start-master.sh
sudo -u spark /data/spark/sbin/start-slaves.sh
Index
- Wiki home
- Community
- Getting Started
- Support
- Portals in production
- ALA modules
- Demonstration portal
- Data management in ALA Architecture
- DataHub
- Customization
- Internationalization (i18n)
- Administration system
- Contribution to main project
- Study case