-
Notifications
You must be signed in to change notification settings - Fork 2
Spark hadoop commands for pipelines
vjrj edited this page Mar 14, 2024
·
6 revisions
We try to describe here some hdfs and spark commands useful for pipelines.
sudo -u spark /data/hadoop/bin/hdfs dfs -mkdir -p /dwca-imports
sudo -u spark /data/hadoop/bin/hdfs dfs -copyToLocal -f /pipelines-data/dr251/ /tmp
sudo -u spark /data/hadoop/bin/hdfs dfs -rm -r /pipelines-data/dr*
sudo -u spark /data/hadoop/bin/hdfs dfs -rm -r /pipelines-data/dr* /pipelines-all-datasets/* pipelines-clustering/* /pipelines-species/* /dwca-exports/* /pipelines-jackknife/*
If you are trying to remove everything, perhaps:
- shutdown the hadoop cluster
- use:
hdfsadmin -dfs format
(this was suggested by Dave in Slack).
If some dr has duplicate keys it cannot be indexed and you can see a log like:
The dataset can not be indexed. See logs for more details: HAS_DUPLICATES
In this case a duplicateKeys.csv file is generated with details of the duplicate records. You can copy these files into local filesystem with:
for i in `sudo -u spark /data/hadoop/bin/hdfs dfs -ls -S /pipelines-data/dr*/1/validation/duplicateKeys.csv | grep -v " 0 " | cut -d "/" -f 3` ; do sudo -u spark /data/hadoop/bin/hdfs dfs -copyToLocal -f /pipelines-data/$i/1/validation/duplicateKeys.csv /tmp/duplicateKeys-$i.csv; done
Durante the migration of uuids you can find occurrences of drs that not longer exist in your collectory. In this case you will have some indexing error for that missing drs with the message NOT_AVAILABLE
And in hdfs
you only have that uuids in identifiers
:
-
So we'll delete from biocache-store
.
You have to install yq
and avro-tools and follow these steps:
- Create a file with all this drs, let's call it
/tmp/missing
- Copy the avro files of that drs:
for i in `cat /tmp/missing` ; do mkdir /tmp/missing-uuids/$i/; sudo -u spark /data/hadoop/bin/hdfs dfs -copyToLocal -f /pipelines-data/$i/1/identifiers/ala_uuid/* /tmp/missing-uuids/$i/; done
- join all uuids to delete in some file:
for i in `ls /tmp/missing-uuids/dr*/*avro`; do avrocat $i | jq .uuid.string | sed 's/"//g' >> /tmp/del_uuids; done
-
scp
that/tmp/del_uuids
file to yourbiocache-store
. - Delete in biocache store with
biocache-store delete-records -f /tmp/del_uuids
.
sudo -u spark /data/spark/sbin/stop-slaves.sh
sudo -u spark /data/spark/sbin/stop-master.sh
sudo -u spark rm -Rf /data/spark-tmp/*
sudo -u hdfs /data/hadoop/sbin/stop-dfs.sh
sudo -u hdfs /data/hadoop/sbin/start-dfs.sh
sudo -u spark /data/spark/sbin/start-master.sh
sudo -u spark /data/spark/sbin/start-slaves.sh
#!/bin/bash
# Find and delete all 'ala_uuid_backup' in any sub-directory of '/pipelines-data/*/1/identifiers/'
sudo -u spark /data/hadoop/bin/hdfs dfs -ls /pipelines-data/*/1/identifiers/ | grep 'ala_uuid_backup' | awk '{print $8}' | while read -r file
do
if [[ -n "$file" ]]; then
sudo -u spark /data/hadoop/bin/hdfs dfs -rm -r "$file"
fi
done
Index
- Wiki home
- Community
- Getting Started
- Support
- Portals in production
- ALA modules
- Demonstration portal
- Data management in ALA Architecture
- DataHub
- Customization
- Internationalization (i18n)
- Administration system
- Contribution to main project
- Study case