pipelines extra steps

Extra steps to setup a pipelines cluster

Jenkins

If you have a medium or big size LA portal is recommend to use the pipelines jenkins service.

Tune your pipelines/spark cluster size

The inventories are generated with the default values of a typical production pipelines cluster, but you should adapt these values to your cluster size and hardware. This can be done using ansible and customizing some variables. Typically copy the values of the spark and pipelines section of your inventories in your local-extra inventory so have precedence and adapt this section.

For instance:

[spark:vars]

# http://spark-configuration.luminousmen.com/
# SPARK-DEFAULTS.CONF
# but. with .dots.
spark_default_parallelism = 144
spark_executor_memory = 22G
spark_executor_instances = 17
spark_driver_cores = 6
spark_executor_cores = 8
spark_driver_memory = 6G
spark_driver_maxResultSize = 8G
spark_driver_memoryOverhead = 819
spark_executor_memoryOverhead = 819
spark_dynamicAllocation_enabled = false
spark_sql_adaptive_enabled = true
# Recommended configuration:
spark_memory_fraction = 0.8
spark_scheduler_barrier_maxConcurrentTasksCheck_maxFailures = 5
spark_rdd_compress = true
spark_shuffle_compress = true
spark_shuffle_spill_compress = true
spark_serializer = org.apache.spark.serializer.KryoSerializer
spark_executor_extraJavaOptions = -XX:+UseG1GC -XX:+G1SummarizeConcMark
spark_driver_extraJavaOptions = -XX:+UseG1GC -XX:+G1SummarizeConcMarke1SummarizeConcMark

interpret_spark_parallelism = {{ spark_default_parallelism }}
interpret_spark_num_executors = {{ spark_executor_instances }}
interpret_spark_executor_cores = {{ spark_executor_cores }}
interpret_spark_executor_memory = '{{ spark_executor_memory }}'
interpret_spark_driver_memory = '{{ spark_driver_memory }}'
image_sync_spark_parallelism = {{ spark_default_parallelism }}
(...)
sensitive_spark_executor_memory = '{{ spark_executor_memory }}'
sensitive_spark_driver_memory = '{{ spark_driver_memory }}'
index_spark_parallelism = 500
index_spark_num_executors = {{ spark_executor_instances }}
index_spark_executor_cores = {{ spark_executor_cores }}
index_spark_executor_memory = '{{ spark_executor_memory }}'
index_spark_driver_memory = '{{ spark_driver_memory }}'
jackknife_spark_parallelism = 500
jackknife_spark_num_executors = {{ spark_executor_instances }}
jackknife_spark_executor_cores = {{ spark_executor_cores }}
jackknife_spark_executor_memory = '{{ spark_executor_memory }}'
jackknife_spark_driver_memory = '{{ spark_driver_memory }}'
clustering_spark_parallelism = 500
clustering_spark_executor_memory = '{{ spark_executor_memory }}'
clustering_spark_driver_memory = '{{ spark_driver_memory }}'
solr_spark_parallelism = 500
solr_spark_num_executors = 6
solr_spark_executor_cores = 8
solr_spark_executor_memory = '20G'
solr_spark_driver_memory = '6G'
# ALA uses 24 cores and 28G, tune this in your local inventories for adjust it
spark_env_extras = { 'SPARK_WORKER_CORES': 24, 'SPARK_WORKER_MEMORY': '24g', 'SPARK_LOCAL_DIRS': '/data/spark-tmp', 'SPARK_WORKER_OPTS': '-Dspark.worker.cleanup.enabled=true -Dspark.worker.cleanup.interval=300 -Dspark.worker.cleanup.appDataTtl=300 -Dspark.network.timeout=180000' }

[pipelines:vars]
(...)
pipelines_jvm_def_options = -Xmx16g -XX:+UseG1GC
(...)

Common problems if you don't do this part correctly:

When a job is configured with too much resources that your workers have the job not starts never and stay in "WAITING".
If you configure spark with too much memory, the jobs are killed by the kernel with OOM. You have to low the memory used by spark for this job or similar.

Migrate uuid job

If you are using cassandra in your LA portal and you want to migrate to use pipelines, you'll want to preserver the occurrences UUIDs. For that there is a jenkins job, but you will need several things to run this job. In your production cassandra:

Create this directory

# mkdir /data/uuid-exports/
# chown someuser:someuser /data/uuid-exports/ # optionally

copy in /data/uuid-exports/uuid-export.sh this script and give execution permissions.

# chmod +x /data/uuid-exports/uuid-export.sh

Be sure that you can connect ssh passwordless from your pipelines jenkins to your cassandra, from the spark user to someuser in cassandra that can run the previous script.
Adapt the migration-uuid job to fit to your infrastructure and users
Copy the migration jar to your pipelines master. Currently this is a WIP. See this issue.

Your layers

Do a tar with your layers and put in a public website. Configure your inventories to use them:

[pipelines:vars]
(...)
geocode_state_province_layer = provincias
geocode_state_province_field = nameunit
pipelines_shapefiles_extra_url = https://datos.gbif.es/others/layers/extra-layers.tgz
pipelines_shapefiles_extra_checksum = sha1:1f04efd7b03690f9bc5c27e42315d3be69cadbb1
sds_layers_with_bitmaps_url =  https://datos.gbif.es/others/layers/extra-layers.tgz
sds_layers_with_bitmaps_checksum = sha1:1f04efd7b03690f9bc5c27e42315d3be69cadbb1

State/provinces layers and bitmaps

We use bitmaps of layers to decrease the number of WS call. See this PR and this script about how to generate them.

SDS layers bitmaps

The same with the SDS layers

stateProvincesCenterPoints.txt

This file is a csv (tab delimiter) with the name of your state or provinces, the centroid latitud, the longitud and 4 more values with the bounder box of the province/state.

A typical stateProvince centroids file and the bbox (Marker of the centroid in blue, SW in black, NE in red): Australian centroids and bbox .

FIXME: add a snippet to calculate the centroid and the bbox using a table of provinces/states and their geojson.

[pipelines:vars]
(...)
state_province_centre_points_file = {{inventory_dir}}/files/stateProvinceCentrePoints.txt

Running locally

You can install in your computer the la-pipelines package, https://github.com/AtlasOfLivingAustralia/documentation/wiki/Testing-Debian-Packaging#testing-la-pipelines-package If you want to play, it's easy. You'll need also a nameindex service and solr if you want to do all the process: https://github.com/gbif/pipelines/tree/dev/livingatlas#running-la-pipelines

Some /data/la-pipelines/config/la-pipelines-local.yaml:

collectory:
  wsUrl: https://collections.l-a.site/ws/
  httpHeaders:
    Authorization: XXXXXXXX-XXXX-XXXX-XXXX-XXXXXXXXXXXX

imageService:
  wsUrl: https://images.l-a.site
  httpHeaders:
    apiKey: XXXXXXXX-XXXX-XXXX-XXXX-XXXXXXXXXXXX

speciesListService:
  wsUrl: https://lists.l-a.site

sampling:
  baseUrl: https://spatial.l-a.site/ws/

End

Index

Wiki home
Community
Getting Started
Support
- LA Netiquette. Asking the smart way
- Troubleshooting
Portals in production
ALA modules
Demonstration portal
- Requirements
- Installation of ala-demo
Data management in ALA Architecture
DataHub
- Data Hub
Customization
Internationalization (i18n)
Administration system
Contribution to main project
- Good practices
Study case
- Setting up Atlas of Living Scotland

pipelines extra steps

Extra steps to setup a pipelines cluster

Jenkins

Tune your pipelines/spark cluster size

Migrate uuid job

Your layers

State/provinces layers and bitmaps

SDS layers bitmaps

stateProvincesCenterPoints.txt

Running locally

End

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!