This flask application provides a basic interface between WebDAV on the one side and an arbitrary filesystem interface on the other side.
WebDAV implementation is compliant with RFC 4918, with WebDAV compliance class of 1. It implements an Apache's ModDAV strategy for partial file updates using Content-Range headers compliant with RFC 9110.
The operations which are currently supported on the WebDAV side are:
- file: read, write, (partial) update, delete
- directory: listings, creation, deletion
- lock: (not supported at the moment)
Python filesystem uses a regular built-in filesystem operations supported by Python itself. It is supported for file reads, writes, (partial) updates, deletes, directory listing, creation and deletion. The implementation uses mainly system calls from pathlib library.
Invenio data repository software is a repository platform for storing experiment data.
It is supported for file reads, writes, (partial) updates, deletes, directory listing, creation and deletion.
The Invenio data repository supports only complete reads and writes, as it uses object based Amazon S3 storage for data storing.
Due to this disadvantage there are two additional data abstraction layers used to fulfill WebDAV partial updates:
- REST data layer for data retrieval and upload
- Caching layer for temporary data local storage
Note
There will always be a tradeoff between space usage on local device and data transmission to data repository. E.g. in case the machine does not have enough space for storing temporary data, there need to be data transferred more frequently over the network to save local space.
The size of the local data storage (upper limit for space occupation) as well as the time to live for cached data can be modified in the config to create optimal environment for the application.
There are, in general, two methods to install / deploy the application.
To install the service using this method, you must have Docker installed.
First, you must download the repository from GitHub and build a Docker image by yourself
# clone the repository from GitHub
git clone <REPO URL>
# change directory to the cloned folder
cd flask-webdav
# build the image, Docker build tag will be "flask-webdav"
docker build . -t flask-webdav
These steps will create new Docker image with tag flask-webdav
.
Warning
The Docker image build builds own Python interpreter .
Be aware, because this uses some amount of data (around 1GB) and takes some time (around 15 minutes) to build!
Create docker compose file docker-compose.yaml
and use package flask-webdav
.
services:
flask-webdav:
image: flask-webdav
restart: unless-stopped
environment:
FLASK_RUN_PORT: 8001
ports:
- "8001:8000"
The service will be accessible on port 8000
. To change this, the line - "8000:8000"
must be changed to
- "8001:<desired port>"
Run the service with
docker compose up
To install the service using this method, you must have Python, version at least
3.11
, installed.
First, you must download the repository from GitHub and create a Python virtual environment
# clone the repository from GitHub
git clone <REPO URL>
# change directory to the cloned folder
cd flask-webdav
# crete virtual environment in the folder "venv"
python -m venv venv
# use the newly created Python virtual environment
source ./venv/bin/activate
# install the requirements
python -m pip install -r requirements.txt
Then, execute the flask
command which will run app.py
# change directory to src
flask --app app run
The service is running on port 8001
. To change this, code in app.py
must be changed as
app.run(host="0.0.0.0", port=<desired_port>, debug=False)
Warning
Be aware, this method does not use uWSGI
nor any other WSGI server. This is not recommended as it can be potential security risk!
To configure the service, there are three (two) main files, where you can change the variables to desired values.
This file configures the app itself. Each variable has its own description which helps better to understand the variable
This file configures the uWSGI
server (only applicable in Docker installation). It is recommended to configure this server to experienced users only.
The regular user can be interested in two of the variables
# number of processes on which the service runs (do not be mistaken with threads)
processes = 4
# Unix domain socket
socket = /tmp/uwsgi.sock
The unix socket can be replaced by http
directive which will serve the application on the specific port instead of the Unix socket.
# the local address and port of the service
http = 127.0.0.1:8001
The abstract class AbstractFileSystem
is an interface provided for the extension to create different filesystem accesses.
It offers various filesystem-like calls to provide the communication, with the mostly used (not an extensive list) open()
, close()
, read()
, write()
, seek()
, etc.
The application was created with further usage in mind, and it is open to extend this list.
Onedata is a data management system with mind on heavy computations and big datasets. It supports huge variety of operations with data which can be suitable for bigger research projects.
To use this application in the Onedata environment it is needed first to be set up correctly. The setup may vary for Onedata versions (this set-up works for version 21.02). Onedata will use this connector as a Storage backend on Oneprovider. The Storage baceknd can be created by opening Onezone interface, navigating to Clusters, selecting a Oneprovider, navigating to Storage backends, Add storage backend (Onezone -> Clusters -> Storage backends -> Add storage backend). Configuration example:
Key | Value | Explanation | Exemplary values |
---|---|---|---|
Type | WebDAV | WebDAV interface is used for communication | WebDAV |
Name | user defined | Name of the Storage backend defined by user, can be anything | webdav-backend |
Endpoint | user defined | URL with scheme (http/https), port and path to this connector application | https://webdav.sbo.sk:8000/invenio |
Credentials | user defined | TBD | TBD |
Range write support | ModDAV | Apache's ModDAV is used for partial reads and writes to the storage | ModDAV |
Connection pool size | 1 | Maximum number of parallel connections | 1 |
Timeout [ms] | 12 000 000 | Number of millisecond until Onedata is willing to wait for response. | 12 000 000 |
After adding a new Storage backend by pressing Add, it can be assigned to a space in the usual way (Space support).
For the most optimal use, a few restrictions need to be taken care of.
- Connection pool size: At the time, maximum of only one parallel connection is supported for the usage with the connector. It is due to the threading (un)support.
- Timeout: The best tradeoff was found out to be 1 200 seconds (20 minutes). The maximal response time for communication with e.g. Invenio is on the file retrieval, especially when partial read is requested at the end of teh file. It is due to the fact, that the whole file is needed to be downloaded before provision of the data. If the file is large enough, the response may take a long time.
Note
Numerically, it is possible to compute how much will it take to download the whole file from a network location.
The formula is t = ( s * 8 ) / b, where t is time in seconds, s is size of the file in bytes and b is bandwidth of the network connection (download speed) in bits per second.
This formula is the lower bound for the time t.
In reality it will take longer because the real throughput on the network is always lower than the bandwidth b.
E.g. for 60 GiB file, 1 Gbit/s it will take t = (60 * 1024 * 1024 * 1024 * 8) / 1 000 000 000, t = 515 seconds (~9 minutes) to download the whole file.
On 100 Mbit/s it would be 5154 seconds (~86 minutes or ~1,5 hours) which would require longer timeout or different domain (smaller sizes of individual files).