Skip to content

Issues with List API reliability #156

@storema3

Description

@storema3

The API POST /api/v1/list is the main method to get message data out of a Kiwi News node. Having worked with it over a longer time span, it shows more and more issues that make it harder to work with it.

The API allows retrieving the messages pagewise, by specifying the start index and the number of messages. To retrieve all messages of a node, the method must be called repeatedly.

Known limitations of the API

The API walks the trie and is expensive in terms of memory and computation. With smaller nodes (4GB RAM, see the Hetzner CAX11 for a reference) it is necessary to pause (15–30 seconds) between each invocation of the API; otherwise the node might hang or crash.

The API seems to highly dependent on the state/load of the system. Sometimes clients get responses without any data, although the next invocation returns data. On other occasions, one client gets repeatedly HTTP 500 responses about missing leaf nodes, while another client happily retrieves the same data the first one also called for.

The sequence of data items retrieved is not always guaranteed. Therefore, ffter node crashes or re-initialization of a node, a complete re-download of all messages is often necessary.

While the re-download was no problem when there were only a few thousand messages, with increasing messages numbers this is becoming a problem for systems relying on the data.

ETL process to retrieve message data and missing data

The ETL process to retrieve data consists of two steps:

  1. Get all current messages by downloading them with repeated invocations of the list API, until no data is returned anymore.
  2. Periodically, check for new messages. The ETL process knows how many messages it has retrieved, and uses this as the from (start index) parameter.

Step (1) normally works, occasional server errors can be corrected by repeating the API calls. On a CAX11 VM this process can take an hour.

However, it occurred that the number of messages retrieved after a complete re-initialization of a node is lower than its previous message count. A system, that previously had exported 25112 messages from a node, got only 25048 after the node data had been deleted and re-synched.

Step (2) is problematic. It gets all new messages from the node, but over time it also gets duplicate messages that should not be there:

kn-import-duplicate

The image shows the amount of new messages the ETL process exports (green line), per call. The yellow line is the number of duplicates the process received. Here, at 10:00, the ETL process received a message with an already existing message index, although it should have gotten only new items! Is there now somewhere new data, that changed the sequence?

These duplicates never go away, their number increases over time (1-4 messages per occasion). After 37 days of operation, there were 10 duplicates.

This behavior affects the download of amplify and comment messages.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions