Skip to content

Connection parameters to mTLS enabled Milvus Missing; and Data Mismatch between Spark Dataframe and Milvus #28

@rohitreddy1698

Description

@rohitreddy1698

Hello,

I am trying to connect to a Milvus instance which has mTLS enabled, and hosted on kubernetes
When trying to connect to it we noticed that the configuration options for providing client certs is missing.

So we enhanced the code by adding parameters for setting client certs and CA certs via changes to the MilvusOptions Format file at : MilvusOptions.scala and
configuring this while initiating the connection at : MilvusConnection.scala.

  val secure: Boolean = config.getBoolean(MILVUS_SECURE, false)
  val caCert: String = config.getOrDefault(MILVUS_CA_CERT, "")
  val clientKey: String = config.getOrDefault(MILVUS_CLIENT_KEY, "")
  val clientCert: String = config.getOrDefault(MILVUS_CLIENT_CERT, "")

Post this was able to successfully connect to the Milvus Based Instance.
With the new configurations the certificates were successfully being configured and the mTLS connection was successful.

But once I overcame this I ran into another issue with datamismatch.
For Functional testing I tried running the quickstart examples : quickstart.py where a sample data from Spark Dataframe is being loaded to Milvus.

**Spark DataFrame** 
data = [(1, "a", [1.0,2.0,3.0,4.0,5.0,6.0,7.0,8.0]),
    (2, "b", [1.0,2.0,3.0,4.0,5.0,6.0,7.0,8.0]),
    (3, "c", [1.0,2.0,3.0,4.0,5.0,6.0,7.0,8.0]),
    (4, "d", [1.0,2.0,3.0,4.0,5.0,6.0,7.0,8.0])]


**When ingested Data In Milvus Collection** 
    (1, "a", [0.0, 1.875, 0.0, 2.0, 0.0, 2.125, 0.0, 2.25]),
    (2, "b", [0.0, 1.875, 0.0, 2.0, 0.0, 2.125, 0.0, 2.25]),
    (3, "c", [0.0, 1.875, 0.0, 2.0, 0.0, 2.125, 0.0, 2.25]),
    (4, "d", [0.0, 1.875, 0.0, 2.0, 0.0, 2.125, 0.0, 2.25])

On deeper exploration I found that, this is happening because of the DataType mismatch between Spark and Milvus.
This is arising when the Spark DataFrame Field marked as Vector field contains an Array of Double Values instead of Float Values, during which an implicit casting is being done leading to the data mismatch.

This is happening because in : MilvusCollection.scala both Arrays of Type Double and Float are mapped to a Float Vector in Milvus.
This is because Milvus does not have a Double datatype inherently.

So to get around this I created a UDF which gets applied when the Spark DataFrame column marked as Vector Field contains values of type Double
This forces a explicit type casting of the column from Type Double Array to Float Array.
This is resulting in a successful data ingestion to Milvus.
After this change data in Milvus :

    (1, "a", [1.0,2.0,3.0,4.0,5.0,6.0,7.0,8.0]),
    (2, "b", [1.0,2.0,3.0,4.0,5.0,6.0,7.0,8.0]),
    (3, "c", [1.0,2.0,3.0,4.0,5.0,6.0,7.0,8.0]),
    (4, "d", [1.0,2.0,3.0,4.0,5.0,6.0,7.0,8.0])

Can you please help me understand if my approach is right?

Thank you

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions