Skip to content

[Feature] Integration test case with Data Science Pipeline, CodeFlare and KubeRay #425

Open
@yuanchi2807

Description

@yuanchi2807

Name of Feature or Improvement

Create an integration test case to validate DSP, CodeFlare and KubeRay implementation.

Describe the Solution You Would Like to See

Test environment assumptions:

  1. Data Science Pipeline v1.
  2. Ray cluster shall consist of no more than 2 worker pods, with 2 CPU cores and less than 6 GB available for each pod.
  3. An integration test execution time shall be less than 20 mins in total.
  4. S3 storage may be available, if needed.
  5. Free of proprietary intellectual property.
  6. Public data only.

Proposed test case: Clustering text documents using k-means on scikit-learn education page.

https://scikit-learn.org/stable/auto_examples/text/plot_document_clustering.html

Data Science Pipeline stages:

  1. Downloading test data (https://scikit-learn.org/stable/auto_examples/text/plot_document_clustering.html#loading-text-data)
  2. Launch Ray cluster with two worker pods.
  3. Ray driver launches two Ray actors, deployed to a pod each. The first actor runs TfidfVectorizer, followed by Kmeans clustering and evaluation. The second actor runs HashingVectorizer, followed by Kmeans clustering and evaluation.
  4. Ray driver collects evaluation results from the two actors. Then it reports the summaries.
  5. Ray cluster is stopped and shutdown.
  6. Pipeline run is completed.

Expected test assets:

  1. DSP pipeline yaml to deploy and kick off test runs.
  2. Test image with Ray and document clustering code.
  3. CodeFlare image to deploy the test image.
  4. Preconfigured credentials and configmaps in the test environment.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    Status

    No status

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions