ACM SIGMOD 2023 Programming Contest

Task Details

The task is to build an approximate K-NN Graph for a set of vectors. i.e., for each vector, find its approximate k nearest neighbors in a limited time. For this year's task, k is set to be 100.

A sample dataset will be provided. It contains millions of high-dimensional vectors. We will refer to it as D.

The data in Dataset D is in a binary format that starts with a 4-byte integer, num_vectors(uint32_t), followed by num_vectors x 100 (num_dimension) x sizeof(float32) bytes of data stored one vector after another.

We provide both the reading function(ReadBin) to load the dataset and the writing function(SaveKNNG) to generate the output file in the io.h file.

Your goal is to design an efficient algorithm for approximate K-NN Graph construction. For each vector, your output must contain k consecutive ids of the nearest neighbors determined by your algorithm. These neighbor lists are stored one by one and stored in a binary file.

During evaluation, we will replace the sample dataset D with a hidden test dataset. The hidden test dataset is randomly drew from the same dataset where D was sampled from. We will evaluate your algorithms using the hidden test dataset. More details are available in the Evaluation section.

More details about the datasets can be found in the dedicated Datasets section.

output.bin format: The evaluation process expects "output.bin" to be a binary file containing |S| x 100 x id(uint32_t) . |S| is the number of random sample vectors in dataset S, 100 is the number of nearest neighbors and id is the index of 100-nearest neighbors in the given dataset S.

Please format "output.bin" accordingly. You can check out our provided baseline solution on how to produce a valid "output.bin".