The 2021 Kidney and Kidney Tumor Segmentation challenge (abbreviated KiTS21) is a competition in which teams compete to develop the best system for automatic semantic segmentation of renal tumors and surrounding anatomy.
Kidney cancer is one of the most common malignancies in adults around the world, and its incidence is thought to be increasing . Fortunately, most kidney tumors are discovered early while they’re still localized and operable. However, there are important questions concerning management of localized kidney tumors that remain unanswered , and metastatic renal cancer remains almost uniformly fatal .
Kidney tumors are notorious for their conspicuous appearance in computed tomography (CT) imaging, and this has enabled important work by radiologists and surgeons to study the relationship between tumor size, shape, and appearance and its prospects for treatment [4,5,6]. It’s laborious work, however, and it relies on assessments that are often subjective and imprecise.
Automatic segmentation of renal tumors and surrounding anatomy (Fig. 1) is a promising tool for addressing these limitations: Segmentation-based assessments are objective and necessarily well-defined, and automation eliminates all effort save for the click of a button. Expanding on the 2019 Kidney Tumor Segmentation Challenge , KiTS21 aims to accelerate the development of reliable tools to address this need, while also serving as a high-quality benchmark for competing approaches to segmentation methods generally.
|Mar 1 - Jul 1||Annotation, Release, and Refinement of Training Data|
|Aug 9||Deadline for Intention to Submit & Required Paper|
|Aug 16 - 30||Submissions Accepted|
|Sep 1||Results Announced|
|Sep 27 or Oct 1||Satellite Event at MICCAI 2021|
The KiTS21 cohort includes patients who underwent partial or radical nephrectomy for suspected renal malignancy between 2010 and 2020 at either an M Health Fairview or Cleveland Clinic medical center. A retrospective review of these cases was conducted to identify all patients who had undergone a contrast-enhanced preoperative CT scan that includes the entirety of all kidneys.
Each case's most recent corticomedullary preoperative scan was (or will be) independently segmented three times for each instance of the following semantic classes.
We are hard at work on collecting and annotating data for KiTS21. At this point it's difficult to predict the total number of patients that KiTS21 will include. We had originally aimed to segment 800 cases, but unfortunately the COVID19 global pandemic has delayed our progress. We plan to continue collecting and annotating training cases right up until the training set is "frozen" on July 1, 2021. After this point, we will continue to collect and annotate cases for the test set until submissions begin in the middle of August. Training set annotation progress can be tracked using the Browse feature, described in detail in the section below.
In an effort to be as transparent as possible, we've decided to perform our training set annotations in full view of the public. This is the primary reason for creating an independent website for KiTS21 on top of our grand-challenge.org entry, which will be used only to manage the submission process.
You may have noticed a link on the top-right of this page labeled "Browse". This will take you to a list of the KiTS21 training cases, where each case has indicators for its status in the annotation process (e.g., see Fig. 3). The meaning of each symbol is as follows:
When you click on an icon, you will be taken to an instance of the ULabel Annotation Tool where you can see the raw annotations made by our annotation team. Only logged-in members of the annotation team can submit their changes to the server, but you may make edits and save them locally. The annotation team's progress is synced with the KiTS21 GitHub repository about once per week.
It's important to note the distinction between what we call "annotations" and what we call "segmentations". We use "annotations" to refer to the raw vectorized interactions that the user generates during an annotation session. A "segmentation," on the other hand, refers to the rasterized output of a postprocessing script that uses "annotations" to define regions of interest.
We placed members of our annotation team into three categories:
Broadly, our annotation process is as follows.
Our postprocessing script uses thresholds and fairly simple heuristic-based geometric algorithms. Its source code is available on the KiTS21 GitHub repository under /annotation/postprocessing.py.
Put simply, the model that produces the "best" segmentations of kidneys, tumors, cysts, arteries, veins, and ureters for the patients in the test set will be declared the winner. Unfortunately "best" can be difficult to define .
Before metrics are discussed, we need to discuss the regions that they will be applied to. A common choice in multiclass segmentation is to simply compute the metric for each semantic class and take the average. For this challenge, we don't think this is the best approach. To illustrate why, consider a case in the test set where a single kidney holds both a tumor and a cyst. If a submission has a very high-quality kidney segmentation, but struggles to differentiate kidney voxels from those belonging to the masses, we believe this deserves a high score for the kidney region, but low scores for each mass. Similarly, suppose the masses were segmented nicely, but the system confuses the cyst with the tumor. We don't think it is ideal to penalize the submission twice here (one for the tumor region, once for the cyst region) when in fact it has done a very good job segmenting masses. To address this, we use what we call "Hierarchical Evaluation Classes" (HECs). In an HEC, classes that are considered subsets of another class are combined with that class for the purposes of computing a metric for the superset. For KiTS21, the following HECs will be used.
In 2019, we used a simple Sørensen-Dice ranking with "Kidney and Tumor", and "Tumor" HECs. The decision to use Dice alone was made to prioritize simplicity and ease of interpretation. We still believe that these things are important, but we also recognize that Dice scores have their limitations. One limitation was the outsized influence that small tumors had on the rankings. This was because segmentation errors are overwhelmingly on the borders of regions, and small regions have a higher ratio of border voxels to interior voxels, leading to lower values for volumetric overlap scores like Dice.
The test set is being annotated using an identical workflow as that used for the training set, and so it too will have several segmentations per case. We've decided to take advantage of this in order to address some of the limitations of a single-reference Dice approach. Our planned approach is similar to that of Heimann et al. for liver segmentation in the 2007 MICCAI Challenge Workshop . In particular, we will be computing gauged scores for each predicted region, which are adjusted according to the average error observed between human raters in that region.
Formally, let be the error between the prediction and reference, where the subscripts represent:
Similarly, let represent the average error between rater and the remaining two raters. Submissions will be ranked according to a total score
The intent of this transformation is to normalize scores so that predictions with roughly equal quality to that of our trainees will achieve a score of . Note that negative scores are possible where error is at least 10 times as high as it is between trainees on average. The six metrics to be used (tentatively) for this computation are as follows:
|1 - Dice||
A common measure of discrepancy in volumetric overlap. If we let be the set of predicted voxels and be the set of reference voxels,
|1 - IoU||
Sometimes referred to as 1 - Jaccard. Another measure of discrepancy in volumetric overlap, with a greater emphasis on cases with high disagreement.
|Symmetric Relative Volume Difference||
A measure of difference in total predicted volume, but slightly modified from traditional Relative Volume Difference.
|Absolute Volume Difference||
Another measure of difference in total predicted volume, but without normalizing for the size of the reference. Using our notation from above,
|Average Symmetric Surface Distance||
The average distance between each prediction boundary voxel to the nearest boundary voxel of the reference, and vice versa. Boundary voxels are defined as any voxel in a region that has a voxel outside of that region in its 18-neighborhood. Let and represent the sets of boundary voxels on the prediction and reference sets respectively.
|RMS Symmetric Surface Distance||
Similar to average symmetric surface distance, but with greater emphasis on boundary areas with large disagreement.
In the occasional cases where either a reference HEC or a predicted HEC is empty but not both, the metrics will be computed as follows:
The above treatment of the volume-difference-based scores was derived such that human annotators would be expected to achieve a total score of roughly 90 on cases with empty references, since that is also the intention with nonempty cases. It's also somewhat intuitive since it consists of a flat penalty for the false positive/negative plus an additional term that increases with the size of the region that was missed or erroneously predicted. In cases where both the predicted and reference HEC are empty, all errors will be set to zero.
The code that will be used to compute these metrics on the test set will be available on the GitHub repository under /evaluation/ (in preparation).