Use Case 6: Invasive Ductal Carcinoma (IDC) Segmentation

This blog posts explains how to train a deep learning Invasive Ductal Carcinoma (IDC)classifier in accordance with ourpaper “Deep learning for digital pathology image analysis: A comprehensive tutorial with selected use cases”.

Please note that there has been an update to the overall tutorial pipeline, which is discussed in full here.

This text assumes thatCaffeis already installed and running. For guidance on that you can referencethisblog post which describes how to install it in an HPC environment (and can easily be adopted for local linux distributions).

As mentioned in the paper, the goal of this project was to create a single deep learning approach which worked well across many different digital pathology tasks. On the other hand, each tutorial is intended to be able to stand on its own, so there is a large amount of overlapping material between Use Cases.

Since the data was provided to us, at the patch level, we are able to reproduce the exact training and test sets of [8]. As a result, we don’t perform any k-fold testing, so we have 1 less step than the previous use cases.

Invasive Ductal Carcinoma (IDC) is themost common subtype of all breast cancers. To assign anaggressiveness grade to a whole mount sample, pathologiststypically focus on the regions which contain the IDC. As aresult, one of the common pre-processing steps for automaticaggressiveness grading is to delineate the exact regions of IDCinside of a whole mount slide.
We obtained the exact dataset, down to the patch level, fromthe authors of [8] to allow for a head to head comparison withtheir state of the art approach, and recreate the experimentusing our network. The challenge, simply stated, is can oursmaller more compact network produce comparable results?Our approach is at a notable disadvantage as their networkaccepts patches of size 50 x 50, while ours use 32 x 32 thusbeing provided 60% less pixels of context to the classifier.

Overview

We break down this approach into 4 steps:

Step 1:Patch Extraction (Matlab):extract patches from all images of both the positive and negative classes and generate the training and test list

Step 2:Database Creation (Bash):using the patches and training lists created in the previous step, create aleveldb training and testing database with mean files, for high performance DL training.

Step 3:Training of DL classifier (Bash):Run the provided prototxt files (solver and architecture) to train the classifier using Caffe.

Step 4:Generating Output on Test Images (Python):Use final model to generate the output

There are, of course, other ways of implementing a pipeline like this (e.g., use Matlab to directly create a leveldb, or skip the leveldb entirely, and use the images directly for training) . I’ve found using the above pipeline fits easiest into the tools that are available inside of Caffe and Matlab, and thus requires the less maintenance and reduces complexity for less experienced users. If you have a suggested improvement, I’d love to hear it!

The original dataset consisted of 162 whole mount slideimages of Breast Cancer (BCa) specimens scanned at 40x. From that, 277,524 patches of size 50 x 50 were extracted (198,738 IDC negative and78,786 IDC positive).

Each patch’s file name is of the format:

u_xX_yY_classC.png— > example 10253_idx5_x1351_y1101_class0.png

Whereuis the patient ID (10253_idx5), X is the x-coordinate of where this patch was cropped from, Yis the y-coordinate of where this patch was cropped from, and Cindicates the class where 0 is non-IDC and 1 is IDC.

The data and traning/test set partitions are locatedhere (1.6G).

Examples of these images can be seen below

We refer to step1_make_patches_and_list_all_types.m, which is fully commented and contains options for the the 3 versions discussed in the paper (a) cropped version, (b) resized version, (c) resized with additional rotations for class balancing.

A high level understanding is provided here:

Load the training, validation and test files which indicate which patients were used for which stage.
For each image, load its respective patches and either resize them to fit our architecture or crop them (50×50 – > 32 x 32).
Save the modified patches to them to disk. At the same time, we write 6 files, indicating which patches belong in which set (Training, testing, validation ). These files are

train_w32_parent_1.txt:This contains a list of the patient IDs which have been used as part of the training set. This is similar to valid_w32_parent_1.txt and test_w32_parent_1.txt, for the validation and testing sets respectively.An example of the file content is:

10304
9346
9029
12911

train_w32_1.txt:contains the filenames of the patches which should go into the training set (and test and validation sets when using test_w32_1.txt and valid_w32_1.txt, respectively). The file format is [filename] [tab] [class]. Where class is either 0 (non-IDC) or 1 (IDC).An example of the file content is:

12909_idx5_x101_y1301_class0.png 0
12909_idx5_x101_y1301_class0r.png 0
12909_idx5_x1151_y1351_class0.png 0
12909_idx5_x1151_y1351_class0r.png 0

All done with the Matlab component!

Now that we have both the patches saved to disk, and training and testing lists, we need to get the data ready for consumption by Caffe. It is possible, at this point, to use an Image layer in Caffe and skip this step, but it comes with 2 caveats, (a) you need to make your own mean-file and ensure it is in the correct format and (b) an image layer can is not designed for high throughput. Also, having 100k+ files in a single directory can bring the system to its knees in many cases (for example, “ls”, “rm”, etc), so it’s a bit more handy to compress them all in to 3databases, and use Caffe’s tool to compute the mean-file.

For this purpose, we use this bash file:step3_make_dbs.sh

We run it in the “subs” directory (“./” in these commands), which contains all of the patches. As well, we assume the training lists are in “../”, the directory above it.

In this use case, we have a validation set given to us, which was used initially in [8] to determine learning variables, but subsequently added into the training set. In our paper we concatenated the two files to create a single larger train_w32_1.txt, as our learning parameters and iterations are fixed, thus not requiring a validation set.

Here we’ll briefly discuss the general idea of the commands, while the script has additional functionality (computes everything in parallel for example).

Creating Databases

We use the caffe suppliedconvert_imagesettool to create the databases using this command:

~/caffe/build/tools/convert_imageset -shuffle -backend leveldb ./ DB_train_1

We first tell it that we want toshufflethe lists, this isveryimportant. Our lists are in patient and class order, making them unsuitable for stochastic gradient descent. Since the database stores files, as supplied, sequentially, we need to permute the lists. Either we can do it manually (e.g., use sort –random) , or we can just let Caffe do it 🙂

Wespecify that we want to use a leveldb backend instead of a lmdb backend. My experiments have shown that leveldb can actually compress data much better without the consequence of a large amount of computational overhead, so we choose to use it.

Creating mean file

To zero the data, we compute mean file, which is the mean value of a pixel as seen through all the patches of the training set. During training/testing time, this mean value is subtracted from the pixel to roughly “zero” the data, improving the efficiency of the DL algorithm.

Since we used a levelDB database to hold our patches, this is a straight forward process:

~/caffe/build/tools/compute_image_mean DB_train_1 DB_train_w32_1.binaryproto -backend leveldb

Supply it the name of the database to use, the mean filename to use as output and specifythat we used a leveldb backend. That’s it!

Setup files

Now that we have the databases, and the associated mean-file, we can use Caffe to train a model.

There are two files involved, which may need to be slightly altered, as discussed below:

BASE-alexnet_solver.prototxt:This file describes various learning parameters (iterations, learning method (Adagrad) etc).

On lines 1 and 10 change: “%(kfoldi)d” to “1”, since we have only 1 fold.

On line 2: change “%(numiter)d” tonumber_test_samples/128.This is to have Caffe iterate through the entire test database. Its easy to figure out how many test samples there are using:

“wc –l test_w32_1.txt”

BASE-alexnet_traing_32w_db.prototxt:This file defines the architecture.

We only need to change lines 8, 12, 24, and 28 to point to the correct fold (again, replace “%(kfoldi)d”with 1). That’s it!

Note, these files assume that the prototxts are stored in a directory called./modeland that the DB files and mean files are stored in the directory above(../). You can of course use absolute file path names when in doubt.

In our case, we had access to a high performance computing cluster, so we used a python script (step4_submit_jobs.py) to submit the training process to the HPC. This script automatically does all of the above work, but you need to provide the working directory on line 11. I use this (BASE-qsub.pbs) PBS script to request resources from our Torque scheduler, which is easily adaptable to other HPC environments.

Initiate training

If you’ve used the HPC script above, things should already be queued for training. Otherwise, you can start the training simply by saying:

~/caffe/build/tools/caffe train –solver=1-alexnet_solver_ada.prototxt

In the directory which has the prototxt files. That’s it! Now wait until it finishes (600,000) iterations. 🙂

At this point, you should have a model available, to generate some output images. Don’t worry, if you don’t, you can usemine.

Here is a python script, to generate the test output for the associated k-fold (step5_create_output_images_kfold.py).

It takes 2 command line arguments, base directory and the fold. Make sure to edit line 88 to apply the appropriate scaling or cropping depending on your training protocol.

The base directory is expected to contain:

BASE/images: a directory which contains the tif images for output generation

BASE/models: a directory which holds the learned model

BASE/test_w32_parent_1.txt: the list of parent IDs to use in creating the output for fold 1, created in step 1

BASE/DB_train_w32_1.binaryproto: the binary mean file for fold 1 created in step 2

It generates 2 output images for each input. A “_class” image and a “_prob” image. The “_prob” image is a 3 channel image which contains the likelihood that a particular pixel belongs to the class. In this case, the Red channel represents the likliehood that a pixel belongs to the non-IDCclass, and the green channel represents the likelihood that a pixel belongs to the IDCclass. The two channels sum to 1. The “_class” image is a binary image using theargmaxof the “_probs image”.

The annotated image on the left shows in green where the pathologist has identified IDC. On the right, we overlay a heatmap onto the same image, where the more red the pixel is, the more likely it is IDC. We note that the regions at twelve o’clock are not actually false positives, but were too small to be deemed interesting by the pathologist, thus they were not originally labeled.

Typically, you’ll want to use a validation set to determine an optimal threshold as it is often not .5 (which is equivalent toargmax). Subsequently, use this threshold on the the “_prob” image to generate a binary image.

Efficiency in Patch Generation

Writing a large number of small, individual files to a harddrive (even SSD) is likely going to take a very long time. Thus for Step 1 & Step 2, I typically employ a ram disk to drastically speed up the processes. Regardless, make sure Matlab does not have the output directory in its search path, otherwise it will likely crash (or come to a halt), while trying to update its internal list of available files.

As well, using a Matlab Pool (matlabpool open), opens numerous workers which also greatly speed up the operation and is recommended as well.

Efficiency in Output Generation

It most likely will take a long time to apply a the classifier pixel wise to an image to generate the output. In actuality, there are many ways to speed up this process. The easiest way in this case is to simply use a larger stride such that you compute every 2nd or 3rd pixel since IDCsegmentation doesn’t require nearly as much precision, as say, nuclei segmentation. Another technique is to simply white threshold away the background regions which often take up a significant portion of the images.

Keep an eye out for future posts where we delve deeper into this and provide the code which we use!

Magnification

It is very important to use the model on images of thesame magnificationas the training magnification. This is to say, if your patches are extracted at 10x, then the test images need to be done at 10x as well.

Code is availablehere

Data is availablehere (1.6G)

Use Case 6: Invasive Ductal Carcinoma (IDC) Segmentation - Andrew Janowczyk (2024)