I have created a blood cell detection dataset set where we can detect white and red blood cells containing 2400+ labelled blood cell images. I briefly talked about what I did in this process below.
One of the conditions for developing a good artificial intelligence model is to have good data. Therefore, collecting, cleaning, and labeling data is an important task in itself. I prepared the data set for blood cell detection that I had been thinking for a long time to experience this process from the very beginning to the very end. In this article, I will briefly talk about these stages.
Dataset Links: GitHub and Kaggle
What Does Blood Cell Detection Data Set Do?
A peripheral blood smear is a slide that prepared for the examination of the patient’s blood under a light microscope. It is performed on a blood smear from a peripheral blood sample. It is done to examine the morphology of the erythrocytes, leukocytes, and thrombocytes in a peripheral blood sample.
This slide helps us in the differential diagnosis of some diseases in the clinic based on its different features such as the number, shape, and color of the cells.
This data set was created by collecting images of blood smears under the microscope at high magnification and resolution and labeling them as white and red blood cells. Using this data set, an artificial intelligence model can be developed that can automatically detect white and red cells in a blood smear.
Data Collecting
I collected the microscope images using my own faculty’s microscopes in the multidisciplinary laboratory of Ankara Yıldırım Beyazıt University Faculty of Medicine and the peripheral smear preparations of the same laboratory. I have recorded 200+ images in total.
Since the first images I took were in very high resolution and TIFF format, they took up a lot of space. I cropped the images to 256 × 256 pixels and converted them to PNG format. I reduced the number to 100 by subjecting such images to various screenings.
Data Tagging
There are many tools available online or offline to annotate data. I preferred VoTT developed by Microsoft. In general, I was satisfied. It is very easy to import / export data. You can tag all your data once and then transfer it to the format you need. It is possible to export the data you have labeled as Pascal VOC or CSV or TFRecords with one click. So you don’t have to worry about using a converter all the time.
At the end of the labeling, 103 white blood cells and 2237 red blood cells were labeled. I am aware that the ratio between white and red blood cells is very large, but this is physiologically so. The rate problem is a problem that can be solved during pre-processing and model training
Publishing
I exported the data I tagged with VoTT in CSV format. I prepared a brief information document about the data set. Then I made the necessary arrangements and shared them on GitHub and Kaggle.
Summary
As a result, we have created a data set where we can detect white and red blood cells without peripheral smear. Maybe there was not an excellent quality data set, but it was my first end-to-end experience in preparing datasets in this process. It is of particular importance as I do everything myself in the laboratory, from microscope to image acquisition and publishing. It was a good step for me to prepare better datasets in the following days.
You can also make your contributions through GitHub.