Breast Ultrasound Lesion Classification with PyRadiomics and Scikit-Learn
In the previous post, we explored Radiomics and its Workflow step-by-step. Radiomics is a powerful tool for quantifying imaging features from medical images to extract clinically meaningful patterns. In this post, we’ll dive into a practical example: developing a radiomics and machine learning solution to classify breast ultrasound lesions into benign and malignant categories using Python, PyRadiomics, and Scikit-Learn.
Dataset Overview
We will use Breast Ultrasound Images Dataset1Al-Dhabyani W, Gomaa M, Khaled H, Fahmy A. Dataset of breast ultrasound images. Al-Dhabyani W, Gomaa M, Khaled H, Fahmy A. Dataset of breast ultrasound images. Data in Brief. 2020 Feb;28:104863. DOI: 10.1016/j.dib.2019.104863.. The data collected at baseline includes breast ultrasound images of women aged 25 to 75 years old. This data was gathered in 2018 and comprises 600 female patients. The dataset consists of 780 images, each with an average size of 500*500 pixels. The images are stored in PNG format. The ground truth images are presented alongside the original images. The images are categorized into three classes: normal, benign, and malignant. For this tutorial, we focus on benign and malignant cases.
Download the dataset and extract the zip archive. Move the “Dataset_BUSI_with_GT” folder to your project’s main folder. You will see normal, benign, and malignant subfolders in this folder. After downloading and extracting the dataset, the folder structure will look like this:
Dataset_BUSI_with_GT/
benign/
malignant/
normal/
Each lesion category contains corresponding mask images, which outline the regions of interest (ROIs) for feature extraction. You can see an example of a malignant image and its mask image below.
Step 1: Dataset Preparation
The first step is preparing the dataset for analysis. The dataset contains breast ultrasound images categorized into three classes: normal, benign, and malignant. Each lesion also has an accompanying mask file that outlines the region of interest (ROI) within the image.
First, we must traverse the dataset directory, identify valid image and mask file pairs, and store their file paths for further processing. This ensures that we only include images with corresponding masks, which are crucial for radiomics feature extraction. The masks define the specific areas in the image to focus on, allowing us to analyze the lesions accurately. During this step, we also label each image as benign or malignant, discarding normal cases, as these are not relevant for this classification task.
import os images_list = [] masks_list = [] labels_list = [] base_dir = "Dataset_BUSI_with_GT" for label in os.listdir(base_dir): label_dir = os.path.join(base_dir, label) if not os.path.isdir(label_dir): continue for image_name in os.listdir(label_dir): if not image_name.endswith(".png") or "_mask" in image_name: continue base_name = image_name[:-4] mask_name = f"{base_name}_mask.png" mask_path = os.path.join(label_dir, mask_name) if os.path.exists(mask_path): images_list.append(os.path.join(label_dir, image_name)) masks_list.append(mask_path) labels_list.append(label)
Step 2: Feature Extraction with PyRadiomics
With the dataset prepared, the next step is feature extraction using PyRadiomics. Radiomics involves extracting quantitative features from medical images, such as shape, texture, and intensity patterns. PyRadiomics simplifies this process by providing an out-of-the-box tool to compute hundreds of features from defined ROIs.
We use the ultrasound images and their corresponding masks to extract features. Each mask highlights the lesion area in the image, and PyRadiomics calculates relevant metrics for this region. These metrics include shape descriptors (e.g., lesion size), intensity statistics (e.g., mean and variance of pixel values), and texture features (e.g., gray-level co-occurrence matrix). By focusing on these features, we turn medical images into structured data that machine learning models can analyze.
import SimpleITK as sitk from radiomics import featureextractor extractor = featureextractor.RadiomicsFeatureExtractor() features = [] for img_path, msk_path, label in zip(images_list, masks_list, labels_list): if label == 'normal': continue sitk_image = sitk.ReadImage(img_path) if sitk_image.GetNumberOfComponentsPerPixel() > 1: sitk_image = sitk.VectorIndexSelectionCast(sitk_image, 0, sitk.sitkFloat32) else: sitk_image = sitk.Cast(sitk_image, sitk.sitkFloat32) sitk_mask = sitk.ReadImage(msk_path) if sitk_mask.GetNumberOfComponentsPerPixel() > 1: sitk_mask = sitk.VectorIndexSelectionCast(sitk_mask, 0, sitk.sitkUInt8) else: sitk_mask = sitk.Cast(sitk_mask, sitk.sitkUInt8) result = extractor.execute(sitk_image, sitk_mask, label=255) extracted_features = {key: val for key, val in result.items()} extracted_features["label"] = label features.append(extracted_features)
Once features are extracted, we store them in a structured format, such as a CSV file, for later use. This step is important because feature extraction can be computationally expensive, especially when dealing with large datasets. Saving the features allows us to reuse them without having to process the raw images again. Additionally, the CSV format makes it easy to visualize the data and prepare it for machine learning tasks, as each row represents an image, and each column represents a specific feature.
import pandas as pd df = pd.DataFrame(features) df.to_csv("radiomics_features.csv", index=False)
Step 3: Model Building with Scikit-Learn
The extracted features are now ready for classification. Using Scikit-Learn, we split the dataset into training and testing sets, ensuring the model is evaluated on unseen data. For the classifier, we use a Random Forest algorithm. Random Forests are well-suited for this task because they handle high-dimensional data effectively, are resistant to overfitting, and provide feature importance scores, which can help interpret the results. The model is trained on the training set to learn patterns that differentiate benign from malignant lesions.
from sklearn.model_selection import train_test_split from sklearn.ensemble import RandomForestClassifier features = df.iloc[:, 22:115] labels = df['label'] X_train, X_test, y_train, y_test = train_test_split(features, labels, test_size=0.25, random_state=42) clf = RandomForestClassifier(n_estimators=100, random_state=42) clf.fit(X_train, y_train)
Step 4: Model Evaluation with Scikit-Learn
Once trained, we test the model on the testing set and evaluate its performance using metrics like precision, recall, and F1-score. These metrics provide insights into how well the model classifies the lesions and whether it achieves a good balance between sensitivity (detecting malignancies) and specificity (avoiding false positives).
from sklearn.metrics import classification_report, confusion_matrix y_pred = clf.predict(X_test) print("Classification Report:") print(classification_report(y_test, y_pred)) print("Confusion Matrix:") print(confusion_matrix(y_test, y_pred))
Conclusion
In this tutorial, we explored how to classify breast ultrasound lesions using a radiomics and machine learning pipeline. By combining PyRadiomics for feature extraction and Scikit-Learn for model building, we demonstrated how to transform medical imaging data into structured, actionable insights. This approach highlights the power of radiomics in quantifying lesion characteristics and its potential to enhance diagnostic accuracy when integrated with machine learning.
While the Random Forest classifier provided reliable results, there is room for further exploration, such as using deep learning for automatic feature extraction or augmenting the dataset to improve performance. Ultimately, this workflow showcases how computational tools can support radiologists in making more informed, data-driven decisions, contributing to better patient care.
References
- 1Al-Dhabyani W, Gomaa M, Khaled H, Fahmy A. Dataset of breast ultrasound images. Al-Dhabyani W, Gomaa M, Khaled H, Fahmy A. Dataset of breast ultrasound images. Data in Brief. 2020 Feb;28:104863. DOI: 10.1016/j.dib.2019.104863.