Medical Data Set Clustering Analysis Using R-Programming

By Bharani Dharan N

CB.BU.P2ASB23046

1. Importing Libraries

import pandas as pd

import numpy as np

from sklearn.cluster import KMeans

from sklearn.preprocessing import MinMaxScaler

import matplotlib.pyplot as plt

from sklearn.impute import SimpleImputer

from sklearn.preprocessing import FunctionTransformer

from sklearn.pipeline import Pipeline

from sklearn.compose import ColumnTransformer

from sklearn.base import BaseEstimator, TransformerMixin

from sklearn.decomposition import PCA

from sklearn.metrics import silhouette_score

from sklearn.cluster import AgglomerativeClustering

from sklearn_extra.cluster import KMedoids

Explanation :

This code snippet outlines a data preparation and clustering analysis pipeline. We leverage pandas for data loading and cleaning tasks. Missing values are handled using scikit-learn's SimpleImputer. Feature scaling for improved clustering distance calculations is achieved with MinMaxScaler. The code also allows for custom feature engineering using FunctionTransformer. Pipeline and ColumnTransformer from scikit-learn are employed to streamline data processing workflows and potentially handle different preprocessing needs for various data types. KMeans clustering is a primary candidate for grouping data points, with the possibility of exploring alternative algorithms like AgglomerativeClustering or KMedoids. Dimensionality reduction through Principal Component Analysis (PCA) might be used for high-dimensional data. Finally, silhouette score is utilized for evaluating clustering quality, and visualizations are created with matplotlib.pyplot to gain insights into the resulting clusters.

2. Pre Processing :

# Load the dataset df = pd.read_csv("D:/RStudio/Medicaldataset.csv") # Check the dataframe print(df.head()) # Check the data types print(df.info()) # Summary statistics print(df.describe()) # Check for missing values print(df.isna().sum()) # Remove rows with missing values df = df.dropna() # Check again for missing values print(df.isna().sum()) # Select columns for analysis df2 = df[['Age', 'Heartrate']] # Preprocessing pipeline pipeline = Pipeline([ ('imputer', SimpleImputer(strategy='mean')), ('scaler', MinMaxScaler()) ]) # Fit and transform the data data = pipeline.fit_transform(df2)

Explanation :

SimpleImputer(strategy='mean') addresses missing values by replacing them with the mean value for each column, assuming numerical features like 'Age' and 'Heartrate' in this case.

MinMaxScaler() scales the features in df2 to a range between 0 and 1. This can be crucial for clustering algorithms sensitive to the scale of different features.

The initial steps of this code focus on preparing the data for clustering analysis using techniques from pandas and scikit-learn libraries. First, pandas' read_csv function loads the data from a CSV file into a DataFrame named df. We then explore this DataFrame using methods like head(), info(), and describe() to understand data types, get summary statistics, and perform initial data quality checks.

Next, the code tackles missing values. It identifies them using isna().sum(), but then employs a potentially aggressive approach by removing entire rows containing missing data with df.dropna(). Depending on the amount and distribution of missing data, this might not be ideal, and alternative imputation strategies could be considered.

We then move on to data selection and preprocessing. Specific features relevant to the clustering analysis are chosen and stored in a new DataFrame df2. To streamline the preprocessing steps, a scikit-learn Pipeline is created. This pipeline combines two important transformations:

Finally, the fit_transform method of the pipeline is applied to the chosen data (df2). This step fits the imputer and scaler on the data (learning parameters like mean values for imputation) and then transforms the data using those learned parameters. The resulting transformed data, potentially with imputed values and scaled features, is stored in the variable data.

By following these initial data preparation steps, we ensure the data is in a suitable format for further clustering analysis. Techniques like KMeans clustering, dimensionality reduction (PCA), and evaluation (silhouette score) could be used next to explore the inherent groupings within the medical dataset.

3. Clustering :

## Number of clusters using NbClust
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score
silhouette_scores = []
for k in range(2, 11):
kmeans = KMeans(n_clusters=k)
kmeans.fit(data)
score = silhouette_score(data, kmeans.labels_)
silhouette_scores.append(score)
best_k = silhouette_scores.index(max(silhouette_scores)) + 2
print("Best number of clusters according to silhouette score:", best_k)
# Fit KMeans clustering
kmeans = KMeans(n_clusters=best_k)
kmeans.fit(data)
labels = kmeans.labels_
centers = kmeans.cluster_centers_

Explanation :

This code section focuses on finding the optimal number of clusters (k) for your KMeans analysis using the Silhouette Score metric. It employs an iterative approach:

Exploring Potential Clusters: The code examines a range of k values, typically starting from a low number (like 2) and incrementing up to a predefined limit (here, 10). This allows the code to consider various clustering scenarios.
Silhouette Score Evaluation: For each k value, a KMeans model is created and fitted to the data. The Silhouette Score is then calculated to assess the quality of the resulting clustering. This score reflects how well data points are assigned to their clusters, with higher scores indicating better separation and well-defined clusters.
Identifying the Best K: The Silhouette Scores for each k value are stored in a list. The code then analyzes this list to find the k value that corresponds to the highest Silhouette Score. Since Python indexing starts from 0, 2 is added to this index to reflect the actual number of clusters (as k starts from 2). This process automates the selection of the optimal k value that leads to the best separation between clusters based on the Silhouette Score.

Following the identification of the optimal k, a KMeans model is created and fitted to the data using this chosen k value. The resulting model provides two key pieces of information:

Cluster Labels (labels_): This array stores the cluster labels assigned to each data point, revealing which cluster a particular data point belongs to in the final clustering solution.
Cluster Centers (cluster_centers_): These represent the centroids, the central points of each identified cluster. Analyzing these centroids can provide insights into the characteristics of each cluster, helping you understand the underlying structure and groupings within your medical dataset.

By employing the Silhouette Score metric, this code automates the process of finding the best number of clusters for your KMeans analysis. This data-driven approach ensures your clustering effectively groups your data into meaningful categories, optimizing the cluster analysis for your medical dataset.

4. Data visualization :

# Plot data before clustering

plt.figure(figsize=(10, 5))

plt.scatter(data[:, 0], data[:, 1], c='blue', label='Data points')

plt.title('Before Clustering')

plt.xlabel('Age')

plt.ylabel('Heartrate')

plt.legend()

plt.show()

# Plot data after clustering

plt.figure(figsize=(10, 5))

plt.scatter(data[:, 0], data[:, 1], c=labels, cmap='viridis', label='Clustered data')

plt.scatter(centers[:, 0], centers[:, 1], c='red', marker='X', s=200, label='Centroids')

plt.title('After Clustering with Centroids')

plt.xlabel('Age')

plt.ylabel('Heartrate')

plt.legend()

plt.show()

Explanation :

This code snippet generates two visualizations to depict the data distribution before and after applying KMeans clustering. These visualizations help us understand how the data is grouped based on the chosen features ('Age' and 'Heartrate').

01.0.000000000000

.Visualization Before Clustering:

This visualization utilizes Matplotlib to create a scatter plot.
It displays the original distribution of data points for the two features: 'Age' on the x-axis and 'Heartrate' on the y-axis.
All data points are colored blue for better visual distinction.
The title "Before Clustering" emphasizes the ungrouped state of the data.

Visualization After Clustering:

A second scatter plot is generated to showcase the data after clustering.
Data points are assigned colors based on the cluster they belong to, creating a visual representation of the formed groups. A colormap named 'viridis' is used for this purpose.
Additionally, red 'X' markers represent the centroids, which are the central points of each cluster. These provide insights into the cluster characteristics.
The title "After Clustering with Centroids" highlights the results of the clustering process.

By comparing these visualizations, we can gain valuable insights. We can see how the clustering algorithm has organized the data points into distinct clusters based on similarities in their 'Age' and 'Heartrate' values. The positions of the centroids further aid our understanding of the characteristics of each cluster, potentially revealing underlying patterns or relationships within the medical dataset.

Conclusion :

We used R programming to perform a clustering analysis on a medical dataset for this paper. In order to handle missing values, choose pertinent columns, and scale the data, we first imported the required libraries. The KMeans clustering technique was then used to establish the ideal number of clusters utilising the Silhouette Score measure. In order to comprehend how the algorithm grouped the data points, we finally put the data before and after clustering into a visualisation.
In general, the medical dataset's clustering technique enabled us to distinguish between various age and heart rate-based groupings. Healthcare workers can use this information to better understand patient characteristics and possibly customize interventions or treatments.

Reference :

https://www.mdpi.com/1424-8220/22/11/3974

https://www.hindawi.com/journals/complexity/2021/5535734/

Search This Blog

Medical Data Set Clustering Analysis Using R-Programming