Explore the details of my software engineering internship aimed at classifying hematological diseases, such as sickle cell disease and spherocytosis using advanced algorithms.
During my Software Engineering Internship at KovaDx, a Yale InnovateHealth accelerated startup, I designed and developed machine learning models for the classification of sickle cell disease and spherocytosis. The objective of this project was to design a set of machine learning models that would be able to predict if the patient was at risk of a vaso-occlusive crisis based on their blood sample. The project involved analyzing 900,000+ cellular phase images to determine the disease stages using advanced classification techniques and dimensionality reduction methods such as t-distributed Stochastic Neighborhood Embedding (t-SNE).
The main challenge in this project was dealing with large-scale data (hundreds of thousands of images), which required optimizing the training process. To overcome this, I leveraged distributed computing techniques and model parallelism. Additionally, I handled class imbalances using techniques such as Synthetic Minority Over-sampling Technique (SMOTE) to balance the data and improve model accuracy.
The machine learning models I developed were able to classify the disease stages with a high clustering accuracy, providing the potential to significantly help a medical team in early disease detection. The t-SNE embeddings provided clear visual representations of the data clusters, helping to interpret the patterns and results effectively.
Since the healthcare data used in this project is private, I have created a Google Colab script that demonstrates the power of t-SNE on a public dataset. This example uses the MNIST dataset of handwritten digits and applies t-SNE to visualize how it can effectively reduce dimensionality and group similar data points based on their features.