projects
Manufacturing Lot-Based Analysis to Detect Adverse Event Signals



Every drug and vaccine approved by the FDA are required to keep track and report all Adverse Events (AE) occurred as a result of its consumption. Drugs and vaccines are manufactured in factory “lots”. Each lot is ID’ed with a unique alphanumeric sequence that allows pharmaceutical companies to trace a dose back to its origin. Since most drugs/vaccines are made from biological substances, the risk of contamination or other transportation issues directly influence the type and prominence of side effects of those products made by the lot in question. The accuracy of lot number reporting is crucial on the tracing and amelioration of AE’s.

Unfortunately, a substantial percentage (>50%) of the lot IDs as manually curated by providers, or reported by patients, are incorrect. The problem lies in that lot numbers are often written confusingly/erroneously during vaccination, the patient has a difficult time deciphering the writings, communication on the phone can sometimes be distorted, and doctors/PA’s on site are often busy or distracted, while their handwriting could be misunderstood on the regulatory end.

During my contract with AstraZeneca, I devised a method to approximate a misspelled raw lot to what a human would deem the closest correct lot in existence. The approach uses NLP spelling correction for a customized vocabulary in a specialized domain. Once all raw lot numbers have been corrected, I apply an Observed-to-Expected analysis model and Gamma-Poisson to detect any abnormal signals in AE for each and every lot. If there is a significant signal from a specific manufacturing lot for a single, or a set of AE’s, the lot is flagged and reported to the Patient Safety team and subsequently regulatory agency for closer examination.
Data-Driven Nutrition and Workout Recommendation System



Insidetracker (Segterra) provides personalized nutrition solutions, supplement and exercise recommendations by measuring users’ DNA, and monitoring users’ diet and activity metrics. The company collected Fitbit data from tens of thousands of customers over the course of a decade. The data include metrics such as REM/Deep sleep duration, exercise and METs calories, exercise type and time, and resting heart rate.

While diet/exercise recommendations are given to customers through their app, these recommendations are currently prescriptive- They are based on rules authored by nutritionists and sport scientists. The millions of data points collected are not utilized in their recommendations, nor in the user interface that their customers interact with on a day-to-day basis.

I was approached by Insidetracker’s CTO to change this. We would like to use data-driven insights to provide the basis behind every analysis presented to the user, and give evidence-based recommendations that we can show to our customers in real-time. To achieve this, I preprocessed Insidetracker’s Fitbit dataset, grouping users and dates, along with their workout activity type. I then created correlation models, as well as implemented an anomaly detection module to flag abnormal resting heart rates and sleep patterns. The correlations will show the user what type of changes in their habits (diet, exercise) can have the most impact to improve their health, while the anomaly detection will alert users of certain live events happening that are affecting their well-being.

In addition, various regression model proof-of-concepts were trained to demonstrate that Insidetracker can go even further to improve users’ blood biomarkers by making recommendations to their nutritional intake, workout routine, and sleep habit. These are important insights to pivot their business from a rule-based, expert system relying on humans, to a more data-driven, purely empirical, automated, and real-time recommender system.
LSTM-enabled Clinical Trial Matching for Precision Medicine



  • Project Link

I led a team of eight cross-functional specialists to develop and deploy Philips' Intellispace Genomics product at the nation's tier-one cancer centers. This includes a $6 million USD project that I was responsible for, a clinical trial matching tool for cancer patients

The problem of clinical trial matching is to extract the relevant eligibility criteria from more than 40,000 clinical trial protocols and then match them to the given cancer patient's profile. The relevant matching criteria include cancer type, staging, genetic mutations, the patient's demographics, and comorbidities. This is an extremely complex NLP problem applied across big data.

With my team, I evolved a naive, Elasticsearch approach to a pipeline using a hybrid of Named Entity Recognition (NER) and logical satisfiability theory. We successfully trained a Long Short-Term Memory neural network (LSTM) with a Conditional Random Field (CRF) output layer using clinical domain informed corpora as word embedding. As a result of our work, my team successfully achieved more than 95% accuracy in automated clinical trial matching—as validated by pathologists and oncologists.

The project yielded impressive results that saved tremendous time for clinicians. It is now a commercial success, replacing IBM Watson and deployed as part of the Philips Intellispace Genomics solution at the nation's top cancer institutions such as Dana-Farber, MD Anderson, Westchester Medical Center, and Boston Children's hospital.
Insurance Billing Code Extraction through Hybrid NLP Approaches



ICD10 code extraction is an essential component of insurance risk-adjustment in the United States private insurance industry. We built a system that processes >10 k Electronic Health Records (EHR), daily, while classifying over 70k different ICD's.

I led a team of seven to build a three-layered hybrid NLP ICD10 extraction microservice infrastructure. First, all pdf scans of EHR are OCR'ed into text using Tesseract. The content of EHRs is stored in JSON format in MongoDB. These texts then go through a rule-based, hashmap-like system where every clinical term, acronyms, synonyms are converted to their subsequent ICD10 code. However, this approach alone has a high false-positive rate, albeit an extremely high recall (>98%). A layer of Transformer network using ClincalBERT and BioBERT is then applied on top of the rule-based output to perform binary classification, eliminating false positives to provide highly accurate ICD10 that is context-sensitive and clinically relevant.

This project along with the startup company has since been successfully acquired.
A Method and Apparatus for Genome Spelling Correction and Acronym Standardization



At Philips, I developed a genomic biomarker spelling correction module. It is used to preprocess clinical trial protocol texts as a prerequisite before name entity detection could be performed. The module is an integral part of the workflow within the IntelliSpace Precision Medicine (ISPM) platform, commercially available at various cancer hospitals around the USA.

The method includes the following steps, by performing preprocessing on a sentence:
  1. 1. Storing a first adjacent word to an unknown word and a second adjacent word to the unknown word
  2. 2. Generating a plurality of candidate words for the unknown word
  3. 3. Forming a plurality of trigrams with the first adjacent word to the unknown word and the second adjacent word to the unknown word to each of the plurality of candidate words
  4. 4. Construct a trigram table out of the entirety of the clinical trial database, each trigram comes with its frequency count as they occur in the database
  5. 5. Searching in the trigram table for each of the plurality of trigrams constructed in step 3
  6. 6. Outputting the candidate word from the trigram with the highest trigram count in the trigram table


The correction module will detect in-context, for example, seeing a “EGGR gene”, resolve it to “EGFR gene” in a text pertaining to non-small cell lung cancer.