Classifying Medical Abstract Sentences
The purpose of this notebook is to replicate an NLP model from Neural networks for joint sentence classification in medical paper abstracts to make reading medical abstracts easier through labelling sentences into 5 categories:
Background
Objective
Methods
Results
Conclusion
Table of Contents
Model Input
For example, can we train an NLP model which takes the following input (note: the following sample has had all numerical symbols replaced with "@"):
To investigate the efficacy of @ weeks of daily low-dose oral prednisolone in improving pain , mobility , and systemic low-grade inflammation in the short term and whether the effect would be sustained at @ weeks in older adults with moderate to severe knee osteoarthritis ( OA ). A total of @ patients with primary knee OA were randomized @:@ ; @ received @ mg/day of prednisolone and @ received placebo for @ weeks. Outcome measures included pain reduction and improvement in function scores and systemic inflammation markers. Pain was assessed using the visual analog pain scale ( @-@ mm ). Secondary outcome measures included the Western Ontario and McMaster Universities Osteoarthritis Index scores , patient global assessment ( PGA ) of the severity of knee OA , and @-min walk distance ( @MWD )., Serum levels of interleukin @ ( IL-@ ) , IL-@ , tumor necrosis factor ( TNF ) - , and high-sensitivity C-reactive protein ( hsCRP ) were measured. There was a clinically relevant reduction in the intervention group compared to the placebo group for knee pain , physical function , PGA , and @MWD at @ weeks. The mean difference between treatment arms ( @ % CI ) was @ ( @-@ @ ) , p < @ ; @ ( @-@ @ ) , p < @ ; @ ( @-@ @ ) , p < @ ; and @ ( @-@ @ ) , p < @ , respectively. Further , there was a clinically relevant reduction in the serum levels of IL-@ , IL-@ , TNF - , and hsCRP at @ weeks in the intervention group when compared to the placebo group. These differences remained significant at @ weeks. The Outcome Measures in Rheumatology Clinical Trials-Osteoarthritis Research Society International responder rate was @ % in the intervention group and @ % in the placebo group ( p < @ ). Low-dose oral prednisolone had both a short-term and a longer sustained effect resulting in less knee pain , better physical function , and attenuation of systemic inflammation in older patients with knee OA ( ClinicalTrials.gov identifier NCT@ ).
Model output
And returns the following output:
['###24293578\n',
'OBJECTIVE\tTo investigate the efficacy of @ weeks of daily low-dose oral prednisolone in improving pain , mobility , and systemic low-grade inflammation in the short term and whether the effect would be sustained at @ weeks in older adults with moderate to severe knee osteoarthritis ( OA ) .\n',
'METHODS\tA total of @ patients with primary knee OA were randomized @:@ ; @ received @ mg/day of prednisolone and @ received placebo for @ weeks .\n',
'METHODS\tOutcome measures included pain reduction and improvement in function scores and systemic inflammation markers .\n',
'METHODS\tPain was assessed using the visual analog pain scale ( @-@ mm ) .\n',
'METHODS\tSecondary outcome measures included the Western Ontario and McMaster Universities Osteoarthritis Index scores , patient global assessment ( PGA ) of the severity of knee OA , and @-min walk distance ( @MWD ) .\n',
'METHODS\tSerum levels of interleukin @ ( IL-@ ) , IL-@ , tumor necrosis factor ( TNF ) - , and high-sensitivity C-reactive protein ( hsCRP ) were measured .\n',
'RESULTS\tThere was a clinically relevant reduction in the intervention group compared to the placebo group for knee pain , physical function , PGA , and @MWD at @ weeks .\n',
'RESULTS\tThe mean difference between treatment arms ( @ % CI ) was @ ( @-@ @ ) , p < @ ; @ ( @-@ @ ) , p < @ ; @ ( @-@ @ ) , p < @ ; and @ ( @-@ @ ) , p < @ , respectively .\n',
'RESULTS\tFurther , there was a clinically relevant reduction in the serum levels of IL-@ , IL-@ , TNF - , and hsCRP at @ weeks in the intervention group when compared to the placebo group .\n',
'RESULTS\tThese differences remained significant at @ weeks .\n',
'RESULTS\tThe Outcome Measures in Rheumatology Clinical Trials-Osteoarthritis Research Society International responder rate was @ % in the intervention group and @ % in the placebo group ( p < @ ) .\n',
'CONCLUSIONS\tLow-dose oral prednisolone had both a short-term and a longer sustained effect resulting in less knee pain , better physical function , and attenuation of systemic inflammation in older patients with knee OA ( ClinicalTrials.gov identifier NCT@ ) .\n', '\n']
Get the data
Data is from PubMed 200k RCT: a Dataset for Sequential Sentence Classification in Medical Abstracts
It's already been split into train, test, valid sets. Nice.
Preprocess and Visualize the Data
So the labels are separated by tabs, and each abstract is separated by a number
Hmm... let's turn this into a list of dictionaries?
Preprocessing Function
Steps:
Remove all the '\n'
Split the lines by abstract ID number? Or have a while loop that resets the total_lines counter each time we see a ###? Or we can remove abstract ID's and when we see a plain '\n' then we reset the counter
Split each text line by tab and put the first part into
targetand second part intotext
Visualize
.dataframe tbody tr th:only-of-type { vertical-align: middle; } .dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; }
0
OBJECTIVE
to investigate the efficacy of @ weeks of dail...
0
11
1
METHODS
a total of @ patients with primary knee oa wer...
1
11
2
METHODS
outcome measures included pain reduction and i...
2
11
3
METHODS
pain was assessed using the visual analog pain...
3
11
4
METHODS
secondary outcome measures included the wester...
4
11
5
METHODS
serum levels of interleukin @ ( il-@ ) , il-@ ...
5
11
6
RESULTS
there was a clinically relevant reduction in t...
6
11
7
RESULTS
the mean difference between treatment arms ( @...
7
11
8
RESULTS
further , there was a clinically relevant redu...
8
11
9
RESULTS
these differences remained significant at @ we...
9
11
10
RESULTS
the outcome measures in rheumatology clinical ...
10
11
11
CONCLUSIONS
low-dose oral prednisolone had both a short-te...
11
11
12
BACKGROUND
emotional eating is associated with overeating...
0
10
13
BACKGROUND
yet , empirical evidence for individual ( trai...
1
10
Is dataset balanced?
Should be somewhat okay - objective will probably have the worst accuracy

Get lists of sentences
Make numeric labels
Baseline Model - Naive Bayes with Tfidf
Explanation of Tfidf: https://monkeylearn.com/blog/what-is-tf-idf/
Explanation of Naive Bayes: https://heartbeat.comet.ml/understanding-naive-bayes-its-applications-in-text-classification-part-1-ec9caea4baae
72% accuracy is pretty good for a baseline!
Model 1: Conv1D with token embeddings
Tokenization
Embedding
Make sure embeddings work
Set up model
Hmm maybe input shape is incorrect? Let's put these into tf datasets
Oh wait it's expecting batches that's why it wants 3 dimensions
Create tensorflow prefetched datasets
tf.data: https://www.tensorflow.org/guide/data
data performance: https://www.tensorflow.org/guide/data_performance
Also, do not shuffle the data because the sequence order matters
That's pretty good! I rushed through the tokenization and embeddings, so let's go back and improve those. Also notice that the model is overfitting.
Improved Tokenization

Improved Embedding Layer
Did better than our baseline, likely because there are pretty long sentences, so the deep learning model is able to learn more
Model 2: Pretrained Universal Sentence Feature Extractor
The paper uses GloVe embeddings, but this isn't on tensorhub so let's try just using the universal sentence encoder
Hmm I have two thoughts on why it didn't perform as well as the Conv1D:
We are working with scientific text
Model is pretty simple
Try the model with basically same set up as model 1 but different embedding
LOL it performed terribly. Maybe I should stick with having dense layers after the encoder
Denser model
Model 3: Conv1D with character embeddings
The paper uses token+character embeddings. Let's try just character embeddings first
Create character tokenizer

Build Model
Before we fit this model, do you think that character embeddings would outperform word embeddings?
Character embedding model performed pretty bad, but still better than randomly guessing (a random guess would get ~20% accuracy because there are 5 classes)
It's still mindblowing that the model was able to learn from just characters!
Just for fun: trying without char standardization
Interesting! Why do you think it did better?
I think it did better because certain punctuation such as parentheses or colons show up more in like results sections
Model 4: Pretrained token embeddings + character embeddings
Multimodal models
Create token-level embedding (model 2)
Create character-level embedding (model 3)
Combine 1 & 2 with concatenate layer (
layers.Concatenate)Add output layers (same as paper)
Here's the paper again: Neural networks for joint sentence classification in medical paper abstracts
Building model

Preparing data
Both of these methods work to build the dataset
Fitting a model on token and character-level sequences
Why do you think it didn't perform as well as model 1? Maybe could try without a pretrained embedding layer?
Model 5: Pretrained token embeddings + character embeddings + positional (feature) embeddings - the full model from the paper
The order of the sentences are important! For example, generally sentences in the beginning are background/objective sentences
Let's do some feature engineering to add this information into the model!
Prepare Dataset
Build positional model
Try different positional model using one-hot

Let's try with one hot encoding (because line 2 isn't double the size of line 1)
Let's try keeping total_lines separate

Simple model with just total lines to see what happens
Tri-input model
Create token model
Create character-level model
Create line number model
Create total_lines model
Concatenate 1 & 2
Concatenate 3,4,5
Create output layer to accept output from 6 and output label probabilities

What is label smoothing? Stops model from predicting 100% to a certain class - instead spreading out some of the % into other classes
So instead of [0, 0, 1]
the model predicts [0.01, 0.01, .98] which helps it to generalize better!
See this article: https://www.pyimagesearch.com/2019/12/30/label-smoothing-with-keras-tensorflow-and-deep-learning/
Note: they custom build label smoothing
Prepare datasets
Fitting and evaluating model
Ideas for further experiments:
Try different positional model setups
Copy the paper exactly and have a second bidirectional LSTM layer
Try GloVe instead of USE
Tune different hyperparameters (label smoothing, increase neurons, different depths of one hot encodings, etc.)
Try adding more layers
Fine tune USE layer
Comparing model results
.dataframe tbody tr th:only-of-type { vertical-align: middle; } .dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; }
baseline
72.183238
0.718647
0.721832
0.698925
custom_token_embed_conv1d
80.034423
0.798406
0.800344
0.798453
pretrained_token_embed
71.243215
0.712945
0.712432
0.709484
custom_char_embed_conv1d
65.282007
0.647531
0.652820
0.644323
hybrid_char_token_embed
74.175824
0.743937
0.741758
0.739004
tribrid_embed
84.582285
0.844667
0.845823
0.844824

Note: Could've used tensorboard to compare

The authors model achieves an F1-score of 90.0 on the 20k RCT dataset versus our F1-score of ~84)
There are some things to note about this difference:
Our models (with an exception for the baseline) have been trained on 10% of the data
We evaluated on validation set, not test set
Saving model
Conclusion
Well that was a long notebook
Additional things to add:
Looking at model's worst predictions
Fitting on all training+valid data and then evaluating on test data
Grabbing our own abstract and seeing how the model performs on it
Additional Future Project:
Can we add a button that let's you evaluate if the predictions were good or bad? Then the model can learn from human evaluations
Last updated