Classifying Medical Abstract Sentences

Open In Colab

The purpose of this notebook is to replicate an NLP model from Neural networks for joint sentence classification in medical paper abstracts to make reading medical abstracts easier through labelling sentences into 5 categories:

  • Background

  • Objective

  • Methods

  • Results

  • Conclusion

Table of Contents

Model Input

For example, can we train an NLP model which takes the following input (note: the following sample has had all numerical symbols replaced with "@"):

To investigate the efficacy of @ weeks of daily low-dose oral prednisolone in improving pain , mobility , and systemic low-grade inflammation in the short term and whether the effect would be sustained at @ weeks in older adults with moderate to severe knee osteoarthritis ( OA ). A total of @ patients with primary knee OA were randomized @:@ ; @ received @ mg/day of prednisolone and @ received placebo for @ weeks. Outcome measures included pain reduction and improvement in function scores and systemic inflammation markers. Pain was assessed using the visual analog pain scale ( @-@ mm ). Secondary outcome measures included the Western Ontario and McMaster Universities Osteoarthritis Index scores , patient global assessment ( PGA ) of the severity of knee OA , and @-min walk distance ( @MWD )., Serum levels of interleukin @ ( IL-@ ) , IL-@ , tumor necrosis factor ( TNF ) - , and high-sensitivity C-reactive protein ( hsCRP ) were measured. There was a clinically relevant reduction in the intervention group compared to the placebo group for knee pain , physical function , PGA , and @MWD at @ weeks. The mean difference between treatment arms ( @ % CI ) was @ ( @-@ @ ) , p < @ ; @ ( @-@ @ ) , p < @ ; @ ( @-@ @ ) , p < @ ; and @ ( @-@ @ ) , p < @ , respectively. Further , there was a clinically relevant reduction in the serum levels of IL-@ , IL-@ , TNF - , and hsCRP at @ weeks in the intervention group when compared to the placebo group. These differences remained significant at @ weeks. The Outcome Measures in Rheumatology Clinical Trials-Osteoarthritis Research Society International responder rate was @ % in the intervention group and @ % in the placebo group ( p < @ ). Low-dose oral prednisolone had both a short-term and a longer sustained effect resulting in less knee pain , better physical function , and attenuation of systemic inflammation in older patients with knee OA ( ClinicalTrials.gov identifier NCT@ ).

Model output

And returns the following output:

['###24293578\n',

'OBJECTIVE\tTo investigate the efficacy of @ weeks of daily low-dose oral prednisolone in improving pain , mobility , and systemic low-grade inflammation in the short term and whether the effect would be sustained at @ weeks in older adults with moderate to severe knee osteoarthritis ( OA ) .\n',

'METHODS\tA total of @ patients with primary knee OA were randomized @:@ ; @ received @ mg/day of prednisolone and @ received placebo for @ weeks .\n',

'METHODS\tOutcome measures included pain reduction and improvement in function scores and systemic inflammation markers .\n',

'METHODS\tPain was assessed using the visual analog pain scale ( @-@ mm ) .\n',

'METHODS\tSecondary outcome measures included the Western Ontario and McMaster Universities Osteoarthritis Index scores , patient global assessment ( PGA ) of the severity of knee OA , and @-min walk distance ( @MWD ) .\n',

'METHODS\tSerum levels of interleukin @ ( IL-@ ) , IL-@ , tumor necrosis factor ( TNF ) - , and high-sensitivity C-reactive protein ( hsCRP ) were measured .\n',

'RESULTS\tThere was a clinically relevant reduction in the intervention group compared to the placebo group for knee pain , physical function , PGA , and @MWD at @ weeks .\n',

'RESULTS\tThe mean difference between treatment arms ( @ % CI ) was @ ( @-@ @ ) , p < @ ; @ ( @-@ @ ) , p < @ ; @ ( @-@ @ ) , p < @ ; and @ ( @-@ @ ) , p < @ , respectively .\n',

'RESULTS\tFurther , there was a clinically relevant reduction in the serum levels of IL-@ , IL-@ , TNF - , and hsCRP at @ weeks in the intervention group when compared to the placebo group .\n',

'RESULTS\tThese differences remained significant at @ weeks .\n',

'RESULTS\tThe Outcome Measures in Rheumatology Clinical Trials-Osteoarthritis Research Society International responder rate was @ % in the intervention group and @ % in the placebo group ( p < @ ) .\n',

'CONCLUSIONS\tLow-dose oral prednisolone had both a short-term and a longer sustained effect resulting in less knee pain , better physical function , and attenuation of systemic inflammation in older patients with knee OA ( ClinicalTrials.gov identifier NCT@ ) .\n', '\n']

Get the data

Data is from PubMed 200k RCT: a Dataset for Sequential Sentence Classification in Medical Abstracts

It's already been split into train, test, valid sets. Nice.

Preprocess and Visualize the Data

So the labels are separated by tabs, and each abstract is separated by a number

Hmm... let's turn this into a list of dictionaries?

Preprocessing Function

Steps:

  1. Remove all the '\n'

  2. Split the lines by abstract ID number? Or have a while loop that resets the total_lines counter each time we see a ###? Or we can remove abstract ID's and when we see a plain '\n' then we reset the counter

  3. Split each text line by tab and put the first part into target and second part into text

Visualize

.dataframe tbody tr th:only-of-type { vertical-align: middle; } .dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; }

target
text
line_number
total_lines

0

OBJECTIVE

to investigate the efficacy of @ weeks of dail...

0

11

1

METHODS

a total of @ patients with primary knee oa wer...

1

11

2

METHODS

outcome measures included pain reduction and i...

2

11

3

METHODS

pain was assessed using the visual analog pain...

3

11

4

METHODS

secondary outcome measures included the wester...

4

11

5

METHODS

serum levels of interleukin @ ( il-@ ) , il-@ ...

5

11

6

RESULTS

there was a clinically relevant reduction in t...

6

11

7

RESULTS

the mean difference between treatment arms ( @...

7

11

8

RESULTS

further , there was a clinically relevant redu...

8

11

9

RESULTS

these differences remained significant at @ we...

9

11

10

RESULTS

the outcome measures in rheumatology clinical ...

10

11

11

CONCLUSIONS

low-dose oral prednisolone had both a short-te...

11

11

12

BACKGROUND

emotional eating is associated with overeating...

0

10

13

BACKGROUND

yet , empirical evidence for individual ( trai...

1

10

Is dataset balanced?

Should be somewhat okay - objective will probably have the worst accuracy

png

Get lists of sentences

Make numeric labels

Baseline Model - Naive Bayes with Tfidf

Explanation of Tfidf: https://monkeylearn.com/blog/what-is-tf-idf/

Explanation of Naive Bayes: https://heartbeat.comet.ml/understanding-naive-bayes-its-applications-in-text-classification-part-1-ec9caea4baae

72% accuracy is pretty good for a baseline!

Model 1: Conv1D with token embeddings

Tokenization

Embedding

Make sure embeddings work

Set up model

Hmm maybe input shape is incorrect? Let's put these into tf datasets

Oh wait it's expecting batches that's why it wants 3 dimensions

Create tensorflow prefetched datasets

tf.data: https://www.tensorflow.org/guide/data

data performance: https://www.tensorflow.org/guide/data_performance

Also, do not shuffle the data because the sequence order matters

That's pretty good! I rushed through the tokenization and embeddings, so let's go back and improve those. Also notice that the model is overfitting.

Improved Tokenization

png

Improved Embedding Layer

Did better than our baseline, likely because there are pretty long sentences, so the deep learning model is able to learn more

Model 2: Pretrained Universal Sentence Feature Extractor

The paper uses GloVe embeddings, but this isn't on tensorhub so let's try just using the universal sentence encoder

Hmm I have two thoughts on why it didn't perform as well as the Conv1D:

  • We are working with scientific text

  • Model is pretty simple

Try the model with basically same set up as model 1 but different embedding

LOL it performed terribly. Maybe I should stick with having dense layers after the encoder

Denser model

Model 3: Conv1D with character embeddings

The paper uses token+character embeddings. Let's try just character embeddings first

Create character tokenizer

png

Build Model

Before we fit this model, do you think that character embeddings would outperform word embeddings?

Character embedding model performed pretty bad, but still better than randomly guessing (a random guess would get ~20% accuracy because there are 5 classes)

It's still mindblowing that the model was able to learn from just characters!

Just for fun: trying without char standardization

Interesting! Why do you think it did better?

I think it did better because certain punctuation such as parentheses or colons show up more in like results sections

Model 4: Pretrained token embeddings + character embeddings

Multimodal models

  1. Create token-level embedding (model 2)

  2. Create character-level embedding (model 3)

  3. Combine 1 & 2 with concatenate layer (layers.Concatenate)

  4. Add output layers (same as paper)

Here's the paper again: Neural networks for joint sentence classification in medical paper abstracts

Building model

png

Preparing data

Both of these methods work to build the dataset

Fitting a model on token and character-level sequences

Why do you think it didn't perform as well as model 1? Maybe could try without a pretrained embedding layer?

Model 5: Pretrained token embeddings + character embeddings + positional (feature) embeddings - the full model from the paper

The order of the sentences are important! For example, generally sentences in the beginning are background/objective sentences

Let's do some feature engineering to add this information into the model!

Prepare Dataset

Build positional model

Try different positional model using one-hot

png

Let's try with one hot encoding (because line 2 isn't double the size of line 1)

Let's try keeping total_lines separate

png

Simple model with just total lines to see what happens

Tri-input model

  1. Create token model

  2. Create character-level model

  3. Create line number model

  4. Create total_lines model

  5. Concatenate 1 & 2

  6. Concatenate 3,4,5

  7. Create output layer to accept output from 6 and output label probabilities

png

What is label smoothing? Stops model from predicting 100% to a certain class - instead spreading out some of the % into other classes

So instead of [0, 0, 1]

the model predicts [0.01, 0.01, .98] which helps it to generalize better!

See this article: https://www.pyimagesearch.com/2019/12/30/label-smoothing-with-keras-tensorflow-and-deep-learning/

Note: they custom build label smoothing

Prepare datasets

Fitting and evaluating model

Ideas for further experiments:

  • Try different positional model setups

  • Copy the paper exactly and have a second bidirectional LSTM layer

  • Try GloVe instead of USE

  • Tune different hyperparameters (label smoothing, increase neurons, different depths of one hot encodings, etc.)

  • Try adding more layers

  • Fine tune USE layer

Comparing model results

.dataframe tbody tr th:only-of-type { vertical-align: middle; } .dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; }

accuracy
precision
recall
f1

baseline

72.183238

0.718647

0.721832

0.698925

custom_token_embed_conv1d

80.034423

0.798406

0.800344

0.798453

pretrained_token_embed

71.243215

0.712945

0.712432

0.709484

custom_char_embed_conv1d

65.282007

0.647531

0.652820

0.644323

hybrid_char_token_embed

74.175824

0.743937

0.741758

0.739004

tribrid_embed

84.582285

0.844667

0.845823

0.844824

png

Note: Could've used tensorboard to compare

png

The authors model achieves an F1-score of 90.0 on the 20k RCT dataset versus our F1-score of ~84)

There are some things to note about this difference:

  • Our models (with an exception for the baseline) have been trained on 10% of the data

  • We evaluated on validation set, not test set

Saving model

Conclusion

Well that was a long notebook

Additional things to add:

  • Looking at model's worst predictions

  • Fitting on all training+valid data and then evaluating on test data

  • Grabbing our own abstract and seeing how the model performs on it

Additional Future Project:

  • Can we add a button that let's you evaluate if the predictions were good or bad? Then the model can learn from human evaluations

Last updated