For my #30DaysofLearning initiative, I chose to take the 4-part specialization in Tensorflow offered through Coursera. It covers Dense Neural Networks, Convolutional Neural Networks, Recurrent Neural Networks, and their applications to processing images, natural language, and time-series data.
As part of the segment on natural language processing, we go through an example of how to perform sentiment analysis on a corpus of movie reviews from IMDB, and an example of generating text based on a corpus of Shakespeare sonnets. So I thought it could be fun (and instructive) to try to merge the two together so as to train a model to generate movie reviews — specifically, bad movie reviews.
Most of the code is lifted directly from one of the two aforementioned notebooks, and therefore much better explanations can be found in the course lectures, but I’ll walk through the code nonetheless, as there are a few things that I needed to change to make this work.
After downloading some standard objects and methods, we load in the
imdb_reviews dataset. We set
as_supervised=True because we’ll need the labels to distinguish between good and bad reviews.
This dataset is already split into
train_data, so we can grab all of it like so. To generate a corpus of sentences, we first split each review into sentences using
review.split(‘.’). As this will also split abbreviations like
Dr. into their own “sentences”, we restrict to only those “sentences” that have more than three characters.
Furthermore, we restrict those that are fewer than 10 words. In part, this is because simpler sentences, with fewer clauses, should be easier to generate. If we need more data, we can do the same with the
We use Keras’s
Tokenizer object to assign each word in our corpus an integer value. By default, it downcases all the words and ignores punctuation. So, if a sentence contains the phrase “P T Cruiser”, we end up with “p” and “t” as individual words. That’s why we add the for-loop to remove these.
Here is where we convert the training data into its final form. The
texts_to_sequences method converts each sentence into an array of integers. For example, “This was classic Steven Seagal.” might become
[32 87 491 2336 15445]. With each sequence, we can then create n-gram sequences of the form
[32 87], [32 87 491], . . ., [32 87 491 2336 15445]. Again, we only take those arrays that are less than 10 in length.
The reason for filtering a second time is because hyphenated words get counted as a single word in our sentence, but split up once tokenized. For instance, this real example: “
I mean, bad-movie-merchandising-made-into-a-worse-game-in-two-weeks-and-then-shipped-out-and-bought-by-morons-and-their-parents-for-christmas horrid” got counted as 4 words in a sentence, but 27 words once tokenized (The reason for filtering in the first place is so that our n_gram sequences don’t get too long as this will cause our session to crash).
pad_sequences object will then prepend 0’s to the front of each n-gram sequence to make them all a uniform length.
For a given n-gram sequence we take the last element to be the y-value, and all but the last element to be the x-value. to_categorical will then convert the y-value into a one-hot encoded vector. We’re now ready to construct our model.
We first embed each word into 32-dimensional space and feed the result to a bidirectional LSTM layer. That then goes into a forward LSTM layer, followed by two dense layers, where our last layer has as many nodes as there are words in our dictionary.
There are a lot of hyperparameters here with which to experiment, as well as the architecture. After much trial and error, I found that there was little gain in training beyond 150 epochs, and I never got above 74% accuracy.
Finally, to generate text, we define the following function:
We essentially take in some “seed text” and the number of words to generate, then iteratively predict the next word given our current sentence. “Sentence” is a loose term as there is no punctuation, and rarely do these word collections actually make grammatical sense.
I experimented with hyperparameters, as well as different types of seed text and then cherry-picked my favorites. Note: all the punctuation was added by me after the text generation.
A few sound like something from a legitimate movie review. (seed text is in bold)
second thought, she was a bad actress.
went terribly wrong with the film at reality.
- In the eighties
good character development was annoying.
- Dolph Lundgren
just looked positively bored. Feels wrong, too.
Others almost sound like poetry.
- The movie opens with a big
commoner perspective called garbage.
- Batman and Robin.
War is a decent musical.
the public, gathering dust calls us love triangle.
sneaks me in the first movie. I remained untouched.
- Dolph Lundgren
is over, and scuttle is chic.
- Kurt Russell.
Brothers: him, I apologize for -- the male girl potion pro-life show.
But mostly it just returned a bunch of words in random order. Next time, we’ll look at another approach to generating bad movie reviews and will hopefully have some more fun examples!
~ Nathan Salzar