In my last blog post, we discussed how to generate bad movie reviews using a text generator, where the input data was based on sequences of words. But you can also generate text by using an input sequence of individual characters, with a character-based RNN. This is particularly useful when trying to generate text that isn’t in a standard format, e.g. plays or movie scripts.
TensorFlow has a great hands-on tutorial on how to build a character-based text generator. But rather than build my own, I decided to use one that someone else has already built and trained. This is an example of what’s called “Transfer Learning”.
What is transfer learning?
Transfer learning is the practice of taking a model that was designed, and whose weights have already been trained, by someone else, and then retraining some subset of those weights on new data. Usually, the model was trained with a much larger dataset and/or better expertise of what hyperparameters to use. This is especially useful when you only have a small dataset to work with. For our example, we’ll use a model called textgenrnn.
What is textgenrnn?
Textgenrnn is a character-based RNN that has already been pretrained “on hundreds of thousands of text documents from Reddit submissions, from a very diverse variety of subreddits”. You can read more about the architecture here.
When training a pre-existing model you can choose to lock most of the layers, so that only a few of them actually learn new parameters. With textgenrnn, however, when you train on a new dataset, it relearns ALL the weights. But rather than starting the training with randomly initialized weights, it starts with the weights that were already learned. Thus, the training begins with a fairly high level of accuracy, and it’s not necessary to train for a lot of epochs.
The author has made the model as user-friendly as possible by creating a Google Colab with step-by-step instructions and comments. The only thing you need to provide is your own data (as a .txt file). Other than that, you can experiment with hyperparameters, or just accept the defaults.
Writing the bad movie reviews into a text file is similar to how we got our data in the last post.
The line to clear the file is there in case you run this cell more than once. Otherwise, duplicates of the reviews will keep getting appended to the end of the file.
Another change this time is the replacement of the “
<br />” elements. When doing a word-based RNN they’re not too problematic, since only the most common words are chosen for text generation. But in a character-based RNN, you need to be more careful. Also, we add a new line after each review. Then, when configuring our model, we set
line_delimited=True to signal that each line constitutes a data point.
After downloading the file to our computer we can then upload it into the Colab walkthrough mentioned above and we have everything we need to start generating bad movie reviews! When generating text there is a “temperature” setting, ranging from 0 to 1. The lower it is, the less creative. That is, it’s more likely to only choose characters that it has exactly seen in the training data. This often leads to loops.
I have no idea what the movie was. A little bit of a script that is so bad that the movie is a little bit of a mistake. The plot is so amateur and the story is absolutely awful. The story is absolutely awful. The acting is bad and the acting is absolutely horrible. The story is a bad movie.
For higher temperatures, the generator is more bold, and thus the output is usually very entertaining.
I love Christianity, unfortunately, no one was entitled to sound writing.
When I trained the model, the first epoch finished with a loss of about 1.4, so the accuracy is already really high. As it trains, the model will generate text at 3 different “temperatures” — 0.2, 0.5, and 1.0. This provides a better sense of how well the machine is learning, as opposed to just looking at the loss metric. Once the loss got to about 1.1, the 1.0 temperature put out this gem:
The television picture is an experiment of the story. It was put out of the theater in the film. I am a DVD destruction of a TV film with a high school
We can even specify seed text as we did before. Notice that because this is a character-based RNN, the punctuation is part of the output, so there’s no need to manually insert it later. (Seed text is in bold)
This movie is about a great movie and the movie is a shame that the movie is a good movie. The story is a stranger who was shot and starts to start to watch the film and the movie was a movie that should be a good movie about the story of the story and the story is a stupid movie.
The protagonist is exciting and the one word of the corpse.
Steven Seagal was a movie that is so bad that the story is pathetic and completely unnecessary to be so creepy and unrealistic. The film is simply the worst movie I have ever seen.
Steven Seagal has the bendwork of a killer mother
Batman and Robinson or the House Into Ms. X's character - Darkness Man was cop.
When we set the temperature to 1, it’s more likely to output unpronounceable words:
SCENNOT OF THIS STO KS BET BERK. Sidney Gambre Harlands in order. But that is a poorly ring stinker. It would have been a startling monster.
But it’s also a great way to generate fun character names and movie titles.
Black Jane Blake
"Bride The Baker Picture" (1952)
"Green Dean of the Anti-Friend"
Night Pizza Martha
Captain Duvike Party
Lone Landi Banderneaus
an ancient Detective Machine (Anal Da McFistin)
Just for some extra fun, we can feed these names (or even entire movie reviews) into an AI-powered text-to-image generator. I put in a few to generate what could be movie posters. However, they all turned out to be blurry, surreal, and hauntingly nightmarish.