Test-driving Rust: array slices and lifetimes

by Max Galkin

Last month I took an exciting new class on Coursera called Bioinformatics: finding hidden messages in DNA, sequencing antibiotics, stuff like that. In the class they introduce algorithms to work with DNA sequences, which, from the algorithm’s perspective, are just vectors of characters-nucleotides. The graded exercises provide you with some input files, you run your implementation of the algorithm on the inputs and submit the output. But in addition to that they also have an automated system where you can verify your implementation, or solve extra problems, if you want to, and they support a number of languages there, including Rust. I’ve been looking for an opportunity to test-drive Rust for quite some time and this turned out to be a great domain to apply Rust array slices, and that’s what this blog post is about.

In Rust, slices “are a view into a block of memory represented as a pointer and a length“. To make this discussion practical, let’s apply them in a simple algorithm — let’s say we have the DNA string (which we can represent as an array of single-byte UTF-8 characters) and we want to count the number of occurrences of each n-gram (substring of length n) in that string. And let’s say we want to turn this algorithm into a utility function returning a hash map from each distinct n-gram to the count.

One way to do that would be to return something like a map from string to int. But let’s say we want something more efficient that that, because we need to handle many large DNA strings, and we don’t want to waste memory creating a copy of each n-gram string to use as keys into the map.

In Rust, we can actually avoid any copying by returning a map from a slice to the count, like this:

// "&[u8]" means "a slice of byte array"
fn count_ngrams(text : &str, n : usize) -> HashMap<&[u8], usize> {
   let mut ngram_map = HashMap::new();

   for i in 0..text.len()+1-n {
      // key is a slice of "n" bytes starting from index "i"
      let key = &text.as_bytes()[i..i+n];
      
      *ngram_map.entry(key).or_insert(0) += 1;
   }

   ngram_map
}

Now wait a minute, a C++ developer might say, are you seriously suggesting to make a utility function that returns a collection of raw pointers/references into a string parameter your code doesn’t own?? That’s insane reckless programming, sir, what about the lifetime of the string???!!!11!!1!

But no worries. That’s where Rust’s brilliant reference lifetime tracking comes into play, and it makes this implementation completely safe in Rust with zero run-time cost: no reference counting, “smart” pointers or garbage collection or anything like that needed at all! If your code compiles, Rust guarantees there’s no “use after free”.

Koala does not believe

Yes, indeed, now let’s try to check that this really works as promised… :-)

To verify this, we just need one nested scope in the main function: we’ll make a local string instance inside the scope, then we’ll count n-grams in it, and we’ll let the hash map escape the scope to try to invalidate the references. I’ll also add another function called print_most_frequent_ngram, which takes a reference to the hash map and prints the most frequent n-gram to try to use these references. Here I’ll show only the code in main, but you can see the full code at the end of the blog post.

Ok, this is our attempt #1 to break Rust:

fn main() {
   let bigram_map;

   {
      let text = "abbcccdddddcccc";
      bigram_map = count_ngrams(text, 2);
   }

   print_most_frequent_ngram(&bigram_map);
}

We run the compiler… and it builds, and the executable runs and prints the correct answer: “The most frequent 2-gram is ‘cc’: 5 occurrence(s).”

Wait, what did just happen?! Use after free?

I was very confused for a moment there, I have to say, until I figured out what’s going on. When we have a bare string literal in code, it doesn’t have scope lifetime, it actually has program lifetime (‘static lifetime in Rust terminology). So our text varaible doesn’t “own” that string’s memory. That string’s memory can outlive the scope, and thus the map is free to escape it too, all the references will remain valid! Well played, Rust.

But in real use case, of course, the string won’t be a static literal, we’d be reading it from a file or from the console, and that’s what we want to verify. One way to emulate that is to simply make a local copy of the static string literal inside the scope. Let’s change the definition of the text variable and store a new string instance in it. That would definitely produce dangling references. Hence, our attempt #2 to break Rust:

fn main() {
   let bigram_map;

   {
      // let text = "abbcccdddddcccc";
      let text = String::from_str("abbcccdddddcccc");
      bigram_map = count_ngrams(text.as_str(), 2);
   }

   print_most_frequent_ngram(&bigram_map);
}

And, hurray, Rust has caught the problem and here’s the compiler output we see, very descriptive:

lifetimes.rs:35:33: 35:37 error: `text` does not live long enough
lifetimes.rs:35       bigram_map = count_ngrams(text.as_str(), 2);
                                                ^~~~
lifetimes.rs:29:19: 39:2 note: reference must be valid for the block suffix following statement 0 at 29:18...
lifetimes.rs:29    let bigram_map;
lifetimes.rs:30
lifetimes.rs:31    {
lifetimes.rs:32       //let text = "abbcccdddddcccc";
lifetimes.rs:33       let text = String::from_str("abbcccdddddcccc");
lifetimes.rs:34       bigram_map = count_ngrams(text.as_str(), 2);
                     ...
lifetimes.rs:33:54: 36:5 note: ...but borrowed value is only valid for the block suffix following statement 0 at 33:53
lifetimes.rs:33       // let text = "abbcccdddddcccc";
lifetimes.rs:34       let text = String::from_str("abbcccdddddcccc");
lifetimes.rs:35       bigram_map = count_ngrams(text.as_str(), 2);
lifetimes.rs:36    }

If you want to learn more about references, borrowing and lifetimes in Rust, check out the book:

Full code used for this post:

#![feature(iter_cmp)]
use std::collections::HashMap;

fn count_ngrams(text : &str, n : usize) -> HashMap<&[u8], usize> {
   let mut ngram_map = HashMap::new();

   for i in 0..text.len()+1-n {
      let key = &text.as_bytes()[i..i+n];
      *ngram_map.entry(key).or_insert(0) += 1;
   }

   ngram_map
}

fn print_most_frequent_ngram(ngram_map : &HashMap<&[u8], usize>)
{
   let (most_frequent_ngram, occurrences) =
      ngram_map.iter().max_by(|&(_, value)| value).unwrap();

   println!(
      "The most frequent {}-gram is '{}': {} occurrence(s).",
      most_frequent_ngram.len(),
      std::str::from_utf8(most_frequent_ngram).unwrap(), 
      occurrences
   );
}

fn main() {
   let bigram_map;

   {
      // let text = "abbcccdddddcccc";
      let text = String::from_str("abbcccdddddcccc");
      println!("Analyzing text '{}'...", text.as_str());
      bigram_map = count_ngrams(text.as_str(), 2);
   }

   print_most_frequent_ngram(&bigram_map);
}