Test-driving Rust: array slices and lifetimes
Last month I took an exciting new class on Coursera called Bioinformatics: finding hidden messages in DNA, sequencing antibiotics, stuff like that. In the class they introduce algorithms to work with DNA sequences, which, from the algorithm’s perspective, are just vectors of characters-nucleotides. The graded exercises provide you with some input files, you run your implementation of the algorithm on the inputs and submit the output. But in addition to that they also have an automated system where you can verify your implementation, or solve extra problems, if you want to, and they support a number of languages there, including Rust. I’ve been looking for an opportunity to test-drive Rust for quite some time and this turned out to be a great domain to apply Rust array slices, and that’s what this blog post is about.
In Rust, slices “are a view into a block of memory represented as a pointer and a length“. To make this discussion practical, let’s apply them in a simple algorithm — let’s say we have the DNA string (which we can represent as an array of single-byte UTF-8 characters) and we want to count the number of occurrences of each n-gram (substring of length n
) in that string. And let’s say we want to turn this algorithm into a utility function returning a hash map from each distinct n-gram to the count.
One way to do that would be to return something like a map from string to int. But let’s say we want something more efficient that that, because we need to handle many large DNA strings, and we don’t want to waste memory creating a copy of each n-gram string to use as keys into the map.
In Rust, we can actually avoid any copying by returning a map from a slice to the count, like this:
// "&[u8]" means "a slice of byte array" fn count_ngrams(text : &str, n : usize) -> HashMap<&[u8], usize> { let mut ngram_map = HashMap::new(); for i in 0..text.len()+1-n { // key is a slice of "n" bytes starting from index "i" let key = &text.as_bytes()[i..i+n]; *ngram_map.entry(key).or_insert(0) += 1; } ngram_map }
Now wait a minute, a C++ developer might say, are you seriously suggesting to make a utility function that returns a collection of raw pointers/references into a string parameter your code doesn’t own?? That’s insane reckless programming, sir, what about the lifetime of the string???!!!11!!1!
But no worries. That’s where Rust’s brilliant reference lifetime tracking comes into play, and it makes this implementation completely safe in Rust with zero run-time cost: no reference counting, “smart” pointers or garbage collection or anything like that needed at all! If your code compiles, Rust guarantees there’s no “use after free”.
Yes, indeed, now let’s try to check that this really works as promised… :-)
Read the rest of this entry »