7 min read

Sooo.... I 've decided to compare the performance of Rust's and C#'s implementations of Russian declension, and, well... Everything is so explicit and safe in Rust, that I've spent several months learning Rust and figuring out the best way to implement all this.

I'm also addicted to Rust now.

UTF-8 strings

To start off, Rust's strings are UTF-8-encoded, while C#/.NET's strings are UTF-16, meaning Russian characters occupy two code units instead of one. So I've simply made a struct containing the two UTF-8 bytes, and matched the letters with them.

pub struct Letter {
    utf8: [u8; 2],
}
impl Letter {
    pub const А: Self = Self::from_char('а');
    /* ... */
}

Everything was going smoothly, and I even fully implemented the noun declension, but then, way too late in the process, I realized that memory alignment is not just some transparent processor optimization... but something that you need to actually consider as a developer.

let s = String::from("abcСЛОВО"); // pointer e.g. 0xf54ba7c0, align=1
let word = &s[3..];

// |-|-|-|-|-|-|-|-|
// |a|b|c| С | Л |О|
// |      ^^^ ^^^ ^|
// |О| В | О |-----|
// |^ ^^^ ^^^      |

// This would result in a misaligned pointer 0xf54ba7c3, align=2
let letters: &[Letter] = unsafe {
    std::slice::from_raw_parts(word.as_ptr().cast(), word.len() / 2)
};

Both the compiler and the processor rely on the data being properly aligned to perform some optimizations. Rust, of course, panicked here, informing me of the misalignment.

Enum values

Rust is incredibly good at optimizing enums and structures! Seriously! If, for example, an enum only has 16 values, ranging from 0 to 15, and you're matching on its values, returning a value from 0 to 15, Rust's compiler will transform code like this:

impl AnyStress {
    pub const fn unprime(self) -> Self {
        match self {
            Self::A | Self::Ap => Self::A,
            Self::B | Self::Bp => Self::B,
            Self::C | Self::Cp | Self::Cpp => Self::C,
            Self::D | Self::Dp => Self::D,
            Self::E | Self::Ep => Self::E,
            Self::F | Self::Fp | Self::Fpp => Self::F,
        }
    }
}

Into something like this:

pub const fn unprime(self) -> Self {
    const LOOKUP: u64 = 0x0636543216543210;
    return (LOOKUP >> (self as u64 * 4)) & 0xF;
}

This optimization blew my mind. In C# enums are simply wrappers over integers (defaulting to signed 32-bit), and I never even imagined that a compiler could do something like that! And now that I've become aware of this kind of optimization, I'm gonna use it everywhere where I can in C#, as well as any other language, too.

Niche optimizations

Another incredible thing about Rust is that Option<T> doesn't have to be larger than T. If T is an enum with values ranging from 0 to 15, then Option::None is gonna be value 16. And you can stack them as much as you want, until all of the byte's possible values are exhausted.

In C#, on the other hand, the Nullable<T> struct is always at least 1 byte larger than T, since it needs to separately store a bool hasValue field in there. So even something as simple as bool? uses two entire bytes instead of one.

Also, references in Rust are always considered to be non-null, avoiding unnecessary runtime null checks! That also means Option<&T> can be the same size as &T, with Option::None occupying the value 0/null.

And another thing — Rust is so much more aggressive at inlining! C#'s JIT draws a line at a certain code size or complexity, while Rust inlines practically every small function. I can rest assured that all of the tiny helper functions I've written are inlined as expected!

Const evaluation

When I learned about const fns, I immediately started doing a bunch of stuff with them, like parsing and formatting my structs and enums from and into stack memory, and storing results of complex computations as constants. I wanted to make everything "const-eval"able.

Unfortunately (for the project), as I was writing the library, Rust's team started updating and improving const-eval on the nightly version, introducing const traits and impls. So, hopping from one version to another, I had to do lots of adjustments in the code for everything to still work and look nice, and that distracted me from actually working on the library.

But, check this out! Instead of manually compiling a table of all endings, and then encoding all the values through some hand-written script (that won't even be included with the source), const-eval, along with macros, allows you to write something like this:

// All endings of nouns, adjectives and pronouns in one 54-char slice
const ENDINGS: &[u8] = "аямимиееговымихемуюьююыевяяхамийогойоейомуыхыйёвахёйём".as_bytes();

// [case:6] [number:2] [gender:3] [stem type:8] = [total:288]
#[rustfmt::skip]
pub(crate) const NOUN_LOOKUP: [(u8, u8); 288] = [
    //    stem types: 1,    2,   3,    4,    5,    6,   7,   8
    /* nom sg masc */ null, ь,   null, null, null, й,   й,   ь,
    /* nom sg n    */ о,    е_ё, о,    е_о,  е_о,  е_ё, е_ё, о,
    /* nom sg fem  */ а,    я,   а,    а,    а,    я,   я,   ь,
    //    stem types: 1, 2, 3, 4, 5, 6, 7, 8
    /* nom pl masc */ ы, и, и, и, ы, и, и, и,
    /* nom pl n    */ а, я, а, а, а, я, я, а,
    /* nom pl fem  */ ы, и, и, и, ы, и, и, и,

    /* ... */
];

macro_rules! define_endings {
    ($($ident:ident)*) => ($(
        const $ident: (u8, u8) = encode_ending(stringify!($ident));
    )*);
    ($($x:ident($un_str:ident, $str:ident)),* $(,)?) => ($(
        const $x: (u8, u8) = ($un_str.0, $str.0);
    )*);
}

define_endings! {
    о е ов ы ей й ё ём ой ёй а ам ами и я ям ями ем у ю ах ях ом ев ёв ь ью // nouns
    ое его ого ые ее ий ая ие ему ую юю яя ый ых ым ыми их ому им ими // pronouns, adjectives
}
define_endings! {
    // nouns
    е_ё(е, ё), е_о(е, о), и_е(и, е), ев_ёв(ев, ёв), ев_ов(ев, ов), ем_ём(ем, ём),
    ем_ом(ем, ом), ей_ёй(ей, ёй), ей_ой(ей, ой), ь_ей(ь, ей), null_ей(null, ей),
    // pronouns, adjectives
    ее_ое(ее, ое), ый_ой(ый, ой), ий_ой(ий, ой), его_ого(его, ого), ему_ому(ему, ому),
}

All of the endings are computed and encoded at compile-time, resulting in zero runtime overhead and maximum clarity in the source code! You could even simplify it further by using a transparent wrapper type over (u8, u8) and const-implementing Div trait for it, allowing you to write е/ё instead of е_ё (which is what I'm planning to do in the next iteration of the library).

Benchmarks

After implementing noun declension in Rust, I benchmarked it and compared the results with that of C#'s implementation. For a more fair comparison, I've also modified GrammarSharp's code to format the word forms directly into stack memory, without allocating and copying into new strings. For input любовь ж 8*b' the results were roughly this:

BenchmarkC#Rust⚡
Parsing word & info95 ns60 ns−36.8%
Declining 12 forms335 ns262 ns−21.8%

So, it's pretty good. But I think it could be better. That's why I'm gonna rewrite the library from scratch, this time accounting for all the issues I encountered during the first iteration: memory alignment, APIs, stack memory, as well as doing some more optimizations. I would also like to generalize the declension a bit, making it possible to decline both UTF-8- and UTF-16-encoded words, as well as any other data format you could need.

After properly rewriting all of it, I might finally be able to start working on numeral declension! Or maybe not. I'm not sure where exactly my life is going at the moment, but hopefully I can somewhat finish this project and put it behind me and start doing something new and cooler.

Subscribe to my blog's RSS feed to stay tuned.