String Interning in Python
Every programming language aims to be performant in its niche and achieving superior performance requires a bunch of compiler and interpreter level optimizations. Since character Strings are an integral part of any programming language, having the ability to perform string operations quickly elevates the overall performance.
In this essay, we dive deep into Python internals and find out how Python makes its interpreter performant using a technique called String Interning. This essay not only aims to put forth Python internals but also aims to make the reader comfortable in navigating through Python's source code; so expect a lot of code snippets taken from CPython.
String Interning
String Interning is a compiler/interpreter optimization method that makes common string processing tasks space and time efficient by caching them. Instead of creating a new copy of string every time, this optimization method dictates to keep just one copy of string for every appropriate immutable distinct value and use the pointer reference wherever referred.
The single copy of each string is called its intern and hence the name String Interning. The lookup of string intern, may or may not be exposed as a public interfaced method. Modern programming languages like Java, Python, PHP, Ruby, Julia, and many more, performs String Interning to make their compilers and interpreters performant.
Why should Strings be interned?
String Interning speeds up string comparisons. Without interning if we were to compare two strings for equality the complexity of it would shoot up to O(n)
where we examine every character from both the strings to decide their equality. But if the strings are interned, instead of checking every character, equal strings will have the same object reference so just a pointer quality check would be sufficient to say if two string literals are equal. Since this is a very common operation, this is typically implemented as a pointer equality test, using just a single machine instruction with no memory reference at all.
String Interning reduces the memory footprint. Instead of filling memory with redundant String objects, Python optimizes memory footprint by sharing and reusing already defined objects as dictated by the flyweight design pattern.
String Interning in Python
Just like most other modern programming languages, Python also does String Interning to gain a performance boost. In Python, we can find if two objects are referring to the same in-memory object using the is
operator. So if two string objects refer to the same in-memory object, the is
operator yields True
otherwise False
.
We can use this particular operator to test which all strings are interned and which are not. In CPython, String Interning is implemented through the following function, declared in unicodeobject.h and defined in unicodeobject.c.
In order to check if a String is interned, CPython implements a macro named PyUnicode_CHECK_INTERNED
, again defined in unicodeobject.h. The macro suggests that the Python maintains a member named interned
in PyASCIIObject
structure whose value suggests if the corresponding String is interned or not.
Internals of String Interning
In CPython, the String references are stored, accessed, and managed using a Python dictionary named interned
. This dictionary is lazily initialized upon the first String Intern invocation and holds the reference to all the interned String objects.
Interning the String
The core function responsible for interning the String is named PyUnicode_InternInPlace
defined in unicodeobject.c that upon invocation lazily builds the main dictionary interned
to hold all interned strings and then registers the object into it with the key and the value both set as the same object reference. The following function snippet shows the String Interning process as implemented in Python.
Cleanup of Interned Strings
The cleanup function iterates over all the Strings held in the interned
dictionary, adjusts the reference counts of the object, and marks them as NOT_INTERNED
allowing them to be garbage collected. Once all the strings are marked as NOT_INTERNED
, the interned
dictionary is cleared and deleted. The cleanup function is defined in unicodeobject.c by the name _PyUnicode_ClearInterned
.
String Interning in Action
Now that we understand the internals of String Interning and Cleanup, we find out what all Strings are interned in Python. To discover the spots all we do is grep for the function invocation for PyUnicode_InternInPlace
in the CPython source code and peek at the neighboring code. Here is a list of interesting spots where String Interning happens in Python.
Variables, Constants, and Function Names
CPython performs String Interning on constants such as Function Names, Variable Names, String Literals, etc. Following is the snippet from codeobject.c that suggests that when a new PyCode
object is created the interpreter is interning all the compile-time constants, names, and literals.
Dictionary Keys
CPython also interns thee Strings which keys of any dictionary object. Upon putting an item in the dictionary the interpreter String Interning on the key against which item is stored. The following code is taken from dictobject.c showcasing the exact behavior.
Fun Fact: There is a comment next to the PyUnicode_InternInPlace
function call that suggests if we really need to intern all the keys in all the dictionaries.
Attributes of any Object
Objects in Python can have attributes that can be explicitly set using setattr
function or are implicitly set as part of Class members or as pre-defined functions on data types. CPython interns all these attribute names, so as to make lookup blazing fast. Following is the snippet of the function PyObject_SetAttr
responsible for setting a new attribute to a Python object, as defined in the file object.c.
Explicit Interning
Python also allows explicit String Interning through the function intern
defined in sys
module. When this function is invoked with any String object, the provided String is interned. Following is the code snippet from the file sysmodule.c that shows String Interning happening in sys_intern_impl
.
Extra nuggets on String Interning
Only compile-time strings are interned. Strings that are specified during interpretation or compile-time are interned while dynamically created strings are not.
Strings having ASCII letters and underscores are interned. During compile time when string literals are observed for interning, CPython ensures that it only interns the literals matching the regular expression [a-zA-Z0-9_]*
as they closely resemble Python identifiers.
References
Other essays you might like
If you liked what you read, consider subscribing to my weekly newsletter at arpitbhayani.me/newsletter where, once a week, I write an essay about programming languages internals, or a deep dive on some super-clever algorithm, or just a few tips on building highly scalable distributed systems.
You can always find me browsing through Twitter @arpit_bhayani.
If my work adds value, consider supporting me