Some demos and experiments
Let's start with something easy:
//
// Duplication of references ==> &s1 == &s2
//
string s1 = "Hello World";
Console.WriteLine("s1 = {0}", s1);
Console.Write("&s1 = "); AddressOf(s1);
string s2 = s1;
Console.WriteLine("s2 = {0}", s2);
Console.Write("&s2 = "); AddressOf(s2);
Console.WriteLine("(&s2 == &s1) == {0}\n", Object.ReferenceEquals(s1, s2));
No surprise that s1 = s2 results in two identical addresses (or references, whatever terminology you like most). In the end, System.String is a reference type, thus assignment is pointer/address/reference assignment:
s1 = Hello World
&s1 = 01591DB8
s2 = Hello World
&s2 = 01591DB8
(&s2 == &s1) == True
So, we now have a string s1 containing "Hello World". What happens if we allocate another string with the same contents? An experiment:
//
// Another string with the same value ==> &s1 == &s3 (!)
//
string s3 = "Hello World";
Console.WriteLine("s3 = {0}", s3);
Console.Write("&s3 = "); AddressOf(s3);
Console.WriteLine("(&s3 == &s1) == {0}\n", Object.ReferenceEquals(s1, s3));
Spitting out the following:
s3 = Hello World
&s3 = 01591DB8
(&s3 == &s1) == True
This, again, is string interning at your service. What happens is that the JIT checks the table with interned strings in the current appdomain (to be precise: on the module level, see SSCLI code later) and if the string is found in the (hash)table, that reference is returned.
Now, what if we append a character to s1 and assign it back to s1? Because strings are immutable, we'll end up with a new string object (and thus a new address/reference):
//
// Strings are immutable; changing them allocates a new string ==> different address!
//
s1 += "!";
Console.WriteLine("s1 = {0}", s1);
Console.Write("&s1 = "); AddressOf(s1);
In IL, this looks like this:
IL_00fa: ldloc.2
IL_00fb: ldstr "!"
IL_0100: call string [mscorlib]System.String::Concat(string, string)
IL_0105: stloc.2
You see, there is no such thing as ldstr "Hello World!" this time. The compiler can't foresee this concatenation (would be even useless if it could in this case, because we've used the unmodified s1 object before), so a runtime String.Concat call is emitted to IL, effectively creating a new string. Allocation of this string doesn't happen using ldstr but through a so-called "framed allocation" that directly calls into the GC for memory allocation (see SSCLI code further on).
Because the newly created string (obtained through concatenation) isn't allocated through ldstr (see further for SSCLI analysis), it doesn't end up in the (hash)table of interned strings. Therefore, calling ldstr "Hello World!" subsequently won't find a match in the interned strings table and will result in a new allocation (which on its turn does end up in the interned strings table!):
//
// Another string, equal to the modified s1 string ==> &s1 != &s4
//
string s4 = "Hello World!";
Console.WriteLine("s4 = {0}", s4);
Console.Write("&s4 = "); AddressOf(s4);
Console.WriteLine("(&s4 == &s1) == {0}\n", Object.ReferenceEquals(s1, s4));
performs
IL_0124: ldstr "Hello World!"
IL_0129: stloc.s s4
and results in
s1 = Hello World!
&s1 = 015973C8
s4 = Hello World!
&s4 = 01591CC8
(&s4 == &s1) == False
How can we solve this? The answer: intern s1. Why? To reduce memory footprint caused by duplicate strings. What happens? The runtime finds s4 that already has the contents of s1 at the moment, thus the String.Intern(s1) call will return the address of s4.
Note: You might wonder what the implications are of interning and having multiple variables point to the same string in the back. Doesn't that cause problems when someone tries to change the string through one variable? The answer is of course no, because string are immutable. Changing a string isn't possible without creating another one (i.e. all instance methods on System.String return a new string instance).
//
// String interning ==> &s1 == &s4
//
Console.WriteLine("\n*** INTERNING s1 ***\n");
s1 = String.Intern(s1);
Console.WriteLine("s1 = {0}", s1);
Console.Write("&s1 = "); AddressOf(s1);
Console.WriteLine("(&s4 == &s1) == {0}\n", Object.ReferenceEquals(s1, s4));
This displays (compare the new value for &s1 with &s4 in the previous experiment):
*** INTERNING s1 ***
s1 = Hello World!
&s1 = 01591CC8
(&s4 == &s1) == True
Finally we allocate yet another string that contains "Hello World!", this time via ldstr directly:
//
// Another string, equal to the modified but interned s1 string ==> &s1 == &s5
//
string s5 = "Hello World!";
Console.WriteLine("s5 = {0}", s5);
Console.Write("&s5 = "); AddressOf(s5);
Console.WriteLine("(s5 == s1) == " + Object.ReferenceEquals(s1, s5));
which displays
s5 = Hello World!
&s5 = 01591CC8
(s5 == s1) == True
Going much deeper - enter the SSCLI
You're a geek? So you do want to know how this really works behind the scenes? The SSCLI will help you out. First download it here and extract it using some ZIP/RAR tool. Download the .vcproj file created by shawnfa too if you want some comfortable browsing through the massive code base.
Part 1 - The JITter interns strings
Let's show how strings get interned when strings are loaded in IL. It all starts by having a call to create a new string literal:
jithelpers.h
CORINFO_HELP_STRCNS, // create a new string literal
...
JITHELPER(CORINFO_HELP_STRCNS, JIT_StrCns)
This relies on the JITHELPER macro defined in jitinterface.h. Just for your reference:
jitinterface.h
// enum for dynamically assigned helper calls
enum DynamicCorInfoHelpFunc {
#define JITHELPER(code, pfnHelper)
#define DYNAMICJITHELPER(code, pfnHelper) DYNAMIC_##code,
#include "jithelpers.h"
DYNAMIC_CORINFO_HELP_COUNT
};
Back on the right track, we find that jithelpers.cpp specifies the JIT_StrCns operation:
jithelpers.cpp
HCIMPL2(Object *, JIT_StrCns, unsigned metaTok, CORINFO_MODULE_HANDLE scopeHnd)
hndStr = ConstructStringLiteral(scopeHnd, metaTok);
return *(Object**)hndStr;
This piece of code relies on the HCIMPL2 macro, defined in fcall.h ("fast call"):
fcall.h
#define FC_COMMON_PROLOG(target, assertFn) FCALL_TRANSITION_BEGIN()
#define HCIMPL_PROLOG(funcname) LPVOID __me; __me = 0; FC_COMMON_PROLOG(funcname, HCallAssert)
#define HCIMPL2(rettype, funcname, a1, a2) rettype F_CALL_CONV funcname(a1, a2) { HCIMPL_PROLOG(funcname)
That being said, on to ConstructStringLiteral which we find further in the jithelpers.cpp file:
jithelpers.cpp
OBJECTHANDLE ConstructStringLiteral(CORINFO_MODULE_HANDLE scopeHnd, mdToken metaTok)
Module* module = GetModule(scopeHnd);
if (module->HasNativeImage() && module->IsNoStringInterning())
{
return module->ResolveStringRef(metaTok, module->GetAssembly()->Parent(), true);
}
return module->ResolveStringRef(metaTok, module->GetAssembly()->Parent(), false);
As you can see, the current module's ResolveStringRef method is called, which is defined in ceeload.cpp:
ceeload.cpp
OBJECTHANDLE Module::ResolveStringRef(DWORD token, BaseDomain *pDomain, bool bNeedToSyncWithFixups)
string = (OBJECTHANDLE)pDomain->GetStringObjRefPtrFromUnicodeString(&strData);
return string;
This one calls into the appdomain implementation:
appdomain.hpp
//****************************************************************************************
// Methods to retrieve a pointer to the COM+ string STRINGREF for a string constant.
// If the string is not currently in the hash table it will be added and if the
// copy string flag is set then the string will be copied before it is inserted.
STRINGREF *GetStringObjRefPtrFromUnicodeString(EEStringData *pStringData);
The comment reveals the intentions of this long-named method and mentions the hash table used for interning. The implementation is rather straightforward too:
appdomain.cpp
STRINGREF *BaseDomain::GetStringObjRefPtrFromUnicodeString(EEStringData *pStringData)
if (m_pStringLiteralMap == NULL)
{
LazyInitStringLiteralMap();
}
return m_pStringLiteralMap->GetStringLiteral(pStringData, TRUE, !CanUnload() /* bAppDOmainWontUnload */);
And finally we end up in the "string literal map" class:
stringliteralmap.cpp
StringLiteralEntry *GlobalStringLiteralMap::GetStringLiteral(EEStringData *pStringData, DWORD dwHash, BOOL bAddIfNotFound)
HashDatum Data;
StringLiteralEntry *pEntry = NULL;
if (m_StringToEntryHashTable->GetValue(pStringData, &Data, dwHash))
{
pEntry = (StringLiteralEntry*)Data;
// If the entry is already in the table then addref it before we return it.
if (pEntry)
pEntry->AddRef();
}
else
{
if (bAddIfNotFound)
pEntry = AddStringLiteral(pStringData);
}
return pEntry;
The hashtable implementation itself is of no further interest to us.
Part 2 - String.Concat doesn't take care of interning
As I told you before, String.Concat doesn't check whether the constructed string is interned. Doing so would require an expensive hashtable lookup and consistency would demand other methods like ToUpper, ToLower, etc to do the same. An overview of the stack:
string.cs
public static String Concat(String str0, String str1)
...
String result = FastAllocateString(str0Length + str1.Length);
...
retun result;
The FastAllocateString method that's called is implemented elsewhere in unmanaged code. Therefore it's declared "extern" and decorated with a MethodImplOptions.InternalCall type of method implementation attribute as shown below:
string.cs
[MethodImplAttribute(MethodImplOptions.InternalCall)]
private extern static String FastAllocateString(int length);
The real implementation lives in ecall.cpp which contains a big mapping from various "internal call" methods to the corresponding supporting methods in the back:
ecall.cpp
ECClass gECClasses[] =
{
...
FCClassElement("String", "System", gStringFuncs)
...
};
FCFuncStart(gStringFuncs)
...
FCFuncStart(gStringFuncs)
FCDynamic("FastAllocateString", CORINFO_INTRINSIC_Illegal, ECall::FastAllocateString)
...
FCFuncEnd()
This implements so-called "fast calls" (abbreviated FCall). Ecall.h comes to a rescue for dynamically assigned FCalls as shown below:
ecall.h
#define DYNAMICALLY_ASSIGNED_FCALLS() \
DYNAMICALLY_ASSIGNED_FCALL_IMPL(FastAllocateString, FramedAllocateString) \
The end is near. Now we end up in a JIT helper, just as we did before in part 1. However, this time the implementation calls into SlowAllocateString stuff.
jithelpers.cpp
HCIMPL1(StringObject*, FramedAllocateString, DWORD stringLength)
...
result = SlowAllocateString(stringLength+1);
result->SetStringLength(stringLength);
return((StringObject*) OBJECTREFToObject(result));
The end of our journey: the GC implementation. Never far away in .NET as you can see. Basically, the gcscan piece of it is asked to allocate a string:
gcscan.cpp
STRINGREF SlowAllocateString( DWORD cchArrayLength )
...
orObject = (StringObject *)Alloc( ObjectSize, FALSE, FALSE );
...
return( ObjectToSTRINGREF(orObject) );
The Alloc method (calling into the heap allocation mechanism and possibly triggering a garbage collection) would lead us too far, outside the scope of our current investigation. There's much more in here (masked by the ellipses ...) such as method table initialization and profiling stuff. Check it out if you got curious :-).
Part 3 - How String.Intern works
In one of our experiments we made an explicit call to String.Intern. Let's see how this one works now:
string.cs
public static String Intern(String str) {
if (str==null) {
throw new ArgumentNullException("str");
}
return Thread.GetDomain().GetOrInternString(str);
}
appdomain.cpp
STRINGREF *BaseDomain::GetOrInternString(STRINGREF *pString)
{
if (m_pStringLiteralMap == NULL)
{
LazyInitStringLiteralMap();
}
_ASSERTE(m_pStringLiteralMap);
return m_pStringLiteralMap->GetInternedString(pString, TRUE, !CanUnload() /* bAppDOmainWontUnload */);
}
And finally back to the "string literal map" class as in part 1. Oh yes, the m_pStringLiteralMap type is:
// The AppDomain specific string literal map.
AppDomainStringLiteralMap *m_pStringLiteralMap;
Other resources
Other articles on interning include a post on Chris Brumme's weblog and a VB2TheMax article on DevX by Francesco Balena. Even Wikipedia knows the concept. So why wouldn't you too? I hope to have wet your SSCLI appetite too by means of this article.
Enjoy your internal string kitchen :-)