Earlier today Brad Wilson and I were discussing a G+ post by Jostein Kjønigsen where he says, "see if you can spot the O(n^2) bug in this code."
public IEnumerable<Process> GetProcessesForSession(string processName, int sessionId){ var processes = Process.GetProcessByName(processName); var filtered = from p in processes where p.SessionId == sessionId select p; return filtered;}
This is a pretty straightforward method that calls a .NET BCL (Base Class Library) method and filters the result with LINQ. Of course, when any function calls another one that you can't see inside (which is basically always) you've lost control. We have no idea what's going on in GetProcessesByName.
Let's look at the source to the .NET Framework method in Reflector. Our method calls Process.GetProcessesByName(string).
public static Process[] GetProcessesByName(string processName){ return GetProcessesByName(processName, ".");}
Looks like this one is an overload that passes "." into the next method Process.GetProcessesByName(string, string) where the second parameter is the machineName.
This next one gets all the processes for a machine (in our case, the local machine) then spins through them doing a compare on each one in order to build a result array to return up the chain.
public static Process[] GetProcessesByName(string processName, string machineName){ if (processName == null) { processName = string.Empty; } Process[] processes = GetProcesses(machineName); ArrayList list = new ArrayList(); for (int i = 0; i < processes.Length; i++) { if (string.Equals(processName, processes[i].ProcessName, StringComparison.OrdinalIgnoreCase)) { list.Add(processes[i]); } } Process[] array = new Process[list.Count]; list.CopyTo(array, 0); return array;}
if we look inside GetProcesses(string), it's another loop. This is getting close to where .NET calls Win32 and as these classes are internal there's not much I can do to fix this function other than totally rewrite the internal implementation. However, I think I've illustrated that we've got at least two loops here, and more likely three or four.
public static Process[] GetProcesses(string machineName){ bool isRemoteMachine = ProcessManager.IsRemoteMachine(machineName); ProcessInfo[] processInfos = ProcessManager.GetProcessInfos(machineName); Process[] processArray = new Process[processInfos.Length]; for (int i = 0; i < processInfos.Length; i++) { ProcessInfo processInfo = processInfos[i]; processArray[i] = new Process(machineName, isRemoteMachine, processInfo.processId, processInfo); } return processArray;}
This code is really typical of .NET circa 2002-2003 (not to mention Java, C++ and Pascal). Functions return arrays of stuff and other functions higher up filter and sort.
When using this .NET API and for looping over the results several times, I'm going for(), for(), for() in a chain, like O(4n) here.
Note: To be clear, it can be argued that O(4n) is just O(n), cause it is. Adding a number like I am isn't part of the O notation. I'm just saying we want to avoid O(cn) situations where c is a large enough number to affect perf.
Sometimes you'll see nested for()s like this, so O(n^3) here where things get messy fast.
LINQ is more significant than people really realize, I think. When it first came out some folks said "is that all?" I think that's unfortunate. LINQ and the concept of "deferred execution" is just so powerful but I think a number of .NET programmers just haven't taken the time to get their heads around the concept.
Here's a simple example juxtaposing spinning through a list vs. using yield. The array version is doing all the work up front, while the yield version can calculate. Imagine a GetFibonacci() method. A yield version could calculate values "just in time" and yield them, while an array version would have to pre-calculate and pre-allocate.
public void Consumer(){ foreach (int i in IntegersList()) { Console.WriteLine(i.ToString()); } foreach (int i in IntegersYield()) { Console.WriteLine(i.ToString()); }}public IEnumerable<int> IntegersYield(){ yield return 1; yield return 2; yield return 4; yield return 8; yield return 16; yield return 16777216;}public IEnumerable<int> IntegersList(){ return new int[] { 1, 2, 4, 8, 16, 16777216 };}
Back to our GetProcess example. There's two issues at play here.
First, the underlying implementation where GetProcessesInfos eventually gets called is a bummer but it's that way because of how P/Invoke works and how the underlying Win32 API returns the data we need. It would certainly be nice if the underlying API was more granular. But that's less interesting to me than the larger meta-issue of a having (or in this case, not having) a LINQ-friendly API.
The second and more interesting issue (in my option) is the idea that the 2002-era .NET Base Class Library isn't really setup for LINQ-friendliness. None of the APIs return LINQ-friendly stuff or IEnumerable<anything> so that when you change together filters and filters of filters of arrays you end up with O(cn) issues as opposed to nice deferred LINQ chains.
When you find yourself returning arrays of arrays of arrays of other stuff while looping and filtering and sorting, you'll want to be aware of what's going on and consider that you might be looping inefficiently and it might be time for LINQ and deferred execution.
Here's a simple conversion attempt to change the first implementation from this classic "Array/List" style:
ArrayList list = new ArrayList();for (int i = 0; i < processes.Length; i++){ if (string.Equals(processName, processes[i].ProcessName, StringComparison.OrdinalIgnoreCase)) { list.Add(processes[i]); }}Process[] array = new Process[list.Count];list.CopyTo(array, 0);return array;
To this more LINQy way. Note that returning from a LINQ query defers execution as LINQ is chainable. We want to assemble a chain of sorting and filtering operations and execute them ONCE rather than for()ing over many lists many times.
if (processName == null) { processName = string.Empty; }Process[] processes = Process.GetProcesses(machineName); //stop here...can't go farther?return from p in processes where String.Equals(p.ProcessName, processName, StringComparison.OrdinalIgnoreCase) select p; //the value of the LINQ expression being returned is an IEnumerable<Process> object that uses "yield return" under the hood
static void Main(string[] args){ var myList = GetProcessesForSession("chrome.exe", 1);}public static IEnumerable<Process> GetProcessesForSession(string processName, int sessionID){ //var processes = Process.GetProcessesByName(processName); var processes = HanselGetProcessesByName(processName); //my LINQy implementation var filtered = from p in processes where p.SessionId == sessionID select p; return filtered;}private static IEnumerable<Process> HanselGetProcessesByName(string processName){ return HanselGetProcessesByName(processName, ".");}private static IEnumerable<Process> HanselGetProcessesByName(string processName, string machineName){ if (processName == null) { processName = string.Empty; } Process[] processes = Process.GetProcesses(machineName); //can't refactor farther because of internals. //"the value of the LINQ expression being returned is an IEnumerable<Process> object that uses "yield return" under the hood" (thanks Mehrdad!) return from p in processes where String.Equals(p.ProcessName == processName, StringComparison.OrdinalIgnoreCase) select p; /* the stuff above replaces the stuff below */ //ArrayList list = new ArrayList(); //for (int i = 0; i < processes.Length; i++) //{ // if (string.Equals(processName, processes[i].ProcessName, StringComparison.OrdinalIgnoreCase)) // { // list.Add(processes[i]); // } //} //Process[] array = new Process[list.Count]; //list.CopyTo(array, 0); //return array;}
This is a really interesting topic to me and I'm interested in your opinion as well, Dear Reader. As parts of the .NET Framework are being extended to include support for asynchronous operations, I'm wondering if there are other places in the BCL that should be updated to be more LINQ friendly. Or, perhaps it's not an issue at all.
Your thoughts?
Scott Hanselman is a former professor, former Chief Architect in finance, now speaker, consultant, father, diabetic, and Microsoft employee. I am a failed stand-up comic, a cornrower, and a book author.
ProcessInfo[] a = GetProcessInfos()Process[] b = Loop_MakeProcessInfosIntoProcesses(a)Process[] c = Loop_FilterProcessesByName(b)IEnumerable<Process> d = Loop_FilterProcessesBySession(c)
We need to separate O(n^2) from O(2n) from the actual performance impact. O(2n) is O(n), because constants aren't ever part of big-O notation. So in the strictest sense, this reuse of arrays and loops is O(n). In practical terms, though, the constant does matter, because it does have an impact on performance (and memory consumption!), so using deferred execution eliminates the constant (or at least reduces it, until you run into internal APIs that you can't/don't want to re-write, as in Scott's example blog post).This has a strong parallel with the Build conference this week: everything in WinRT which can run a long time is now async only. The Windows (and Visual Studio) team realized that the problems with writing code which is and/or consumes async things were basically tooling problems. C# 5 and VB 5 added await and async, and we all win. Deferred execution is the same thing: linear code gets churned and turned into a finite state machine that implements IEnumerable; while it was possible to hand-craft enumerator implementations which did deferred execution in the past, it was really LINQ and .NET 3.5's addition of "yield return" that made it doable by mortals.
The reason is simple, when you're not the consumer of your enumeration, deferred execution bites you in the maintenance butt. The problem is people expect a method to give back a snapshot. It gets worse when the underlying source is constantly changing, the less experienced quickly introduce bugs that only occur under load as the runtime complains about enumerating a modified collection. I've seen that particular bug in production code too often. (The TPL concurrent classes have helped - by effectively snapshotting for you).
What have you trained your developers to think about the differences between IQueryable<T> and IEnumerable<T>?
static void Main (string [] args) { var items = Enumerable.Range (0, 100000000); var predicates = new Predicate<int> [] { i => (i % 5) == 0, i => (i % 3) == 0, i => (i % 2) == 0, }; Stopwatch sw = new Stopwatch (); sw.Start (); var results1 = new List<int> (); foreach (var item in items) { var fMatch = false; foreach (var predicate in predicates) { if (!predicate (item)) { fMatch = false; break; } } if (fMatch) results1.Add (item); } sw.Stop (); Console.WriteLine (sw.ElapsedMilliseconds); sw.Restart (); foreach (var predicate in predicates) { var tmp = new List<int> (); foreach (var item in items) { if (predicate (item)) tmp.Add (item); } items = tmp; } sw.Stop (); Console.WriteLine (sw.ElapsedMilliseconds); }
var result = new List<int>();foreach (var item in ints) { if (x % 2 == 0 && item < 100) { result.Add(item); }}
// first .Where():Func<int, bool> pred1 = x => x % 2 == 0;var where1 = new WhereEnumerable();where.source = ints;where.preds.Add(pred1);// second .Where():var where2 = new WhereEnumerable()Func<int, bool> pred2 = x => x < 100;// Chained Where()s will not chain their WhereEnumerables,// but they *will not* merge their predicates.where2.source = where1.source;where2.preds.AddRange(where1.preds);where2.preds.Add(pred2);// source.ToList() is simply new List(), foreach .Add()// (assuming source is not an array or IList? or was it ICollection)// However, foreach uses .MoveNext(), which for a WhereEnumerable is:bool WhereEnumerator.MoveNext() { while (source.MoveNext()) { foreach (var pred in preds) { if (!pred(source.Current)) goto next; } return true; next: } return false;}// Therefore, .ToList(), with the IEnumerable culled, is *really*:var result = new List();while (source.MoveNext()) { foreach (var pred in preds) { if (!pred) goto next; } result.Add(source.Current);next:}
// Minimum heap allocations: 8internal class HanselmanProgram{ private static void Main() { var processes = GetProcessesForSession("chrome.exe", 1); // 1: IEnumerator for 'processes' foreach (var process in processes) Console.WriteLine(process.ProcessName); } public static IEnumerable<Process> GetProcessesForSession( string processName, int sessionId) { // 1: array returned by Process.GetProcesses() var processes = Process.GetProcesses(".") // 3: IEnumerator for array, delegate, compiler-generated IEnumerable .Where(p => processName == null || p.ProcessName == processName) // 3: IEnumerator, delegate, compiler-generated IEnumerable .Where(p => p.SessionId == sessionId); return processes; }}// Minimum heap allocations: 6internal class AlternateProgram{ private static void Main() { var processes = ("chrome.exe", 1); // 1: IEnumerator for 'processes' foreach (var process in processes) Console.WriteLine(process.ProcessName); } public static Process[] GetProcessesByName( string processName, string machineName) { // 1: array returned by Process.GetProcesses() Process[] processes = Process.GetProcesses(machineName); // 2: 'list' and the array wrapped by 'list' ArrayList list = new ArrayList(); for (int i = 0; i < processes.Length; i++) { if (processName == null || processName == processes[i].ProcessName) list.Add(processes[i]); } // 1: 'results' array Process[] results = new Process[list.Count]; list.CopyTo(results, 0); return results; } // 1: Compiler-generated IEnumerable (result of rewriting the iterator method) public static IEnumerable<Process> GetProcessesForSession( string processName, int sessionId) { var processes = GetProcessesByName(processName, "."); for (var i = 0; i < processes.Length; i++) { var process = processes[i]; if (process.SessionId == sessionId) yield return process; } }}
Disclaimer: The opinions expressed herein are my own personal opinions and do not represent my employer's view in any way.