Lost Password? No account yet? Register
Member Area

Software Installation Technologies

May 17th

Home arrow Community arrow External Blogs arrow Scripting

Scripting
Fabulous Adventures In Coding
Eric Lippert's Blog

  • Reading Code Over the Telephone

    In my youth I once attended a lecture given by Brian Kernighan on the subject of code quality, which was very influential on my attitudes towards writing legible code. One of the things that Kernighan recommended was to endeavour write code that was so clear that it could be easily read over the phone and understood. Most people find code much easier to comprehend when read than when heard; if you can make it clear enough to be understood when heard, it's probably pretty clear code.

    I was reminded of this when I got the following question from a colleague by email:

    Subject: Stupid C# 3.0 lambda expression question

    How does one read the => operator?

    First off, I told my colleague that there are no stupid questions, only stupid people. No stupid people work here, so don't stress about it. This is a perfectly sensible question.

    As far as I know, we do not have an "official" line on how to read this operator over the phone. In the absense of any other context, I personally would say c=>c+1 as "see goes to see plus one". Some variations that I've heard:

    For a projection, (Customer c)=>c.Name: "customer see becomes see dot name"

    For a predicate, (Customer c)=>c.Age > 21: "customer see such that see dot age is greater than twenty-one"

    An unfortunate conflation is that the => operator looks a lot like ?, the "implies" operator in mathematics. Since => does not have the same semantics as ?, it is probably a bad idea to read => as "implies". (x?y would have the semantics of !x|y in C#.)

    Incidentally, it is a little known fact that VB6 and VBScript implemented the ? operator with the Imp keyword and the ? operator with the Eqv keyword. They disappeared in VB.NET. Where did they go? It is a mystery!



  • Mutating Readonly Structs

    Consider this program which attempts to mutate a readonly mutable struct. What happens?

    struct Mutable {
        private int x;
        public int Mutate() {
            this.x = this.x + 1;
            return this.x;
        }
    }

    class Test {
        public readonly Mutable m = new Mutable();
        static void Main(string[] args) {
            Test t = new Test();
            System.Console.WriteLine(t.m.Mutate());
            System.Console.WriteLine(t.m.Mutate());
            System.Console.WriteLine(t.m.Mutate());
        }
    }

    There are a number of things this program could do. Does it:

    1) Print 1, 2, 3 -- because m is readonly, but the "readonly" only applies to m, not to its contents.
    2) Print 0, 0, 0 -- because m is readonly, x cannot be changed. It always has its default value of zero.
    3) Throw an exception at runtime, when the attempt is made to mutate the contents of a readonly field.
    4) Do something else

    ?

    People are frequently surprised to learn that the answer is (4). In fact, this prints 1, 1, 1. 

    Why?

    Because, remember, accessing a value type gives you a COPY of the value. When you say t.m, you get a copy of whatever is presently stored in m. m is immutable, but the copy is not. The copy is then mutated, and the value of x in the copy is returned. But m remains untouched.

    The relevant section of the specification is 7.5.4, which states that when resolving "E.I" where E is an object and I is a field...

    ...if the field is readonly and the reference occurs outside an instance constructor of the class in which the field is declared, then the result is a value, namely the value of the field I in the object referenced by E.

    The important word here is that the result is the value of the field, not the variable associated with the field. Readonly fields are not variables outside of the constructor. (The initializer here is considered to be inside the constructor; see my earlier post on that subject.)

    Great. What about that second dot, as in ".Mutate()"?  We look at section 7.4.4 to find out how to invoke E.M():

    If E is not classified as a variable, then a temporary local variable of E's type is created and the value of E is assigned to that variable. E is then reclassified as a reference to that temporary local variable. The temporary variable is accessible as this within M, but not in any other way. Thus, only when E is a true variable is it possible for the caller to observe the changes that M makes to this.

    And there you go. Value semantics are tricky!

    This is yet another reason why mutable value types are evil. Try to always make value types immutable.



  • Trivial Projections Are (Usually) Optimized Away

    OK, computers aren't entirely dumb when it comes to LINQ. Here's an example of a place where we're a bit smarter.

    Consider the following query:

    IEnumerable<int> query = from n in number_array orderby n select n;

    Does this get transformed by the compiler into

    IEnumerable<int> query = number_array.OrderBy(n => n);

    or

    IEnumerable<int> query = number_array.OrderBy(n => n).Select( n => n );

    ?

    This question caused tremendous debate amongst the members of the C# design and implementation teams. I am quite happy with the compromise we achieved.

    The sections of the C# specification which cover this topic are sections 7.15.2.3, and 7.15.2.5:

    ***

    A query expression of the form ?from x in e select v? is translated into ?( e ) . Select ( x => v )? except when v is the identifier x, the translation is simply ?( e )?

    A query expression of the form ?from x in e select x? is translated into ?( e ) . Select ( x => x )?.

    A degenerate query expression is one that trivially selects the elements of the source. A later phase of the translation removes degenerate queries introduced by other translation steps by replacing them with their source. It is important however to ensure that the result of a query expression is never the source object itself, as that would reveal the type and identity of the source to the client of the query. Therefore this step protects degenerate queries written directly in source code by explicitly calling Select on the source. It is then up to the implementers of Select and other query operators to ensure that these methods never return the source object itself.

    ***

    It might sound like the second statement contradicts the first. The first statement says that the ?.Select(x=>v)? is omitted when ?v? is ?x?, and the second statement says that no, the ?.Select(x=>x)? is in fact generated.

    The spec is not particularly clear on this point, I agree, though the explanatory text about ?degenerate query expressions? does help clear it up somewhat. Basically what we are saying here is that the optimization to remove degenerate selects is only performed on the result of an earlier query translation.

    Some examples will help.

    If you have from n in number_array orderby n select n then this is first translated into from n in (number_array).OrderBy(n=>n) select n and then, since the selector is the same as the range variable, this is translated into ((number_array).OrderBy(n=>n)).

    We optimize away the Select(n=>n) because from n in (number_array).OrderBy(n=>n) select n is the intermediate result of an earlier query translation.

    If you had from n in number_array select n then this would be considered a degenerate query expression requiring a Select. We cannot simply translate it into (number_array) because then you could assign that to a variable and have referential identity with the collection. The result of the query should not ?leak? information about the collection out; it should just be an enumerable object of the appropriate type, and not have an identity relationship. Therefore, we take the performance hit and protect the identity of the collection by emitting a degenerate query translation (number_array).Select(n=>n).

    This is important because the collection might be a mutable collection. One does not expect that the result of a query over that collection could be used to mutate the collection! The query results should be read-only.

    We do not need to append the degenerate Select(n=>n) when you have an ?orderby? in there because the OrderBy(n=>n) call already protects the identity of the original collection. This saves on the performance cost of the no-op Select.

    I think we?ve made a reasonable compromise in the case of degenerate queries. If you are crazy enough to write ?from x in xs select x? when you mean ?(xs)?, then you?ll have to live with the decreased performance. But if you are introducing a degenerate select at the end of a meaningful query, then it will be optimized away.



  • Computers are dumb

     

    A few short takes today, from questions I've received recently about LINQ in C# 3.0.

    The first question was "in the following code, does it really check every single non-negative integer, or does it use the knowledge that once you're beyond ten, you can stop iterating?"

    var smallNumbers = Enumerable.Range(0, int.MaxValue).Where(n => n < 10);
    foreach (int i in smallNumbers)
        Console.WriteLine(i);

    The former. You asked LINQ to Objects to apply a predicate to a sequence and that's what it does. If the sequence has two billion elements in it, it runs the predicate two billion times.

    You certainly could build your own range object which had a Where method that took an expression tree. Your Where method could then analyze the range and the expression tree in an attempt to build a new range object that iterated a smaller range. Analyzing simple predicates would be pretty easy; more complex predicates would have to be turned back into delegates.

    Of course, if it hurts when you do that, don't do that. if you already know the range you want, just restrict it in the range constructor in the first place!

    The second question was "which is the preferred coding style?" :

    if (0 != customers.Count()) ...

    vs

    if (customers.Any()) ...

    On efficiency grounds alone the latter is far preferable. Suppose you have a box which may contain between zero and two billion pennies. You want to take some action if there are any pennies in the box, but you don't care how many there are, only whether there are zero or more. In that scenario, clearly it makes no sense whatsoever to count them all. The Count method doesn't know that all you care about is whether it's going to return zero or not. You asked it to count, so it counts.

    (I am reminded of an old joke: on a bank teller's first day, he is asked to count a stack of twenties and verify that there are one hundred in the stack. He starts counting them from one pile into another: one, two, three, ... but when he gets to fifty, he stops and says to his boss "I'll stop here -- if it's right all the way up to fifty, odds are good it's right the whole way.")

    Even leaving the obvious efficiency concern aside, the latter code reads better. It more clearly reflects the intention of the programmer.

    The third question was "when iterating through a collection, I noticed that the former code is much faster than the latter code. Any insight as to why?"

    myEventLogEntryCollection.CopyTo(myEntriesArray,0);
    for(int i=0; i < myEntriesArray.Length, i++) {...}

    vs

    foreach(EventLogEntry myEntry in myEventLogEntryCollection) {...}

    This made no sense to me. Sure, there will be differences. The former code is using more memory because it makes a copy of the collection; initializing that memory is going to take significant time if the collection is large, but I could see how maybe the array access could be faster than the collection iterator in the loop. Perhaps the improvement in iteration is fast enough that for a large collection, the larger initialization cost was made up. But the user was reporting "much faster", not "slightly faster". I asked for more details.

    What exactly were the timings? Three seconds vs sixteen seconds. Holy goodness! I had figured we were talking about microseconds of difference here, not actual humans sitting around watching the clock time. What on earth was the user doing? Accessing event logs from a remote machine over a slow network.

    Aha! Now it is clear what is going on here. The iteration time is being dominated by the round trip to the remote machine; the former code makes one very expensive remote call that takes three seconds. The latter code makes hundreds of cheaper remote calls, possibly several per loop iteration, that take a fraction of a second each but which add up to more expense in the long run.

    My "insights" were therefore:

    • You can sometimes trade increased memory use for decreased time. 
    • The cost of fetching n items by making m requests might be dominated by m, not n.
    • If you don't actually describe the relevant features of the problem, it is impossible to get a correct analysis.

    I blogged about these three questions not because I got them all in the same day, but because there is a common thread to the answers: computers are dumb. They do exactly what you tell them. If you want performance savings by eliminating unnecessary work through logical deductions about what work can be eliminated, you're going to have to be the one to do make those deductions!



  • Covariance and Contravariance, Part Eleven: To infinity, but not beyond

    UPDATE: Andrew Kennedy, author of the paper I reference below, was good enough to point out some corrections and omissions, which I have addressed. Thanks Andrew!

    As I've discussed at length in this space, we are considering adding covariance and contravariance on delegate and interface types parameterized with reference types to a hypothetical future version of C#. (See my earlier articles in this series for an explanation of the proposed feature.)

    Variance is highly useful in mainstream cases, but exposes a really irksome problem in certain bizarre corner cases. Here's just one example.

    Consider a "normal" interface contravariant in its sole generic type parameter, and a "crazy" invariant interface which extends the normal interface in a strange way:

    public interface IN<in U> {}
    public interface IC<X> : IN<IN<IC<IC<X>>>> {}

    This is a bit weird, but certainly legal.

    Before I go on, I want to digress and talk a bit about why this is legal. Most people when they first see such a beast immediately say "but an interface cannot inherit from itself, that's an illegal circularity in the inheritance chain!"

    First off, no, that is not correct. Nowhere does the C# specification make this kind of inheritance illegal, and in fact, a weak form of it must be legal. You want to be able to say:

    interface INumber<X> : IComparable<INumber<X>> { ... }

    that is, you want to be able to express that one of the guarantees of the INumber<X> contract is that you can always compare one number with another one. Therefore, it must be legal to use a type's name in a type argument of a generic base type.

    However, all is not rosy. This particularly gross kind of inheritance that I give as an example is in fact illegal in the CLR, even though it is not illegal in C#. This means that it is possible to have the C# compiler generate an interface type which then cannot be loaded by the CLR. This unfortunate mismatch is troubling, and I hope in a future version of C# to make the type definition rules of C# as strict or stricter than those of the CLR. Until then, if it hurts when you do that, don't do it.

    Second, unfortunately, the C# compiler presently has numerous bugs in its cycle detector such that sometimes things which kinda look like cycles but are in fact not cycles are flagged as cycle errors. This just makes it all the more difficult for people to understand what is a legal cycle and what isn't. For example, the compiler today will incorrectly report that this is an illegal base class cycle, even though it clearly is not:

    public class November<T> {}
    public class Romeo : November<Romeo.Sierra.Tango> {
       public class Sierra {
           public class Tango {}
        }
    }

    I have devised a new (and I believe correct!) cycle detection algorithm implementation, but unfortunately it will not make it into the service release of the C# 3 compiler. It will have to wait for a hypothetical future release. I hope to address the problem of bringing the legal type checker into line with the CLR at the same time.

    Anyway, back to the subject at hand: crazy variance. We have the interfaces defined as above, and then give the compiler a little puzzle to solve:

    IC<double> bar = whatever;
    IN<IC<string>> foo = bar;  // Is this assignment legal?

    I am about to get into a morass of impossible-to-read generic names, so to make it easier on all of us, I am going to from now on abbreviate IN<IC<string>> as NCS. IC<double> will be abbreviated as CD. You get the idea I'm sure.

    Similarly, I will notate "is convertible to by implicit reference conversion" by a right-pointing arrow. So the question at hand is true or false: CD?NCS ?

    Well, let?s see. Clearly CD does not go to NCS directly. But (the compiler reasons) maybe CD?s base type does.

    CD?s base type is NNCCD. Does NNCCD?NCS? Well, N is contravariant in its parameter so therefore this boils down to the question, does CS?NCCD ?

    Clearly not directly. But perhaps CS has a base type which goes to NCCD. The base type of CS is NNCCS. So now we have the question does NNCCS?NCCD ?

    Well, N is contravariant in its parameter, so this boils down to the question does CCD?NCCS ?

    Let?s pause and reflect a moment here.

    The compiler has ?reduced? the problem of determining the truth of CD?NCS to the problem of determining the truth of CCD?NCCS! If we keep on ?reducing? like this then we?ll get to CCCD?NCCCS, CCCCD?NCCCCS, and so on.

    I have a prototype C# compiler which implements variance ? if you try this, it says ?fatal error, an expression is too complex to compile?.

    I considered implementing an algorithm that is smarter about determining convertibility; the paper I reference below has such an algorithm. (Fortunately, the C# type system is weak enough that determining convertibility of complex types is NOT equivalent to the halting problem; we can find these bogus situations both in principle and in practice. Interestingly, there are type systems in which this problem is equivalent to the halting problem, and type systems for which the computability of convertibility is still an open question.) However, given that we have many other higher priorities, it?s easier to just let the compiler run out of stack space and have a fatal error. These are not realistic scenarios from which we must sensibly recover.

    This is just a taste of some of the ways that the type system gets weird. To get a far more in-depth treatment of this subject, you should read this excellent Microsoft Research paper




Visitors: 501134

Extended Menu