The Underestimated Log in Software System Part I

If you go through recent OSDI and SOSP, you will find there is a very productive professor Yuan Ding who published two papers every conference from the University of Toronto. Most of his papers focus on the logs in software. I begin to wonder why the logs are so attractive that people can publish so many papers. First I want to write a whole summary of all his papers related to logs. But it contains too many things. This is the part one which only includes the work in single machine software. The distributed system logs research is in part two.

When I first learnt to write code, the most frequent question I asked my teacher is that can you help me debug my toy program? My teacher asked me the meaning of every variable and printed them out in the middle of the program to debug. In large software, developers also print something that is not the final result which is called LOG. The only propose of LOG is debugging.

Debugging is hard because of three reasons: missing input, missing runtime library, and the timing. This is a general challenge. The only practical solution may be involving more human force.

To understand more about the LOG, Prof. Ding has four papers: SherLog(ASPLOS'10), LogEnhancer(ASPLOS'11), Errlog(OSDI'12), and a study(ICSE'12). Debugging improvement is very hard to quantify. I feel if I were the author, I would definitely be stuck in the evaluation part.

One log entry indicates the statement printing this log is executed and there may be some values printed for understanding the context. So the question is can we reconstruct the execution path and environment as much as possible based on the logs? That is exactly what SherLog did. It first finds the most possible execution path and then fills the value to the heap for better debugging. SherLog maps potential source code locations for one log entry. Then it goes through all the execution path that can print longest log sequence. The interesting trick is that mapping log entry to source code location is keyword first matching. For example, for an entry "No file found", it could match with a regular expression "[.]*" or an existing error code "error(0)". It should match with error(0) because "No file found" is a keyword. And trying all the path and selecting a good one is also very intelligent. My first feeling is that I think it is very hard to directly reconstruct the execution path from logs. Now it turns out it is better to start with execution path. The printed value helps to evaluate more value based on the data flow analysis.

These execution paths and environment MAY help for reproducing the bugs. What are the failed situations? One could be the printed value is not enough. Another is the closest log entry is still far from the root cause. In other words, we need more variables in the log(LogEnhancer) and more log entries(Errlog).

LogEnhancer tries to put more variables in each log entry. The question is which variable is valuable to log. LogEnhancer define a valuable variable is that the value is not determined but also affect the condition for the path reaching log point. It build a condition expression  which likes a switch statement C1:V1,C2:V2 for every assigned variable.... Then based on the log entry, it knows the conditions for reaching the log entry are satisfied. It filters all the deterministic variables then. The remaining are uncertain variables which should be printed out. Maybe there are too much uncertain variables. LogEnhancer has two more rules to filter the useless ones, which is explained in the paper.

In the first half of Errlog paper, it contains a study to draw a conclusion that most error in software are caught. But as there is no log in the error handler, it is hard to reproduce it. It is quite interesting that Errlog did the bugs study but LogEnhancer not. I feel it is not difficult to draw that most hard-reproduce errors are caused by missing logging variables. Anyway, as long as we admit the conclusion, the next step is to reach the error caught location and log it.  It analyzes all the conditions and figure out what is the pattern for error check. It contain some trivial check (None-NULL check) and indirect check(tmp = malloc(); check tmp). It does not apply machine learning for pattern. It just define several simple-esay-check rules. And this is the only paper that conduct user study to measure the improvement for debugging.

Last is the history of log, a study about how log changes. The data shows that log speeds up the debugging 2.2x. And people hardly ever delete existing log. Most modification about log is add/change logged variables or static content or verbosity. Surprisingly, there is few movement of log entries.



































Comments

Post a Comment

Popular posts from this blog

What is a Valuable Input for Software Testing