Jeff's Technology Blog: June 2013

Thursday, June 27, 2013

Lost Another Harddisk

Another harddisk died recently. This was a 2TB Seagate that I installed February of 2012. At first, I started getting I/O errors. I forced remount and it was running again. I wanted to run an xfs_check on it, but the forced unmount made it look like it was still mounted, so xfs_check refused to run. I decided to reboot, but the computer didn't come back up. I transferred my HDMI cable (I don't have a switch yet) and noticed that it was hung on the "Mounting local filesystems" screen. I decided to shut off the computer and take out the problem hard disk. The computer had 6 hard disks, but luckily the case made it easy to take disks out and identify which one was which. I tried plugging the disk into a SATA->USB converter, but it started beeping at me. I tried a second time, and I could hear the disk try to start up. It started beeping again. Luckily, I didn't lose a lot of data. Most of the data was mirrored onto other hard disks.

Wednesday, June 26, 2013

Showing pride in your work

A while back, I was conducting an interview. I love interviewing people because I find it fascinating what people claim they know and what the actually know. I am pretty good at figuring that out by asking pointed questions as opposed to quizzes. One thing that I have been looking for, but haven't seen yet, is someone showing pride in their work. Then I saw that in a recent interview.

I like to look for obscure items on people's resumes. People always ask about knowing Java. If you list something, like the JSch library, I'm going to ask about it. For those of you who don't know, JSch is a SSH client implementation written completely in Java. I have used this library many times, but most people haven't heard of it. During my pre-interview review, I noticed the person listed JSch. During the interview, I asked him what he did with JSch. He took a second, and started to smile. I could tell he was proud of the work he did. He started to describe a packaging system, where the end of the build process would copy the result of the build to various Linux servers. This streamlined the testing process.

Pride in your work is a very important thing. When you are proud of what you are working on, you work harder on it, you spend more time on it, and you put more thought into it. You often produce better software when you are proud of what you are doing. These are the types of things that let me know what kind of worker you are going to be. We hired this person.

Tuesday, June 25, 2013

Youtube Playlists on the Android App

I noticed something on the Youtube Android app that I hadn't noticed before. It seems like a recent update to the app allows you to watch a user's playlist. This is something that I have wanted for a while now. A lot of Youtube organizations will upload multiple shows to a single channel. The channel owners will often make playlists for each show, making it easier to watch older episodes of that show, or watch a series of episodes in succession. The problem was, the Youtube app didn't support that. I had to select a channel, then look for all the episodes of that show. There is a dropdown on the top left of the screen that originally had "Uploads" and "Activity" as possible values. Now, "Playlists" is in that dropdown. I can finally watch older Crash Course History or Tabletop episodes with ease!

Monday, June 24, 2013

XML Parsers: Part 3

I have made a case to use stream-based parsers. There are occasions where I will use XPath instead, however. Sometimes, performance isn't an issue. Sometimes, it is better to have user friendly and maintainable code instead of fast running code. For example, we had a problem with web.xml's that were provided by the developers. We would deploy the war/web.xml to Tomcat and Websphere. If there was a problem with the web.xml, Tomcat and Websphere would blow up and different but spectacular ways. They often didn't give good error messages. Tomcat would give NullPointerExceptions for a lot of these errors. Websphere would give JNDI lookup errors (since the JNDI system didn't initialize due to a webapp startup failure).

That is when I decided to perform some deploy time web.xml validation. The better time to do this validation was at build time, but I didn't control the build processes. There were multiple processes as well. It was much easier to validate at deploy time. Now, I don't know the web.xml spec like the back of my hand. I only know the problems that we have been hit with. We would run into problems like filter mappings without filters, filters without filter mappings and duplicate servlet names.

The solution I came up with is an XPath-based parser that was exposed as a Javascript/Rhino api. Then, I could write a batch JavaScript script that would parse the web.xml and search for known problems. If a known problem is detected, then the war/ear deployment should fail. One key thing to remember when designing a solution like this is the discovery of new problems. This solution won't catch all web.xml errors. It was easily modifiable in the event that a new problem pattern was discovered, however. Not only that, almost anyone could make the change. It didn't require a complicated development cycle to update the problem detector. In this scenario, we were concerned with parsing a single web.xml at deploy time. If the web.xml had a lot of entries, the detector would run slower, but so would the web app. At that point, the developers would have more problems to worry about (a slow running app, vs a slow deploying app).

Thursday, June 20, 2013

XML Parsers: Part 2

After going over the types of parsers, I went into some of the history of using parsers in our team. There was a batch process that took about 20 minutes to run. It frustrated the people that used the process, because if you gave a bad argument, it would take 15 minutes to tell you that you passed in the wrong arguments! People complained constantly to the developer, but he said there was nothing that could be done. They came to me asking my opinion, and I would say there is no reason the process should take 20 minutes.

Finally, the developer decided to talk to me. Instead of asking for advice on making the process faster, he reiterated the fact that he couldn't make it faster. The problem was one of the xml files that needed to be parsed. It was a large xml file that had a complicated structure. He said it took 5 minutes just for the XPath parser to return the top of the object tree. From there, he ran dozens of XPath queries against the large tree. In his mind, the performance problem didn't exist in the code that he wrote. Therefore, he couldn't do anything about it. It wasn't his code that was slow!

That is when I brought up different parser types. He complained that DOM was too hard for him to understand and that he didn't think there would be that much of a performance improvement. I explained the advantages and disadvantages of all three types of parsers. He had never heard of stream parsers before. The conversation ended with him looking into DOM parsing, but he wouldn't promise anything, since he thought it was a waste of time to make the change.

Out of my personal frustration, since I was a user of the batch process and a developer that felt that I could do better, I wrote a proof of concept that used SAX instead of DOM or XPath. The funny part was that it only took me 4 hours to write a proof of concept that did the entire 20 minute process in about 500ms. That is about 1/2 a second! When word got around to the developer, he rushed his DOM changes in. When migrating from XPath to DOM, the process time went from 20 minutes to 2 minutes. That is still much longer than my proof-of-concept 500ms, but fast enough that people were relatively happy.

Wednesday, June 19, 2013

XML Parsers: Part 1

One of the junior developers on my team was trying to find an easier way of parsing xml. His complaints were based on some of the code that he saw that was use DOM to traverse the xml structure. The xml structure of that particular file was a little complicated, so the DOM traversal got complicated. He brought up using some XPath-based parsers. This is when I chimed in. I gave him an overview of the 3 general types of xml parsers and gave him some of the history within our team.

Document Object Model (DOM)

The first parser I will talk about is the DOM-based parser. This parser is easy to use (especially for the client-side web developers). First, the entire xml file is parsed. An intermediate representation is created. This representation contains Nodes, Attributes, Comments and other objects that represent xml components. Once the entire xml file is in the intermediate representation, you can start making calls like getElementById() and getElementsByTagName(). You can also call getChildNodes() and getParentNode(). You traverse the object tree, node by node. I have found that this method of xml parsing is easy to learn but can be tedious. It is also the middle of the road when it comes to performance. This is something I will dive deeper into later.

XPath

The next type of parser is the XPath-based parser. It is usually not a parser by itself. Usually, XPath refers more to the traversal of the DOM object tree as opposed to actually parsing the xml. XPath is really user friendly. Writing an XPath based parser is really easy. It is also very supportable, in the sense that the parsing code tends to be very readable. The main problem with XPath is performance. XPath is the slowest xml parsing method available.

Stream Parser

The final parser I'm going to talk about is the stream-based parser. The most famous of these is the blazing fast Expat c library, but the Java SAX parsers are another example. Stream parsers are the fastest parsers. As the name implies, the parser executes WHILE the xml file is being parsed. The other two parsers will read the entire xml file into memory. They will store an object tree in memory for the entire file. For junior developers, stream parsers are very difficult to understand. When it comes to parsing complicated xml, the stream-based parsers become almost unreadable. They are really fast, though. In fact, the other two parsers are usually implemented under the hood using a stream parser.

Nine out of ten times, I will use a stream parser. It is rare that I find xml formats that are really complicated. The only time I don't use a stream parser is when performance is not a consideration. There is an interesting side effect of constantly using stream parsers. When I am designing the structure for a new xml format, I tend to make sure the structure supports stream parsers. These formats tend to be easier to parse in all the parser types above. They are easier to read and easier to understand. In my opinion, using stream parsers makes you a better xml designer.

Tuesday, June 18, 2013

Frustrations with Windows

I recently had to re-install Windows and it brought back a lot of frustrated memories. The re-install frustrated me so much, I decided to blog about it.

The first thing the frustrated me was how quickly Windows hides some error messages. I was only partially paying attention to the restart when I noticed that it restarted a second time, this time into recovery mode. I rebooted it again and noticed that a blue screen occurred. Before I had a change to read anything, it rebooted automatically. Instead of letting my know what the problem is, it decided that it was much more user friendly to reboot into recovery mode.

Frustration two came about due to recovery mode. It was pretty useless in this situation. It would sit there for a few minutes, then tell me it could not fix the problem. It said the problem probably occurred because of a hardware change. It didn't tell me what hardware was causing the problem. Just that I should "undo" the hardware change that I made. In my situation (going from virtual to physical hardware), I can't just "undo" the change.

Frustration three is related to network drivers. I'm not talking about wireless drivers. I am talking about a Realtek Gigabit Ethernet controller. This particular install didn't need the drivers, but my next one might. The main reason I am "allowing" a physical Windows install is because Windows 7 Pro supports network backup/restore functionality. In theory, you can pop in the Windows 7 Pro install cd and restore from a previous snapshot. This is very similar to the Qemu qcow2 snapshotting feature. If the entire harddisk dies for some reason, I can restore from a snapshot. That is where the lack of network drivers comes in. The install cd won't work the way I want it to since it won't be able to access the network share.

Monday, June 17, 2013

VGA Passthrough Overhead

I never had the opportunity to test the overhead of VGA Passthrough. I had a gaming VM that I recently turned into a physical computer. During the VM days, the performance on the games that I played was fine. I had no complaints. I tried to run Dolphin to emulate the Gamecube and the Wii, but I could only get around 45 FPS on the lowest resolution. I also played around with Bitcoin mining from inside of the VM. This is where I am getting my benchmark for overhead. When the VGA card was being passed through to a VM, I was generating approximately 184 Mhash/s. Now that there is no passthrough/virtualization layer, I can get around 358 Mhash/s. That is about a 94% improvement. These numbers come from snapshots in time. They do not correspond to an average of an sorts. Although the GPU is exactly the same (AMD Radeon HD 7870), the CPU's were different. The VM had a AMD Phenom(tm) II X6 1090T Processor running at 3.2 Ghz and the VM only got 2 cores. The new setup is an AMD FX 4100 running at 3.6 Ghz. Since there is no virtualization, the computer gets all 4 cores. Also, the VM only had 4GB of RAM while the physical has 12GB of RAM. The CPU and RAM shouldn't have that much of an impact on the OpenCL program running. Emperically, this is a comparison of an OpenCL program running on the same video card in two different computers; one of them virtualized. This means you shouldn't quote this blog as claiming the all PCI/VGA Passthrough has a 49% performance penalty.

Thursday, June 13, 2013

Giving up on my gaming VM

In a previous post, I talked about how PCI Passthrough stopped functioning on the one computer that I used it with. I had 2 VMs that used PCI Passthrough. One was a multi-function Windows VM. I passthroughed a USB controller card. The only reason why I do this is because Xen's USB passthough isn't very functional. I have 2 USB devices that I passthough: an X10 Firecracker wireless transmitter and an Eaton Home Heartbeat base station. This VM also hosts the Nightowl software to view my security cameras. It is my home automation and security "computer". I never really had the X10 and Home Heartbeat working, however. I wrote a python script that allowed me to send emails when water was detected in my basement, but it got shifted lower in priority.

My second VM was my gaming VM. I passthoughed a video card and a USB controller card. Although I haven't been gaming as much, the computer as still been used. We have been getting a lot of visitors recently, and a lot of times, people want to look at pictures. Although the MK802 can do that, people tend to like the Windows interface (for some reason). They want to use a mouse and control everything. Also, it came in handy when we wanted to compare photos.

Because this VM still gets regular use, and its primary function won't work anymore, I had to give up on the VM. I started re-arranging hardware and turned my Ubuntu desktop (which I used only on occasion) into the gaming/living room PC. It already had a 120GB SSD. I dd'ed the QCOW2 image onto the SDD. I tried turning on the computer. The Windows boot screen came up, but it blue screened. After booting into recovery mode, Windows told me that it could not repair the problem. It suggested that a hardware change caused the problem, and that I should "undo" the hardware change. I guess changing from a virtual to physical computer was too big of a change for Windows. I ended up reinstalling on top of the existing Windows install.

Overall, I am sad that I am no longer using a VM. The QCOW2 image format provided the BEST backup system I have ever used. I was able to copy the entire disks over the network, then take a snapshot. I did this weekly. Now, I have to rely on Windows Backup and Restore. Backup and Restore seems to be a lot better on Windows 7, though. The install DVD lets me restore an OS from a backup. I had other issues getting Windows 7 running, but those aren't related to the Xen conversion, so I will leave that rant for another post.

Wednesday, June 12, 2013

On the Importance of (Proper) Logging

I was recently in a situation where I had to defend my work. First, some background. There are 3 JVMs. Lets call them A, B and C. I wrote/maintain C, but I wrote code that gets included in A. C is a service them, that a bunch of JVMs contact, including A and B. Since so many JVMs connect to my service, I ended up adding some decent logging to the system. The main thing that gets logged is the start/stop times for every service call. The log line for the stop time also has the amount of time it took to make the call. On top of that, I wrote code that monitors all the web container threads. If a servlet takes longer than 3.5 seconds, the stack trace of that servlet gets dumped to the log. This code was written as a library, so JVM A also has the servlet stack trace functionality in it. JVM B is currently getting some large code changes, A is getting minor changes and C is not getting any changes.

The situation started when someone was running a test case on the QA instance of A. They went through a flow, then all of a sudden, the page took about 2 minutes to load. After some investigation, the team making the changes was able to reliably reproduce the problem. That is the good news. The bad news is JVM B cannot be run on someone's desktop.

This is when the head developer of A and B started to tell people the C was causing the performance issue. He didn't talk to me; no, he told other people. He wanted to know what would have changed in C. This is where logs come in handy. I logged into the QA server and looked at the logs. In the previous 7 days, only one request took over 1000ms. This means it never took long enough to even get a thread dump! In my mind, it was proven. C did not cause a problem. There wasn't even enough calls to add up to 2 minutes of processing time (QA instance of C, not Prod)!

After a few days, I started hearing more over the cube wall. Once again, the developer for A and B was saying C was causing the issue. I started taking a look at A's logs. Luckily, A has the servlet filter that I wrote that dumps the stack trace of a servlet at the 3.5s mark. At that point in the call, A was calling B! More proof that C wasn't causing the issue.

This kept going back and forth until I was called in by the junior (junior based on the management pecking order, not actual seniority) members of the A/B dev team to figure out why B had a performance issue. The problem persisted because the head A/B developer considered it a C problem so he delegated to someone else to "solve" the problem. The first thing I asked was did anyone put in the performance monitoring code I wrote into B (the thread dump servlet filter). They said no. I asked them which person had B running on their desktop so that I could run JProfiler or JProbe against the JVM. Apparently, B won't run on a desktop. You have to checking in code, build it, and test it on the DEV server.

At this point, I decided to do the kill -3 trick. On unix systems, performing a kill -3 on a JVM will cause the JVM to print a stack trace for every thread to standard out. I wrote a quick while loop in korn shell to perform a kill -3 on the DEV JVM once every second. I had the developer run the test. After about 10 seconds, I killed the while loop and started looking at the logs. The worker thread was at the same point in every stack trace.

The method was a simple (but poorly written) pad method. It was padding a string to be 50 characters wide. It did not pre-allocate a buffer. It just kept prepending the pad string over and over again until the string had a length of 50. As we went up the stack, things appeared just as bad. The calling method was a toString() call. This call iterated over a map, appending the pad() of the map entry value to a StringBuffer that was not pre-allocated. This toString() method was called because the caller was calling log.debug( "The value is " + value ); This log line was inside of a tight loop. The method that had the loop was being called about 300 times for the users that had a performance problem. It was called only once for the people without the problem. I assumed someone added the logging line and that caused the performance problem.

What is interesting is that it took a few tries to explain to the developers why the toString() was being called. The way they understood it (at a high level) was that since Log4j wasn't configured to output debug messages, the line shouldn't have done anything. I had to explain to them that the input object has to be created (in this case, toString() needed to be called due to string concatenation) before the method is even invoked. Inside of the method is the logic that determines if the message should be printed. Nobody fessed up to adding the log line. We have revision control to figure out that kind of thing, but I try to concentrate on the Why did it happen and the How do we fix it rather than the Who caused it.

This story really tells us two things. Logging is very helpful. When done right, it proved that JVM C wasn't causing a performance issue. When done wrong, it caused a performance issue in JVM B.

Tuesday, June 11, 2013

VTech Digital Baby Monitor

The VTech Digital Baby Monitor is the first half of my "smart" baby monitor system. This device is a functioning baby monitor, which is more than can be said about some other baby monitor systems out there. The VTech supports two parent units. This was a must have feature. We keep one parent unit in the bedroom and another in the living room. The volume is loud enough that between the two parent units, we can hear the baby's room from anywhere in the house.

The VTech microphone is phenomenal. It picks up the slightest noise. It is probably too sensitive for a baby monitor, but it doesn't bother me at all. It picks up footsteps from outside of the baby's room. The audio quality is great. The microphones on the parent units are really good as well. The battery doesn't seem like it can handle a full day of being on, so we keep them plugged in overnight. The LED lights that display amplitude are useful when you are trying to gauge how loud your child is being. So far, I am super pleased with the devices.

That being said, there are some bad reviews for the VTech. Most of them start out just like I sound in this review. They talk about how great the device functions. There seems to be a high rate of device failure a few months after being used. Since I have only used the devices for over a week now, I can't report on that. If the devices do fail, I will be sure to write another post, and update this post to mention the failure. In the mean time, I am very happy with the VTech Digital Baby Monitor.

Monday, June 10, 2013

Advanced Baby Monitors

As a technology guy, I went on a search for the latest in baby monitor technology. I was sad to see that most advanced baby monitors failed to have one major feature: they didn't function as baby monitors. That may sound weird, but in all actuality, it is true. The problem was the advanced monitors acted more as nanny-cams than as baby monitors. They worked well in the nanny-cam space, but not as a baby monitor.

First, here is what I was looking for in an advanced baby monitor. First, they should function as baby monitors. I will get to this point later. Second, I wanted multiple "parent units". If my wife and I are in different rooms, I want us both to have the ability to monitor the baby. Third, I wanted the option to see video of my baby. This is the nanny-cam concept. I want it more for two sub-tasks: 1) allow me to see why my baby is fussing and 2) allow grandparents to see the baby.

Now, a little background on what baby monitors are supposed to do. This may sound weird....everyone should know what a baby monitor does. The problem is, I don't think the makers of the advanced baby monitors know what baby monitors are supposed to do. Baby monitors allow parents to listen to a baby to hear if they need to pay the baby a visit. This is an important fact. The reason it is important is because of how most advanced baby monitors work. They are sound alarms. In the event that too much sound occurs, they send a text or an email to your phone. Unfortunately, this tells you that a sound was made, not that your baby needs you. Those two things can be different. I don't want to get a text because my baby farted really loud (I'm a parent now; I can make fart jokes). True baby monitors transmit audio so that I can tell the different between loud farts and cries for attention.

That brings us to the next reason why advanced baby monitors fail at being baby monitors. They don't work for an extended period of time. That "extended" period of time can be measured in seconds. Two issues tend to pop up. First, the audio/video stream buffers, causing you to not hear anything. Second, the stream completely shuts down. I have read reviews of various other monitors, and they tend to say the Withings Smart Monitor that I linked to above is one of the more stable products on the market.

In the end, I decided to buy two cheaper products that ended up costing less than a single "smart" baby monitor. I bought the VTech Digital Baby Monitor and a Foscam FI8918W IP camera. I will write up a review for the VTech in a future post. Between these two products, I get something that fits all my requirements. First, the VTech is a functioning baby monitor with two parent units. Second, I can use my MK802, phone or tablet to access the Foscam. I can also give grandparents access to log into the Foscam.

Thursday, June 6, 2013

Still can't get cgminer to use my Geforce GTS 250

I have been struggling with this for a few weeks now. I still can't get cgminer to use my GPU. The programs's -n argument does list the GPU, but it can't use it for some reason. I figured I would post here and see if someone else has the same problem.

# cgminer -n
[2013-05-22 19:40:30] CL Platform 0 vendor: NVIDIA Corporation
[2013-05-22 19:40:30] CL Platform 0 name: NVIDIA CUDA
[2013-05-22 19:40:30] CL Platform 0 version: OpenCL 1.1 CUDA 4.2.1
[2013-05-22 19:40:30] Platform 0 devices: 1
[2013-05-22 19:40:30] 0 GeForce GTS 250
[2013-05-22 19:40:30] 1 GPU devices max detected

# cgminer output

[2013-05-22 19:39:19] Started cgminer 2.7.4
[2013-05-22 19:39:19] Started cgminer 2.7.4
[2013-05-22 19:39:20] Probing for an alive pool
[2013-05-22 19:39:20] Long-polling activated for **********************************
[2013-05-22 19:39:20] Error -2: Creating Context. (clCreateContextFromType)
[2013-05-22 19:39:20] Failed to init GPU thread 0, disabling device 0
[2013-05-22 19:39:20] Restarting the GPU from the menu will not fix this.
[2013-05-22 19:39:20] Try restarting cgminer.
Press enter to continue:

Wednesday, June 5, 2013

Python's lack of design

Many years ago, my dad was going on and on about a new programming language. He was talking about how simple it is to create an http server and serve up content. I was very dismissive because to me, having an easy to use standard library doesn't make a language great. That language was Python.

I have never been a fan of Python. One of the areas that I felt it lacked was in the design (or lack of) surrounding the Python standard library. Lets go back to the http server example from before. We will use the Python3 example, since the Python2 example was even worse. It is super easy to create an http server. You just have to instantiate http.server.HTTPServer and pass in an http.server.SimpleHTTPRequestHandler object. This object will serve up files in the current working directory. Seems simple enough. My project needs it to serve files in a sandbox directory, however. So, we should be able to override the web root of the bundled http server, right? Actually, no. The SimpleHTTPRequestHandler object hard codes the webroot to be the current working directory. In order to change that, you have to override the SimpleHTTPRequestHandler.translate_path() method. You have to copy and paste the entire contents of that method and change one line to add a custom path.

Once you do that, you decide to test http streaming of large files. While a large file is streaming, you decide to test something else and you notice something. Your second connection to the http server freezes. Woops. You can't just instantiate HTTPServer. That is a single threaded server. You have to instantiate socketserver.ThreadingTCPServer instead.

Once your server supports multiple threads, lets add an XMLRPC server in there as well. Python supports XMLRPC so it should be as simple as invoking some XMLRPC code from an overridden method of SimpleHTTPRequestHandler, right? Well, no. The XMLRPC server code is written as a class that extends http.server.BaseHTTPRequestHandler. That implementation assumes you are using the xmlrpc.server.SimpleXMLRPCServer class. If you aren't using the SimpleXMLRPCServer class, then things get weird really fast.

It seems natural that since Python supports an http server that can serve up static files and an xmlrpc server, that it should be able to handle a single server that does both things. These two sub-libraries don't seem to intermingle, though. Without any type of api or underlying design, you have to hack them together. This is what I mean by Python's lack of design. The standard library supports many different protocols and features, but they are not written as building blocks. It is not easy to use these libraries inside of a larger application.

For my project, I ended up subclassing http.server.SimpleHTTPRequestHandler. I created an instance of xmlrpc.server.SimpleXMLRPCDispatcher. I started overriding methods to delegate to the SimpleXMLRPCDispatcher when a post to /RPC2 occurred. Then I started the copy game. I kept copying methods from http/server.py and xmlrpc/server.py into my program until both classes worked together. I finally got XMLRPC and serving up static files working in the same server.

To get all of this to work, I had to use "private" methods. I consider the methods private because the Python documentation doesn't list the methods. Python doesn't have a concept of private vs protected vs public. To me, this means future versions of Python could break my code, since I had to program to the implementation, not the interface. This is a design anti-pattern.

My next step is to modify the code to support the Accept-Ranges header since the stock Python http server doesn't support it. Until next time!

Tuesday, June 4, 2013

Xen PCI Passthrough stopped working

Somehow, after my Monday morning reboot, all PCI Passthrough stopped functioning. The PCI devices still show up on xl pci-assignable-list. When I start the VM (without the VGA Passthrough), the PCI-E USB card does show up, but it is listed as not functioning. That is why the VM with the VGA card is blue screening. The card is visible, just not functional. Given that the USB card no longer works, I'm assuming it is not related to the VGA card. At this point, I'm wondering if something happened to the motherboard.

When I start the VM, here is what I get:

dom0 ~ # xl create /etc/xen/dom1

Parsing config from /etc/xen/dom1

xc: info: VIRTUAL MEMORY ARRANGEMENT:

Loader: 0000000000100000->000000000019dd88

TOTAL: 0000000000000000->000000007f800000

ENTRY ADDRESS: 0000000000100000

xc: info: PHYSICAL MEMORY ALLOCATION:

4KB PAGES: 0x0000000000000200

2MB PAGES: 0x00000000000003fb

1GB PAGES: 0x0000000000000000

libxl: error: libxl_pci.c:960:do_pci_add: xc_assign_device failed

Daemon running with PID 19919

dom0 ~ #

When I run xl dmesg, I see a bunch of these lines:

(XEN) traps.c:2595:d0 Domain attempted WRMSR 00000000c0000408 from 0xc000000001000000 to 0xc008000001000000.

(XEN) pt_irq_create_bind failed (-3) for dom1

Here is what the 2 cards look like in Device Manager:

Here is the relevant log entries from qemu-dm-dom1.log:

dm-command: hot insert pass-through pci dev

register_real_device: Assigning real physical device 02:00.0 ...

register_real_device: Disable MSI translation via per device option

register_real_device: Disable power management

pt_iomul_init: Error: pt_iomul_init can't open file /dev/xen/pci_iomul: No such file or directory: 0x2:0x0.0x0

pt_register_regions: IO region registered (size=0x00002000 base_addr=0xfe900004)

pt_msix_init: get MSI-X table bar base fe900000

pt_msix_init: table_off = 1000, total_entries = 8

pt_msix_init: mapping physical MSI-X table to 7f3166ea9000

pci_intx: intx=1

register_real_device: Error: Binding of interrupt failed! rc=-1

register_real_device: Real physical device 02:00.0 registered successfuly!

IRQ type = INTx

dm-command: hot insert pass-through pci dev

register_real_device: Assigning real physical device 03:00.0 ...

register_real_device: Disable MSI translation via per device option

register_real_device: Disable power management

pt_iomul_init: Error: pt_iomul_init can't open file /dev/xen/pci_iomul: No such file or directory: 0x3:0x0.0x0

pt_register_regions: IO region registered (size=0x00002000 base_addr=0xfe800004)

pt_msix_init: get MSI-X table bar base fe800000

pt_msix_init: table_off = 1000, total_entries = 8

pt_msix_init: mapping physical MSI-X table to 7f3166ea8000

pci_intx: intx=1

register_real_device: Error: Binding of interrupt failed! rc=-1

register_real_device: Real physical device 03:00.0 registered successfuly!

IRQ type = INTx

I verified that IOMMU is enabled in the BIOS. This setup has worked for about 18 months now. I don't know what changed that caused it to start failing all of a sudden.

Monday, June 3, 2013

Xen Broke then (Mostly) Fixed

Somehow, Xen has stopped working for me. My VM's were running. I used an RDP client to connect to one of the VM's and the Dom-0 froze. When the Dom-0 came back up, the VM's weren't running like they should be. I tried starting them. Xen claimed they started, but I couldn't RDesktop in or ping the VMs. They would eventually die before the OS came up. I looked at the Xen logs, and here is what I saw:

Domain 5 has shut down, reason code 1 0x1

dom0 xen # cat xl-dom1.log
Waiting for domain dom1 (domid 1) to die [pid 19733]
Domain 1 has shut down, reason code 1 0x1
Action for shutdown reason code 1 is destroy
Domain 1 needs to be cleaned up: destroying the domain
Done. Exiting now
dom0 xen #

I had no clue what those messages mean. Xen log messages are really meant for the developers. For something as complicated as Xen, I don't mind that fact as much. I tried a few things, like turning off the Dom-0 for a while, then upgrading Xen from 4.2.1 to 4.2.2. Nothing helped. Finally, it occurred to me that I had a VNC server listening as well. Only one of my VMs had the emulated video card outputting to VNC, though. The other had VGA Passthrough as the primary video card. I connected to the VM without VGA Passthrough and there was a Windows Startup Recovery screen. It was asking a question! I answered the question and told it to reboot. It rebooted into the Startup Recovery screen again. Luckily, all this happened the day I performed a backup. 3 minutes to revert one disk and 8 minutes to revert another (I love snapshots!) and I was able to boot.

Unfortunatly, PCI Passthrough had stopped functioning. The VM with the VGA card was getting a blue screen of death and the VM without it started, but without the USB controller card.

JS Ext

Thursday, June 27, 2013

Wednesday, June 26, 2013

Tuesday, June 25, 2013

Monday, June 24, 2013

Thursday, June 20, 2013

Wednesday, June 19, 2013

Tuesday, June 18, 2013

Monday, June 17, 2013

Thursday, June 13, 2013

Wednesday, June 12, 2013

Tuesday, June 11, 2013

Monday, June 10, 2013

Thursday, June 6, 2013

Wednesday, June 5, 2013

Tuesday, June 4, 2013

Monday, June 3, 2013