Ten years of KVM

This article was contributed by Amit Shah

We recently celebrated 25 years of the Linux project. KVM, or Kernel-based Virtual Machine, a part of the Linux kernel, celebrated its 10th anniversary in October. KVM was first announced on 19 October 2006 by its creator, Avi Kivity, in this post to the Linux kernel mailing list.

That first version of the KVM patch set had support for the VMX instructions found in Intel CPUs that were just being introduced around the time of the announcement. Support for AMD’s SVM instructions followed soon after. The KVM patch set was merged in the upstream kernel in December 2006, and was releasedas part of the 2.6.20 kernel in February 2007.

Background

Running multiple guest operating systems on the x86 architecture was quite difficult without the new virtualization extensions: there are instructions that can only be executed from the highest privilege level, ring 0, and such access could not be given to each operating system without it also affecting the operation of the other OSes on the system. Additionally, some instructions do not cause a trap when executed at a lower privilege level — despite them requiring a higher privilege level to function correctly — so running a “hypervisor” that ran in ring 0, while running other OSes in lower-privileged rings was also not a solution.

The VMX and SVM instructions introduced a new ring, ring -1, to the x86 architecture. This is the privilege level where the virtual machine monitor (VMM), or the hypervisor, runs. This VMM arbitrates access to the hardware for the various operating systems so that they can continue running normally in the regular x86 environment.

There are several reasons to run multiple operating systems on one hardware system: deployment and management of OSes becomes easier with tools that can provision virtual machines (VMs). It also leads to lower power and cooling costs by hosting multiple OSes and their corresponding applications and services to run on newer, more capable hardware. Moreover, running legacy operating systems and applications on newer hardware without any changes to adapt to the newer hardware now becomes possible by emulating older hardware via the hypervisor.

The functionality of KVM itself is divided in multiple parts. The generic host kernel KVM module, which exposes the architecture-independent functionality of KVM; the architecture-specific kernel module in the host system; the user-space part that emulates the virtual machine hardware that the guest operating system runs on; and optional guest additions that make the guest perform better on virtualized systems.

At the time KVM was introduced, Xen was the de facto open source hypervisor. Since Xen was introduced before the virtualization extensions were available on x86, it had to use a different design. First, it needed to run a modified guest kernel in order to boot virtual machines. Second, Xen took over the the role of the host kernel, relegating Linux to only manage I/O devices as part of Xen’s special “Dom0” virtual machine. This meant that the system couldn’t truly be called a Linux system — even the guest operating systems were modified Linux kernels with (at the time) non-upstream code.

Kivity started KVM development while working at Israeli startup Qumranet to fix issues with the Xen-related work the company was doing. The original Qumranet product idea was to replicate machine state across two different VMs to achieve fault tolerance. It was soon apparent to the engineers at Qumranet that Xen was too limiting and a poor model for their needs. The virtualization extensions were about to be introduced in AMD and Intel CPUs, so Kivity started a side-project, KVM, that was based on the new hardware virtualization specifications and would be used as the hypervisor for the fault-tolerance solution.

Development model

Since the beginning, Kivity wrote the code with upstreaming it in mind. One of the goals of the KVM model was as much reuse of existing functionality as possible: using Linux to do most of the work, with KVM just being a driver that handled the new virtualization instructions exposed by hardware. This enabled KVM to gain any new features that Linux developers added to the other parts of the system, such as improvements in the CPU scheduler, memory management, power management, and so on.

This model worked well for the rest of the Linux ecosystem as well. Features that started their life with only virtualization in mind began being useful and widely-adopted in general use cases as well, like transparent huge pages. There weren’t two separate communities for the OS and for the VMM; everyone worked as part of one project.

Also, management of the VMs would be easier as each VM could be monitored as a regular process — tools like top and ps worked out of the box. These days, perf can be used to monitor guest activity from the host and identify bottlenecks, if any. Further chipset improvements will also enable guest process perf measurement from the host.

The other side of KVM was in user space, where the machine that is presented to the guest OS is built. kvm-userspace was a fork of the QEMU project. QEMU is a machine emulator — it can run unmodified OS images for a variety of architectures that it supports, and emulate those architecture’s instructions for the host architecture it runs on. This is of course very slow, but the advantage of the QEMU project was that it had quite a few devices already emulated for the x86 architecture — such as the chipset, network cards, display adapters, and so on.

What kvm-userspace did was short-circuit the emulation code to only allow x86-on-x86 and use the KVM API for actually running the guest OS on the host CPU. When the guest OS performs a privileged operation, the CPU will exit to the VMM code. KVM takes over; if it can service the request itself, it would do so, and give control back to the guest. This was a “lightweight exit”. For requests that the KVM code can’t serve, like any device emulation, it would defer to QEMU. This implied exiting to user space from the host Linux kernel, and hence this was called a “heavyweight exit”.

One of the drawbacks in this model was the maintenance of the fork of QEMU. The early focus of the developers was on stabilizing the kernel module, and getting more and more guests to work without a hitch. That meant much less developer time was spent on the device emulation code, and hence the work to redo the hacks to make them suitable for upstream remained at a lower priority.

Xen too used a fork of QEMU for its device emulation in its HVM mode (the mode where Xen used the new hardware virtualization instructions). In addition, QEMU had its own non-upstream Linux kernel accelerator module (KQEMU) for x86-on-x86 that eliminated the emulation layer, making x86 guests run faster on x86 hardware. Integrating all of this required a maintainer who would understand the various needs from all the projects. Anthony Liguori stepped up as a maintainer of the QEMU project, and he had the trust of the Xen and KVM communities. Over time, in small bits, the forks were eliminated, and now KVM as well as Xen use upstream QEMU for their device model emulation.

The “do one thing, do it right” mantra, along with “everything is a file”, was exploited to the fullest. The KVM API allows one to create VMs — or, alternatively, sandboxes — on a Linux system. These can then run operating systems inside them, or just about any code that will not interfere with the running system. This also means that there are other user-space implementations that are not as heavyweight or as featureful as QEMU. Tools that can quickly boot into small applications or specialized OSes with a KVM VM started showing up — with kvmtool being the most popular one.

Developer Interest

Since the original announcement of the KVM project, many hackers were interested in exploring KVM. It helped that hacking on KVM was very convenient: a system reboot wasn’t required to install a new VMM. It was as simple as re-compiling the KVM modules, removing the older modules, and loading the newly-compiled ones. This helped immensely during the early stabilization and improvement phases. Debugging was a much faster process, and developers much preferred this way of working, as contrasted with compiling a new VMM, installing it, updating the boot loader, and rebooting the system. Another advantage, perhaps of lower importance on development systems but nonetheless essential for my work-and-development laptop, was that root permissions were not required to run a virtual machine.

Another handy debugging trick that was made possible by the separation of the KVM module and QEMU was that if something didn’t work in KVM mode, but worked in emulated mode, the fault was very likely in the KVM module. If some guest didn’t work in either of the modes, the fault was in the device model or QEMU.

The early KVM release model helped with a painless development experience as well: even though the KVM project was part of the upstream Linux kernel, Kivity maintained the KVM code on a separate release train. A new KVM release was made regularly that included the source of the KVM modules, a small compatibility layer to compile the KVM modules on any of the supported Linux kernels, and the kvm-userspace piece. This ensured that a distribution kernel, which had an older version of the KVM modules, could be used unchanged by compiling the modules from the newest KVM release for that kernel.

The compatibility layer required some effort to maintain. It needed to ensure that the new KVM code that used newer kernel APIs that were not present on older kernels continued to work, by emulating the new API. This was a one-time cost to add such API compatibility functions, but the barrier to entry for new contributors was significantly reduced. Hackers could download the latest KVM release, compile the modules against whichever kernel they were running, and see virtual machines boot. If that did not work, developers could post bug-fix patches.

Widespread adoption

Chip vendors started taking interest and porting KVM to their architectures: Intel added support for IA64 along with features and stability fixes to x86; IBM added support for s390 and POWER architectures; ARM and Linaro contributed to the ARM port; and Imagination Technologies added MIPS support. These didn’t happen all at once, though. ARM support, for example, came rather late (“it’s the reality that’s not timely, not the prediction”, quipped Kivity during a KVM Forum keynote when he had predicted the previous year that an ARM port would materialize).

Developer interest could also be seen at the KVM Forums, which is an annual gathering of people interested in KVM virtualization. The first KVM Forum in 2007 had a handful of developers in a room where many discussions about the current state of affairs, and where to go in the future, took place. One small group, headed by Rusty Russell, took over the whiteboard and started discussions on what a paravirtualized interface for KVM would look like. This is where VIRTIO started to take shape. These days, the KVM Forum is a whole conference with parallel tracks, tens of speakers, and hundreds of attendees.

As time passed, it was evident the KVM kernel modules were not where most of the action was — the instruction emulation, when required, was more or less complete, and most distributions were shipping recent Linux kernels. The focus had then switched to the user space: adding more device emulation, making existing devices perform better, and so on. The KVM releases then focused more on the user-space part, and the maintenance of the compatibility layer was eased. At this time, even though the kvm-userspace fork existed, effort was made to ensure new features went into the QEMU project rather than the kvm-userspace project. Kivity too started feeding in small changes from the kvm-userspace repository to the QEMU project.

While all this was happening, Qumranet had changed direction, and was now pursuing desktop virtualization with KVM as the hypervisor. In September 2008, Red Hat announced it would acquire Qumranet. Red Hat had supported the Xen hypervisor as its official VMM since the Red Hat Enterprise Linux 5.0 release. With the RHEL 5.4 release, Red Hat started supporting both Xen and KVM as hypervisors. With the release of RHEL 6.0, Red Hat switched to only supporting KVM. KVM continued enjoying out-of-the box support in other distributions as well.

Present and future

Today, there are several projects that use KVM as the default hypervisor: OpenStack and oVirt are the more popular ones. These projects concern themselves with large-scale deployments of KVM hosts and several VMs in one deployment. These come with various use cases, and hence ask of different things from KVM. As guest OSes grow larger (more RAM and virtual CPUs), they become more difficult to live-migrate without incurring too much downtime; Telco deployments need low latency network packet processing, so realtime KVM is an area of interest; and faster disk and network I/O is always an area of research. Keeping everything secure and reducing the hypervisor footprint are also being worked on. The ways in which a malicious guest can break out of its VM sandbox and how to mitigate such attacks is also a prime area of focus.

A lot of advancement happens with new hardware updates and devices. However, a lot of effort is also spent in optimizing the current code base, writing new algorithms, and coming up with new ways to improve performance and scalability with the existing infrastructure.

For the next ten years, the main topics of discussion may well not be about the development of the hypervisor. More interesting will be to see how Linux gets used as a hypervisor, bringing better sandboxing for running untrusted code, especially on mobile phones, and running the cloud infrastructure, by being pervasive as well as invisible at the same time.

From: http://lwn.net/Articles/705160/

SeaBIOS study (1)

SeaBIOS [1] 是 x86 结构下的一种 BIOS 的开源实现,可以完成类似 coreboot [2] 初始化硬件后payload的工作,实现一些启动逻辑。

CPU初始化后,放在EIP中执行的第一条指令地址为0xFFFFFFF0,这是intel CPU的一个hack (an inelegant but effective solution to a computing problem),叫做重置向量(reset_vector)。内存0xFFFFFFF0~0xFFFFFFFF(4G最后边的16bytes)的指令,会将CPU调转到系统BIOS的入口地址0xF0000。系统BIOS被预加载到(0xF0000~0xFFFFF, 960k~1M)。

SeaBIOS/src/ 下只有两个汇编代码:
>>> seabios/src/entryfuncs.S : 这里面是从汇编调用C语言函数的宏定义
>>> seabios/src/romlayout.S :
BIOS 起始函数入口“entry_post”,POST代表 Power-on test-self(加电自检)。 文件头部使用DECLFUNC指令定义了几个函数,有中断处理、 32/16big/16模式转换、resume、PMM(处理器电源管理)、PnP(热插拔)、APM(高级电源管理)、PCI-BIOS、BIOS32、ELF等入口函数。entry_post中,对依次通过一些调转函数或中断,对各类设备进行设置。
“jmp entry_19” 调转到 entry_19 函数,再通过entryfuncs.S 中的宏定义,实际调用src/boot.c中的handle_19(),用于加载启动操作系统。
entry_18/handle_18()则用来处理启动(INT 19)失败。

(引:http://en.wikipedia.org/wiki/BIOS_interrupt_call)
+ 18h Execute Cassette BASIC: True IBM computers contain BASIC in the ROM to be interpreted and executed by this routine in the event of a boot failure (called by the BIOS)
+ 19h After POST this interrupt is used by BIOS to load the operating system.

Qemu[4]解析启动参数(例如:“-boot order=ndc,menu=on,boot-timeout=1000 ..”),并通过rom中的fw_cfg文件向BIOS传递启动参数,BIOS则通过读取这些文件应用参数。但是SeaBios不只是被Qemu一个项目使用,所以启动参数的默认、启动策略有差异。
关于BIOS启动有一个规范(bios boot specification) [5] http://www.scs.stanford.edu/nyu/04fa/lab/specsbbs101.pdf,考虑兼容/支持很多硬件,比较复杂,正在读规范中…

Bios为系统ACPI提供DSDT(Differentiated System Description Table,差异系统描述表),这样ACPI就能通过统一的接口对不同类型的设备进行初始化设置,描述使用的是ASL汇编语言,编译后的16进制文件,可以被标准系统使用。比如热插拔功能,Bios再DSDT表中描述PCI设备(如网卡),定义电源管理的回调函数,_EJ0方法用于移除一个设备。相应的在操作系统内部,有PCI驱动处理PCI设备的热插拔 (code: linux/drivers/pci/hotplug*),从固定IO Port探测PCI设备、注册初始化、管理,到最后的销毁。

其他有意思的东西:
SMBios(System Management BIOS): 主板/操作系统厂商显示产品管理信息所需遵循的统一规范
DMI(Desktop Management Interface): 帮助收集电脑系统信息的管理系统

把好几样东西混在了一起,不过基本都是围绕SeaBios的。SeaBios似乎没什么文档,coreboot文档比较齐全。但是有代码和邮件列表在,这就是最好的文档。

[1] http://www.seabios.org
[2] http://www.coreboot.org
[3] http://en.wikipedia.org/wiki/BIOS_interrupt_call
[4] http://www.qemu.org
[5] http://www.scs.stanford.edu/nyu/04fa/lab/specsbbs101.pdf

实时和Linux之二:抢占式内核

实时和Linux之二:抢占式内核

本文译者:
康华:计算机硕士,主要从事Linux操作系统内核、Linux技术标准、计算机安全、软件测
试等领域的研究与开发工作,现就职于信息产业部软件与集成电路促进中心所属的MII-HP
Linux软件实验室。如果需要可以联系通过kanghua151@msn.com联系他。
注明:棕色写的内容为译者注。
如果有意义不明之处请参见原文。
Kevin 将继续他的实时之旅,这次他要着重分析如何通过改造Linux内核来为应用程序带来
实时性能。
在2002年的1-2月发行的嵌入Linux月刊中,我们探讨了有关实时和Linux的基础问题。而这
次我将精力花在改造Linux内核来为应用程序带来实时性能这个主题上。到目前为止,工作
的重心集中在提高内核响应速度——通过减少抢占响应时间缩短系统响应时间,因为我们
知道抢占响应时间在Linux中耗时较长。
通过改进内核——仅仅是剔除一些标准内核的功能——并不去改变或增加内核API,应用程
序就可以获得更快的响应速度。这样做优势明显,因为ISVs(独立软件开发商)不需要为不
同的实时要求开发不同版本的系统。比如,DVD播放器可以在一个改进过的内核上更稳定的
运行,而它不必知道该内核是经过改进的版本。
背景和历史
自2.2版本内核发布以来,内核抢占成为一个热门话题。Paul Barton-Davis 和 Benno
Senoner曾给Linus Torvalds写了一封信(该信后来追加了许多人的签名),请求在2.4版本
内核中显著降低抢占延迟时间。

Linux 同步方法剖析

FROM :http://www.ibm.com/developerworks/cn/linux/l-linux-synchronization.html (这里面还有一些没讲到,2.6同步机制包括:per-cpu variables, atomic operation, memory barrier, spin lock,semaphore, seqlocks, local interrupt disabling, local softirq disable, read-copy-update。详见ULK3)

Linux 同步方法剖析

M. Tim Jones 是一名嵌入式软件工程师,他是 GNU/Linux Application Programming、AI
Application Programming 以及 BSD Sockets Programming from a Multilanguage
Perspective 等书的作者。他的工程背景非常广泛,从同步宇宙飞船的内核开发到嵌入式
架构设计,再到网络协议的开发。Tim 是位于科罗拉多州 Longmont 的 Emulex Corp. 的
一名顾问工程师。
简介: 在学习 Linux® 的过程中,您也许接触过并发(concurrency)、临界段(
critical section)和锁定,但是如何在内核中使用这些概念呢?本文讨论了 2.6 版内核
中可用的锁定机制,包括原子运算符(atomic operator)、自旋锁(spinlock)、读/写
锁(reader/writer lock)和内核信号量(kernel semaphore)。本文还探讨了每种机制
最适合应用到哪些地方,以构建安全高效的内核代码。
发布日期: 2007 年 11 月 19 日
本文讨论了 Linux 内核中可用的大量同步或锁定机制。这些机制为 2.6.23 版内核的许多
可用方法提供了应用程序接口(API)。但是在深入学习 API 之前,首先需要明白将要解
决的问题。

并发和锁定

当存在并发特性时,必须使用同步方法。当在同一时间段出现两个或更多进程并且这些进
程彼此交互(例如,共享相同的资源)时,就存在并发现象。
在单处理器(uniprocessor,UP)主机上可能发生并发,在这种主机中多个线程共享同一
个 CPU 并且抢占(preemption)创建竞态条件。抢占通过临时中断一个线程以执行另一个
线程的方式来实现 CPU 共享。竞态条件发生在两个或更多线程操纵一个共享数据项时,其
结果取决于执行的时间。在多处理器(MP)计算机中也存在并发,其中每个处理器中共享
相同数据的线程同时执行。注意在 MP 情况下存在真正的并行(parallelism),因为线程
是同时执行的。而在 UP 情形中,并行是通过抢占创建的。两种模式中实现并发都较为困
难。
Linux 内核在两种模式中都支持并发。内核本身是动态的,而且有许多创建竞态条件的方
法。Linux 内核也支持多处理(multiprocessing),称为对称多处理(SMP)。可以在本
文后面的参考资料部分学到更多关于 SMP 的知识。
临界段概念是为解决竞态条件问题而产生的。一个临界段是一段不允许多路访问的受保护
的代码。这段代码可以操纵共享数据或共享服务(例如硬件外围设备)。临界段操作时坚
持互斥锁(mutual exclusion)原则(当一个线程处于临界段中时,其他所有线程都不能
进入临界段)。
临界段中需要解决的一个问题是死锁条件。考虑两个独立的临界段,各自保护不同的资源
。每个资源拥有一个锁,在本例中称为 A 和 B。假设有两个线程需要访问这些资源,线程
X 获取了锁 A,线程 Y 获取了锁 B。当这些锁都被持有时,每个线程都试图占有其他线程
当前持有的锁(线程 X 想要锁 B,线程 Y 想要锁 A)。这时候线程就被死锁了,因为它
们都持有一个锁而且还想要其他锁。一个简单的解决方案就是总是按相同次序获取锁,从
而使其中一个线程得以完成。还需要其他解决方案检测这种情形。表 1 定义了此处用到的
一些重要的并发术语。
表 1. 并发中的重要定义
术语                定义
竞态 两个或更多线程同时操作资源时将会导
条件 致不一致的结果。
临界 用于协调对共享资源的访问的代码段。
互斥 确保对共享资源进行排他访问的软件特
锁  性。
死锁 由两个或更多进程和资源锁导致的一种
特殊情形,将会降低进程的工作效率。
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

Linux 同步方法

Debugging mac address of bridge

# strace brctl addbr br0
...
socket(PF_FILE, SOCK_STREAM, 0)         = 3
ioctl(3, SIOCBRADDBR, 0x7fff0b1240c6)   = 0
exit_group(0)                           = ?

# ifconfig br0 (got a random address)
br0       Link encap:Ethernet  HWaddr 5a:34:f1:ba:8f:40 

# strace brctl addif br0 eth0 (no ioctl releated with changing mac)
...
ioctl(4, SIOCGIFINDEX, {ifr_name="eth0", ifr_index=2}) = 0
close(4)                                = 0
ioctl(3, SIOCBRADDIF, 0x7fff57c335c0)   = 0
exit_group(0)                           = ?

# ifconfig br0 (got same addr as eth0)
br0       Link encap:Ethernet  HWaddr 00:22:68:16:c9:e8  

# grep SIOCBRADDIF -nr net/bridge/
net/bridge/br_ioctl.c:409:      case SIOCBRADDIF:
net/bridge/br_ioctl.c:411:              return add_del_if(br, rq->ifr_ifindex, cmd == SIOCBRADDIF);

net/bridge/br_ioctl.c:
int br_dev_ioctl(struct net_device *dev, struct ifreq *rq, int cmd)
{
...
        case SIOCBRDELIF:
                return add_del_if(br, rq->ifr_ifindex, cmd == SIOCBRADDIF);
                           ^
                           |
                           |
net/bridge/br_ioctl.c      v
static int add_del_if(struct net_bridge *br, int ifindex, int isadd)
{
...
        if (isadd)
                ret = br_add_if(br, dev);
                           ^
                           |
                           |
net/bridge/br_if.c:        v
int br_add_if(struct net_bridge *br, struct net_device *dev)
{
...
        changed_addr = br_stp_recalculate_bridge_id(br);
                           ^
                           |
                           |
net/bridge/br_stp_if.c:    v
bool br_stp_recalculate_bridge_id(struct net_bridge *br)
{       
        const unsigned char *br_mac_zero =
                        (const unsigned char *)br_mac_zero_aligned;
        const unsigned char *addr = br_mac_zero;
        struct net_bridge_port *p;
        
        /* user has chosen a value so keep it */ flags & BR_SET_MAC_ADDR)
                return false;
        
        list_for_each_entry(p, &br->port_list, list) {
                if (addr == br_mac_zero ||
                    memcmp(p->dev->dev_addr, addr, ETH_ALEN) dev->dev_addr; <---- a 'min' mac address in port_list will be set to bridge
        }