Meet Casper

Casper was my colleague in Red Hat, we haven’t met each other for about 4 years. We knew same friends in Opensource communities before he join in Red Hat. He is working for Aliyun on system software area.

They want to cooperate with Universities, share their cool technology, provide guide and job position (intern or regular) for the students.

They investigated and used machine-learning in system software, one success usecase is key-value search, it’s quicker than hash table, but only for read most situation, that’s popular in their production environment.

Ten years of KVM

This article was contributed by Amit Shah

We recently celebrated 25 years of the Linux project. KVM, or Kernel-based Virtual Machine, a part of the Linux kernel, celebrated its 10th anniversary in October. KVM was first announced on 19 October 2006 by its creator, Avi Kivity, in this post to the Linux kernel mailing list.

That first version of the KVM patch set had support for the VMX instructions found in Intel CPUs that were just being introduced around the time of the announcement. Support for AMD’s SVM instructions followed soon after. The KVM patch set was merged in the upstream kernel in December 2006, and was releasedas part of the 2.6.20 kernel in February 2007.

Background

Running multiple guest operating systems on the x86 architecture was quite difficult without the new virtualization extensions: there are instructions that can only be executed from the highest privilege level, ring 0, and such access could not be given to each operating system without it also affecting the operation of the other OSes on the system. Additionally, some instructions do not cause a trap when executed at a lower privilege level — despite them requiring a higher privilege level to function correctly — so running a “hypervisor” that ran in ring 0, while running other OSes in lower-privileged rings was also not a solution.

The VMX and SVM instructions introduced a new ring, ring -1, to the x86 architecture. This is the privilege level where the virtual machine monitor (VMM), or the hypervisor, runs. This VMM arbitrates access to the hardware for the various operating systems so that they can continue running normally in the regular x86 environment.

There are several reasons to run multiple operating systems on one hardware system: deployment and management of OSes becomes easier with tools that can provision virtual machines (VMs). It also leads to lower power and cooling costs by hosting multiple OSes and their corresponding applications and services to run on newer, more capable hardware. Moreover, running legacy operating systems and applications on newer hardware without any changes to adapt to the newer hardware now becomes possible by emulating older hardware via the hypervisor.

The functionality of KVM itself is divided in multiple parts. The generic host kernel KVM module, which exposes the architecture-independent functionality of KVM; the architecture-specific kernel module in the host system; the user-space part that emulates the virtual machine hardware that the guest operating system runs on; and optional guest additions that make the guest perform better on virtualized systems.

At the time KVM was introduced, Xen was the de facto open source hypervisor. Since Xen was introduced before the virtualization extensions were available on x86, it had to use a different design. First, it needed to run a modified guest kernel in order to boot virtual machines. Second, Xen took over the the role of the host kernel, relegating Linux to only manage I/O devices as part of Xen’s special “Dom0” virtual machine. This meant that the system couldn’t truly be called a Linux system — even the guest operating systems were modified Linux kernels with (at the time) non-upstream code.

Kivity started KVM development while working at Israeli startup Qumranet to fix issues with the Xen-related work the company was doing. The original Qumranet product idea was to replicate machine state across two different VMs to achieve fault tolerance. It was soon apparent to the engineers at Qumranet that Xen was too limiting and a poor model for their needs. The virtualization extensions were about to be introduced in AMD and Intel CPUs, so Kivity started a side-project, KVM, that was based on the new hardware virtualization specifications and would be used as the hypervisor for the fault-tolerance solution.

Development model

Since the beginning, Kivity wrote the code with upstreaming it in mind. One of the goals of the KVM model was as much reuse of existing functionality as possible: using Linux to do most of the work, with KVM just being a driver that handled the new virtualization instructions exposed by hardware. This enabled KVM to gain any new features that Linux developers added to the other parts of the system, such as improvements in the CPU scheduler, memory management, power management, and so on.

This model worked well for the rest of the Linux ecosystem as well. Features that started their life with only virtualization in mind began being useful and widely-adopted in general use cases as well, like transparent huge pages. There weren’t two separate communities for the OS and for the VMM; everyone worked as part of one project.

Also, management of the VMs would be easier as each VM could be monitored as a regular process — tools like top and ps worked out of the box. These days, perf can be used to monitor guest activity from the host and identify bottlenecks, if any. Further chipset improvements will also enable guest process perf measurement from the host.

The other side of KVM was in user space, where the machine that is presented to the guest OS is built. kvm-userspace was a fork of the QEMU project. QEMU is a machine emulator — it can run unmodified OS images for a variety of architectures that it supports, and emulate those architecture’s instructions for the host architecture it runs on. This is of course very slow, but the advantage of the QEMU project was that it had quite a few devices already emulated for the x86 architecture — such as the chipset, network cards, display adapters, and so on.

What kvm-userspace did was short-circuit the emulation code to only allow x86-on-x86 and use the KVM API for actually running the guest OS on the host CPU. When the guest OS performs a privileged operation, the CPU will exit to the VMM code. KVM takes over; if it can service the request itself, it would do so, and give control back to the guest. This was a “lightweight exit”. For requests that the KVM code can’t serve, like any device emulation, it would defer to QEMU. This implied exiting to user space from the host Linux kernel, and hence this was called a “heavyweight exit”.

One of the drawbacks in this model was the maintenance of the fork of QEMU. The early focus of the developers was on stabilizing the kernel module, and getting more and more guests to work without a hitch. That meant much less developer time was spent on the device emulation code, and hence the work to redo the hacks to make them suitable for upstream remained at a lower priority.

Xen too used a fork of QEMU for its device emulation in its HVM mode (the mode where Xen used the new hardware virtualization instructions). In addition, QEMU had its own non-upstream Linux kernel accelerator module (KQEMU) for x86-on-x86 that eliminated the emulation layer, making x86 guests run faster on x86 hardware. Integrating all of this required a maintainer who would understand the various needs from all the projects. Anthony Liguori stepped up as a maintainer of the QEMU project, and he had the trust of the Xen and KVM communities. Over time, in small bits, the forks were eliminated, and now KVM as well as Xen use upstream QEMU for their device model emulation.

The “do one thing, do it right” mantra, along with “everything is a file”, was exploited to the fullest. The KVM API allows one to create VMs — or, alternatively, sandboxes — on a Linux system. These can then run operating systems inside them, or just about any code that will not interfere with the running system. This also means that there are other user-space implementations that are not as heavyweight or as featureful as QEMU. Tools that can quickly boot into small applications or specialized OSes with a KVM VM started showing up — with kvmtool being the most popular one.

Developer Interest

Since the original announcement of the KVM project, many hackers were interested in exploring KVM. It helped that hacking on KVM was very convenient: a system reboot wasn’t required to install a new VMM. It was as simple as re-compiling the KVM modules, removing the older modules, and loading the newly-compiled ones. This helped immensely during the early stabilization and improvement phases. Debugging was a much faster process, and developers much preferred this way of working, as contrasted with compiling a new VMM, installing it, updating the boot loader, and rebooting the system. Another advantage, perhaps of lower importance on development systems but nonetheless essential for my work-and-development laptop, was that root permissions were not required to run a virtual machine.

Another handy debugging trick that was made possible by the separation of the KVM module and QEMU was that if something didn’t work in KVM mode, but worked in emulated mode, the fault was very likely in the KVM module. If some guest didn’t work in either of the modes, the fault was in the device model or QEMU.

The early KVM release model helped with a painless development experience as well: even though the KVM project was part of the upstream Linux kernel, Kivity maintained the KVM code on a separate release train. A new KVM release was made regularly that included the source of the KVM modules, a small compatibility layer to compile the KVM modules on any of the supported Linux kernels, and the kvm-userspace piece. This ensured that a distribution kernel, which had an older version of the KVM modules, could be used unchanged by compiling the modules from the newest KVM release for that kernel.

The compatibility layer required some effort to maintain. It needed to ensure that the new KVM code that used newer kernel APIs that were not present on older kernels continued to work, by emulating the new API. This was a one-time cost to add such API compatibility functions, but the barrier to entry for new contributors was significantly reduced. Hackers could download the latest KVM release, compile the modules against whichever kernel they were running, and see virtual machines boot. If that did not work, developers could post bug-fix patches.

Widespread adoption

Chip vendors started taking interest and porting KVM to their architectures: Intel added support for IA64 along with features and stability fixes to x86; IBM added support for s390 and POWER architectures; ARM and Linaro contributed to the ARM port; and Imagination Technologies added MIPS support. These didn’t happen all at once, though. ARM support, for example, came rather late (“it’s the reality that’s not timely, not the prediction”, quipped Kivity during a KVM Forum keynote when he had predicted the previous year that an ARM port would materialize).

Developer interest could also be seen at the KVM Forums, which is an annual gathering of people interested in KVM virtualization. The first KVM Forum in 2007 had a handful of developers in a room where many discussions about the current state of affairs, and where to go in the future, took place. One small group, headed by Rusty Russell, took over the whiteboard and started discussions on what a paravirtualized interface for KVM would look like. This is where VIRTIO started to take shape. These days, the KVM Forum is a whole conference with parallel tracks, tens of speakers, and hundreds of attendees.

As time passed, it was evident the KVM kernel modules were not where most of the action was — the instruction emulation, when required, was more or less complete, and most distributions were shipping recent Linux kernels. The focus had then switched to the user space: adding more device emulation, making existing devices perform better, and so on. The KVM releases then focused more on the user-space part, and the maintenance of the compatibility layer was eased. At this time, even though the kvm-userspace fork existed, effort was made to ensure new features went into the QEMU project rather than the kvm-userspace project. Kivity too started feeding in small changes from the kvm-userspace repository to the QEMU project.

While all this was happening, Qumranet had changed direction, and was now pursuing desktop virtualization with KVM as the hypervisor. In September 2008, Red Hat announced it would acquire Qumranet. Red Hat had supported the Xen hypervisor as its official VMM since the Red Hat Enterprise Linux 5.0 release. With the RHEL 5.4 release, Red Hat started supporting both Xen and KVM as hypervisors. With the release of RHEL 6.0, Red Hat switched to only supporting KVM. KVM continued enjoying out-of-the box support in other distributions as well.

Present and future

Today, there are several projects that use KVM as the default hypervisor: OpenStack and oVirt are the more popular ones. These projects concern themselves with large-scale deployments of KVM hosts and several VMs in one deployment. These come with various use cases, and hence ask of different things from KVM. As guest OSes grow larger (more RAM and virtual CPUs), they become more difficult to live-migrate without incurring too much downtime; Telco deployments need low latency network packet processing, so realtime KVM is an area of interest; and faster disk and network I/O is always an area of research. Keeping everything secure and reducing the hypervisor footprint are also being worked on. The ways in which a malicious guest can break out of its VM sandbox and how to mitigate such attacks is also a prime area of focus.

A lot of advancement happens with new hardware updates and devices. However, a lot of effort is also spent in optimizing the current code base, writing new algorithms, and coming up with new ways to improve performance and scalability with the existing infrastructure.

For the next ten years, the main topics of discussion may well not be about the development of the hypervisor. More interesting will be to see how Linux gets used as a hypervisor, bringing better sandboxing for running untrusted code, especially on mobile phones, and running the cloud infrastructure, by being pervasive as well as invisible at the same time.

From: http://lwn.net/Articles/705160/

Play Cassandra in Docker

上次在Mac上玩Docker用的还是boot2docker[1],一个基于http://tinycorelinux.net/的轻量级Linux发行版。今天想部署多个节点的Cassandra集群,可以把已经装好的Fedora虚拟机拷贝两份硬盘空间不足,在怎么清理还是不够,主要是使用这个虚拟机编译了很多东西磁盘占用较多。下载的Fedora24-Beta mini默认迷你安装下来,也才1.5G。

然后想着用Docker更轻便一些,去官网看了一下GetStart文档[3],下载安装了DockerToolbox[4],这里边自带了Docker引擎、组件、主虚拟机、和一个去年收购的图形化管理工具Kitematic(Beta)[5]。看来给开源系统底层软件做管理工具是条不错的路子!

开源系统底层软件技术都比较牛逼,但在商业应用中就需要更方便普通用户的图形管理接口,方便部署、运维。当然对开源软件做系统化、自动化、企业级的测试认证也是非常重要。这是我目前看到两个基于开源软件的商业模式、赢利点。

图形工具和docker命令结合起来一起使用,先要启动引擎、虚拟机,然后创建自己的容器container、部署程序。

具体部署Cassandra步骤参考的是[6]:

Starting "default"...
....
Copying certs to the local machine directory...
Copying certs to the remote machine...
Setting Docker configuration on the remote daemon...

                        ##         .
                  ## ## ##        ==
               ## ## ## ## ##    ===
           /"""""""""""""""""___/ ===
      ~~~ {~~ ~~~~ ~~~ ~~~~ ~~~ ~ /  ===- ~~~
           ______ o           __/
                          __/
              ___________/


docker is configured to use the default machine with IP 192.168.99.100
For help getting started, check out the docs at https://docs.docker.com
# 新建一个容器,拉取并执行docker官方仓库里的Cassandra
Jerusalem:~ amoskong$ docker run --name vm4 -d cassandra
09753a7f8c615229e0cfba97d46a9a557c0f189dab4584482ba915191892e222

# 获取一个容器的shell,然后执行cqlsh连接数据库
Jerusalem:~ amoskong$ docker exec -it vm4 sh

# cqlsh
Connected to Test Cluster at 127.0.0.1:9042.
[cqlsh 5.0.1 | Cassandra 3.5 | CQL spec 3.4.0 | Native protocol v4]
Use HELP for help.
cqlsh> 

重复上面的命令,指定不同的名字,就可以创建多个容器,对应多个Cassandra节点。多个节点连接连接成一个集群cluster就需要配置cassandra seeds,这个可以通过配置文件,docker命令行,或者Kitematic图形管理工具的设置界面。

# 创建容器时候指定 Cassandra seeds ip地址
Jerusalem:~ amoskong$ docker run --name vm4 -d cassandra -e CASSANDRA_SEEDS="$(docker inspect --format='{{ .NetworkSettings.IPAddress }}' vm1)"

Jerusalem:~ amoskong$ docker run --name vm4 -d cassandra -e CASSANDRA_SEEDS="172.17.0.2,172.17.0.3,172.17.0.4"

第三个容器里的cassandra老是启动失败,内存不足,在网上查了一下,最后还是通过设置环境变量 MAX_HEAP_SIZE, HEAP_NEWSIZE。

【1】http://boot2docker.io/
【2】https://mirrors.ustc.edu.cn/fedora/linux/releases/test/24_Beta/Everything/x86_64/iso/Fedora-Everything-netinst-x86_64-24_Beta-1.6.iso
【3】https://docs.docker.com/mac/
【4】https://www.docker.com/products/docker-toolbox
【5】http://www.lupaworld.com/article-251196-1.html
【6】https://hub.docker.com/_/cassandra/

Berlin in Germany (柏林)三

在柏林待了六天时间,第一天是地铁游,其他时间都是靠金浦哥的自行车。整个城市不大,骑车也能去很多小地方,发现很多有趣的东西。想停就停,不用老是查地图,还半天找不到公交或地铁。我计划要骑行到波茨坦,前两天先适应一下交通环境,还有体力。

金浦哥家就在柏林西北部的机场旁边,一路东行往南一拐就到Alex广场金浦哥上班的地方。还是挺远的,得骑半个多小时。话说金浦哥身体很壮实,所以他之前说的很近让我有了新的认识。看来没有第一天直冲波茨坦是很明智的。去波茨坦,金浦哥也说很近 0:-)

街道很宽敞,交通持续非常规范,这个符合德国人的特点。不过有一天下午也碰到几个飞车党,警车一路鸣笛追赶。即便看起来很偏僻的小村子附近也有非常完善的公共设施,有自家的小花园,公共草坪,休闲公园,围起来的四个足球场,生活在这里的小伙伴真实幸福。我在想这里有这么多人吗,修四个大足球场。骑行走过了我做地铁经过的地方,停下来拍拍照,尽量让脑海中留下更多的印记。有很多河流把城市切割成很多小份,建设发展情况也不均匀,有个别地方也感觉乱糟糟,脏兮兮。

我基本骑车从东往西,从西往南,再往东,把这个城市划了很多圈,柏林大教堂、博物馆、电视塔、犹太人大屠杀纪念碑、柏林墙遗址、洪堡大学,都去感受了一下。更多的是那些不知道没名气,生活中普普通通的东西。居民区附近的小教堂,和抽烟的小学生打乒乓球,逛跳蚤市场,和土耳其的饭店老板聊天。柏林大教堂还有博物馆太过庄严了,古朴的建筑让人肃然起敬,不过柏林最近都在搞“装修”,修复古建筑,到处都是铁架子,估计也是为了开发旅游,旧貌换新颜。

洪堡大学第一次是晚上去得,什么都没看到,压根没见几个人。之前在老师的博客里看到描述洪堡大学之旅感觉洪堡大学真伟大,我去寻找马克思、爱因斯坦、薛定谔、赫兹、黑格尔、叔本华、恩格斯海涅、俾斯麦、伯恩、周恩来的印记,却啥都没感受到,没文化真可怕。第二天一大早又跑了过来,依旧没看到太多。原来洪堡大学有几个校区,这个时候学生都放假没在,所以….所以我只能离开。学校里虽然建筑很久,但是看的出北侧搞科研的楼非常现代化,还有人大半夜在实验室加班,学校绝对的开放,不关门,没门卫,教室楼可以随便进入参观,还专门为游客准备了很多介绍学校,免费讲座的小册子。

一骑行就忘记了吃饭,一直骑到城市西南的某个街区角落停了下来,锁好车。找了家土耳其肉夹馍管子,里边还有非常接近老家的软面薄饼,加上菜很好吃,啤酒,还有足球赛,赌博机。老板来自土耳其,老婆孩子都在家里,他两三年才回去一次。这边店面和住宿房租开支太大,他每年下来折合人民币赚30多万,柏林的消费比北京能贵一点点(基本接近),不过这几年生意原来越不行了。他表示也很无奈,回国也没什么能做,只能在这延续着,即便情况越来越糟糕。我还给老板说,你来中国吧,我们那边人也爱吃肉夹馍,估计生意要比这边好,他说好呀好呀。我也是喝的晕乎乎,狂飙英语,Life is hard, We are Friends。最后一只聊聊,聊的酒劲都散了,才走的。他说让我下次来柏林再去找他,如果过来工作那就更好。他人真的是太好了,一年后的今天我只能祝福他,祝福他的家庭。

第四天一切准备好了,骑行波茨坦。对于波茨坦只知道有个公告很有名,听金浦哥说这个地方不大,周围很多河流,景色很美。一大早由柏林西北沿湖西侧骑行,见到了湖边的野鸭,锻炼的行人,文艺的提示路标,工业区等。一路基本都有骑行车道很安全,而且有人超车也会说声sorry。后半部分有上坡路,而且还吹风,有点吃力,就在旁边超市买了写零食。方便面买过来发现不能干吃,太硬了,只能煮。买的罐头打开全是汤水,哎呀寒风瑟瑟,我的内心拔凉拔凉…买东西不能凭想象力..刚来第一天买了面条,买了菜,回去发现面太硬了,于是又炒了一遍。辣椒面味道跟土一样…总之很不给力。应该去专门的亚洲超市买东西,如果打算去国外生活,一定要把厨艺练好,才能饿不死,下厨房之类的APP非常管用。

1.pic_hd

中途有一个路口走错,来到一个打猎森林公园。在那一刻,表示有点慌乱,这要被那个子弹走眼了,异国他乡的…在那一刻我的GPS神奇般的能工作了,在杜塞尔多夫、柏林前几天压根都不能工作,我以为是手机或者服务问题。可就在那一刻它确实能工作了。我一看需要往南继续走,我超西走错了。再一看旁边的提示,公园里边有鹿还有一些不危险的小动物,哎。刚好有人过来了,有重复问了一遍确认我得往南继续走。

2.pic_hd

再往南骑行,中途为了抄近路(一直沿着公路骑行没啥意思) 结果走到了森林深处,路都被腐烂的树叶覆盖了,但是我还是头皮硬撑着没有走回头路,中间只有遇到过一对父女。内心有一点害怕,树很高但是不那么密,能偶尔听到汽车的声音,但是就是找不到路,因为之前看了地图所以感觉只要朝着西南方向走就能出去。结果走呀走,走呀走,最后终于看到了一户人家,有了正规的路。沿着这条路又走了老半天才来到一个小村庄,拿手机一看一直在超西北方向偏移,在森林里耗费了两个小时。后面的路就简单了,一路直行就到波茨坦小镇了。看到小镇那一刻,真是豁然开朗,以后还是不要冒然行动。

波茨坦水确实多,感觉跟好几个小岛拼凑起来一样,有几个特别小的“岛”上就是一个公园,骑着车子随处逛很方便。最后开始下起了小雨,在一个公共健身区有一个吊绳上休息了一会真是舒服,尽管有点冷,但是蜷缩在吊床上,看着蓝天,看着陌生的城市,小腿胀胀的休息一下真舒服。不一会雨停了,来回穿梭了几个街区,这边有个刚刚翻新了的大教堂,土黄发白的外墙,铜绿色的顶部,非常美。周围也是有一个博物馆,不过没有进去。城市西侧是坐小山,估计是城市小,所以工业和生活区掺插起来,有一点乱,超市里的东西比城区的还要贵一些,估计是运送成本增大的缘故。

回去的路好走多了,是沿着湖的东侧,有专门的骑行车道,一路杀到第一天坐船的地方,还有胜利纪念碑,不是很晚继续北行。在一家泰国餐馆吃了个炒饭,喝了被芒果汁,真的好甜好新鲜。最后结账时候发现菜单上的价钱贵了0.5欧,老板指另外一个地方标着价钱,我说我只能按菜单上的价钱给你支付。老板没说什么,最后给我少了0.5欧,还把一些零头舍去了。我表示有点内疚 :-/

最后两天也基本是在城市里,走一些边边角角,买东西,不过这里周末不开店,也没买到啥东西。金浦哥家附近就是机场,我专门去那里看飞机起降,看的人还不少,站在台子上,拍照录像。不清楚哪些人是不是游客,而且是在偏僻的住宅区后边,所以没有互相交流。

最后一天金浦哥和晓丹姐从土耳其回来了,第一次见面他们非常友好,聊了很多。晓丹姐打电话咨询电话卡的事,能听得出她得发音非常棒,估计英文也特别好。金浦哥标准IT男,人很憨厚,很热心的和我分享了很多在德国的生活体验,这边人的做事方式,这边的社会教育福利,公司人员相处。总是这里读书免费,外国人家庭观念强所以很少串门,中国人过来会稍微孤独一点。但是自然环境好,社会人际关系简单,物质条件优越,过的很舒坦。如果爱好旅游欧洲文化,那就更方便了。晓丹姐学得生物专业,要像专业对口得去其他城市的大实验室,柏林这边是没有条件的,不过现在她主要在学习德语。

短短几天的游走,离别的时候,略感忧伤。不知道多久才能再回来,下次是来旅游呢,还是来工作,还是一辈子再也来不了了。虽然柏林作为重新建造的首都,不想其他城市那么有特别美丽的风景,特别有名的景点,但是柏林骑行,让我感知了这里的生活,让我对自己的生活有了新的认识和反思。

Berlin in Germany (柏林)二

2013年底我认识了在德国工作的金浦哥和晓丹姐,金浦哥是在linkedin上看到我的,并且给我介绍了个在德国工作的不错的工作机会,后来虽然没能去德国工作,但是我们成了很好的朋友,经常通过朋友圈看到他们在国外的生活学习、工作旅行,非常羡慕他们的生活状态。但是我们并没有见过面。

去德国之前联系的他们,不过他们已经安排好了,土耳其的旅行计划没法改动。但是他们仍然邀请我去柏林玩,给我制定了详细的旅游建议,介绍了很多他们平时发现的好地方、好项目,并且让我吃住在他们家,使用他们的山地自行车骑行。这种真诚的对待、被信任,让我非常感激,习惯性的中式推脱后还是欣然接受了 🙂

在杜塞尔多夫参加Linux Conference/KVM Forum期间就碰到了金浦哥公司的 Sebastian, 之前只是见过照片,在FaceBook、Linkedin上看过他一些牛逼的作品(Kernel开发项目、飞行器模拟制造运行,游戏等),走着走着一眼就认出来了。感觉说话有一点点羞涩,不过我一问他们的项目,他就开始说了好多好多想法,非常有成就感。顺便也跟他们公司其他几个人打了招呼,介绍了自己。

独自一个坐着火车一大早离开杜塞尔多夫,一路向北。先是空旷的车厢,漆黑的窗外,后来车上人也慢慢多了,窗外也能看到很多美景,零零散散的村落、漫无边际的草坪、冷清的工厂、河流。德国北边领海,南高北底,北边来自大海的湿气由北向南,气温降低降雨量大,因此草木生长茂盛,德国的阴雨天比较多。所以德国人办事都比较严肃严谨,制造业世界领先,经济发达。

而地中海沿岸的西班牙,却在挠经济危机,上次去巴塞罗那,半个城市就像鬼城。白天大马路上都没几个人,商店也都关闭了,罢工游行频频,抢劫小偷特别猖獗。而这里的人,每天起得很晚,商店一天营业没几个小时,晚上一般10点多才吃晚饭,去早了要么没开始营业,要么人特别少。一般人晚饭后都有好几个聚会要参加,玩到很晚很晚。沿海气候,沙滩日光浴都可以免费享受,再来点海鲜啤酒那就更美了。估计这里人都忙着享受,把工作淡忘了,才会有经济危机 ;-?

到了柏林火车站,有好几层超大,地铁公交换乘也很方便,就是得看清楚指示牌。买了张一天的通票,第一次乘坐要自动检票打上时间。压根没人检票全靠自觉,不过被逮到罚款很严重。碰到一个麦当劳想去上个wc,结果需要3.5欧,这里的人力很贵,免费服务较少,整个城市免费wifi也非常少。地铁线很密集,每一站都不是很大,设备也没有中国的新,毕竟是developED国家吗。倒了几趟车感觉来到了东郊,沿途很多涂鸦,并没有name太多崭新的东西。先是去找 Sebastian 拿金浦家的钥匙,刚下地铁时真好碰到了在会议上认识的Sebastian的印度同事,他们带着老婆孩子也才过来不到一年时间,他把我带到了公司。之前在Google街景上大概看过周边的情况,离Alex广场繁华商区非常近。从外面看任然是欧式建筑,不想明亮时尚的写字楼,倒想个居住区。推开大门先是一个长廊,然后是个很大的院子,透过窗户这才看到了很多现代化公司。公司会议室不大,休息室却很大,有各种啤酒甜点游戏,不过没见一个人休息,好多人也是自己从家里带饭。

拿到钥匙又继续坐地铁去西北部的金浦家,很好找都是之前查好的,这边的绿化就更好了进入住宅区人非常少。东西一放稍作休息,就继续出门去南部一个一个湖区坐船玩,凭地铁票就可乘坐。车上跟一个很友好大妈聊天,她原来在杜塞尔多夫住,跟着儿子在这工作,她就过来了,来了不到一年好多地方也不熟悉。

出了地铁站,路对面就是坐船的地方,德文我只能大概看,好多单词和英语很接近,比如kontact联系。一个高中生坐在岸边认真的看书,看到我这个黑发黄皮肤,主动问有什么需要帮忙的,我和他确认了一下乘坐点没问题,行船线路。然后和他问了一些他们学习,英语教学,实践的话题,小孩英语很棒。

地铁和船都可以载自行车,方便来这里骑行的人。船开的不慢,但是湖太大,还是走了很久很久,拍了一路。湖中也有划皮艇的人。有几个六七十的小姑娘一路自拍搞笑的很。

(未完待续…)

Looking for a SW Engineer to join the Avocado team (Power8)

Hi folks. (from: https://www.redhat.com/archives/autotest-kernel/2014-September/msg00012.html)

We’re looking for a SW Engineer to work on the KVM Test Automation team (currently working on Avocado), focusing on Power8.

The primary assignment of this person will be to work on the Power8 port of Avocado and to be the person responsible for anything related to power8 in our KMV test automation infrastructure in the long term.

If you know of somebody who would fit the description below and is willing to join our team, please point them to the Red Hat job site or contact me in private.

The position is public in this URL:
http://jobs.redhat.com/jobs/descriptions/software-engineer-automating-testing-brno-jihomoravsky-kraj-czech-republic-job-1-4786737 (don’t care about the work place)

Company Description:
    At Red Hat, we connect an innovative community of customers,
    partners, and contributors to deliver an open source stack of
    trusted, high-performing technologies that solve business
    problems. We’re a billion dollar S&P 500 company offering
    solutions from Linux to middleware, storage to cloud,
    together with award-winning global customer support,
    consulting, and implementation services. 

Job summary:
    The Red Hat Engineering team in Brno is looking for a
    Software Engineer to join the KVM team in Brno, Czech
    Republic. Focusing on Power8 (ppc64), you will work on the
    development of testing frameworks and tools to automate the
    test coverage of Red Hat virtualization technologies running
    on the IBM Power8 architecture. As a Software Engineer,
    you'll work with a globally distributed team to develop
    critical technology for Red Hat products while collaborating
    with the open source community on many projects. 

Primary job responsibilities:
    * Design and implement new features in open source testing
      frameworks like Avocado and Autotest
    * Integrate multiple automation tools in a Continuous
      Integration (CI) system for KVM on Power8
    * Assist the Quality Assurance team in developing complex
      tests involving virtualization technologies
    * Promote a culture of test automation internally within Red
      Hat

Required skills:
    * 2-3 years of significant software development experience on
      Linux
    * Good understanding of the inner workings of a Linux
      distribution
    * Experience with test automation or continuous integration
      systems
    * Familiarity with development languages like Python and C
    * Experience with open source projects and development tools
    * Familiarity with IBM's Power8 architecture
    * Bachelor's degree in computer science or equivalent

Thanks!

收缩QEMU虚拟机磁盘空间

相机照了很多照片,删了可惜,只能精选一些放网上.还有很多景色照片全部丢到百度网盘了.再加上很多虚拟机磁盘镜像,硬盘空间经常不够用.除了删除大量不经典的照片,就是在虚拟机镜像上想办法了.

qcow2磁盘镜像在创建后只占很小的空间,随着虚拟机里磁盘空间的真正的使用,镜像会越来越大.即便文件删除后,空间也不会自动收缩.下面是的方法可以用于回收为使用的磁盘空间.

+ Reclaimed space of my old guests images

  1) Fill unused space to zero:
    In Linux guests:
     # dd if=/dev/zero of=./a.out bs=1M
     # rm a.out

    In Windows guests:
     # download sdelete from http://technet.microsoft.com/en-us/sysinternals/bb897443.aspx
     # sdelete -c c:

  2) # qemu-img convert -p -O qcow2 orig.qcow2 new.qcow2
     # rm orig.qcow2

收缩了几个用了很长时间的镜像,笔记本有剩余40+G空间感觉很富有 🙂

level-triggered interrupt & edge-triggered interrupt (电平触发中断 与 边沿触发中断)

(谢谢 wanpin Li对这篇博客的反馈,让我重新发现了一些问题)

两种触发方式,是为了在多个设备共享中断线时,实现流量控制。

  • 所谓边沿触发,就是电平发生变化(通常是从高到低)时触发的中断 (CPU忙时中断可能丢失)
  • 所谓电平触发,就是设备保持中断请求引脚(中断线)处于预设的有效触发电频(通常为高),中断会一直被请求(CPU忙时中断也不会丢失)。为了避免中断请求被重复处理,需要在处理前 标记 irq,然后 ACK irq(复位中断请求引脚为无效电频,以便能够接受其他中断请求),处理中断,然后 取消 irq 标记。

KVM 最先只支持 edge-triggered 中断,使用一个irqfd通知guest,电频下降边沿可在注入模拟中一次完成。看下面代码连续调用两个kvm_set_irq(),倒数第二个参数为电平。

虚拟机 level-triggered 的注入仍然依赖的是userspace,在QEMU里模拟了 PCI / IOapic /LAPIC(8259)。KVM_IRQFD ioctl 是中断注入的接口,唯一参数 kvm_irqfd 结构体包含了一个irqfd、gsi(全局系统中断),flags。

KVM 在 2012 年也添加了 level-triggered 中断模拟 [1],添加了一个 resamplefd (为eventfd) 到 kvm_irqfd 结构体,用它来通知guest取消 irq 标记。来自所有guests、设备的请求被放在链表中管理,所有的 resample irqfds使用同一个 IRQ源ID,所以 de-assert(复位中断请求引脚为无效电平,也就是清除中断状态寄存器–ISR)只需要一次,而每个irqfd都需要单独通知 guest 一次(用于通知新的resampler使用这个gsi)。

static void
irqfd_inject(struct work_struct *work)
{
	struct _irqfd *irqfd = container_of(work, struct _irqfd, inject);
	struct kvm *kvm = irqfd->kvm;

	if (!irqfd->resampler) {  //edge-triggered
		kvm_set_irq(kvm, KVM_USERSPACE_IRQ_SOURCE_ID, irqfd->gsi, 1, false);
		kvm_set_irq(kvm, KVM_USERSPACE_IRQ_SOURCE_ID, irqfd->gsi, 0, false);
	} else //level-triggered
		kvm_set_irq(kvm, KVM_IRQFD_RESAMPLE_IRQ_SOURCE_ID,
			    irqfd->gsi, 1, false); //assert irq
}
/*
 * Since resampler irqfds share an IRQ source ID, we de-assert once
 * then notify all of the resampler irqfds using this GSI.  We can't
 * do multiple de-asserts or we risk racing with incoming re-asserts.
 */
static void
irqfd_resampler_ack(struct kvm_irq_ack_notifier *kian)
{
	...
	kvm_set_irq(resampler->kvm, KVM_IRQFD_RESAMPLE_IRQ_SOURCE_ID,
		    resampler->notifier.gsi, 0, false); //de-assert irq for level-triggered interrupt

	list_for_each_entry_rcu(irqfd, &resampler->list, resampler_link)
		eventfd_signal(irqfd->resamplefd, 1); //notify resimplers
	...
}

[1] commit 7a84428af  [PATCH] KVM: Add resampling irqfds for level triggered interrupts

Reference:
linux/virt/kvm/eventfd.c:
  irqfd_inject(struct work_struct *work)
  irqfd_shutdown(struct _irqfd *irqfd)
  irqfd_resampler_ack(struct kvm_irq_ack_notifier *kian)
  struct kvm_irqfd
  struct _irqfd
  struct _irqfd_resampler
linux/virt/kvm/irqchip.c:
  kvm_set_irq(kvm, source_id, level, line_status)

linux/virt/kvm/irq_comm.c:
  kvm_reqeust_irq_source_id(struct kvm *kvm)

封装一个sendkey函数向虚拟机一次性发送字符串

QEMU 提供一个 sendkey的 Monitor 命令,用来向虚拟机发送单个字符,或者组合键。之所以只支持单个字符,是因为这里需要对空格、ctrl,回车等进行转换,也需要支持keycode的输入。
libvir的virsh调用QEMU的sendkey命令为上层提供类似的功能。但有一个新的需求就是一次输入连续的字符串。我们可以从三个不同层面来实现这个需求,分别是QEMU、libvirt,还有用户自己的脚本、程序里。

这里我跟倾向从用户程序封装出一个转换函数,实现一次连续输入,这样也更加灵活一些,方便扩展。因为QEMU、libvirt接受的多种类型的KEY,如果再把字符转换加进去,会导致接口复杂,语义表达带来冲突。

[amos@amosk qemu]$ cat sendkey.sh 
DOM=rhel6u5_x64

# 封装一下sendkey()函数,调用virsh send-key 命令想虚拟机发送字符串
function sendkey() {
    str=$@
    length=`expr length "$str"`
    for ((i=1; i<=$length; i++)); do
        char=`expr substr "$str" $i 1`
        if [ "$char" = " " ];then
            char="spc"
        fi
        echo virsh send-key $DOM "$char"
    done
}

sendkey "root"
echo virsh send-key $DOM kp_enter
sendkey "shutdown -h now"
echo virsh send-key $DOM kp_enter
[amos@amosk qemu]$ bash sendkey.sh 
virsh send-key rhel6u5_x64 r
virsh send-key rhel6u5_x64 o
virsh send-key rhel6u5_x64 o
virsh send-key rhel6u5_x64 t
virsh send-key rhel6u5_x64 kp_enter
virsh send-key rhel6u5_x64 s
virsh send-key rhel6u5_x64 h
virsh send-key rhel6u5_x64 u
virsh send-key rhel6u5_x64 t
virsh send-key rhel6u5_x64 d
virsh send-key rhel6u5_x64 o
virsh send-key rhel6u5_x64 w
virsh send-key rhel6u5_x64 n
virsh send-key rhel6u5_x64 spc
virsh send-key rhel6u5_x64 -
virsh send-key rhel6u5_x64 h
virsh send-key rhel6u5_x64 spc
virsh send-key rhel6u5_x64 n
virsh send-key rhel6u5_x64 o
virsh send-key rhel6u5_x64 w
virsh send-key rhel6u5_x64 kp_enter

SeaBIOS study (1)

SeaBIOS [1] 是 x86 结构下的一种 BIOS 的开源实现,可以完成类似 coreboot [2] 初始化硬件后payload的工作,实现一些启动逻辑。

CPU初始化后,放在EIP中执行的第一条指令地址为0xFFFFFFF0,这是intel CPU的一个hack (an inelegant but effective solution to a computing problem),叫做重置向量(reset_vector)。内存0xFFFFFFF0~0xFFFFFFFF(4G最后边的16bytes)的指令,会将CPU调转到系统BIOS的入口地址0xF0000。系统BIOS被预加载到(0xF0000~0xFFFFF, 960k~1M)。

SeaBIOS/src/ 下只有两个汇编代码:
>>> seabios/src/entryfuncs.S : 这里面是从汇编调用C语言函数的宏定义
>>> seabios/src/romlayout.S :
BIOS 起始函数入口“entry_post”,POST代表 Power-on test-self(加电自检)。 文件头部使用DECLFUNC指令定义了几个函数,有中断处理、 32/16big/16模式转换、resume、PMM(处理器电源管理)、PnP(热插拔)、APM(高级电源管理)、PCI-BIOS、BIOS32、ELF等入口函数。entry_post中,对依次通过一些调转函数或中断,对各类设备进行设置。
“jmp entry_19” 调转到 entry_19 函数,再通过entryfuncs.S 中的宏定义,实际调用src/boot.c中的handle_19(),用于加载启动操作系统。
entry_18/handle_18()则用来处理启动(INT 19)失败。

(引:http://en.wikipedia.org/wiki/BIOS_interrupt_call)
+ 18h Execute Cassette BASIC: True IBM computers contain BASIC in the ROM to be interpreted and executed by this routine in the event of a boot failure (called by the BIOS)
+ 19h After POST this interrupt is used by BIOS to load the operating system.

Qemu[4]解析启动参数(例如:“-boot order=ndc,menu=on,boot-timeout=1000 ..”),并通过rom中的fw_cfg文件向BIOS传递启动参数,BIOS则通过读取这些文件应用参数。但是SeaBios不只是被Qemu一个项目使用,所以启动参数的默认、启动策略有差异。
关于BIOS启动有一个规范(bios boot specification) [5] http://www.scs.stanford.edu/nyu/04fa/lab/specsbbs101.pdf,考虑兼容/支持很多硬件,比较复杂,正在读规范中&#8230;

Bios为系统ACPI提供DSDT(Differentiated System Description Table,差异系统描述表),这样ACPI就能通过统一的接口对不同类型的设备进行初始化设置,描述使用的是ASL汇编语言,编译后的16进制文件,可以被标准系统使用。比如热插拔功能,Bios再DSDT表中描述PCI设备(如网卡),定义电源管理的回调函数,_EJ0方法用于移除一个设备。相应的在操作系统内部,有PCI驱动处理PCI设备的热插拔 (code: linux/drivers/pci/hotplug*),从固定IO Port探测PCI设备、注册初始化、管理,到最后的销毁。

其他有意思的东西:
SMBios(System Management BIOS): 主板/操作系统厂商显示产品管理信息所需遵循的统一规范
DMI(Desktop Management Interface): 帮助收集电脑系统信息的管理系统

把好几样东西混在了一起,不过基本都是围绕SeaBios的。SeaBios似乎没什么文档,coreboot文档比较齐全。但是有代码和邮件列表在,这就是最好的文档。

[1] http://www.seabios.org
[2] http://www.coreboot.org
[3] http://en.wikipedia.org/wiki/BIOS_interrupt_call
[4] http://www.qemu.org
[5] http://www.scs.stanford.edu/nyu/04fa/lab/specsbbs101.pdf