Hand-held Augmented Reality for Facility Maintenance

December 2, 2016 | Author: Geraldine Violet Lester | Category: N/A
Share Embed Donate

Short Description

1 Digital Comprehensive Summaries of Uppsala Dissertations from the Faculty of Science and Technology 1412 Hand-held Aug...


Digital Comprehensive Summaries of Uppsala Dissertations from the Faculty of Science and Technology 1412

Hand-held Augmented Reality for Facility Maintenance FEI LIU


ISSN 1651-6214 ISBN 978-91-554-9669-2 urn:nbn:se:uu:diva-301363

Dissertation presented at Uppsala University to be publicly examined in Room 2446, Lägerhyddsvägen 2, House 2, Uppsala, Friday, 7 October 2016 at 13:15 for the degree of Doctor of Philosophy. The examination will be conducted in English. Faculty examiner: Professor Xiangyu Wang (Curtin University). Abstract Liu, F. 2016. Hand-held Augmented Reality for Facility Maintenance. Digital Comprehensive Summaries of Uppsala Dissertations from the Faculty of Science and Technology 1412. 81 pp. Uppsala: Acta Universitatis Upsaliensis. ISBN 978-91-554-9669-2. Buildings and public infrastructures are crucial to our societies in that they provide habitations, workplaces, commodities and services indispensible to our daily life. As vital parts of facility management, operations and maintenance (O&M) ensure a facility to continuously function as intended, which take up the longest time in a facility’s life cycle and demand great expense. Therefore, computers and information technology have been actively adopted to automate traditional maintenance methods and processes, making O&M faster and more reliable. Augmented reality (AR) offers a new approach towards human-computer interaction through directly displaying information related to real objects that people are currently perceiving. People’s sensory perceptions are enhanced (augmented) with information of interest naturally without deliberately turning to computers. Hence, AR has been proved to be able to further improve O&M task performance. The research motif of this thesis is user evaluations of AR applications in the context of facility maintenance. The studies look into invisible target designation tasks assisted by developed AR tools in both indoor and outdoor scenarios. The focus is to examine user task performance, which is influenced by both AR system performance and human perceptive, cognitive and motoric factors. Target designation tasks for facility maintenance entail a visualization-interaction dilemma. Two AR systems built upon consumer-level hand-held devices using an off-the-shelf AR software development toolkit are evaluated indoors with two disparate solutions to the dilemma – remote laser pointing and the third person perspective (TPP). In the study with remote laser pointing, the parallax effect associated with AR “X-ray vision” visualization is also an emphasis. A third hand-held AR system developed in this thesis overlays infrared information on façade video, which is evaluated outdoors. Since in an outdoor environment marker-based tracking is less desirable, an infrared/visible image registration method is developed and adopted by the system to align infrared information correctly with the façade in the video. This system relies on the TPP to overcome the aforementioned dilemma. Keywords: Augmented reality, Façade, Image registration, Thermal infrared imaging, Facility management, Third person perspective, Target designation, Precision study, Experiment Fei Liu, , Department of Information Technology, Computerized Image Analysis and HumanComputer Interaction, Box 337, Uppsala University, SE-75105 Uppsala, Sweden. Department of Information Technology, Division of Visual Information and Interaction, Box 337, Uppsala University, SE-751 05 Uppsala, Sweden. © Fei Liu 2016 ISSN 1651-6214 ISBN 978-91-554-9669-2 urn:nbn:se:uu:diva-301363 (http://urn.kb.se/resolve?urn=urn:nbn:se:uu:diva-301363)

Dedicated to my family

List of papers

This thesis is based on the following papers, which are referred to in the text by their Roman numerals. I

Fei Liu and Stefan Seipel. Detection of line features in digital images of building structures. In IADIS International Conference Computer Graphics, Visualization, Computer Vision and Image Processing 2012 (CGVCVIP 2012), Lisbon, Portugal, pages 163–167, July 2012.


Fei Liu and Stefan Seipel. Detection of façade regions in street view images from split-and-merge of perspective patches. Journal of Image and Graphics, 2(1):8–14, June 2014.


Fei Liu and Stefan Seipel. Infrared-visible image registration for augmented reality-based thermographic building diagnostics. Visualization in Engineering, 3(16):1–15, 2015.


Fei Liu and Stefan Seipel. On the precision of third person perspective augmented reality for target designation tasks. To appear in Multimedia Tools and Applications, September 2016.


Fei Liu and Stefan Seipel. Precision study on augmented reality-based visual guidance for facility management tasks. Submitted for journal publication, April 2016.


Fei Liu, Stefan Seipel and Torsten Jonsson. Augmented reality-based building diagnostics using natural feature registration and third person perspective. Manuscript prepared for journal submission, August 2016.

Reprints were made with permission from the publishers.

Related work

In addition to the papers included in this thesis, the author has also contributed to the following publication. 1. Julia Åhlén, Stefan Seipel, and Fei Liu. Evaluation of the automatic methods for building extraction. International Journal Of Computers and Communications, 8:171–176, 2014.

Contributions to each work

Paper I Fei designed and implemented the detection method. Stefan offered advice through discussions during the process. Fei was the principal author while Stefan contributed in the revision. Paper II Fei designed and implemented the detection pipeline. Stefan offered advice through discussions during the process and suggested the experiment for validating the detection method as well as carrying it out with Fei. Fei was the principal author of the paper while Stefan contributed in the revision. Paper III Fei designed and implemented the registration method while gathering both visible and infrared image data and performing tests on the data. Stefan contributed in the evolution of the method through discussions and advice for improvement, in particular the addition of statistical analysis for the test results. Fei was the principal author of the paper and Stefan contributed in the revision. Paper IV The method presented in this paper was jointly designed by Fei and Stefan. The implementation of the software and the experimental procedure was carried out by students as part of a Bachelor-level project at the University of Gävle, where Fei acted as a co-supervisor. The results of the study were statistically analyzed by Stefan and Fei. Fei was the principal author of the paper with contributions from Stefan in the revision. Paper V This paper was based on ideas of Fei and Stefan. Fei implemented the experimental AR application and conducted user experiments. The statistical analysis was performed by Stefan and Fei. Fei was the principal author of the paper with major contributions from Stefan in the sections regarding results and discussions. Additionally, Stefan also revised the paper. Paper VI Fei implemented the AR system and conducted user experiments with Torsten to collect data. Torsten designed the façade heating rig and oversaw its production process. Fei, Stefan and Torsten together designed the experiment procedures. Stefan conducted the statistical analysis of the results and discussed its implications as well as conclusions with Fei and Torsten. Fei was the principal author of the paper. Stefan contributed with the sections on results, discussions and conclusions while Torsten co-authored the sections of thermal anomaly simulation and user experiments with Fei.



Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.2 Motivations and objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.3 Thesis structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

11 11 12 14


Augmented reality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.1 Definition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.2 Brief history . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.3 Fundamental system components . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.3.1 Display . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.3.2 Tracking and registration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.3.3 Virtual content creation and rendering . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.3.4 Input and interaction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.4 Evaluation of AR systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.4.1 Challenges for AR evaluations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.4.2 AR user evaluations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

15 15 16 18 18 26 33 37 38 38 39


Digital image processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.1 Digital images . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.2 Scope of digital image processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.3 Image registration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.3.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.3.2 General steps for image registration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

41 41 43 44 44 45


Facility maintenance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52 4.1 AR and the AEC/FM industry . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52 4.2 AR and facility maintenance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52


Summary of the papers




Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.1 Facility maintenance tasks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.2 Hand-held AR tools . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.3 User performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

66 66 66 68




1. Introduction

1.1 Background Buildings and public infrastructures are of vital importance to sustain our societies and to enable their continuous development for the well-being of their members. As a sector that designs and produces these artifacts as well as providing myriad other related services, the architecture, engineering, construction and facility management (AEC/FM) industry plays a significant role in most countries’ economy both in terms of employment and investment [80]. Not surprisingly, projects from the industry are inherently huge and complex with large budgets, long duration divided into multiple phases and involving a variety of interested parties. Take these parties as example, a typical construction project can comprise architects, structural designers, schedulers, fabricators, facility users, insurers, government agents etc. All these people differ from each other in regard to skills, knowledge, experiences and agendas. Plus the likelihood of their geographical separation, collaboration and information sharing are among the many challenges of construction project management. Consequently, the AEC/FM industry has always been active in adopting information technology (IT) to improve the project performance [67]. A typical kind of interactions between computers and their users nowadays follows such a pattern: whenever the users want to access information stored in computers, they pause the current work in hand and divert their attention to computers. They move mice or type on keyboards to interact with user interface (UI) control elements displayed by the computers to retrieve the desired information. After that, they switch their attention back to the previously interrupted work and continue. There is a seam between the physical world we live in and the cyberspace we heavily rely on [50] and the constant attention switches dictated by the conventional computer UI is by no means natural. Without exceptions, computer users from the AEC/FM industry are also experiencing this seam and they may have to perform additional demanding mental activities at the same time. For instance, an on-site construction worker can obtain 2D drawings of the design of a wall via her laptop but she needs to mentally map the 2D design (usually with different views as well) to the 3D space constantly during the construction. This mapping process is time-consuming and error-prone thus requiring a lot of practical experience. Is there a technology that can bridge these two worlds and thus lightens the mental workloads of the aforementioned worker? The answer is augmented reality (AR). AR is an emerging UI technology that displays virtual information directly on the real environment. The virtual information can potentially be visual, 11

auditory or of any other human senses but predominantly visual. The information relates to its corresponding real world objects through tracking the position and the orientation of our relevant body parts. With AR, our sensory perceptions are enhanced naturally with the virtual information while we are perceiving the physical world. Consequently, the construction worker can readily see a 3D model of the wall erected on the construction site via AR and uses it to guide her through the construction process.

1.2 Motivations and objectives Like many other ITs, the benefits of adopting AR for AEC/FM activities were soon acknowledged and embraced by practitioners and researchers. Plenty of AR applications have been proposed for almost all the phases of a building’s life cycle such as architectural design and construction for the purpose of facilitating collaborations and improving performance, albeit the majority of the systems are prototypes which have not been put into practice. FM concerns all the post-completion operations and services that ensure a facility to function continuously as intended. By nature, FM spans the longest period of time in the facility’s life cycle and claims the most economical expenses. Therefore, substantial benefits can be gained by employing AR in FM when task performance is improved and overall costs are reduced. Unfortunately, this combination has not yet received much attention in the AEC/FM community [112]. Thanks to the advances of microelectronics, computers nowadays are being continuously miniaturized. While their sizes diminish, their processing power does not suffer. A case in point is the mobile devices like smartphones and tablet computers people carry around in their pockets or briefcases every day. Most of them not only feature a multi-gigahertz central processing unit with multi-cores, a dedicated graphical processing unit, large memories and a big, high resolution screen but are also equipped with a suite of sensors such as mega-pixel cameras, inertial-measurement units, WiFi interface and GPS receiver. The popularity of powerful portable computers signals the arrival of ubiquitous computing and AR is indisputably a better UI choice for it since the user’s view of the world and the computer interface literally become one through AR [39]. Indeed, together with the availability of fullfledged AR software development toolkits, the application domains of AR are diversifying, permeating the traditional professional areas as well as the mass market. As more AR applications are developed, it is increasingly necessary to evaluate them for identification of issues before they reach their potential users. However, driven by the novelty and the vast unexplored design space of AR, researchers are more keen on discovering new applications, inventing new hardware, interaction techniques etc. Accompanied with several chal12

lenges (detailed in Section 2.4), the evaluation of developed AR system has largely been left behind [23, 33]. In view of the research deficiencies discussed above, my research motif has thus been set to user evaluations of AR applications in the FM domain. The focus is examining user task performance taking into consideration human perceptive and cognitive factors. It is my belief that as technologies advance and the field evolves, AR systems will boast accurate tracking results, convincing integration of the virtual objects with their real counterparts and sophisticated yet natural interaction mechanisms. But in the end all these systems have to be put into the hands of their users to realize their values and therefore it is these human-machine systems that determine the ultimate task performance, not the AR systems alone albeit being well-crafted. More specifically, in this thesis I looked into maintenance tasks in the built environment with the assistance of hand-held AR and addressed the following research questions: • Infrared thermography (IRT) has been a very popular technology for building diagnostics thanks to its capability of sensing temperature as well as its non-destructive and non-contact nature. How do we align infrared information with visible images (video frames) of related façades in order to augment them? Can image registration techniques be applied here? If yes, what are the features given these two very different modalities and how to match them for estimating the transformation models? • A maintenance task often involves a step of locating hidden or invisible objects. For instance, they can be pipes, ducts and wires concealed by walls or ceilings which need maintaining (or similarly, to avoid being damaged by the maintenance process) and AR can be used to visualize these objects. However, given the limited field of view of built-in cameras from mobile smart devices and the often large size of those occluding building elements, there is a dilemma between staying at a distance for more visual context and interacting with the building elements for, e.g., marking up the objects. What display and/or interaction techniques can be employed to overcome this dilemma? What about outdoor scenarios? Can the solutions devised for the indoor environment be transferred directly to the outdoor one? • The accuracy and/or the precision of locating the hidden objects can be crucial for certain maintenance tasks [95]. Suppose our AR tools are meant for this kind of maintenance tasks. What are the locating errors? How are they influenced by both system implementations and user factors such as human perception and cognition? What can we do to alleviate the influences thus improving the task performance?


1.3 Thesis structure Structurally, the remaining of this thesis is divided into three parts. The first part includes Chapter 2 Augmented reality and Chapter 3 Digital image processing, which provides a broader theoretical background to my works. Chapter 4 alone forms the second part of the thesis wherein I describe the targeted application field, facility maintenance. In the last part, I summarize my research included in this thesis with my contributions to the field (Chapter 5) and present concluding remarks (Chapter 6).


2. Augmented reality

Computers have been well integrated into our daily lives in this age of information. As hardware technology advances, computers are becoming faster, smaller and cheaper. The first electronic general purpose computer, ENIAC, occupied around 167 m2 area and ran only 5000 machine cycles per second. An example of today’s computers, Raspberry Pi 3 model B, has the size of a credit card (85.6 mm × 56.5 mm) but it is equipped with a quad-cored central processing unit which runs 1.2 billion cycles per second. The continuous miniaturization allows computers to be portable and ubiquitous. Traditional UI such as graphical UI rests upon the fact that information stored in computers is separated from its related objects in the real world. To interact with the information, users need to divert their attention to the computers. This distraction creates a gap between the computer world and the real world [92], which is not desirable as computers nowadays tend to permeate around us and information is so readily accessible at any time. AR offers a new approach towards human-computer interaction through directly displaying information related to real objects that people are currently perceiving. People’s sensory perceptions are enhanced (augmented) with information of interest naturally without deliberately turning to computers. All human senses, e.g., vision, audition and olfaction, can potentially be enhanced but most research has been focused on visual and auditory AR [44], especially the former since vision provides the largest amount of information perceived by human beings. Therefore, visual AR is also the emphasis of this thesis.

2.1 Definition Two definitions are commonly referred to in AR literature. One was proposed by Milgram and Colquhoun [72], which describes AR from a perspective of the mixture of real and virtual environments. According to them, real and virtual environments are not simply alternatives to each other but rather two ends of a Reality-Virtuality continuum (Figure 2.1). The AR section starts from the real environment end and expands towards the virtual environment and its counterpart originating from the virtual environment extreme is named augmented virtuality (AV). Both AR and AV make up the mixed reality, which accounts for the entire continuum but the two ends. The other well-known definition was brought forward by Azuma in his influential survey paper on 15

Mixed reality Augmented reality

Real environment

Augmented virtuality

Virtual environment

Figure 2.1. The RV continuum and mixed reality adapted from [72] AR [9]. He defined an AR system through three characteristics: 1) combining the real and the virtual objects; 2) interactive in real time and 3) registration in 3D. It is worth noting that according to the third characteristic of this definition, Azuma did not consider 2D overlay on live video as a type of AR, therefore excluding many applications for tourism [37, 42], navigation [8, 34, 55] and maintenance [75, 43] where text bubbles and/or 2D icons are utilized to provide users with additional information related to the real world. Such a limitation was removed from his later version of the definition proposed in a subsequent survey paper [6]. The modified third characteristic only requires that real and virtual objects are registered with each other. Based on Azuma’s definition, it is natural to think of AR applications only adding computer-generated objects to the perceived world but in fact there is a less common variant of AR which requires to remove objects in the real world from our views. This type of AR is sometimes referred to as mediated or diminished reality [120].

2.2 Brief history Although AR is a new type of UI to computers and has been garnering more and more attention in recent years, the idea of AR dates back to approximately 50 years ago. This section presents some of the milestone events which occurred across the span of these years. In 1968 computer graphics pioneer Ivan Sutherland and his students developed the first AR prototype system [101]. The system employed an optical see-through head-worn display which was tracked by either a mechanical tracker or an ultrasonic tracker. Very simple wireframe geometries were drawn on the display in real time. Subsequently, during the 1970s and the 1980s a handful of researchers from, e.g., the U.S. Air Force’s Armstrong Laboratory, the NASA Ames Research Center and the University of North Carolina at Chapel Hill worked on AR [39]. The term “augmented reality” was coined in 1992 by Thomas Caudell and David Mizell from Boeing. In their paper [26] they discussed the advantages of AR over virtual reality such as less processing power required but they also pointed out the more stringent registration AR requires. Feiner et al. [38] published their research results on the first AR-based maintenance assisting system KARMA in 1993, which aided printer users to perform some straight16

forward tasks such as refilling the paper tray and replacing the toner cartridge. As already discussed in the previous section, the first survey paper on AR [9] was published in 1997 by Ronald Azuma in which he defined the three characteristics for an AR system. In the same year, Feiner et al. [37] presented the first mobile AR system, the Touring Machine and it was designed to be used in outdoor unprepared environments. The definition of AR through the Reality-Virtuality continuum was brought forward by [72] in 1999 as mentioned earlier in Section 2.1. Furthermore, the AR community did not have to build their applications from the ground up because Hirokazu Kato and Mark Billinghurst released the first open-source software development kit for AR called ARToolKit in the same year [52]. ARToolKit utilizes marker-based tracking for camera pose recovery, which was first introduced by Jun Rekimoto [91] in 1996. The toolkit is still being maintained and widely adopted today. At the end of the 1990s, the importance of AR and its profound influence on the whole computer UI landscape had been well recognized. Consequently, several international conferences started to emerge during that period to provide AR researchers and practitioners with platforms for exchanging ideas and results. The first decade of the new century saw a boom in AR research thanks to the increased availability of hardware (cheaper yet more powerful). The hardware platforms most AR applications were built upon have since shifted to portable devices such as personal digital assistants (PDA) or mobile phones. The Archeoguide from Vlahakis et al. [109] in 2001 was an early endeavor of AR cultural heritage on-site guide based on a server-client architecture. Three types of mobile units were offered as clients for scalability: a laptop computer, a flat panel PC and a Pocket PC. The mobile units were connected with the server through wireless network. In 2003, Wagner and Schmalstieg [111] presented the first stand-alone hand-held AR system built on an unmodified PDA attached with a commercial camera.The system was capable of visionbased self-tracking. As the hardware advanced, it also became possible to run some sophisticated and computationally intensive algorithms from computer vision on consumer-level mobile devices. For example, Wagner et al. [110] implemented a heavily modified version of the well-known SIFT (Scale Invariant Feature Transform) and Ferns classification on mobile phones in 2008 and in 2009 Klein and Murray [54] achieved real-time performance of running SLAM (Simultaneous Localization and Mapping) on an iPhone. The capability of running these algorithms effectively on mobile devices paved the way for tracking in a natural unprepared environment, which has been a desirable goal for AR as a UI technology catering towards ubiquitous computing. Today ever faster computer hardware continues to be miniaturized. As a result, it is rather common for personal computing devices such as smartphones and tablet computers to possess high speed powerful processors and high resolution displays while being integrated with various sensors — for instance, mega-pixel cameras, WiFi network adapter, inertia-measurement unit 17

(IMU), compass and GPS [30]. On the other hand, the availability of convenient off-the-shelf software development toolkits also helps lower the barrier for creating AR-based software. Combining these two factors, the application of AR has not only become more sophisticated in the traditional areas like medicine, manufacturing and military but also reached out to the general public. AR-related systems and products have constantly been reported from research results and commercial advertisements in recent years. An up-to-date survey on this subject can be found in [94]. As computers continue to intertwine with every aspect of our lives, this new breed of interface will certainly prosper in the future.

2.3 Fundamental system components From the glimpse of the history presented above, it is not difficult to observe that AR systems have evolved a lot since Ivan Sutherland’s prototype in 1968. However, the foundation stones of any AR system have not altered very much during the course. As shown in Figure 2.2, display, tracking and registration, input and interaction along with virtual content creation and rendering still remain as the four core building blocks upon which an AR system is structured. In this section, I will examine each of these four pieces more closely in turn and discuss how they were related to my research described in Paper IV, V and VI particularly. AR experience User AR system


Tracking & registration

Input & interaction

Physical world Virtual content creation & rendering

Figure 2.2. Elements of AR experience

2.3.1 Display Display devices convey computer-generated information to users. In a broad sense, they refer to output hardware that presents information of any modality that we human can perceive, such as visual, auditory and haptic. Since the subject of this thesis is visual AR, I will focus on visual displays. 18

Visual depth cues Before we delve into the discussion of visual displays, it is necessary to first provide an overview of various visual depth cues. Because we live in a 3D world, the sense of depth is crucial to our understanding of the 3D structure of the environment. As Milgram and Drascic [31] pointed out, almost all visual spatial perception requires the sense of depth. For instance, depth information helps us interact with surrounding objects as well as navigating in the environment. The real world we are living in possesses abundant sources for us to extract the depth information. Such sources are called depth cues. Since AR concerns interacting with virtual worlds, understanding the depth cues ensures that creators of AR experience can correctly simulate them in the virtual worlds in order to give AR users their intended perceptions of the virtual space. There are four categories of visual depth cues: monoscopic image, physiology, stereoscopic image (stereopsis) and motion [96]. Monoscopic image depth cues are those ones that can be inferred from a static image of an scene through one eye. Interposition or occlusion is the cue we obtain when an object blocks our views of another. In such case, we infer that the occluding object is probably closer to us. Shadows can tell viewers the positional relationship between objects. Additionally, a brighter object is generally considered to be closer to viewers. We compare the relative size of a group of objects to judge their distances from us: usually the smaller an object is, the farther it is away. On the other hand, when it comes to a familiar object, its absolute size alone may be enough to convey its distance. Linear perspective is the phenomenon that parallel lines seem to converge at a distant point in an image. Although this depth cue was not exploited for the purpose of conveying depth information in a virtual scene in Paper II and III, we extracted parallel line segments from façade images based on the observation that façades commonly exhibit abundant linear features which are either horizontal or vertical. Another depth cue comes from surface textures as they become less detailed at a distance than up close, which is texture gradient depth cue. Height in the visual field can also serve as a depth cue because the horizon in an image is higher than where viewers stand. Hence, objects farther from the viewers will appear higher in the view. The last but not least depth cue is aerial perspective or atmospheric attenuation. Due to light scattering and absorption by the atmosphere, a near object has more vibrant colors whereas a distant object looks duller and dimmer. Physiological depth cues are generated by muscle movements of our eyes, which comprise accommodation and convergence. Accommodation refers to the focusing process of an eye caused by changing the shape of its lens. When we are perceiving an object, each eye generates a slightly different image of it. Such a difference is referred to as binocular disparity. To avoid double vision (diplopia), both eyes are then rotated to fuse their respective images into one. This process is called convergence and together with accommodation, the object is now in focus. The amount of muscle movements from both 19

processes provides the brain with distance information of objects in the view. The aforementioned binocular disparity is processed by our brains to create yet another depth cue called stereopsis. Unlike monoscopic image depth cues, stereopsis usually requires special displays to synthesize [107]. Meanwhile, stereoscopic depth information functions best when the objects are roughly within arm’s reach but current stereoscopic displays still suffer from problems like frame cancellation and accommodation-convergence conflict due to the failure of displaying objects at different focal planes [113]. The visual cues discussed so far can all be perceived statically. Motion depth cues, on the other hand, only appear when viewers and/or objects are in motion. The main depth cue in this category is motion parallax, which refers to the phenomenon that objects close to the viewers move more rapidly across the visual field than objects farther away. This depth cue was heavily used in early 2D cartoons and video games. It is worth noting that apart from various depth cues we synthesize when creating virtual scenes, we can also introduce artificial spatial cues that are not inherent to the scenes. A fine example demonstrated in [113] is adding perpendicular dropping lines from objects in space to the ground plane, if there is one available. Through this cue, the spatial relationship of these objects are immediately discernible. We adopted this idea in Paper V through adding a ground plane to convey the depth difference between a pipe and its projection on the wall plane. As shown in Figure 2.3, it is clear that the yellow line (the guide) is closer to the viewer than the green line (the pipe) but if we cover the ground plane with a hand, it will become impossible to judge their spatial relationship.

Figure 2.3. the ground plane cue used in Paper V

Reality-combining technologies Since AR integrates real world objects with virtual information in users’ vision, one basic requirement for AR displays is a mechanism to combine the reality and the virtuality. There are three common types of combining technologies: optical see-through, video see-through and projective. 20

Real world

Optical combiner

Virtual scene generator

Display AR user

Figure 2.4. Conceptual illustration of optical see-through display technology Figure 2.4 depicts the concept of optical see-through displays. The center piece of this type of displays is a special component called optical combiner, which is often a half-silvered mirror that can both transmit and reflect light. Users are able to see the real world through the combiner due to its transmission of light. Simultaneously, if the combiner is properly oriented in front of users, it will also reflect the virtual information from the computer display towards users’ line of sight, thus achieving the integration of the real and the virtual scenes. The Video see-through approach, on the other hand, utilizes video cameras to capture the real world and then combines the video stream with the rendered virtual scenes using video mixing techniques. Finally, the composite imagery is displayed to users (see Figure 2.5). Display

Video combiner

Virtual scene generator

Real world Video camera

AR user

Figure 2.5. Conceptual illustration of video see-through display technology Both see-through technologies have their own advantages and disadvantages. Since users perceive the real world directly with optical see-through displays, the resolution of the view and the field of view over the real world are unaltered. Additionally, compared to video see-through, no eye-camera offset is introduced and people can still see their surroundings when an optical see-through display malfunctions, which eliminates certain safety concerns. Major drawbacks of optical see-through also stem from the rigidness of viewing the real world directly. Because our natural environment has a much larger dynamic range of luminance than most display devices can produce, it is often difficult to match the brightness between the real and the virtual objects. 21

Since the display system does not readily possess knowledge of the real world, it is challenging to render virtual objects occluded by real ones. Reversely, the occluding effects which virtual objects have on real ones are impaired as well due to their translucent appearance on the combiner. Lastly, because all virtual objects are displayed on the optical combiner, users need to focus on the combiner plane in order to see them clearly. This means it might not be possible for users to focus on a real object as well as its intended virtual counterpart at the same time. In contrast to optical see-through, the availability of the real world as video streams enables video see-through displays to alleviate or even overcome most of the aforementioned problems. Firstly, the mismatch of brightness is less protruding with the video see-through approach because the range of the real world is first limited by video cameras and then again by the display device itself. Secondly, both real and virtual imageries can undergo a wide variety of graphical and image processing algorithms to produce myriad blending or occluding effects within the video combiner, whose role is most likely assumed by graphics processing units (GPUs) of modern computers along with customized shaders. Finally, there is no conflict as to focusing on virtual and real objects simultaneously because they are all displayed on the same physical plane. The main shortcoming of video see-through comparing with optical see-through is the poor visual experience, especially of the real world. Such a shortcoming can be ascribed to low display resolution and limited field-of-view but as hardware technology advances, the resulting negative effects will become less influential in AR applications. Projective displays employ video projectors to project virtual objects onto its corresponding real objects thus achieving the combination of virtuality and reality. This type of displays eliminate the need of see-through combiners. Since graphics are directly displayed in the real environment, users are able to perceive them naturally. Moreover, the projection can act as a means to alter the surface properties of real objects, such as color, texture, shape (to a small degree) and even transparency [70]. According to Bimber and Raskar [21], some challenges of realizing AR applications with projective display technology include handling shadows cast by real objects and users, restrictions of display area due to its physical properties (e.g., lack of peripheral areas for projection), single user only for virtual objects with non-zero parallax, constant focal length for common projectors and increased alignment and calibration complexity with additional projectors. Image-forming positions In the previous section, we classified AR displays in terms of technologies with which the reality and the virtuality are combined and also went through pros and cons of each technology. Another popular way of categorizing AR displays is via the positions where composite images are formed along the optical path between a user and real objects to be augmented [21]. There are hence three types of AR displays — head-based, hand-held and spatial, 22

which are illustrated in Figure 2.6. The translucent planes along the optical path in the figure represent see-through displays. Both optical and video seethrough technologies have been adopted at every position. Likewise, as shown in the figure, projectors can be placed at any of the three positions as well for projection-based AR.

Figure 2.6. Various display positions in relation to AR user (adapted from [21])

Head-based displays are the most widely recognized display form for AR, as AR is historically rooted in the field of virtual reality. Different styles of head-based displays include head-worn displays with miniature monitors or projectors, counterweighted displays on mechanical arms and retinal displays. One notable advantage all head-based displays have is their field of regard covers the entire spherical space around the users, which makes for a seamless AR viewing experience. Retinal displays scan low-power lasers directly onto the retina so they can offer much brighter images with sharp contrast and high-resolution [62]. Together with low-power consumption, retinal displays have the potential to become the future solution to outdoor AR. However, at the moment it is still very expensive to realize full-color retinal displays (especially the blue and the green colors). Additionally, since this kind of displays bypass the ocular motor system, the focal length is constant [21]. Besides inherited disadvantages from the reality-combining technologies they use, headbased displays often have rather limited display resolution and field of view. In recent years, the popularity of head-based displays in the AR landscape has seemed to give way to hand-held displays due to the steep price drop of mobile smart devices. However, with the release of some latest consumer-level virtual reality systems such as PlayStation VR, Oculus Rift and HTC Vive shown in Figure 2.7, we may see more AR applications built upon these headsets come into being in the future. Hand-held AR displays are positioned within arm’s reach, farther away from users comparing to head-based displays. Early AR systems adopting this type of displays were often realized with PDAs, such as the systems in [76] and [109]. Due to limited processing power and lack of fundamental sensors, 23

these PDAs were mainly used as input/output devices to back-end systems. Since Wagner and Schmalstieg [111] first introduced a stand-alone hand-held AR system in 2003, hand-held AR has seen a steady increase of popularity in the field. Especially in the recent decade, the combination of cheap yet powerful personal mobile devices and off-the-shelf AR software toolkits has greatly relieved researchers from the entanglement of technical implementation details, which lowers the barrier for conducting such studies as human factors of AR and usability of AR in specific application domains. Video see-through is the predominant display technology in this category but there are reports on some early systems which utilized other reality-combining technologies [87, 18, 88]. The disadvantages of hand-held displays include comparatively small screen size, limited field-of-view given built-in cameras, misalignment between user view and device view [85, 13], rapid battery consumption and failure to support tasks that require both hands.



Figure 2.7. Latest virtual reality headsets in the (b) Oculus Rift (c) HTC VIVE


market1 .

(a) Sony Project Morpheus

While head-based and hand-held displays are attached to users directly, AR systems based on spatial displays move most, if not all, of the system components away from the users and thus become a part of the environment. This makes for a more natural AR experience and opens up opportunities for ARbased collaborations. Although both video and optical see-through spatial displays have been proposed, for example video see-through systems reported in [20, 51] and optical see-through in [19, 79, 78], projection-based technology seems to be associated with spatial displays more often [70, 15, 114]. In general, this type of display arrangement does not manifest major flaws other than those ones inherited from adopted reality-combining technologies, which have been discussed previously. However, most spatial displays are installed within the environment to be augmented so they are not suitable for mobile applications. Display centricity Thanks to the use of video cameras, video see-through AR has one more unique and interesting capability than the other reality-combining technologies, that is, the detachment of the camera from the user’s current view. The 1 Images:

Sony Project Morpheus at Game Developers Conference 2014 by Official GDC, licensed under CC BY 2.0; Oculus Rift and HTC VIVE, courtesy of Fredrik Nysjö.


Figure 2.8. The display centricity continuum adapted from [72]

relationship between a camera and its user is termed display centricity by Milgram and Colquhoun [72]. Like their definition of AR, they also conveyed the concept of centricity via a continuum which is illustrated in Figure 2.8. The ends of the continuum are egocentric and exocentric respectively with intermediate cases summarized as egomotion [105]. An egocentric viewpoint refers to the situation where the camera view is the same as the user’s, namely ego-referenced. This viewpoint is often known as the first person perspective (FPP) and is the most frequently adopted viewing perspective in AR applications. With the egocentric view, both the real and the virtual worlds are presented from the user’s perspective. An exocentric viewpoint, however, is different from and independent of the user’s viewpoint. In this case, the camera is detached from the user and fixated in the world. Therefore, the real world and its augmentations are presented from another viewing perspective, which is known as the third person perspective (TPP). Egomotion viewpoints which span the continuum are also TPP views at heart but the difference lies in that the camera is still related (“tethered”) to the user, which is represented by a line between the camera and the user in Figure 2.8. There are a few existing works that studied the benefits of employing different viewpoints in their AR applications. For example, Sukan et al. [100] prototyped a furniture layout application which allows users to view virtual furnitures from several other pre-stored viewpoints without physically moving to those viewing positions. Similarly, a mobile hand-held AR urban exploration system overlays a model of a real world object in sight on its user’s current view. The users can choose among various pre-stored viewing positions of the model and some of them are not even physically reachable [104]. Mulloni et al. [74] presented two zooming-based centricity transitioning techniques: one from ordinary egocentric view to egocentric panoramic view of 360 degrees while the other one from egocentric view to exocentric top-down view. Egocentric and exocentric views are also employed as camera view transitioning techniques for an AR-based multi-camera environment monitoring system 25

[108]. In the following text, I will present the display aspect of my research in regard to the various characteristics discussed thus far. At the start of this chapter, I already motivated why AR would be a suitable interface to today’s gradually smaller and more intimate computers. Personal mobile devices such as smartphones and tablet computers are the representative products of this trend given their enormous popularity nowadays. With powerful hardware specifications and off-the-shelf software supports, it is opportune to explore AR applications on these mobile smart devices [1]. Therefore, I placed my research emphasis on hand-held video see-through AR enabled by these devices. Among the research projects which required to develop an AR application, Paper IV employed a Sony Xperia Z3 Compact smartphone and the hardware platform for Paper V is a Microsoft Surface Pro 3 CI5 tablet while the AR application in Paper VI ran on a Google Nexus 9 tablet. Since objects to be inspected and maintained in the built environment can cover large areas, e.g., piping and ducts for heating or cooling systems, builtin cameras from mobile hand-held devices can only capture a small portion of them at a close distance due to the limited field of view [57]. In order to obtain more context of the objects to be maintained as well as accessing more features (artificial or natural) for the AR system to track, maintenance workers have to back off from the objects under inspection and most of the time, the distance can be significant. This makes it difficult to mark object positions on a wall or façade for the sake of subsequent maintenance operations. One solution to this dilemma is to leave the marks with a laser pointer remotely as explored in Paper V. However, we found out through testing that laser dots were not visually discernible on AR displays in an outdoor scenario due to the much stronger lighting condition. In view of this drawback, we resorted to TPP AR in Paper IV and VI. In this exocentric setup, a remote camera captures the objects in question as well as the user. The real video stream is then augmented with virtual information and transmitted to a mobile device held by the user. The user can remain closely in front of the real object while relying on the TPP video to carry out the interaction with it, such as marking target positions. In Paper IV, we conducted a pilot study on user performance and acceptance under the circumstance of TPP AR with an abstract 2D target designation task. To follow up, we carried out another user performance study with infrared target designation task on a real façade in Paper VI, which combines TPP AR with infrared/visible image registration method developed in Paper III.

2.3.2 Tracking and registration In the section above, I discussed AR displays, one of the system components for presenting the combination of reality and virtuality to users. However, the results of simply mixing visuals of the real and the virtual world would most 26

likely be meaningless and thus useless, let alone genuine AR experiences. The missing piece here is a link between the virtual information and the real objects to be augmented, a spatial relationship between them [21]. Such a link is called registration, which is a crucial element to enable AR technology. Figure 2.9 gives an example of the effect registration has on the visual mixture of real and virtual objects. Through comparing Figure 2.9 a and b, it is sound



Figure 2.9. Demonstration of registration in AR. (a) The virtual champagne glass is arbitrarily inserted into the real scene. (b) The same virtual glass is inserted into the real scene taking into account its spatial relationship with the table.

to argue that Figure 2.9 b boasts more convincing illusion of the coexistence of the virtual champagne glass and the other real objects. Registraion accuracy Although registration is indispensable to AR, the accuracy requirement for registration varies between different application purposes. For example, the AR navigation system for brain surgery described in [64] and the underground utility surveying system in [95] require high registration accuracy. By contrast, accurate registration is less crucial in the following examples: the tourism assisting AR system by Feiner et. al [37] which shows text labels of building names on the user’s view; the museum AR guide in [73] that displays description texts and plays animations when visitors point the guidance PC (handheld) at exhibitions and lastly the AR tool presented in [117] which draws image-based assembly instructions on its displays. In fact, all these systems only need to know what real objects are currently in the user’s view so that corresponding virtual information can be retrieved and rendered. It is of less importance as for where exactly to render it thus loosening the registration requirements. When precise registration is required, registration errors are more discernible in AR applications than in virtual reality owing to the presence of the real world which provides a spatial ground truth. Hence, AR can impose more stringent requirements of registration accuracy. According to [9], registration errors can be classified into two groups, static and dynamic. Static errors are inherent to AR systems so they exist even when the users or the real objects are completely still. Dynamic errors, on the other hand, are mainly caused by system delays when it responds to the movement of the users or the real objects. This type of errors are especially prominent with optical see27

through devices since video-based displays can attempt to delay the real video stream in order to match the virtual one. Camera parameters for registration As discussed above, the minimum requirement of registration is knowing what virtual objects need to be superimposed. For applications that call for accurate registration, correct views of the virtual objects with respect to the current view of the real world must be generated. So how do we meet these two requirements in order to maintain the link between the real and the virtual world? Essentially, the answer consists in the connection between the visual vantage points of both worlds. The AR system, or rather, the graphical rendering sub-system needs to have knowledge of the real vantage point. Such a vantage point can be a user’s eyes or a video camera, depending on which reality-combining technology is employed in the display devices. I will address them both as camera in the following text for simplicity. The structure of aforementioned knowledge is shown in Figure 2.10. Within a pre-determined world coordinate system, camera position and orientation can be expressed as Knowledge about real camera


Optical properties


Focal length


Principal point

Figure 2.10. Knowledge about the real camera which needs to be obtained for registration

a 4 × 4 homogeneous matrix respectively, whose elements are called external parameters collectively. We can put all these external parameters together to form a pose matrix M pose ⎤ ⎡ ⎤ ⎡ r11























⎢r21 M pose = RT = ⎢ ⎣r31 ⎡

⎢r21 =⎢ ⎣



0⎥ ⎥

⎢0 ⎢ × 0⎦ ⎣0 0

T2 ⎥ ⎥ T3 ⎦










−t2 ⎥ ⎥ −t3 ⎦ 1




where R is a rotation matrix representing the orientation of the camera in the world while matrix T expresses its position in the world. The camera pose information lets the AR system know what the camera is currently looking 28

at, which is sufficient for applications that only need to superimpose, e.g., text labels in the user’s view. For applications requiring registration of high accuracy, however, the optical properties of the camera are equally crucial, which can also be bundled into a matrix ⎡ ⎤ fx 0 x0 (2.2) Moptical = ⎣ 0 fy y0 ⎦ 0 0 1 where fx and fy are the focal length (distance between C and p in Figure 2.11) of the camera in terms of image pixel counts in both x and y directions while x0 and y0 are the image coordinates of the principal point (p in Figure 2.11). Moptical is also known as the camera calibration matrix, whose elements are

Y X y x C



Image plane Figure 2.11. Coordinate systems of a camera and its image. C is the camera center and p is the principal point

called internal parameters and Equation 2.2 represents the one of a CCD (charge-coupled device) camera. While external parameters provide the system with knowledge of what virtual information to render, internal parameters further define the rendering details, such as the field of view and the pixel positions within the image coordinate system. Camera parameter acquisition The acquisition of internal parameters is through a calibration process, whose actual procedures differ between adopted reality-combining technologies. Since the real environment is perceived by users directly with optical see-through AR systems, the common calibration process for them is more of manual and iterative nature [7, 106]. Users are often asked to align virtual targets with real ones multiple times and inform the system each time when the alignment is achieved. As for video see-through system, the problem is essentially calibrating the video cameras of the system, which has been well studied in the field of computer vision. Detailed procedures and related theories of camera 29

calibration can be found in [41]. In terms of projective AR systems, the principles of calibrating projectors are similar to the ones of calibrating cameras, a dual problem at heart. The work of Bimber and Raskar [21] gives a thorough treatment on this subject. AR systems employ an assortment of tracking technologies to acquire the external parameters. All these technologies fall into two categories: sensorbased and vision-based. Each technology has its advantages and limitations so AR systems that require high registration accuracy often employ different tracking technologies to compensate for the weakness of each individual technology. Sensor-based tracking. This type of tracking utilizes a wide range of positioning sensors functioning on various measuring principles which include mechanics, magnetism, sound, inertia and radio waves (e.g. wireless local area networks and navigation satellites) to attain positions and orientations of cameras and/or other physical objects of interest in the environment. Each technology has its tracking accuracy and range and the measured pose information can either be absolute with respect to a global coordinate system or relative. Thanks to virtual reality, sensor-based tracking technologies were well developed by the time AR started to garner attention in the research community. Comprehensive discussions on this topic can be found in [93, 71]. Vision-based tracking. The second type of tracking is based on a video stream of the physical environment, hence the name. Although optical sensors, such as simple webcams, are still required for video recording, it mainly relies on analyzing the video frames using image processing and computer vision techniques rather than the sensor hardware itself to report the pose information. Bajura and Ulrich [10] point out that the advantage of vision-based tracking over sensor-based one is that the system possesses views of the real world, which can be used to correct registration errors and thus achieve much higher registration accuracy. Vision-based tracking provides a means to bring feedback to the system, hence a “closed-loop” approach. Unlike the sensor-based approach which tracks the camera directly, visionbased techniques typically depend on detecting and tracking visual features of the real objects to be augmented in the images (video frames) to achieve the same goal. Essentially, this type of tracking seeks to establish the relationship between feature positions in 2D image space and in 3D world space so that computer vision techniques can be applied to estimate the camera pose. The most popular as well as successful vision-based tracking technique utilizes 2D square fiducial markers attatched to the real objects to be augmented. These black-on-white markers, which function as artificial features, were invented by [91] (twenty-year history by now) and popularized by the ARToolKit library [52]. Figure 2.12 contains two examples of these square markers from ARToolKit. The four corners of the marker frame are detected and used for tracking while the binary pattern inside a marker encodes the identification of


the marker (also its orientation) so that the system knows what virtual objects are associated with it.

Figure 2.12. Examples of square markers used by ARToolKit Although fiducial markers are easy to track and often lead to robust pose estimations, the downsides of them are also obvious. The foremost one is that the placement of markers might not always be feasible in uncontrolled environments, e.g., outdoors [90]. Moreover, the number of markers which can be attached to each tracked object is rather limited. The resulting small number of features makes tracking susceptible to occlusion. Besides, placing too many markers will undoubtedly disturb users’ perception of the real world. In view of the shortcomings of artificial features, the involved research communities, especially in computer vision, are actively looking for naturally occurring features in the environments for tracking. Prominent visual features such as edges and corner points are crucial to image interpretation and they are easy to detect and usually abundant in a scene. Therefore, these features have become the foundations of most tracking techniques based on natural features. For instance, the works of [28, 81, 116] draw on edge information to model real objects of interests and then through tracking the models across video frames, the camera pose is computed. As a current research trend in the AR community, SLAM is yet another natural feature-based tracking technique. In some realizations of the technique, e.g. [54], corner points are used as landmarks. So far, I have covered the category of vision-based tracking whose main objective is to estimate camera pose indirectly through tracking features of real objects which are to be augmented. By contrast, there is also a type of vision-based tracking akin to the sensor-based counterpart in the sense that the cameras are directly tracked. These tracking techniques rely on rigid body targets, each of which is a collection of small light-emitting (active) or lightreflecting (passive) objects that form specific 2D or 3D patterns. Before tracking, the cameras for AR users are prepared with these targets, similar to the fiducial markers attached to scene objects. Optical sensors are then deployed to capture these targets for pose analysis and estimation with the aid of the pattern in which each target is arranged. An example of such kind of tracking technique is employed in an infrared-based tracking system for virtual reality and AR by Pintaric and Kaufmann [82]. The targets consist of small nylon spheres wrapped in retro-reflective material that reflects infrared light. The 31

spatial arrangement of spheres for each target is unique so that the tracked object it attaches to can be identified. Four cameras with infrared band-pass filter are used to record images of these targets. The geometrical relationship between the target positions in a user-defined world and the image space is established to estimate the poses of tracked objects. Marchand et. al [69] provide an up-to-date survey on various existing vision-based tracking techniques for AR. Hybrid tracking. As mentioned at the beginning of this subsection, each tracking technology has its strengths and weaknesses. Hybrid tracking combines more than one tracking technology together in order to overcome each component technology’s weaknesses so that overall tracking performance may be improved. To reduce the registration error, Azuma and Bishop [7] employed inertial sensors to predict head motion while the display was already tracked by an optoelectronic tracker. The first mobile AR system [37] used the built-in magnetometer of the head-worn display for tracking user’s head orientation and the position of the user was tracked by differential GPS. The hybrid tracking system in [90] put together edge-based vision tracking, measurements of rotational velocity, 3D acceleration vector and 3D magnetic field vector to provide real-time, accurate urban AR experience. Since today’s smart devices are usually equipped with a suite of different sensors, it is a natural practice to take advantage of these sensors when AR applications are developed. RescueMe [2] is a smartphone-based indoor AR evacuation system, which utilizes built-in camera, wireless network interface, accelerometer and digital compass to generate exiting routes for users. Certainly, the tracking system becomes more complex to build as more sensor types are incorporated but the added complexity will pay off if the tracking performance is significantly improved. The tracking technology used in my research projects is vision-based and both artificial and natural features have been explored. Paper IV and V rely on marker-based tracking functionality provided by Vuforia AR software development kit (SDK)1 . Marker-based tracking is easy to implement and can achieve satisfactory tracking results, especially in indoor environments so it is included in almost all AR software packages. The fiducial markers from Vuforia are called frame makers (see Figure 2.13). The working principle of them

Figure 2.13. Marker 0 and Marker 1 of Vuforia frame markers 1 Website:



is the same as the 2D square markers used by ARToolkit and the difference between them is the way they encode the marker identifications. As shown in Figure 2.13, Vuforia frame markers utilize the black-and-white “teeth” along the edges to identify individual markers while users can print any humanreadable information inside the marker for distinguishing purpose. We stuck these markers on whiteboards (Paper IV) and a wall (Paper V) to display virtual objects associated with them. Paper VI is concerned with augmenting a façade with its thermal infrared information. Since it is less desirable or even impractical to stick large markers across the façade in question for tracking purpose, I turned to an edge-based natural feature, whose development started with Paper I and evolved through Paper III. Moreover, given the 2D nature of the virtual content used in Paper VI, namely infrared images of the façade, we did not have to estimate the camera pose like other works in order to render the virtual information properly. Instead, we took advantage of the fact that edge-based features are equally prominent in infrared images [29] and employed image registration techniques for aligning the virtual information with the façade in the video. Details about image registration will be presented in Chapter 3.

2.3.3 Virtual content creation and rendering While previous subsections are mainly related to the hardware aspect of an AR system, this subsection will be more software-oriented since it is concerned with the actual substance of an AR experience, the virtual content. Like displays, the modality of virtual content depends on the human sense to be augmented and I will focus on the virtual content for vision here. The visual representation of the virtual world can be 2D or 3D, static or animated. It can be a realistic, concrete model of an object from the real world or manifestations of abstract, imaginary concepts such as force, temperature and dragons. The application purpose of an AR system determines what the virtual content should be and how to present it together with the reality. In the remaining of this subsection I will, combining with my research projects, expound on creation and rendering of the virtual content. Virtual objects are mostly created with special computer software. 3D models of, for instance, dinosaurs shown in AR interactive books for children from the Popar series1 may be created and animated using Maya, 3ds Max or Blender (Figure 2.14 a) by artists. The textures of the dinosaur models can be drawn with raster graphics editors like Photoshop or GIMP (Figure 2.14 b) and these tools can also be used to generate 2D virtual content such as icons, geometric shapes and text-based artworks for AR. 3D models in AR applications regarding industrial products such as architectures or cars are usually created with computer-aided design software, e.g. AutoCAD. Finally, outputs 1 Website:





Figure 2.14. Examples of virtual content creators. (a) Blender (b) GIMP

from other sensors such as laser scanners and infrared cameras can be employed as virtual content for specific AR applications, too. The virtual objects in Paper IV comprises a grid, a red cross and four red squares. Since they were simple to create, we modeled them directly in the game engine Unity, which will be described in greater details shortly. The ground plane in Paper V was modeled in AutoCAD and imported to Unity afterwards while the remaining virtual objects (the pipe, the hint object and two confining ribbons) were created directly within Unity. We used infrared information to augment the video of a façade in Paper VI so the virtual content was generated by an infrared camera. After acquiring the virtual content, we need to render and present it properly with the real world to AR users. Figure 2.15 depicts the relation between the renderer and the other components within an AR system. Based on tracking Real world



Sensory organs

Tracking and user active inputs

Virtual content


Figure 2.15. The renderer of an AR system combines the virtual content with the real world based on various inputs and presents the composition to the user via a display.


information and user inputs, the renderer produces frames of the virtual content and combines them with visuals of the real world for final display. The rendering and compositing process takes place on the graphical units of computers. Methods and techniques for creating digital imagery of virtual content are topics covered in computer graphics, a sub-field of computer science. A discourse on this field is beyond the scope of this thesis so I will only present the graphical pipeline which summarizes the processing procedure of modern real-time rendering. Interested readers can resort to excellent sources such as [47], [99] and [68]. Vertex processor

Vertex data

Primitive assembly

Clip and project

Viewport transform

Frame buffer

Fragment processor


Figure 2.16. The graphical pipeline for modern real-time rendering





Figure 2.17. The rendering process of a triangle in the graphical pipeline As shown in Figure 2.16, the process starts with vertex data of models created with software discussed previously. The data contain, e.g., vertex positions, colors and normals, exemplified in Figure 2.17 a. They are first sent to the vertex processor whose most common operations are a series of coordinate transformations to arrange the spatial relations between the models and bring them into the view of the virtual camera. These operations are instructed by a short passage of program called vertex shader, which is executed for each vertex. The processed vertex data continue moving to the phase of primitive assembly in which related vertices are associated together to form triangle meshes, namely surfaces of the models (see Figure 2.17 b). Models outside the camera view will be clipped and the remaining scene is reduced to 2D through projection. The viewport transformation determines the actual size of each model in terms of the resolution of the targeted display area, which is crucial for the subsequent rasterizing operation. Rasterization essentially seeks to represent the geometric shapes (triangles) that form up the surfaces of models with discrete screen elements, illustrated in Figure 2.17 c. Such screen elements, called fragments, are closely related to pixels on the screen since they contain necessary information to generate the pixels. The fragment 35

processor provides developers with an opportunity to dictate how individual fragments should be colored and eventually stored in the frame buffer becoming a pixel ready for display (Figure 2.17 d). The developers achieve this task through writing another short program called fragment shader. Similar to the vertex shader, the fragment shader also runs once for each fragment. Once the imagery of the virtual scene is generated, it is ready to be composited with the visuals of the real world and thanks to tracking, both the virtual and the real scene should well align (register) with each other. Implementing a renderer for AR applications itself is not a trivial task, much less various other supporting components, such as virtual content loader, resource manager and interfaces to AR SDKs. To better focus my attention on the study of the AR applications, I opted for a ready-made software framework as a platform upon which I could develop our AR application with ease. More specifically, I chose to use Unity1 game engine for my projects (see Figure 2.18). As a game engine, it is capable of providing the developers with

Figure 2.18. The UI of Unity 4 editor. A screenshot of the application development for Paper V is shown here.

virtually every tool they need for creating games and other 3D real-time applications. The core functionality of a game engine includes digital content (termed assets in game development) management, scene modeling/editing, 2D/3D rendering, animation, networking, memory/file management and so on. Through employing a game engine like Unity, my focus could be shifted to designing the AR experiments and rapidly realizing the desired applications rather than being entangled with technical details. The support of it for the popular AR SDK Vuforia is another added merit. In Paper IV and V, the rendering and composition process was controlled by Vuforia and I, as an application developer, only needed to define the association between the virtual content and the frame markers. As for Paper VI, since registration was achieved through image-based approach, I did not use Vuforia in the project but implemented the rendering and composition process myself using shaders within Unity. 1 Website:



2.3.4 Input and interaction The ultimate goal of AR is to transform our real world itself into a UI and thus support ubiquitous computing. As an emerging technology, AR also needs to overcome obstacles from input methods and interaction techniques in order to reach that goal. This is because not all traditional computer UI components can be transferred directly to AR. For example, using input devices such as mice and keyboards requires a flat surface, which contradicts the idea of ubiquitous computing, let alone the inconvenience of carrying them around. While familiar WIMP (windows, icons, menus, pointers) UI controls and their related interaction techniques are serviceable for AR-based information browsing and simple virtual object interactions such as the ones demonstrated in [49], [77] and [98], they are designed for 2D interfaces after all and are not competent with interactions in 3D space. Since AR links our physical world with the digital one, an ideal UI for AR should take advantage of such a link. An early attempt was made by Ishii and Ullmer with so-called “Tangible UIs” [50]. The idea was to associate everyday physical objects with virtual information thus making the UI truly ubiquitous and invisible. One of the examples they gave in [50] is metaDESK which runs a prototype application “Tangible Geospace”. Users of the application can place physical models of landmarks (called phicons, short for physical icons) on the desk surface and the surface will display regional maps related to the landmarks with the positions of the landmarks on the map right beneath the models. The users can also pan and rotate the map view via translating and rotating the phicons. Kato and his colleagues further developed this idea for AR and proposed “Tangible AR” [53]. They demonstrated the concept with a collaborative card matching game where each card is tracked by a unique fiducial marker and presents a virtual object to the players. Another early example of “Tangible AR” is Tiles, an AR authoring interface [84]. Each tile is a paper card attached with a fiducial marker. In the augmented view, the tiles amount to the icons of WIMP and users of the system can interact with virtual information by physically manipulating the tiles rather than with an extra device special for the input purpose. Although “Tangible AR” with mediating physical objects brings us one step closer to AR’s ultimate goal, it is still not natural comparing to the way we interact with our world in everyday life. Therefore, researchers have sought to eliminate the mediating objects, which leads to input based purely on users’ hands. Interacting with virtual objects directly using hands is very intuitive since we do the same with our surroundings in real life. This type of input requires tracking users’ hands and recognizing individual gestures as pre-defined commands [17]. In addition to hand gestures, gaze interaction [59] and voice commands [48] have also been successfully applied to AR systems. All these new input methods and interaction techniques can certainly be combined together to create multimodal interfaces [16], which are more natural, intuitive 37

for the users while propelling AR in the right direction towards its ultimate goal. Input and interaction for AR systems were not on my research agendas during the span of my study so I did not have the opportunity to explore this aspect of AR in depth. The applications developed in Paper IV, V and VI exploited AR for visualizing the virtual objects and no direct interaction with them was involved. Hence, only a few buttons were needed to enable users to control the flow of each experiment and they interacted with these UI controls via the touch screens of the mobile smart devices.

2.4 Evaluation of AR systems As the hardware and the software for developing AR applications mature, the development barriers have been constantly lowered. As a result, AR is moving out of labs and approaching towards the general public. Like any other interactive products, AR applications ought to create user experiences that enhance the way people work, communicate and interact. Therefore, user-driven design is becoming increasingly important for AR system development and meanwhile the end systems should be evaluated with actual users as well [33]. In this section, I will describe some challenges faced by AR system evaluations and briefly introduce user-based evaluations that can have various objectives.

2.4.1 Challenges for AR evaluations Although many novel applications of AR have been brought forth over its short history, formal evaluations of the systems are comparatively fewer [35, 102]. The reason can perhaps be ascribed to the existence of several challenges for evaluating AR. Diverse implementations As we have seen from the preceding discussions on AR fundamental components, there are many implementations for each of them. For example, the output device can take on the form of a hand-held touch screen, a desktop monitor, a head-worn display, a projector etc. Tracking can be done with or without sensors. If sensor-based tracking is chosen, there are different types of sensors with disparate working principles. The situation for vision-based tracking is even more complex as new algorithms based on different theories are continuously being devised. Consequently, it is challenging to define universal evaluation techniques and metrics for AR interfaces. One possible solution, as suggested by [33], is to design common evaluation methods within a smaller scope, for instance, creating guidelines for mobile phone AR systems. Additionally, given the diversity of system implementation, it is also 38

not safe to generalize existing evaluation results of individual systems, which contributes to the lack of common design and evaluation principles as well. Elusive end users Since AR is a relatively new type of UIs, major research interests have been placed upon exploring the vast design space — discovering new applications, developing new fundamental technologies (e.g. display and tracking), designing new interaction techniques and metaphors etc. The target users for these new inventions are often not known or well understood, which results in solutions looking for problems [23, 33]. Without the knowledge of end users, it is therefore difficult to evaluate a system meaningfully, let alone design an effective system. Difficulties in applying traditional evaluation methods While many evaluation methods have been developed for traditional UIs such as WIMP, they are not suitable for AR systems. First of all, AR features interactions such as object selection and manipulation in 3D space whereas traditional UIs only deal with interactions in 2D space. Other reasons can be inferred from the challenges discussed previously. For example, multimodal input methods like voice recognition and gaze tracking do not exist for traditional UIs. Heuristics for AR interfaces are lacking due to the novelty of AR as a whole and the idiosyncrasies of individual existing systems. Even if heuristics could be designed for AR, it is difficult to identify experts to perform the evaluations. All in all, one obstacle preventing AR applications from reaching wider audiences as applications with traditional UIs have done is the lack of rigorous evaluation metrics and methods. Without them, usable products are not guaranteed. Perhaps these metrics and methods can be developed and standardized in the forms of guidelines, performance prediction models and frameworks within a specific application domain whose common task goals are well defined and thus implementation details do not vary much.

2.4.2 AR user evaluations As I mentioned at the start of this section, user-driven design has become the focus of AR system development and so is the user-based evaluation of AR systems. According to [33], there are four categories of AR user evaluations in terms of the purpose: • Studying human perception and cognition • Examining user task performance • Examining collaboration between users 39

• System usability and design evaluation. The evaluation methods can have objective and/or subjective measures. Objective measures are usually quantitative observations such as accuracy, precision, task completion times etc. while subjective measures relate to subjective judgment of users or evaluators such as ratings and ranking, based on quantitative and/or qualitative data obtained during the experiments. User evaluations of related AR systems were incorporated in Paper IV, V and VI. All these evaluations are concerned with user task performance in the context of facility maintenance, which comprises human perceptive, cognitive and motoric aspects as well as AR system performance. The tasks involve invisible target designation aided by AR tools (refer to Chapter 4) and the major results are represented as objective measures of target designation accuracy and/or precision as well as task completion times. These objective measures of performance have relevant implications in real-life applications since erroneous locating hidden utilities can cause a waste of time, money and even life danger [103, 95].


3. Digital image processing

Digital image processing lies at the foundation of myriad computer vision applications, among which is the theme of this thesis, namely AR. We have already seen that vision-based tracking for AR relies heavily on image processing and computer vision techniques. The registration of infrared images with live video of a façade reported in Paper VI was also built upon early image processing procedures. In this chapter, I will provide a concise introduction to digital image processing and one of its applications pertaining to my research, i.e. image registration.

3.1 Digital images Images are essentially two-dimensional recordings of electromagnetic radiations of a “scene”. Modern imaging sensors cover almost the whole electromagnetic spectrum from gamma rays to radio waves. Irrespective of the sensor type, three main steps are required to convert continuous energy signals into digital images: spatial sampling, temporal sampling and signal value quantization [25]. Spatial sampling converses energy signals received by sensors into spatially discrete representation. Take the most familiar imaging sensors to us, digital cameras, as example, their sensor elements are arranged into a 2D array and thus tessellate the sensor plane into a regular grid. Each grid cell (a sensor element) is responsible for one image element known as pixel. Temporal sampling refers to measurement of energy signals at fixed time intervals by the sensors. For instance, CCD (charge-coupled device, one type of sensor typically used in digital cameras) carries out this step by triggering the charging process and measuring the electrical charge which has built up for a specific amount of time. Each temporal sampling results in one image of the “scene”. The amount of energy received by each sensor element during the temporal sampling needs to be converted to digital format, usually an integer scale so that it can be stored in and processed by computers. This final digitization process is signal value quantization. The integer scale is determined by the number of bits used for encoding a pixel. For example, each pixel of an 8-bit gray-level image is capable of displaying 28 = 256 shades of gray. A true color image, for example, uses 24 bits for each pixel, namely 8 bits for each color channel (red, green and blue) and therefore one pixel can take on 224 different colors. Figure 3.1 demonstrates the discrete nature of a digital image. According to the discussion so far, digital images can in fact be viewed 41

Figure 3.1. Results of sampling and quantization can be viewed by zooming in on a digital image.

as 2D matrices of N rows and M columns with each pixel corresponding to an element of the matrix. If we assign a 2D Cartesian coordinate system to a digital image (Figure 3.2), we can also view it as a 2D function which maps from the natural number domain of N × N to a range of possible pixel values P, i.e. I = f (x, y), x, y ∈ N and I ∈ P. (3.1) X 0

M-1 columns


f(x, y)

N-1 rows




Figure 3.2. A digital image coordinate system Digital images used to be handled by small groups of specialists due to the high costs of acquisition and processing hardware. Today, digital images have pervaded our lives thanks to the popularity of digital cameras. Common mobile phones or tablet computers with built-in cameras, for example, can easily capture and store images with multi-megapixels. Equipped with high speed processors and large storage, our personal computers are capable of performing a wide range of image editing and processing tasks. Additionally, various image file formats were developed to facilitate the transmission and the sharing of images, which further contributed to the popularity of them. While the technology advancement has brought digital images closer to the general public, so has it undoubtedly widened the applications of them in the profes42

sional domains. Nowadays the use of digital images can be found in medicine, electronic manufacturing, entertainment industry, astronomy, surveying, meteorology — to name but a few. The omnipresence of digital images calls for knowledge and skills of processing them, especially in the professional scene. Although it is unrealistic to give a thorough introduction to the concepts and the techniques of digital image processing within this text, I will however delimit the scope of it and exemplify some of the tasks/operations that are associated with digital image processing in the following section. For a firm grasp of the field’s basis, readers are referred to excellent textbooks of [40] and [25].

3.2 Scope of digital image processing According to [40], digital image processing comprises processes whose inputs are images while outputs can be images or a set of attributes extracted from images. It is closely related to the fields of digital image analysis and computer vision but there is no clear-cut boundary between them. Rather, the authors define their relationships in terms of a continuum, much like the way AR is defined with respect to mixed reality in Figure 2.1. This continuum has digital image processing and computer vision at both ends and contains three levels of computerized processing, as shown in Figure 3.3. Beginning with the digital digital image analysis low-level process

Digital image processing

mid-level process

high-level process

Computer vision

Figure 3.3. The digital image processing-computer vision continuum from [40]

image processing end of the continuum, the level of processing increases as we travel along it. Low-level processes are applied to images and prepare them for subsequent operations once they are obtained from the sensors. These processes are primitive and the outputs are also images. Example processes at this level are image enhancement and image restoration. Mid-level processes generally extract image attributes, objects (e.g., edges, lines, shapes) and describe them in a way that they can be processed by computers. Image segmentation and classification are common tasks in this category. Finally, high-level processes attempt to understand the content of an image based on recognized attributes and objects from the previous level thus stepping into the domain of image analysis and eventually computer vision. Table 3.1 lists the specific processing methods/techniques employed in Paper I, II and III with respect to the processing categories they belong to. Although digital image processing was certainly applied in my other papers, e.g. vision-based tracking in Paper V, I did not develop any methods there myself. Therefore, these papers are excluded from the table. 43

Table 3.1. Digital image processing methods used in some papers included in this thesis. L: low-level processing; M: mid-level processing according to [40] Paper I

Paper II

Paper III

Image acquisition (L)

digital camera

digital camera

digital camera, infrared camera

Image enhancement (L)

noise reduction

noise reduction

noise reduction, histogram equalization

Color processing (L)

RGB to gray-level

RGB to gray-level, RGB to HSV

RGB to gray-level

Image segmentation (M)

edge, line detection

edge, line detection, region segmentation

edge, line detection, region segmentation

Fourier descriptor, texture descriptors

region area

Image restoration (L)

geometric transformations

Representation and description (M)

3.3 Image registration Previous discussions have shown that lower-level digital image processing serves as the foundation for image analysis and/or computer vision applications. One of such applications which has been developed in my research is image registration (Paper III) and it was later employed in the façade AR experiment (Paper VI). In this section I will introduce various aspects of image registration and how they are addressed in the method described in Paper III.

3.3.1 Overview Images of the same scene are rarely identical. Because of the different viewing positions, the same scene objects can appear in different sizes, at different positions within the images, rotated and/or with different perspectives. These differences may also result from different imaging times or sensors, although in these cases the pixel values of the same scene objects can differ, too. Image registration tries to use image processing and analysis techniques to override the aforementioned geometric differences so that the same scene objects in different images (often called reference and sensed images) are aligned with each other when the images are overlaid. Image registration is useful in a broad range of application domains. Through registering images of a scene taken from different viewing positions, photographers or artists can stitch these images together to form a panorama of the scene which is normally not possible with a single shot. As touched upon in the Section 3.1, there are imaging sensors working on virtually every frequency band of the electromagnetic spectrum and different imaging sensors usually reveal different aspects of the imaged objects. Through registering images from various sensors, people can obtain more complete knowledge of the objects in question. For example, registration of magnetic resonance images (MRI) with images of single photon emission computed tomography (SPECT) helps doctors accurately locating the positions of illness so effective treatment plans can be made; registration of multi-band satellite images enables earth scientists to understand a phenomenon more thoroughly and aligning thermal infrared images with their visible counterparts allows facility maintenance workers to pinpoint building 44

defects such as heat or water leakage. Lastly, registering time-variant images facilitates change detection over time, which is widely used in medicine, remote sensing and surveillance. Since video see-through AR needs to align virtual information on videos, which are essentially real time image sequences, registering time-variant images can be applied to AR as well. Given the wide variety of image contents stemming from diverse application domains and multiple modalities, it is not possible to devise a universal method that registers all sorts of images. However, four general steps can be identified in the majority of image registration methods: feature detection, feature matching, transformation model estimation and image resampling and transformation [119]. Their goals, challenges and common approaches will be described in the following subsection.

3.3.2 General steps for image registration Feature detection Features are salient and distinctive image entities that are matched across images to align the scene objects. Good features should be unique, numerous and spread all over an image. They were traditionally manually selected by experts and now it is preferable that they are automatically detected. There are features based on image regions, lines or points. Region features are often close-boundary image areas representing scene objects after segmentation processes, such as buildings, highways and lakes in satellite images. Line features are usually extracted from contours or edges of scene objects. Point features are perhaps the most commonly used type of features with examples like corners, popular SIFT (Scale Invariant Feature Transform) feature [66] and its many variants. Despite the type of features, they are often represented by points called control points (CP) to facilitate the subsequent steps. Examples of CPs can be vertices or centers of gravity of a region feature, end points of a line etc. The challenges of feature detection lie in the decision of what features to use for a specific registration task and the design of the detection algorithms themselves which have to be fast, accurate and robust since all the remaining steps rest upon them. It is also worth mentioning that there are one type of image registration methods based on image regions of pre-defined size (not to be confused with region features) or even the whole image. These so-called area-based methods skip the feature detection step and instead estimate the image correspondence directly from those regions in the next feature matching step. The design of suitable feature in Paper III began with line features because we were registering visible images with thermal infrared (TIR) ones and only strong intensity discontinuity such as edges are preserved in both modalities [29]. We further observed that façades contained plenty of rectangles, mainly resulting from windows and entrances and most of these rectangles 45

were formed by horizontal and vertical lines. To take advantage of this fact, we grouped line segments originating from edges into quadrilaterals and used them as features for registration. These region features are at a higher level than their linear counterparts and most of them represent salient façade elements such as windows. This added knowledge would facilitate the design of subsequent steps in the process. I used the edge center points of each quadrilateral feature as CPs and hence each feature is represented by four CPs. Feature matching The correspondence between detected features in a sensed image and those ones in the reference image is established in this step. There are in general two approaches to this task. The first one is an iterative process with hypothesis and evaluation while the second one models features into descriptors, a set of numerical attributes that can be directly compared with each other. No matter which approach is adopted, a criterion or similarity measure is required to determine if correspondence should be established. For the first approach, matching hypotheses are often made according to neighboring pixel values, feature spatial relations or a priori domain knowledge. The hypotheses are then evaluated by designated similarity measure. Examples of similarity measure in this category are cross-correlation of pixel values and mutual information [83]. Many feature descriptors have been proposed for image registration as well as for tracking in computer vision which I have discussed in Chapter 2. Common ones are based on, e.g., shape/topology descriptions [63], statistical moments of pixel intensity [46] and histograms of pixel gradient directions within feature regions [66]. Since features are represented in numerical form, the similarity measures for this approach can be as straightforward as testing the equality of these quantities or the distance between two positions in the feature space. The major challenges for this step are to overcome the results of different imaging conditions: the same scene objects in different images can take on different shapes, be displayed with vastly different pixel values and/or be fully or partially occluded. Hence, the design and application of feature descriptors and similarity measures have to be effective or invariant to these changes. Since matching usually involves searching through a large number of features, the algorithms need to be efficient as well. The feature matching in Paper III adopted the first approach described above. Since the hypothesis module is closely tied to the evaluation one, which takes advantage of the transformation model estimation, I will discuss them together in the next step. Transformation model estimation Once a group of features from the sensed image are matched with the ones in the reference image, the geometric differences between these two images may be modeled. As presented earlier, a coordinate system can be associated with a digital image (see Figure 3.2). The difference model is then expressed by a 46

series of transformations applied to the pixel coordinates. The most common geometric transformations operated on images are 2D projective transformations, which is a linear transformation on homogeneous 3-vector represented by a non-singular 3 × 3 matrix: ⎤⎛ ⎞ ⎛ ⎞ ⎡ h11 h12 h13 x1 x1 ⎝x2 ⎠ = ⎣h21 h22 h23 ⎦ ⎝x2 ⎠ , (3.2) x3 h31 h32 h33 x3 where xi is the homogeneous coordinate of a pixel obtained by simply extending its 2D Cartesian coordinate with one more element and setting it to 1. The 3 × 3 matrix is called a homography and various image geometric differences mentioned at the beginning of this section can be modeled through imposing different constraints on the values of its elements. The corresponding geometric transformations form a hierarchy, as shown in Figure 3.4. From the bottom up, the transformations become more and more general as fewer constraints are imposed. In the following text, I will go through this generalization process via presenting the corresponding homographies and demonstrate the effects of translation, rotation, scaling, affine and projective transformations on an example image. Projective









Figure 3.4. The hierarchy of projective transformations Translation displaces an image in a given direction and is expressed by ⎡ ⎤ 1 0 tx HT = ⎣0 1 ty ⎦ , (3.3) 0 0 1 where tx and ty are the translation amount in the direction of x and y respectively. Figure 3.5 b displays the result of translating the example image (Figure 3.5 a) with tx = 20 and ty = 50. Translation has two degrees of freedom, 47

which means two independent parameters, namely tx and ty , need to be specified to determine the transformation. Image rotation around its origin is expressed by ⎡ ⎤ cosθ −sinθ 0 HR = ⎣ sinθ cosθ 0⎦ , (3.4) 0 0 1 where θ is the rotation angle with positive values indicating clockwise rotation while negative ones counter-clockwise. In Figure 3.5 c, the example image is rotated 20 degrees. Rotation has only one degree of freedom. Combinations of translations and rotations model the motion of a rigid object and hence they are also called rigid transformations whose homography is ⎡ ⎤ cosθ −sinθ Tx (3.5) Hrig = ⎣ sinθ cosθ Ty ⎦ . 0 0 1 Rigid transformations have three degrees of freedom. Scaling enlarges or shrinks an image in the direction of x and/or y by respective scaling factors. The homography for scaling is ⎡ ⎤ sx 0 0 HS = ⎣ 0 sy 0⎦ , (3.6) 0 0 1 where sx and sy are the scaling factors in the direction of x and y respectively. The result of scaling the example image with sx = 0.5 and sy = 1.5 is shown in Figure 3.5 e. In this case, the image is shrunk by half in the x direction while being enlarged by half in the y direction at the same time. Apparently, scaling has two degrees of freedom. If sx = sy , the resulting scaling is uniform or isotropic. Similarity transformations combine any rigid transformations with uniform scales and thus have the form of ⎡ ⎤ s cosθ −s sinθ Tx (3.7) Hsim = ⎣ s sinθ s cosθ Ty ⎦ . 0 0 1 And similarity transformations have four degrees of freedom. An affine transformation (or simply an affinity) can be expressed by any non-singular matrix of the form ⎡ ⎤ a11 a12 a13 (3.8) Ha f f = ⎣a21 a22 a23 ⎦ . 0 0 1 As depicted in Figure 3.4, affine transformations are the most general below the projective transformations in the hierarchy and it can represent any combinations of the transformations I have discussed so far. Figure 3.5 f shows 48

the example image transformed by an affinity combining two rotations and a non-uniform scaling. An affine transformation has six degrees of freedom. Comparing the homography in 3.8 with the one in Equation 3.2, we can see that the last constraint imposed on affinity is removed for projective transformations. Namely, the vector for the last row no longer has to be [ 0 0 1 ]. Thus, we have arrived at the general form of a non-singular 3 × 3 matrix introduced at the start. Being the most general transformations, projective transformations can not only model translations, rotations and scalings of an image, they can also model perspective distortions caused by different imaging angles, which is crucial to register images used in my research. Although there are nine elements of the matrix, only their ratio is significant. Hence, a projective transformation has eight degrees of freedom. An example result of the transformation is given in Figure 3.5 d.







Figure 3.5. Examples of 2D projective transformations on an image. (a) Original image (b) Translation (c) Rotation (d) Projective (e) Scaling (f) Affine

All the projective transformation models described so far are applied to images globally, i.e., transforming an entire image with one model. This 49

approach fails to handle locally different image changes, which occurs frequently in medical imaging [119] due to the non-rigid shapes of scene objects in question. To cope with this shortcoming, local registration techniques have been proposed and there are two variants of them in general. The first one is piecewise image registration [5, 115], in which images are divided into subregions and the global transformations or registration methods are then applied to each subregion to achieve the overall registration goal. The second variant is employing a transformation model that takes local deformations into consideration. These models are usually non-linear, which means they can not be represented by matrix multiplication nor do they preserve lines. The prime example of these models is the family of radial basis functions (RBF) [118, 4]. The local nature of RBFs reflects on their property of determining the function values based on the distance between a pixel in question and CPs, rather than the pixel position itself. Since the scene objects in my study are of planar nature (façades), the global transformation models suffice and thus I did not explore local non-linear transformations further. Assuming features detected in the first step are accurate and they are matched correctly in the second step, the main challenge for transformation model estimation is to figure out what transformation models are related with images to be registered. Fortunately, the answer can often obtained from the domain knowledge and experiences. For example, in my studies, I worked with planar façades and it is known that two planes imaged at different viewing angles are related by a homography [41]. Afterwards, the actual estimation process can be carried out with well-established methods such as the normalized direct linear transformation (DLT) or the Gold Standard algorithm [41]. As mentioned previously, feature matching and transformation model estimation are combined in Paper III with the hypothesis-evaluation approach. The formulation of a matching hypothesis is based on the assumption that the positions of a façade in a pair of sensed (TIR) and reference (visible) images do not alter drastically so features within vicinity in both images are hypothetically matched, provided certain geometric properties are the same. The forward selection algorithm is then employed to iteratively estimate a homography from selected hypothetical matches and verify it with a feature areabased similarity measure. At the end of the process, the hypothetical matches that yield the highest similarity score are considered to be true matches and the resulting homography is used for the final registration. Image resampling and transformation The actual transformations of the sensed images take place in this step. While the source pixel coordinates are always expressed in natural numbers, their transformed coordinates are almost always decimal numbers. Therefore, image resampling through interpolation is required to determine the destination pixel values. Common interpolation methods used for image registration are nearest neighbor, linear and cubic. Nearest-neighbor interpolation is the 50

fastest to compute but gives the poorest visual results while cubic interpolation is the opposite. Consequently, the challenge for this step is to select an interpolation method that would strike a balance between efficiency and quality. More details on the topic of interpolation can be found in [61]. The decimal numbers of transformed coordinates also bring about another practical issue concerning image transformation. If we transform an image naturally, namely from source to destination (also known as forward transformation), we encounter the difficulty of determining the contribution of a transformed pixel to its neighboring pixels on the destination grid. As we can see in Figure 3.6, a pixel at P is transformed to P on the destination grid and then it is not obvious as to how this pixel should contribute its value to pixels at Q1 , Q2 , Q3 or Q4 given a certain interpolation method. To overcome this, the backward transformation (destination to source) is usually performed for image resampling and Figure 3.7 illustrates this idea. In this case, to determine the value of a pixel at P , its coordinate is transformed into the source grid via the inverse transformation model. Now since all the neighboring pixels required for interpolation, e.g. Q1 , Q2 , Q3 and Q4 , are known, the value for pixel at P can be computed readily. x’







y’ Q3’





Figure 3.6. Forward transformation has difficulty determining the values of pixels at destination (Q1 to Q4 ) from a transformed pixel P . x’

x Q2

Q1 y


P Q3






Figure 3.7. The value of a pixel at destination P can be computed by looking up the neighboring pixels at source, e.g., Q1 to Q4 after backward transformation.

Image resampling and transformation employed in Paper III was implemented with MATLAB image processing toolbox, where cubic interpolation was chosen. 51

4. Facility maintenance

4.1 AR and the AEC/FM industry Buildings and public infrastructures form the backbones of our societies as they support our daily life through providing habitations, workplaces, commodities and services. Constructing and maintaining them are, not surprisingly, big and complex undertakings. Related projects often span a long period of time and cost a lot of money. They tend to generate a large amount of data and information, which must be accessed by a great number of people with different experience and agendas and often located separately in various geographical places. A concrete example from BIM Handbook [36] indicates that nearly all large-scale projects can cost more than 10 million US dollars and involve 850 people from 420 companies while there are 50 different types of documents generated, amounting to 56000 pages or 3000-megabyte data if scanned. It is thus barely surprising that the AEC/FM industry has been actively adopting IT to help manage its projects effectively and efficiently. A comprehensive review regarding IT and the industry by Lu et al. [67] reports that web technology, wireless technology, mixed reality (including virtual reality and AR), electronic data interchange/electronic data management system and BIM (building information modeling) are the top 5 IT utilized by AEC organizations based on the literature they surveyed between 1998 and 2012. According to the review, mixed reality technology is mostly used for collaboration/coordination followed by decision making and performance. More specific to AR, Shin and Dunston [97] have identified eight work tasks in industrial construction that are suitable for AR applications. These work tasks are layout, excavation, position, inspection, coordination, supervision, commenting and strategizing. Chi et al. summarized research trends and possible research opportunities of AR in AEC in terms of localization, natural UI, cloud computing and mobile devices [27]. Up-to-date surveys on existing AR applications in AEC/FM can be found in [86, 14].

4.2 AR and facility maintenance While the exact definition, nature and scope of facility management (FM) are still under continuous debates [32], the general consensus is that FM can be categorized into “hard” FM, “soft” FM and business support services. “Hard” FM refers to physical infrastructure maintenance, e.g., building structure/shell 52

maintenance, building fabric refurbishment and mechanical/electrical maintenance. “Soft” FM incorporates services that enable and sustain the operations of a building or facility. The examples are cleaning, reception and security. Finally, business support services facilitate and form the foundation of business development, for instance, property leasing/renting service, human resources and FM contract control. In view of the contents of FM, for a building, FM mainly takes place at the use stage of the building’s life cycle [24], which is apparently the longest stage. Moreover, the cost of FM, so-called the “ownership cost”, also accounts for the largest portion (60% to 80% according to [89]) of the total life cycle costs, much more than the initial acquisition cost which relates to the AEC activities. Given this pivotal role played by FM in a building’s life cycle regarding time and cost, we can undoubtedly reap the most benefits from the increase of operation efficiency and the reduction of operation cost achieved through applying IT to FM. Indeed, software systems such as computer-aided facility management (CAFM) and computerized maintenance management (CMM) have been recently developed and even integrated with information from other stages of the life cycle through BIM to digitally support facility operators [3]. AR for FM, on the other hand, remains an unexplored territory at large [112]. This is unfortunate because according to a few existing works on this subject, overcoming the interaction seam between the physical world and the cyberspace [50] with AR does show great potentials to improve the efficiency of operations and maintenance (O&M) activities. Lee and Akin reported that O&M field workers could save over 50% of time on average in locating the operation destinations with the navigation function of their AR-based fieldwork facilitator than with conventional methods [60]. Similarly, Koch et al. [55] applied AR to indoor navigation and maintenance instruction support but their focus was the study of nature mark performance for tracking and thus the efficiency gain for maintenance work with AR was not presented. The evaluation of an AR tool for piping assembly in [45] revealed that AR visualization was able to lower workers’ cognitive workload, which shortened the task completion time by 50% and reduced assembly errors also by 50%. These improvements naturally lead to the reduction of costs for employment and false assembly correction. The scarcity of AR research in FM provides the rationale for me to investigate this specific application domain, in particular, facility maintenance. With the maturity of AR software development frameworks and personal hand-held devices (refer to the discussions in Chapter 2), I managed to carried out studies reported in Paper V and VI. Both studies are essentially concerned with invisible target designation tasks aided by AR tools. The reason for this choice is simple: maintenance workers need to first locate the intended utility or equipment and most likely mark up its position before they can carry out the actual maintenance. We evaluated user task performance in terms of precision and/or


accuracy as well as task completion times in those studies, as mentioned in Section 2.4. In Paper V, the designation task was specified as locating hidden pipes behind a wall utilizing the AR “X-ray vision” visualization metaphor [12, 65]. Since there is often space between pipes and a wall (depth of pipes), if the current AR view is not perpendicular to the actual pipe position, marking the pipe directly on the wall as it is indicated by the AR tool will result in parallax errors. This paper studies the interplay between user marking precision, viewing angle and depth of pipes behind the wall. Our results show that the marking precision worsens as either viewing angle or depth increases and therefore an explicit visual guide on the wall plane must be included for this type of AR tools. The designation task in Paper VI was specified as locating a group of infrared heat signatures on a façade, which are visible through an AR interface developed for the study. IRT (infrared thermography) has been a very popular non-destructive and non-contact testing technology for building diagnostics [11, 58]. The capability of infrared sensors to display temperature makes IRT very effective for detecting such building anomalies as heat loss, damaged or missing thermal insulation, thermal bridges, air and/or water leakages. Conventionally, building inspectors need to mentally relate thermal readouts to the actual location being examined. Given the popularity of IRT, we believe the elimination of those mental processes through AR will benefit the diagnostic practice even more. Since we needed to work outdoors, the laser-based designation used in Paper V would not work due to the strong natural lighting, as already discussed in Section 2.3.1. The AR system in Paper VI was thus implemented with the TPP. Another change was the AR system adopted the image registration method developed in Paper III for aligning the infrared information with the façade video since it is impractical to cover such a large surface as a façade with fiducial markers. In this study, we also incorporated more potential end users of the system as test subjects. There were over half of them from the fields related to built environment. The analysis of the results from the user experiments and four independent benchmark tests show that the system factor, namely the image registration method, attributes to the largest part of the designation error comparing to the influence from the human aspects, such as perception and cognition with TPP AR and placement of the markers. In general, the study results favor the application of TPP AR in thermographic building inspections.


5. Summary of the papers

Overview The papers included in this thesis can be sorted into two lines of research which converge at Paper VI (see Figure 5.1). The first line consists of Paper I, II and III. The research goal here is to develop a type of feature that is abundant, reliable in both visible and infrared building images so that they can be registered through matching the features. The development started with Paper I. During the process I quickly noticed that building images contained a lot of straight lines and developed a method to detect both long and short line segments in the images. The detected line segments are further grouped according to their orientations in Paper II. Line segments from a façade are mostly horizontal and vertical so they are oriented in two major directions in an image, namely the vanishing points. With the help of this perspective information, I was able to segment image façade regions. Although the coarse outlines of façades segmented in Paper II turned out to be inadequate for registering visible and infrared images, the grouping of line segments in two major orientations led to the discovery of quadrilateral features in Paper III. The four sides of a quadrilateral feature are formed by two pairs of intersecting horizontal and vertical line segments. The experiments conducted in the paper show promising registration results with this type of feature. Paper 






Figure 5.1. Relations between papers included in this thesis The second line of research involves actual implementations of AR systems and user-based evaluations of them. The goal is twofold: first, we would like to explore the viability of AR tools for facility maintenance utilizing 55

consumer-level hand-held smart devices and off-the-shelf software development toolkits; second, we would like to find out user performance and factors (both human and system) that affect it, given the developed tools and the tasks. More specifically, Paper V examines user performance of designating the positions of hidden pipes through AR “X-ray vision” visualization metaphor in an indoor environment. To overcome the visualization-interaction dilemma described at the end of Section 2.3.1, the users were given a laser pointer to designate the pipe positions. The measures we applied to the study include designation precision and task completion time. While working on Paper V, it quickly dawned on us that the laser pointer approach would not work outdoors due to the disparate lighting condition. Therefore, we turned to TPP AR and designed a pilot study to evaluate user performance with it, which is detailed in Paper IV. Finally, in Paper VI we conducted one more user performance evaluation, which combines the results and the findings from these two lines of research. Similar to the tasks in Paper IV and V, users were instructed to designate the positions of infrared heat signatures on a façade with the aid of a TPP AR tool. The infrared information is aligned with the façade video using the visible/infrared image registration method developed in the first line of my research while the design of the AR tool, user experiments and error analysis methods were benefit from the experiences and the insights we gained from the second line. Figure 5.2 shows the connections between each paper and the elements of AR experience depicted in Figure 2.2. According to the figure, the first line of research is concerned with one of the four core enabling technologies of AR, i.e. tracking and registration while the second line and Paper VI incorporate the user end of the experience.

AR experience Paper IV


Paper V

AR system


Tracking & registration

Input & interaction

Physical world

Paper VI

Virtual content creation & rendering

Paper I Paper II Paper III

Figure 5.2. Relations between the papers included in this thesis and the elements of AR experience


Paper I Façades contain plentiful straight lines which are often parallel to each other and forming some regular shapes. In this paper we developed a line segment detection method that paves the way for devising higher-level image features specific to buildings. Our method starts with computing the moduli of image gradients using Sobel filters (both horizontal and vertical). The gradient maps give us knowledge of the locations in an image where abrupt pixel intensity changes have occurred, which often signifies object edges and boundaries. We then ran the Canny edge detector on the gradient maps to obtain an edge map of the original image. From here, our method is divided into two parts — long and short line segment detection. The long line segments are usually derived from building silhouettes, long ledges and other large patterns on façades. They can be used to delimit the building region in an image. We used the straightforward Hough transform on the edge map to detect the long line segments. On the other hand, frameworks of windows and entrances often consist of short line segments whose spatial relations may be used to characterize a specific façade. To detect these line segments, we computed the connected components in the horizontal and the vertical gradient maps respectively and then confined the Hough transform locally on these connected components to identify line segments that can represent them.

Contributions We combined classic edge and line detection algorithms to extract both long and short line segments in building images. Due to insufficient votes, the standard Hough transform fails to detect short line segments, which are important for characterizing façade details. We alleviated this disadvantage by running it only within individual connected components of a gradient map.

Paper II The paper describes a pre-processing method for street-view images of buildings. After applying the method, a coarse segmentation of the façade region can be obtained. The purpose for this pre-processing is to filter out nonbuilding background of an image so that the subsequent processes can focus only on the relevant part of the image. The development of this method is based on two observations of most façades: 1) there are plenty of horizontal and vertical linear features on façades; 2) façades often possess repetitive patterns in these two orientations. Therefore, we transformed the problem of detecting façade regions into detecting horizontal and vertical regions with repetitive patterns (named homogeneous 57

regions in the paper). The overview of our method is illustrated in Figure 5.3. Since parallel lines of the same orientation converge on a vanishing point, we Edge line segment extraction

Horizontal/vertical vanishing point detection

Scan line generation

Homogeneous region detection

Façade region identification

Figure 5.3. Processing flow of the method described in Paper II begin with grouping image edge line segments into horizontal and vertical directions. From here, we introduce a novel scan-line approach for homogeneous region detection. The scan lines are constructed by connecting centers of grouped line segments with their respective vanishing points. After merging the overlapping lines, we sample the hue channel of the image along each scan line to obtain 1D profiles. Within the same orientation group, the profiles are similar in a homogeneous region but differ from the ones in other homogeneous regions. Hence, the homogeneous regions are delimited by these scan lines with dissimilar profiles. We use frequency components to represent and compare these 1D profiles. The detected homogeneous regions of orthogonal directions are intercepted and similar homogeneous regions are further merged into larger coherent image regions. According to observation 1) introduced earlier, the resulting image region containing the highest count of horizontal and vertical line segments is selected as the façade region of the image. The method was tested on ZuBud building dataset, which is publicly available. We chose 201 images, each of which represents the 201 buildings contained in the dataset. First, a human observer (O1) manually segmented the predominant façades in these images to establish ground truth. Subsequently, both our method and another human observer (O2) performed the same segmentation task. At last, the segmentation results of O2 and our method were compared against the ground truth respectively in terms of correctly and incorrectly classified pixel counts. The resulting confusion matrix showed our method achieved rather similar performance as the one of O2. The limitations of our method, on the other hand, also stem from the two observations on which it is based: first, linear boundaries of homogeneous regions cannot capture irregular façade boundaries, such as diagonal or curved pediments; second, the method is prone to fail on façades which do not exhibit apparent repetitive patterns.

Contributions We contributed a novel building region segmentation method that harnesses both image perspective information and an observed fact that façades contain repetitive patterns. More specifically, these patterns are identified through a unique approach of comparing 1D profiles which are the results of scanning 58

image pixels along lines constructed with major vanishing points. The detected building regions can be used for further image analysis tasks such as building recognition.

Paper III This paper presents an original method for registering TIR (thermal infrared) and visible images of façades. The purpose of designing such a method is to pave the way for AR-based thermographic building diagnostics where the TIR readings of a façade under inspection are correctly superimposed on digital images or videos of the façade. Our method brings forward a new quadrilateral feature and a complete registration pipeline (see Figure 5.4) revolving around this feature. The idea of this feature is an extension of the observations presented in Paper II. A façade Edge line segment extraction

Return registration result and feature correspondences

Horizontal/vertical line segment grouping


Enough pairs?

Line segment interception detection

Quadrilateral feature assembly

Possible feature correspondence establishment

Registration quality measure

Transformation model estimation

Feature pair selection


Figure 5.4. The proposed registration pipeline typically possesses many windows, whose frames are the major sources of the horizontal and the vertical line segments observed on the façade. Although a lot of visual details are lost in TIR images, these edge line segments are largely present in either modality. By grouping these separate line segments, not only can our quadrilateral features be detected across two modalities but more higher-level knowledge, such as aspect ratio and area, also become available for the registration process. Similar to the workflow of Paper II, our method starts with the detection of horizontal and vertical edge line segments through the help of vanishing points in these two orientations. The quadrilateral features are constructed from the intersecting horizontal and vertical line segments. We use the center point of each edge of a quadrilateral as the control point (CP), hence, 4 CPs per feature. To estimate the transformation model between a pair of TIR and visible images, a hypothesis-evaluation framework was proposed. Quadrilaterals from both TIR and visible images are hypothetically paired based on their spatial proximity (defined by CPs) as well as their aspect ratios. Note that for a single quadrilateral, there can be more than one hypothetical match. During the evaluation phase, we adopt the forward selection algorithm to enumerate a subset of all possible hypothetical sets of matches. For each set, a tentative transformation model is estimated from the hypothetical quadrilateral pairs and a 59

score is assigned to the resulting registration, which is calculated based on the area ratio of related quadrilaterals. Finally, the transformation model giving the best registration score is regarded as the true geometrical relationship between the current TIR/visible pair. We tested the method on 41 pairs of images gathered on the university campus. Among them, 33 successfully went through the registration pipeline with visually unnoticeable "ghosting" effect. To obtain the quantitative registration error, we manually selected 10 pairs of corresponding points for each pair of images and the overall error of registering them was 3.23 pixels on average. Meanwhile, registering those 10 pairs of points using the transformation model estimated from themselves still yielded an average error of 1.08 pixels. This finding tells us that 1) when we interpret the former error of 3.23 pixels, the error of 1.08 pixels should be considered together; 2) even a human observer cannot have perfect designation of corresponding image points and our method is only inferior to a human observer in terms of two pixels on average. The eight image pairs that failed the registration revealed the limitations of our method and on the flip side showed that registering TIR and visible images is a challenging task. The main factor that limits the proposed method is that some image pairs do not possess abundant rectangular façade elements to derive registration features.

Contributions Registering visible and TIR images is a challenging task in itself and to our knowledge, registration methods specially for façade images of these modalities have rarely been reported. This paper expounds a complete registration pipeline built around a new type of feature which is designed for façade images. Our test of the method yielded satisfactory registration results and given the popularity of infrared thermography in building diagnostics, the proposed method will contribute to computer-assisted infrared building inspection tasks.

Paper IV This paper reports a user performance study in regard to an AR-assisted target designation task. The task requires users to mark targets on a 2D workarea displayed by our AR tool. This simplified task is an abstraction of operations often performed in construction and maintenance, for example, marking positions for columns and beams and planning pipe installation in a ceiling section. While most AR applications running on mobile smart devices adopt the FPP (first person perspective), the AR tool we developed for this study works completely from the TPP (third person perspective). User experiments were carried out to evaluate the task performance with the tool. Quantitatively, we measured both the precision and the time spent by subjects in marking the 60

targets. A post-session questionnaire was also presented to subjects to gather subjective opinions on the tool as well as the task itself. The concept of TPP AR is illustrated in Figure 5.5. The real object to be augmented and the user are together captured by a remote camera. The

Smart device

Object to be augmented


Remote camera Computer

Figure 5.5. Conceptual illustration of the TPP AR system video frames are sent to the smart device held by the user and composited with virtual information. Since the user can see herself in the video, she can draw on her hand’s position relative to the real object to perform interaction tasks. In our implementation, however, we added a laptop PC and offloaded the rendering and composition process from the smart device to improve system performance. The smart device also sends user inputs back to the laptop PC for control purpose. This process flow is shown in Figure 5.6.

Augmented scene

Real scene Laptop PC (Vuforia + Unity) Virtual scene

Smartphone (Unity) User inputs

Figure 5.6. Process flow of the TPP AR system The experiments were conducted at two places, with 12 people in the first group (G1) and 10 in the second group (G2). We set up a white board as the workarea at each place. There were three trials for every subject, in which the subjects were asked to mark target positions (expressed by their 2D coordinates) on the workarea. The subjects only relied on a ruler to locate the targets in the first trial but gained access to the AR tool in the other two trials. Through the AR tool, a cross-shaped target was superimposed on the workarea with an additional virtual grid rendered for the third trial specially. We added the grid to test if this explicit coordinate system would improve the designation precision. Furthermore, three trials used different sets of target positions. 61

Contributions The original aspect of this study is the evaluation of TPP AR applied to a facility maintenance-related task. The important findings from this study can be summarized as follows: 1) it is viable to build a TPP AR solution for target designation tasks on 2D work space using consumer-level hand-held smart devices and off-the-shelf AR software development toolkits; 2) the TPP AR approach was only half a centimeter less precise than the manual measurement on average; 3) but subjects could complete the target designation task significantly faster with the AR approach; 4) TPP AR tools seem to be intuitive and easy to use without causing user discomfort. Therefore, TPP AR shows great promise to replace manual target designation and further efforts should be made to integrate it into similar tasks for building construction and maintenance.

Paper V Apart from displaying virtual information which relates to the surface of a physical object, AR is also capable of showing occluded real objects virtually. This unique “X-ray vision” visualization metaphor makes AR an excellent tool for facility maintenance where utilities are often, for example, concealed by walls and ceilings. Similar to Paper IV, this paper investigates user performance of a facility maintenance-related task aided by an AR tool which was developed with common hardware and software. In the case of hidden utilities, they normally do not situate on the exact planes of walls or ceilings but rather with an offset from the planes. The offset in depth results in a parallax effect when we view the utilities from a non-perpendicular position. Figure 5.7 depicts such an effect and the related error E p . While factors affecting user h

P d

O wall

Q Ep w

Figure 5.7. Illustration of the parallax effect in terms of the object P behind a wall

task performance stem from both the AR system and the users, the focus of 62

this paper has been placed upon the ones from the users. More specifically, we studied the relationship between the task performance and the spatial factors that cause E p , namely depth d of the object behind its occluder and horizontal offset h of the object from the viewer denoted in Figure 5.7. We developed a marker-based AR system that allows users to see a virtual pipe hidden behind a real wall. The virtual pipe can appear at one of 12 different locations behind the wall. As shown in Figure 5.8 a, these pre-determined x Cw













0.2 m 4


0.5 m

2.47 m



Figure 5.8. (a) The 12 pipe locations used in the study and their relations to the user (b) The user experiment in session

locations differ from each other in both depth and horizontal offset. During the experiments, subjects were asked to designate the pipe positions on the wall. To overcome the visualization-interaction dilemma, we employed a laser pointer for remote position designation instead of TPP AR used in Paper IV (see Figure 5.8 b). Each subject needed to perform the designation task under two different conditions. The first condition involved only necessary virtual objects for visual depth perception, which comprised a pair of red ribbons on the wall plane and a ground grid. For the second condition, we rendered a perpendicular projection of the pipe on the wall plane as an enhanced visual guide in addition to the virtual objects from the first condition. We measured the world coordinates of user designated positions as well as the designation time of each position in the experiments for analysis.

Contributions Firstly, this paper presented a laser-based target designation method and validated its feasibility in AR-based interaction with augmented objects beyond arm’s reach. Secondly, through the analysis of experiment results, we have found that people are able to mentally compensate for the parallax error given sufficient depth cues for spatial understanding (the experiments under the first condition) but such an ability is very unreliable, especially as either of the depth d and the horizontal offset h increases. While much research has been done to promote the importance of creating correct visual depth cues for AR 63

“X-ray vision” so that users can better understand the spatial relationship between the occluded objects and their occluders, the inevitable parallax effect intrinsic to this visualization metaphor has been constantly overlooked. Through this paper, we have established that the parallax effect is an influential factor for precision positioning tasks within FM and further suggest that dedicated visual guides aligned with the real occluding structures should be provided for such AR-based tasks. According to the data analysis, at the largest horizontal offset (1.5 m), adding the dedicated visual guide reduced the designation error from 71.8 mm to 3.8 mm while at the largest depth (0.4 m), the designation error is reduced from 83.4 mm to 13.3 mm.

Paper VI IRT (infrared thermography) is widely adopted for building diagnostics due to its non-destructive and non-contact nature and the ability to sense object temperature. However, thermal inspectors have to frequently switch their focus between objects in the real world and the thermal images in order to comprehend the heat distribution on the surfaces of those objects, which can result in efficiency loss and even errors. We prototyped an AR inspection tool which overlays infrared information directly on the video of the related façade and conducted a user task performance study with the tool in this paper. Before the experiments, we manufactured a heating rig installed with 13 well spread-out heating devices. The positions of these heating devices in relation to the rig were manually measured and used as the ground truth. With this rig, we created 13 thermal targets on a chosen façade to simulate thermal anomalies. Following the system design from Paper IV, our prototype tool also adopted TPP AR as a solution to the visualization-interaction dilemma. Since it is impractical to cover such a large surface as a façade with fiducial markers for the purpose of tracking and registration, we employed our previous work on infrared/visible façade image registration (Paper III) to align the infrared information with the façade video. The user task involves designation of these infrared targets on the façade visualized by our AR tool. More specifically, each subject aligns markers (customized total station reflectors) with infrared targets one at a time through the guidance of the TPP video during the experiment (see the top image of Figure 5.9). There were 23 volunteers participating in the experiments and over half of them had professional background related to built environment. We recorded the designation time for each target and after the subject designated all the infrared targets, we also measured the marker positions using a total station for analysis. While the comparison between the measured marker positions and the ground truth gives us the user performance in terms of designation accuracy, both the system and the human factors attributing to the performance remain unclear. To further analyze these factors, we conducted 4 benchmark tests on human 64

Working area

Total station

Approx. 8 m AR system

Figure 5.9. Experimental environment from the study in Paper VI perceptive and motoric capabilities specific to the task as well as image registration error regarding the façade in question.

Contribution The novelty of this study lies in the design of the AR tool, which employs TPP and utilizes image registration techniques to align infrared information with visible façade videos. Moreover, we evaluated it with a group of users in a real working environment. The analysis of experiment results show that the tool is applicable for inspections of building elements whose size is larger than one decimeter, given the average positioning accuracy is around 7.6 cm. The further analysis from the 4 benchmark tests reveals that the majority error source in fact originates from the image registration method employed in this work. Excluding its influence, the human perceptive, cognitive and motoric factors only result in an error of 2.2 cm on average, which signifies the suitability of TPP AR in similar facility maintenance tasks.


6. Conclusions

O&M (operations and maintenance) ensure the intended functions of a facility and therefore must be performed during the whole use stage of the facility’s life cycle, which spans a long period of time. Such long-lasting practices naturally entail a large amount of expense. Hence, facility managers are constantly seeking means to improve the efficiency of O&M thus bringing down the overall costs. Like many other economic sectors, computers and IT have been widely adopted in FM services to automate many traditional maintenance methods and processes, making O&M faster and more reliable. This thesis investigates the potentials of AR, a new UI technology which superimposes computer-generated information directly on user’s vision, for further boosting the performance of facility maintenance field tasks, given its unique way of displaying information.

6.1 Facility maintenance tasks The tasks employed in the user experiments from various studies of this thesis have a common archetype, namely target designation aided with AR tools. This task archetype is crucial to maintenance operations in real scenarios. For example, we need to designate target utilities for subsequent maintenance or for avoiding damaging them during a maintenance operation; after examining a piece of equipment in a routine preventive inspection, we may want to mark it to signify its status. Additionally, this task may also assist decision making during a maintenance planning phase. For instance, a facility manager can assess the fitness of an assortment of valve samples with respect to currently operational pipework behind a wall by aligning the samples with the pipes in an AR view. Certainly, there are many other facility maintenance tasks that could benefit from AR but are not examined in this thesis. For example, facility managers and owners discuss a maintenance project through AR-based collaborative planning tools; AR can be used to navigate field workers to the operation site; instructions and/or schematics of target utilities can be presented using AR to guide the actual maintenance etc.

6.2 Hand-held AR tools Personal mobile devices today have the processing power comparable to the one of a modern desktop personal computer. Together with their miniaturized 66

sizes and affordable prices, they have become one of the most popular artifacts on this planet. In view of these factors, we chose mobile smart devices such as smartphones and tablet computers as the hardware platforms of our AR tools and envision that every maintenance field worker could simply take out her mobile smart device from the pocket and start the intended maintenance operation aided by AR. However, there are some obstacles towards realizing that vision. One of them is reliable tracking and registration. Since AR overlays virtual information on the real world, a certain degree of spatial relation between the virtual information and real objects of interest needs to be established. Sometimes such a spatial relation must be an accurate alignment between the virtual and the real objects. The target designation task performed for facility maintenance is one of such cases. Tracking and registration are key technologies for AR to obtain the spatial relations and we investigated vision-based tracking based on both fiducial markers and natural features in this thesis. Marker-based tracking is well-established and is capable of achieving registration with high accuracy. The success in carrying out the user experiments of Paper IV and V corroborates the validity of employing fiducial markers in a controlled environment and when the object to be augmented is not too large. Therefore, indoor facility maintenance tasks can rely on marker-based tracking for high registration accuracy. We also studied AR-assisted target designation task in an outdoor environment in this thesis, where marker-based tracking is less desirable. Firstly, large outdoor objects such as façades may require a lot of markers to cover them, which is laborious to prepare; secondly, harsh weather conditions can damage the markers if the maintenance task takes time to finish; lastly, in order to acquire more visual context of a large object, AR users tend to move away from the object and the increased viewing distance may incur visibility problems of the markers. Therefore, natural feature-based tracking is a more viable choice for outdoor maintenance tasks. However, it is trickier to design natural features due to the complexity of visual appearance of natural or man-made objects in addition to other variants introduced by the imaging process. For Paper VI, since the virtual information is infrared images, we aligned them with façade video frames using our image registration method developed in Paper III. While the success of conducting the user experiments does support the preceding statement that natural feature-based tracking is a more viable choice for outdoor maintenance tasks, the aforementioned challenges also present themselves as the limitations of our registration method. For example, the complex characteristic of natural features requires more computation power and more thoughtful implementations, which is why our algorithm can not run in real-time at the current state. Another limitation is the registration method does not work with non-planar façades or façades with very few horizontal and vertical lines. To sum up, applying AR to outdoor maintenance tasks requires fast and reliable natural features for tracking and registration. 67

Facility maintenance usually involves large physical objects, e.g. walls and façades and consequently the related virtual information can cover a wide area as well. To capture more augmented content in the view, users of hand-held AR have to move away from the augmented objects due to the small field of view of built-in cameras. In fact, being able to see more content is almost necessary for both the AR system and the designation task: on one hand, the AR system has more features (artificial or natural) for better tracking results; on the other, users obtain more context, which is obviously helpful for locating targets. Unfortunately, moving away from the augmented object means maintenance workers are unable to physically interact with it, namely designating targets on it. To address this visualization-interaction dilemma, we proposed two solutions in this thesis: remote laser pointing (Paper V) and TPP (Paper IV, VI). The dedicated precision test on the laser pointer-based designation method proves this approach is viable for target designation tasks within the context of facility maintenance. The drawback of this approach is that the laser dot becomes less visible in camera videos when the ambient lighting is strong, e.g. in an outdoor environment. In the light of this weakness, TPP AR was brought forth as an alternative and the study in Paper VI confirms the applicability of TPP AR in target designation tasks related to facility maintenance.

6.3 User performance As more AR applications have been discovered and developed, the number of potential end users is consistently growing. Evaluating AR systems with actual users has thus become an important step to bring AR into people’s everyday life. In view of this, we evaluated all AR tools in this thesis through user experiments. User task performance is measured by target designation precision and/or accuracy as well as task completion time. Paper IV shows that with the assistance of an AR tool, the subjects could finish the required target designation task 3 times as fast as the manual measuring approach and the precision for using the AR tool was around 1 cm on average in the worst case (only 0.5 cm less precise than the manual approach). In Paper V, if we disregard the influence of parallax effect for the time being, namely with the aid of the enhanced visual guide, the worst precision was 1.38 cm at depth = 0.2 m. Lastly, the outdoor study with an actual façade gives us 7.6 cm error (both in accuracy and precision). Although all our targets are imaginative (without real world counterparts), the implication of the errors listed above for real applications can be inferred by relating them to the error reported in [95]. There, the authors studied a similar AR application and evaluated it with experts in real use cases. They reported the overall error, in terms of re-projection accuracy, was around 5 cm, which means our tools are certainly relevant for real world applications. 68

We also investigated the human aspects that affect the user performance and in our studies we identified two design decisions that could possibly be limited by human perceptive, cognitive and motoric capabilities, namely the remote laser pointing and the TPP. As for the former one, I have already discussed that it is viable for designation tasks. While the parallax effect does worsen the designation precision as the depth of virtual object or the viewing angle increases, adding an enhanced visual guide can greatly negate the effect. Regarding the cognitively demanding TPP AR, results in Paper IV and VI indicate it contributes little to the overall designation errors, which has also been discussed in the previous text.



I would like to express my biggest gratitude to my main supervisor Stefan Seipel for his trust in me, his guidance and patience over these years. I also appreciate the support from my co-supervisors Julia Åhlén and Ewert Bengtsson. I am grateful to University of Gävle and Uppsala University for making this PhD opportunity possible for me. My thanks go to all the colleagues in both places and I thank you for all the help you gave me as well as those interesting conversations. I especially would like to thank Torsten “Totte” Jonsson at University of Gävle for his tremendous contributions in my later research. Last but certainly not least, I want to thank my parents for letting me fly away when most of the parents I know only want to bind their children to themselves. To those family members who have been looking after them in my absence, I am deeply in your debt.


Summary in Swedish

Byggnader och offentlig infrastruktur är avgörande för fungerande samhällen genom att de tillhandahåller grundläggande funktionalitet för människors boende, arbete, transport och tjänster nödvändiga för vårt dagliga liv. En effektiv hantering av drift och underhåll av byggnationer säkerställer deras kontinuerliga funktion och hållbarhet. åtgärder för drift och byggnadsunderhåll pågår under den största delen av byggnaders livscykel och kräver därför mycket stora kostnader. I byggnadssektorn har man drivit utvecklingen av modern informationsteknologi inom ramen för BIM (building information modelling), bland annat för att automatisera traditionella underhållsmetoder och processer, vilket gör byggnadsunderhåll effektivare och mer tillförlitlig. En av de största utmaningar med användning av modern informationsteknik (IT) inom byggnadssektorn är att människors arbete måste utföras utanför kontorslokaler, vilka IT-systemen traditionellt är anpassade till. Arbete med byggnader kräver mobilitet och tillgång till information precis på platsen där arbetet pågår. Dessa arbetsmiljöer ställer höga krav på flexibilitet av informationssystemen i termer av tillgängligt utrymme, sättet att interagera med datorn, eller belysningsförhållandena, för att nämna några. Förstärkt verklighet, eller AR (Augmented reality), erbjuder här en ny strategi för människa-datorinteraktion genom att direkt visa information relaterad till verkliga objekt som människor för närvarande uppfattar och vill interagera med. Förstärkt verklighet drar nytta av de senaste årens utveckling av informationsteknik vilket kännetecknas av kraftfulla miniatyriserade datorsystem och lätta bildskärmar med mycket hög upplösning och ljusstyrka. AR förstärker människors sinnesintryck med information av intresse på ett naturligt sätt dvs. genom att på plats och ofta i realtid överlagra en bild av verkligheten med den informationen som önskas. Eftersom användaren av AR inte medvetet behöver vända sin uppmärksamhet till en dator, bär denna teknik stor potential för att kunna ytterligare förbättra arbetsprocesser inom bl.a. byggnadsunderhåll. Den snabba teknikutvecklingen har under den senaste tiden lett till en stor mängd AR-applikationer, som har rönt stor medial uppmärksamhet. Framför allt inom spel och underhållning, som oftast inte har särskilt höga krav på positioneringens noggrannhet, har AR blivit mycket populärt. Inom byggnadsunderhåll är det i många situationer viktigt att den virtuella informationen visas rumsligt korrekt ovanpå de verkliga objekt som de tillhör samt att även användaren förmår att förknippa lägesbunden virtuell information riktig med korrekta positioner i verkligheten. Syftet med denna avhandling har varit att undersöka olika (tekniska och mänskliga) faktorer inom användning av AR 71

och analysera hur dessa påverkar noggrannheten i överföring av rumslig information till verkligheten i arbetssituationer som är relevanta inom byggnadsunderhåll. De metoder som utvecklades och användes i de olika delstudierna av avhandlingen omfattar dels nya algoritmer för bildbaserad registrering av byggnadsfasader, dels experiment (både inomhus och utomhus) för att fastställa felmarginaler i positionering vid användning av AR. Avhandlingen bidrar med nya så kallade features (visuellt framträdande egenskaper i bilden) som beskriver säregna drag i olika fasader. Dessa utgör grunden för att geometriskt korrekt kunna överlagra bilder av en fasad med annan förstärkande information relaterad till samma byggnad. I avhandlingens fallstudier användes termografiska bilder för att tillföra information som inte syns naturligt. Genom experimentella användarstudier i kontrollerade labbmiljöer undersöktes huruvida människans perceptuella och kognitiva förmågor begränsar möjligheten till lägesriktig bestämning av intressanta strukturer. Dessa experiment undersökte även olika former att se världen genom AR gränssnittet (visuella perspektiv) samt olika tekniker för att markera (interagera med) fysiska objekt. I ett utomhusexperiment integrerades de kunskaper vunna från tidigare studier i en applikationsprototyp för AR-baserad positionering av termografiska defekter i byggnadsfasader och utvärderades dess användbarhet i ett realistiskt användningsscenario. De olika delstudier visar att bilder av byggnadsfasader kan överlagras med termografiska bilder genom användning av de i avhandlingen utvecklade registreringstekniker som utgår från naturliga egenskaper i bilden. Den resulterande överlagringen uppnår en precision som gör tekniken användbar för positioneringsuppgifter med medelhöga noggrannhetskrav vid användning i en verklig miljö. Detta är ett viktigt bidrag då denna form av registrering inte kräver artificiella markörer som är vanliga i många aktuella AR applikationer. Genom att använda ett tredje-persons perspektiv istället för det traditionella första-persons perspektivet i handhållen AR går det att övervinna interaktionsdilemmat som uppstår när användaren vill fysiskt interagera med fasaden på armlängd medan hela fasadkontexten samtidigt ska kunna ses från längre håll. Den kognitiva belastningen som krävs för att tolka bildperspektivet i användarens egocentriska kontext medförde inga avsevärda fel i bestämning av lägesriktiga punkter.



[1] Nur Intan Adhani and Rambli Dayang Rohaya Awang. A survey of mobile augmented reality applications. In 1st International Conference on Future Trends in Computing and Communication Technologies, pages 89–96, 2012. [2] Junho Ahn and Richard Han. An indoor augmented-reality evacuation system for the smartphone using personalized pedometry. Human-Centric Computing and Information Sciences, 2(18):1–23, 2012. [3] A Akcamete, X Liu, B Akinci, and JH Garrett. Integrating and visualizing maintenance and repair work orders in bim: lessons learned from a prototype. In Proceedings of the 11th International Conference on Construction Applications of Virtual Reality, pages 639–649, 2011. [4] Giampietro Allasia, Roberto Cavoretto, and Alessandra De Rossi. Local interpolation schemes for landmark-based image registration: a comparison. Mathematics and Computers in Simulation, 106:1–25, 2014. [5] Vicente Arevalo and Javier Gonzalez. Improving piecewise linear registration of high-resolution satellite images through mesh optimization. IEEE Transactions on Geoscience and Remote Sensing, 46(11):3792–3803, 2008. [6] Ronald Azuma, Yohan Baillot, Reinhold Behringer, Steven Feiner, Simon Julier, and Blair MacIntyre. Recent advances in augmented reality. Computer Graphics and Applications, IEEE, 21(6):34–47, 2001. [7] Ronald Azuma and Gary Bishop. Improving static and dynamic registration in an optical see-through hmd. In Proceedings of the 21st annual conference on Computer graphics and interactive techniques, pages 197–204. ACM, 1994. [8] Ronald Azuma, Jong Weon Lee, Bolan Jiang, Jun Park, Suya You, and Ulrich Neumann. Tracking in unprepared environments for augmented reality systems. Computers & Graphics, 23(6):787–793, 1999. [9] Ronald T Azuma. A survey of augmented reality. Presence: Teleoperators and virtual environments, 6(4):355–385, 1997. [10] Michael Bajura and Ulrich Neumann. Dynamic registration correction in video-based augmented reality systems. IEEE Computer Graphics and Applications, 15(5):52–60, 1995. [11] CA Balaras and AA Argiriou. Infrared thermography for building diagnostics. Energy and buildings, 34(2):171–183, 2002. [12] Ryan Bane and Tobias Höllerer. Interactive tools for virtual x-ray vision in mobile augmented reality. In Mixed and Augmented Reality, 2004. ISMAR 2004. Third IEEE and ACM International Symposium on, pages 231–239. IEEE, 2004. [13] Domagoj Bariˇcevi´c, Tobias Höllerer, Pradeep Sen, and Matthew Turk. User-perspective augmented reality magic lens from gradients. In Proceedings of the 20th ACM Symposium on Virtual Reality Software and Technology, pages 87–96. ACM, 2014.


[14] Amir H Behzadan, Suyang Dong, and Vineet R Kamat. Augmented reality visualization: A review of civil infrastructure system applications. Advanced Engineering Informatics, 29(2):252–267, 2015. [15] Hrvoje Benko, Ricardo Jota, and Andrew Wilson. Miragetable: freehand interaction on a projected augmented reality tabletop. In Proceedings of the SIGCHI conference on human factors in computing systems, pages 199–208. ACM, 2012. [16] Mark Billinghurst, Hirokazu Kato, and Seiko Myojin. Advanced interaction techniques for augmented reality applications. In International Conference on Virtual and Mixed Reality, pages 13–22. Springer, 2009. [17] Mark Billinghurst, Thammathip Piumsomboon, and Bai Huidong. Hands in space: gesture interaction with augmented-reality interfaces. IEEE computer graphics and applications, 34(1):77–81, 2014. [18] Oliver Bimber, L Miguel Encarnaçáo, and Dieter Schmalstieg. Augmented reality with back-projection systems using transflective surfaces. In Computer Graphics Forum, volume 19, pages 161–168. Wiley Online Library, 2000. [19] Oliver Bimber, Bernd Fröhlich, D Schmalsteig, and L Miguel Encarnaçáo. The virtual showcase. Computer Graphics and Applications, IEEE, 21(6):48–55, 2001. [20] Oliver Bimber, Stephen M Gatesy, Lawrence M Witmer, Ramesh Raskar, and L Miguel Encarnação. Merging fossil specimens with computer-generated information. Computer, 35(9):25–30, 2002. [21] Oliver Bimber and Ramesh Raskar. Spatial augmented reality: merging real and virtual worlds. CRC Press, 2005. [22] IEC BiPM, ILAC IFCC, IUPAC ISO, and OIML IUPAP. International vocabulary of metrology–basic and general concepts and associated terms, 2008. JCGM, 200, 2008. [23] Doug A Bowman, Ernst Kruijff, Joseph J LaViola Jr, and Ivan Poupyrev. 3D user interfaces: theory and practice. Addison-Wesley, 2004. [24] Ignacio Zabalza Bribián, Alfonso Aranda Usón, and Sabina Scarpellini. Life cycle assessment in buildings: State-of-the-art and simplified lca methodology as a complement for building certification. Building and Environment, 44(12):2510–2520, 2009. [25] Wilhelm Burger and Mark J Burge. Digital image processing: an algorithmic introduction using Java. Springer, 2 edition, 2016. [26] Thomas P Caudell and David W Mizell. Augmented reality: An application of heads-up display technology to manual manufacturing processes. In System Sciences, 1992. Proceedings of the Twenty-Fifth Hawaii International Conference on, volume 2, pages 659–669. IEEE, 1992. [27] Hung-Lin Chi, Shih-Chung Kang, and Xiangyu Wang. Research trends and opportunities of augmented reality applications in architecture, engineering, and construction. Automation in construction, 33:116–122, 2013. [28] Andrew I Comport, Éric Marchand, and François Chaumette. A real-time tracker for markerless augmented reality. In Proceedings of the 2nd IEEE/ACM International Symposium on Mixed and Augmented Reality, page 36. IEEE Computer Society, 2003. [29] Kristin J Dana and P Anandan. Registration of visible and infrared images. In




[32] [33] [34]




[38] [39] [40] [41] [42]




Optical Engineering and Photonics in Aerospace Sensing, pages 2–13. International Society for Optics and Photonics, 1993. Pasquale Daponte, Luca De Vito, Francesco Picariello, and Maria Riccio. State of the art and future developments of the augmented reality for measurement applications. Measurement, 57:53–70, 2014. David Drascic and Paul Milgram. Perceptual issues in augmented reality. In Electronic Imaging: Science & Technology, pages 123–134. International Society for Optics and Photonics, 1996. Bernard Drion, Frans Melissen, and Roy Wood. Facilities management: lost, or regained? Facilities, 30(5/6):254–261, 2012. Andreas Dünser and Mark Billinghurst. Evaluating augmented reality systems. In Handbook of augmented reality, pages 289–307. Springer, 2011. Andreas Dünser, Mark Billinghurst, James Wen, Ville Lehtinen, and Antti Nurminen. Exploring the use of handheld ar for outdoor navigation. Computers & Graphics, 36(8):1084–1095, 2012. Andreas Dünser, Raphaël Grasset, and Mark Billinghurst. A survey of evaluation techniques used in augmented reality studies. Human Interface Technology Laboratory New Zealand, 2008. Chuck Eastman, Charles M Eastman, Paul Teicholz, Rafael Sacks, and Kathleen Liston. BIM handbook: A guide to building information modeling for owners, managers, designers, engineers and contractors. John Wiley & Sons, 2011. Steven Feiner, Blair MacIntyre, Tobias Höllerer, and Anthony Webster. A touring machine: Prototyping 3d mobile augmented reality systems for exploring the urban environment. Personal Technologies, 1(4):208–217, 1997. Steven Feiner, Blair Macintyre, and Dorée Seligmann. Knowledge-based augmented reality. Communications of the ACM, 36(7):53–62, 1993. Steven K Feiner. Augmented reality: A new way of seeing. Scientific American, pages 48–55, 2002. Rafael C. Gonzalez and Richard E. Woods. Digital image processing. Pearson Education, Inc., 3 edition, 2008. Richard Hartley and Andrew Zisserman. Multiple view geometry in computer vision. Cambridge university press, 2003. Anne-Cecilie Haugstvedt and John Krogstie. Mobile augmented reality for cultural heritage: A technology acceptance study. In Mixed and Augmented Reality (ISMAR), 2012 IEEE International Symposium on, pages 247–255. IEEE, 2012. Steven J Henderson and Steven Feiner. Evaluating the benefits of augmented reality for task localization in maintenance of an armored personnel carrier turret. In Mixed and Augmented Reality, 2009. ISMAR 2009. 8th IEEE International Symposium on, pages 135–144. IEEE, 2009. Tobias Höllerer and Steve Feiner. Mobile augmented reality. Telegeoinformatics: Location-Based Computing and Services. Taylor and Francis Books Ltd., London, UK, 21, 2004. Lei Hou, Xiangyu Wang, and Martijn Truijens. Using augmented reality to facilitate piping assembly: an experiment-based evaluation. Journal of Computing in Civil Engineering, 29(1):05014007, 2013.


[46] Ming-Kuei Hu. Visual pattern recognition by moment invariants. IRE transactions on information theory, 8(2):179–187, 1962. [47] John F Hughes, Andries Van Dam, James D Foley, and Steven K Feiner. Computer graphics: principles and practice. Pearson Education, 2014. [48] Sylvia Irawati, Scott Green, Mark Billinghurst, Andreas Duenser, and Heedong Ko. An evaluation of an augmented reality multimodal interface using speech and paddle gestures. In Advances in Artificial Reality and Tele-Existence, pages 272–283. Springer, 2006. [49] Javier Irizarry, Masoud Gheisari, Graceline Williams, and Bruce N Walker. Infospot: A mobile augmented reality method for accessing building information through a situation awareness approach. Automation in Construction, 33:11–23, 2013. [50] Hiroshi Ishii and Brygg Ullmer. Tangible bits: towards seamless interfaces between people, bits and atoms. In Proceedings of the ACM SIGCHI Conference on Human factors in computing systems, pages 234–241. ACM, 1997. [51] Seokhee Jeon, Hyeongseop Shim, and Gerard Jounghyun Kim. Viewpoint usability for desktop augmented reality. IJVR, 5(3):33–39, 2006. [52] Hirokazu Kato and Mark Billinghurst. Marker tracking and hmd calibration for a video-based augmented reality conferencing system. In Augmented Reality, 1999.(IWAR’99) Proceedings. 2nd IEEE and ACM International Workshop on, pages 85–94. IEEE, 1999. [53] Hirokazu Kato, Mark Billinghurst, Ivan Poupyrev, Kenji Imamoto, and Keihachiro Tachibana. Virtual object manipulation on a table-top ar environment. In Augmented Reality, 2000.(ISAR 2000). Proceedings. IEEE and ACM International Symposium on, pages 111–119. Ieee, 2000. [54] Georg Klein and David Murray. Parallel tracking and mapping on a camera phone. In Mixed and Augmented Reality, 2009. ISMAR 2009. 8th IEEE International Symposium on, pages 83–86. IEEE, 2009. [55] Christian Koch, Matthias Neges, Markus König, and Michael Abramovici. Natural markers for augmented reality-based indoor navigation and facility maintenance. Automation in Construction, 48:18–30, 2014. [56] Christian Koch, Matthias Neges, Markus König, and Michael Abramovici. Natural markers for augmented reality-based indoor navigation and facility maintenance. Automation in Construction, 48:18–30, 2014. [57] Stan Kurkovsky, Ranjana Koshy, Vivian Novak, and Peter Szul. Current issues in handheld augmented reality. In Communications and Information Technology (ICCIT), 2012 International Conference on, pages 68–72. IEEE, 2012. [58] Angeliki Kylili, Paris A Fokaides, Petros Christou, and Soteris A Kalogirou. Infrared thermography (irt) applications for building diagnostics: A review. Applied Energy, 134:531–549, 2014. [59] Jae-Young Lee, Hyung-Min Park, Seok-Han Lee, Soon-Ho Shin, Tae-Eun Kim, and Jong-Soo Choi. Design and implementation of an augmented reality system using gaze interaction. Multimedia Tools and Applications, 68(2):265–280, 2014. [60] Sanghoon Lee and Ömer Akin. Augmented reality-based computational



[62] [63]



[66] [67]

[68] [69]


[71] [72]




fieldwork support for equipment operations and maintenance. Automation in Construction, 20(4):338–352, 2011. Thomas Martin Lehmann, Claudia Gonner, and Klaus Spitzer. Survey: Interpolation methods in medical image processing. IEEE transactions on medical imaging, 18(11):1049–1075, 1999. John R Lewis. In the eye of the beholder. IEEE Spectrum, 41(5):24–28, 2004. Hui Li, BS Manjunath, and Sanjit K Mitra. A contour-based approach to multisensor image registration. IEEE transactions on image processing, 4(3):320–334, 1995. Hongen Liao, Takashi Inomata, Ichiro Sakuma, and Takeyoshi Dohi. 3-d augmented reality for mri-guided surgery using integral videography autostereoscopic image overlay. IEEE Transactions on Biomedical Engineering, 57(6):1476–1486, 2010. Mark A. Livingston, Arindam Dey, Christian Sandor, and Bruce H. Thomas. Human Factors in Augmented Reality Environments, chapter Pursuit of “X-Ray Vision” for Augmented Reality, pages 67–107. Springer New York, New York, NY, 2013. David G Lowe. Distinctive image features from scale-invariant keypoints. International journal of computer vision, 60(2):91–110, 2004. Yujie Lu, Yongkui Li, Miroslaw Skibniewski, Zhilei Wu, Runshi Wang, and Yun Le. Information and communication technology applications in architecture, engineering, and construction organizations: A 15-year review. Journal of Management in Engineering, 31(1):A4014010–1 – A4014010–19, 2014. Frank Luna. Introduction to 3D game programming with DirectX 11. Mercury Learning and Information, 2012. E. Marchand, H. Uchiyama, and F. Spindler. Pose estimation for augmented reality: a hands-on survey. IEEE Transactions on Visualization and Computer Graphics, 2016. Michael R Marner, Ross T Smith, James A Walsh, and Bruce H Thomas. Spatial user interfaces for large-scale projector-based augmented reality. Computer Graphics and Applications, 34(6):74–82, 2014. Rainer Mautz. Indoor positioning technologies, 2012. Paul Milgram and Herman Colquhoun. A taxonomy of real and virtual world display integration. Mixed reality: Merging real and virtual worlds, 1:1–26, 1999. Tsutomu Miyashita, Peter Meier, Tomoya Tachikawa, Stephanie Orlic, Tobias Eble, Volker Scholz, Andreas Gapel, Oliver Gerl, Stanimir Arnaudov, and Sebastian Lieberknecht. An augmented reality museum guide. In Proceedings of the 7th IEEE/ACM International Symposium on Mixed and Augmented Reality, pages 103–106. IEEE Computer Society, 2008. Alessandro Mulloni, Andreas Dünser, and Dieter Schmalstieg. Zooming interfaces for augmented reality browsers. In Proceedings of the 12th international conference on Human computer interaction with mobile devices and services, pages 161–170. ACM, 2010. Ulrich Neumann and Anthony Majoros. Cognitive, performance, and systems issues for augmented reality applications in manufacturing and maintenance.

















In Virtual Reality Annual International Symposium, 1998. Proceedings., IEEE 1998, pages 4–11. IEEE, 1998. Joseph Newman, David Ingram, and Andy Hopper. Augmented reality in a wide area sentient environment. In Augmented Reality, 2001. Proceedings. IEEE and ACM International Symposium on, pages 77–86. IEEE, 2001. Manuel Olbrich, Holger Graf, Svenja Kahn, Timo Engelke, Jens Keil, Patrick Riess, Sabine Webel, Ulrich Bockholt, and Guillaume Picinbono. Augmented reality supporting user-centric building information management. The visual computer, 29(10):1093–1105, 2013. Alex Olwal, Jonny Gustafsson, and Christoffer Lindfors. Spatial augmented reality on industrial cnc-machines. In Electronic Imaging, volume 6804, pages 680409–680409. International Society for Optics and Photonics, 2008. Alex Olwal and Tobias Höllerer. Polar: portable, optical see-through, low-cost augmented reality. In Proceedings of the ACM symposium on Virtual reality software and technology, pages 227–230. ACM, 2005. Oscar Ortiz, Francesc Castells, and Guido Sonnemann. Sustainability in the construction industry: A review of recent developments based on lca. Construction and Building Materials, 23(1):28–39, 2009. Antoine Petit, Eric Marchand, and Keyvan Kanani. Augmenting markerless complex 3d objects by combining geometrical and color edge information. In Mixed and Augmented Reality (ISMAR), 2013 IEEE International Symposium on, pages 287–288. IEEE, 2013. Thomas Pintaric and Hannes Kaufmann. Affordable infrared-optical pose-tracking for virtual and augmented reality. In Proceedings of Trends and Issues in Tracking for Virtual Environments Workshop, IEEE VR, pages 44–51, 2007. Josien PW Pluim, JB Antoine Maintz, and Max A Viergever. Mutual-information-based registration of medical images: a survey. IEEE transactions on medical imaging, 22(8):986–1004, 2003. Ivan Poupyrev, Desney Tan, Mark Billinghurst, Hirokazu Kato, Holger Regenbrecht, and Nobuji Tetsutani. Tiles: A mixed reality authoring interface. In INTERACT 2001 Conference on Human Computer Interaction, pages 334–341, 2001. Klen Copic Pucihar, Paul Coulton, and Jason Alexander. Evaluating dual-view perceptual issues in handheld augmented reality: device vs. user perspective rendering. In ICMI, pages 381–388, 2013. Sara Rankohi and Lloyd Waugh. Review and analysis of augmented reality literature for construction industry. Visualization in Engineering, 1(1):1–18, 2013. Ramesh Raskar, Matt Cutts, Greg Welch, and Wolfgang Stuerzlinger. Efficient image generation for multiprojector and multisurface displays. In Rendering Techniques’ 98, pages 139–144. Springer, 1998. Ramesh Raskar, Jeroen Van Baar, Paul Beardsley, Thomas Willwacher, Srinivas Rao, and Clifton Forlines. ilamps: geometrically aware and self-configuring projectors. In ACM SIGGRAPH 2006 Courses, pages 7–16. ACM, 2006. Rabee M Reffat, J Gero, and Wei Peng. Using data mining on building


[91] [92]




[96] [97]

[98] [99]





maintenance during the building life cycle. In Proceedings of the 38th Australian and New Zealand Architectural Science Association (ANZASCA) Conference, pages 91–97, 2004. Gerhard Reitmayr and Tom Drummond. Going out: robust model-based tracking for outdoor augmented reality. In Proceedings of the 5th IEEE and ACM International Symposium on Mixed and Augmented Reality, pages 109–118. IEEE Computer Society, 2006. Jun Rekimoto. Augmented reality using the 2d matrix code. Interactive Systems and Software IV. Kindaikagaku-sha, pages 199–208, 1996. Jun Rekimoto and Katashi Nagao. The world through the computer: Computer augmented interaction with real world environments. In Proceedings of the 8th annual ACM symposium on User interface and software technology, pages 29–36. ACM, 1995. Jannick P Rolland, Larry Davis, and Yohan Baillot. A survey of tracking technology for virtual environments. Fundamentals of wearable computers and augmented reality, 1(1):67–112, 2001. Andrea Sanna and Federico Manuri. A survey on applications of augmented reality. Advances in Computer Science: an International Journal, 5(1):18–27, 2016. Gerhard Schall, Stefanie Zollmann, and Gerhard Reitmayr. Smart vidente: advances in mobile augmented reality for interactive visualization of underground infrastructure. Personal and ubiquitous computing, 17(7):1533–1549, 2013. William R Sherman and Alan B Craig. Understanding virtual reality: Interface, application, and design. Elsevier, 2002. Do Hyoung Shin and Phillip S Dunston. Identification of application areas for augmented reality in industrial construction based on technology suitability. Automation in Construction, 17(7):882–894, 2008. Do Hyoung Shin and Phillip S Dunston. Evaluation of augmented reality in steel column inspection. Automation in Construction, 18(2):118–129, 2009. Dave Shreiner, Graham Sellers, John M Kessenich, and Bill Licea-Kane. OpenGL programming guide: The Official guide to learning OpenGL, version 4.3. Addison-Wesley, 2013. Mengu Sukan, Steven Feiner, Barbara Tversky, and Semih Energin. Quick viewpoint switching for manipulating virtual objects in hand-held augmented reality using stored snapshots. In Mixed and Augmented Reality (ISMAR), 2012 IEEE International Symposium on, pages 217–226. IEEE, 2012. Ivan E Sutherland. A head-mounted three dimensional display. In Proceedings of the December 9-11, 1968, fall joint computer conference, part I, pages 757–764. ACM, 1968. J Edward Swan and Joseph L Gabbard. Survey of user-based experimentation in augmented reality. In Proceedings of 1st International Conference on Virtual Reality, pages 1–9, 2005. Sanat A Talmaki, Suyang Dong, and Vineet R Kamat. Geospatial databases and augmented reality visualization for improving safety in urban excavation operations. In Construction Research Congress, volume 2010, pages 91–101, 2010.


[104] Markus Tatzgern, Raphael Grasset, Eduardo Veas, Denis Kalkofen, Hartmut Seichter, and Dieter Schmalstieg. Exploring real world points of interest: Design and evaluation of object-centric exploration techniques for augmented reality. Pervasive and Mobile Computing, 18:55–70, 2015. [105] Marcus Tönnis, David A Plecher, and Gudrun Klinker. Representing information–classifying the augmented reality presentation space. Computers & Graphics, 37(8):997–1011, 2013. [106] Mihran Tuceryan and Nassir Navab. Single point active alignment method (spaam) for optical see-through hmd calibration for ar. In Augmented Reality, 2000.(ISAR 2000). Proceedings. IEEE and ACM International Symposium on, pages 149–158. IEEE, 2000. [107] Hakan Urey, Kishore V Chellappan, Erdem Erden, and Phil Surman. State of the art in stereoscopic and autostereoscopic displays. Proceedings of the IEEE, 99(4):540–555, 2011. [108] Eduardo Veas, Alessandro Mulloni, Ernst Kruijff, Holger Regenbrecht, and Dieter Schmalstieg. Techniques for view transition in multi-camera outdoor environments. In Proceedings of Graphics Interface 2010, pages 193–200. Canadian Information Processing Society, 2010. [109] Vassilios Vlahakis, John Karigiannis, Manolis Tsotros, Michael Gounaris, Luis Almeida, Didier Stricker, Tim Gleue, Ioannis T Christou, Renzo Carlucci, and Nikos Ioannidis. Archeoguide: first results of an augmented reality, mobile computing system in cultural heritage sites. In Virtual Reality, Archeology, and Cultural Heritage, pages 131–140, 2001. [110] Daniel Wagner, Gerhard Reitmayr, Alessandro Mulloni, Tom Drummond, and Dieter Schmalstieg. Pose tracking from natural features on mobile phones. In Proceedings of the 7th IEEE/ACM International Symposium on Mixed and Augmented Reality, pages 125–134. IEEE Computer Society, 2008. [111] Daniel Wagner and Dieter Schmalstieg. First steps towards handheld augmented reality. In Proceedings of the 7th IEEE International Symposium on Wearable Computers, pages 127–135. IEEE Computer Society, 2003. [112] Xiangyu Wang, Mi Jeong Kim, Peter ED Love, and Shih-Chung Kang. Augmented reality in built environment: Classification and implications for future research. Automation in Construction, 32:1–13, 2013. [113] Colin Ware. Information visualization: perception for design. Elsevier, 2012. [114] Andrew Wilson, Hrvoje Benko, Shahram Izadi, and Otmar Hilliges. Steerable augmented reality with the beamatron. In Proceedings of the 25th annual ACM symposium on User interface software and technology, pages 413–422. ACM, 2012. [115] Zhou Wu and Ardeshir Goshtasby. Adaptive image registration via hierarchical voronoi subdivision. IEEE Transactions on Image Processing, 21(5):2464–2473, 2012. [116] Harald Wuest, Didier Stricker, and Jens Herder. Tracking of industrial objects by using cad models. Journal of Virtual Reality and Broadcasting, 4(1):1–9, 2007. [117] ML Yuan, SK Ong, and AYC Nee. Augmented reality for assembly guidance using a virtual interactive tool. International Journal of Production Research, 46(7):1745–1767, 2008.


[118] Lyubomir Zagorchev and Ardeshir Goshtasby. A comparative study of transformation functions for nonrigid image registration. IEEE transactions on image processing, 15(3):529–538, 2006. [119] Barbara Zitova and Jan Flusser. Image registration methods: a survey. Image and vision computing, 21(11):977–1000, 2003. [120] Siavash Zokai, Julien Esteve, Yakup Genc, and Nassir Navab. Multiview paraperspective projection model for diminished reality. In Mixed and Augmented Reality, 2003. Proceedings. The Second IEEE and ACM International Symposium on, pages 217–226. IEEE, 2003.


Acta Universitatis Upsaliensis Digital Comprehensive Summaries of Uppsala Dissertations from the Faculty of Science and Technology 1412 Editor: The Dean of the Faculty of Science and Technology A doctoral dissertation from the Faculty of Science and Technology, Uppsala University, is usually a summary of a number of papers. A few copies of the complete dissertation are kept at major Swedish research libraries, while the summary alone is distributed internationally through the series Digital Comprehensive Summaries of Uppsala Dissertations from the Faculty of Science and Technology. (Prior to January, 2005, the series was published under the title “Comprehensive Summaries of Uppsala Dissertations from the Faculty of Science and Technology”.)

Distribution: publications.uu.se urn:nbn:se:uu:diva-301363


View more...


Copyright � 2017 SILO Inc.