I Introduction
Omnidirectional cameras have been widely used in mobile robots. Visual cues from panorama images help the robot to achieve homing in [1], which uses omnicameras’ advantage of the large field of view (FOV). In [2], panorama images are exploited to implement a bearingsonly SLAM system, which can provide rich feature points. In addition, several meaningful applications with panorama cameras are mentioned in [3]. In the above cases, pose estimation is one of the important topics. Optimization and geometric methods are two common ways to achieve this task. The former is usually used in featurebased methods [4], while the latter is one of the basic parts in direct methods [5]. These two approaches are usually combined together to achieve better performance. For example, Engel et. al combined both methods to realize realtime robust visual odometry in [6]; [7]
uses geometric methods for the frontend of a SLAM system and the optimization method for the backend. In addition, recently deep learning gained attention in pose estimation
[8, 9]. Since the deep learning methods heavily depend on training images, which may be difficult to adapt to different environments within limited training datasets, we do not take the deep learning method into account.Optimization methods are usually used to minimize the photometric and geometry errors. In direct methods based on photometric consistency, the motion between different frames is estimated by minimizing the photometric error, as used in [5, 6, 10]. Geometry error optimization helps to boost pose estimation, such as bundle adjustment [11] and graph optimization [12, 13].
Geometric methods include two main groups: pose estimation by 2D to 2D correspondences and 3D to 2D correspondences. The eightpoint [14] and fivepoint algorithms [15] are widely used to estimate the relative pose between two frames with given 2D2D correspondences. Firstly, the essential matrix or fundamental matrix is calculated by epipolar geometry. Then relative rotation and uptoscale translation are estimated from the matrix. Methods like [16], that estimate relative pose by 3D2D correspondences, compensate the scaling case and calculate the absolute pose directly, because 3D points provide the real scale information.
In addition to featurebased and direct methods, spectral methods are also used for motion estimation. On 2D images, the iFMI algorithm [17, 18]
uses Fast Fourier Transform (FFT) to calculate the
and translation, yaw and scaling between two images. It is applied to image mosaicking [19] and motion detection [20]. A related approach is also used for registration of 3D range data [21].If we only consider pose estimation between two frames, optimization is used with direct methods, while the geometric vision helps featurebased methods to calculate the camera’s motion. Our proposed method belongs to the latter, but it differs from the two approaches mentioned above by directly exploiting the properties of catadioptric omnidirectional cameras. Figure 1 shows the omnidirectional camera that we use to collect omniimages and one omniimage, mounted on a simple robot. We also show the image as captured by the smart phone and its unwrapped panorama image. From Figure 1(c), we can see that the catadioptric panorama image has the disadvantage of low resolution, as mentioned in [3], which will increase the difficulty for featurebased methods. Thus it is important to process these images with more robust algorithms. To meet this requirement, we combined the improved FourierMellin invariant (iFMI) algorithm with our proposed method, since it proves to be more robust than SIFT. Our main contributions are summarized as:

proposing a novel relative pose estimation method based on geometric vision and fitting of pixel displacement values to sinusoidal functions;

exploiting the 2D frequencybased algorithm to estimate 3D pose of the omnicamera;

comparing our algorithm with commonly used epipolar geometry methods together with different features.
The rest of this paper is organized as follows: the theoretical foundation of our proposed method is introduced in Sec II, including motion model, proof and parameter estimation; then we explain out implementation detail in Sec III; the experiment and analysis is displayed in Sec IV; finally, we conclude our work in Sec V.
Ii Problem Formulation
Iia Notations

u,vaxis: image coordinates

x,y,zaxis: camera coordinates

: an omniimage; : superscript means frame

: a panorama image

: coordinates in a panorama image;

: width and height of the panorama image;

: the translation (in pixel) in uaxis versus a column () between two panorama images;

: the translation (in pixel) in vaxis versus a column () between two panorama images;

: 3D rotation of the camera between two panorama images;

: 3D translation of the camera between two panorama images;

: projection of in xy plane;

: projection of in z axis;

: projection of around x and y axis;

: projection of around z axis;

: the angle between the axis of the x,y translation and x axis;

: the unknown scale factor of the translation;

: the known opening angle of a pixel. Suppose pixels are square, then ;

: the radius of the cylinder;

: the height of the cylinder; thus the height of corresponding panorama image;

: transformation matrix in 3D space.
IiB Modeling of Pose Estimation
IiB1 Cylinder Camera Model
As mentioned in [22], an image captured by a catadioptric omnidirectional camera is usually unwrapped into a cylindric panorama image. Thus we model the pose estimation problem base on the cylindric camera model in this work. Figure 2
gives an intuitive description from spherical to cylinder model, which can be implemented by interpolation. The transformation between the cylinder and the panorama image can be described as
(1a)  
(1b) 
and
(2a)  
(2b)  
(2c) 
Note that in the panorama image there will be a associated with the positive axis of the camera (e.g. front of the robot) and another , pixels to the right, which is associated with the positive axis of the camera (e.g. left of the robot). The negative sides of the axes will be on the opposite side of the panorama image, so pixels away. In order to simplify the formulation, this paper assumes that the xaxis will always located at (grey dash line in Figure 2), and the yaxis is thus at (green line in Figure 2).
IiB2 Resolution Consistency
When unwrapping omniimages to panorama images with Eq. 1, 2, we cannot make sure that each pixel is square, i.e. there may be resolution inconsistency. In other words, the incident angle could be different with same pixels in and axis. Thus we use a calibration method described in [23] to find the ratio between angles per pixel in and direction. For the remainder of this paper we assume a perfectly calibrated cylinder model with square pixels, while the detailed discussion of such calibration is out of the scope of this paper.
IiB3 Motion Model
We propose a sinusoidal function to describe the motion model of catadioptric omnidirectional cameras as follows,:
(3a)  
(3b)  
(3c) 
We model the motion of the camera using the sinusoid function Eq 3a with offset , amplitude and phase shift , while
is just a calibration factor (field of view of a pixel). In order to recover the full six degree of freedom (DoF) motion of the camera we apply this function twice: once to the vertical motion
of the columns of the panorama image (Eq. 3b) and once to the horizontal motion of the columns of the panorama image (Eq. 3c).Then we analyze the motion model in four different cases to explain why we choose the sinusoidal function to represent the model:

Translation along zaxis (): the shift along vaxis of each column , which is the same for all ;

Rotation around x and y axis (): (Roll and pitch, respectively, using pitch=0 and some roll as example) the closer is to the (positive or negative) xaxis of the cylinder model, the smaller the shift along the vaxis; the closer is to the (positive or negative) yaxis, the larger the shift along the vaxis;

Rotation around zaxis (): (Yaw) the shift along the uaxis in each column , which is the same for all ;

Translation along x and yaxis (): (using translation along xaxis as an example) the closer is to the (positive or negative) xaxis of the cylinder model, the smaller the shift along uaxis of the center in each column; the closer is to the (positive or negative) yaxis, the larger the shift along the uaxis.
When the camera rotates around an arbitrary axis in the xy plane, the shift along v of the columns follows a certain pattern: the closer the column is to , the smaller the shift is. Thus we use the phase shift to model the angle of to the xaxis in Eq 3b. The phase shift in Eq 3c is derived following the same argument.
In the following we give a rigorous proof of the motion model. We have two images and and a transfrom between them. Assume there is an arbitrary point in the second panorama image , its cylinder coordinate is
(4) 
Then we analyze the shifts of each row or column when the camera moves like the above four cases. Firstly, the 3D point is transformed to with specified transformation matrix ; secondly, we find the intersection
between the vector
and the cylinder , which is then unwrapped into the point in the panorama image ; finally, we calculated the shift between and in row and column direction, respectively.
Translation along zaxis:
The transformation matrix is: (5a) the transformed point is (5b) the intersection is (5c) finally we get the shift in column direction: (5d) 
Rotation around x and y axis: (for example x axis (roll))
The transformation matrix is: (6a) the transformed point is (6b) the intersection is (6c) where ; finally we get the shift in coulumn direction: (6d) The denominator in Eq 6d can expand into , which can be approximated to when is small. Under this condition, and . Thus the shift will be
(6e) 
Rotation around zaxis:
The transformation matrix is: (7a) the transformed point is (7b) the intersection remains (7c) finally we get the shift in row direction: (7d) 
Translation along x and yaxis: (for example along xaxis)
The transformation matrix is: (8a) the transformed point is (8b) the intersection is (8c) where . ; finally we get the shift in row direction: (8d) (8e) In Eq 8e, since , then approximately equals to and is small enough to take the firstorder Taylor appromation. Thus, the shift of each row can be described as: (8f)
IiC Curve Fitting
There are some unknown parameters in Eq. 3 that we estimate using curve fitting. We solve it through modeling it as a nonlinear leastsquared problem and the parameters are estimated by using a nonlinear optimization algorithm. Then the rotation and translation can be calculated from .
To find the corresponding parameters in Eq. 3b and Eq. 3c we build the following two objective functions:
(9a)  
(9b)  
(9c)  
(9d) 
where and are cost functions and and are the measured shifts in and direction by using the iFMI algorithm [19]. and
are loss functions in the standard leastsquared form. Afterwards, the nonlinear optimization LevenbergMarquardt algorithm
[24] is used to minimize the objective functions, which could also be replaced with other optimization methods.(10a)  
(10b) 
We use two methods to handle the outliers problem. First we employ a median filter of data to remove some obvious outliers. Then we use the Huber loss function to reduce the influence of outliers, which was proven to be less sensitive to outliers in data than the L2 loss
[25]. Because the PseudoHuber loss function [26] combines the best properties of L2 loss and Huber loss by being strongly convex when close to the minimum and less steep for outliers, we use it to replace the standard leastsquared loss form with Eq. 10a and Eq. 10b.Iii Implementation
The implementation of our proposed method is described in Algorithm 1, where we have .
We use a square window to slide along the direction on the panorama images and . For each window pair, we use the iFMI method to find the 2D motion. Then we get the set of shifts and versus column index . Afterwards, we fit the values with the sinusoidal function to estimate parameter as line 7 of Algorithm 1 describes. of Eq 10 is selected in our implementation. Figure 3(c) displays an example of curve fitting between two frames. The optimization algorithm fits the sinusoids effectively.
In Figure 3(c) we see maxima in the curve at around and . Following the definition from above, that the axis is at , this means there is a big motion in the image where the axis is. It thus follows that there has been a rotation around the axis of the camera (roll).
Iv experiments
In this section we compare our method with featurebased algorithms using experiments on three different datasets. We use images from indoor (office) and outdoor (street) scenes as well as another dataset with ambiguous features (grass), which poses big challenges to featurematching algorithms (Samples are shown in Figure 4). All the images are captured by the phone (Oneplus 5) covered with a lowcost omnilens (Kogeto Dot Lens), as shown in Figure 1(a). In addition, all the computations are conducted on a PC with an Intel Core i76700 CPU and 16 GB memory.
grass  office  street  average error  

roll  pitch  yaw  roll  pitch  yaw  roll  pitch  yaw  roll  pitch  yaw  time[s]  
ORB  0.196  0.683  0.239  0.225  0.163  0.191  0.313  0.339  0.328  0.245  0.395  0.253  0.300  0.11 
AKAZE  0.130  0.414  0.113  0.171  0.110  0.073  0.164  0.312  0.106  0.155  0.279  0.097  0.177  0.75 
Ours  0.088  0.123  0.129  0.054  0.047  0.021  0.027  0.032  0.143  0.056  0.067  0.098  0.074  0.40 



In this experiment, we choose ORB and AKAZE feature detectors to implement the featurebased algorithm, since ORB is one of the fastest detectors and AKAZE is designed for detecting features in nonlinear space according to [27]. Besides, the STEWENIUS fivepoint algorithm is exploited to estimate the relative pose, which is widely used and performs robustly.
To simplify the comparison, the experiment is done with controlled variables, that is, we only rotate the camera around one axis (, or axis) at one single test. Experiments for the translation are omitted in this work for two reasons: For estimating the correct translation, the sinusoid fitting model assumes an equal distance to the camera of all points. This restriction is easily broken and thus the translation estimation is error prone. The rotation doesn’t require this assumption. Secondly, the translation estimate is anyways up to an unknown scale factor, which adds another difficulty. We use the IMU of the smartphone as ground truth. In the following, we analyze the results in both qualitative and quantitative ways.
Iva Results Analysis
Figure 5 demonstrates an intuitive performance of the pose estimation in the three datasets. We find that our method achieves a better performance than featurebased methods in general. Thanks to the robust iFMI algorithm, our method also works well on the grass dataset, while the featurebased methods suffer from describing the ambiguous features. Especially, the ORBbased algorithm fails to detect roll on the grass datasets. Both the proposed algorithm and featurebased methods show a good performance on office datasets, where the latter gets help from rich features in this scenario, e.g. chairs, books and so on. Moreover, the proposed method also outperforms the two featurebased ones on the office dataset. Lastly, the street dataset shows that our method also performs well regarding the accuracy and robustness in the general outdoor scenario.
Table I provides a rigorous comparison among three methods, where the root mean square error (RMSE) is used as the error metrics. This table shows that the total average error of our proposed method is less than half of the AKAZEbased algorithm and about onequarter of the ORBbased algorithm. It also shows that the AKAZEbased method outperforms our method with a slight advantage in the yaw test. But we suspect, that the ground truth yaw (from the smartphone IMU) might not be reliable even for the short interval between the frames and thus the yawexperiments shouldn’t be trusted too much. The runtime of the AKAZEbased algorithm is more than that of our proposed and ORBbased methods, as the last column of Table I shows.
In summary, the our method is the most robust among the three algorithms in the different application scenarios and performing especially well in environments which are featuredeprived or have repetitive features, such as grass. Even though our method is not the fastest one, it is costeffective when considering the balance between accuracy and runtime.
V Conclusion
In this paper we proposed a novel pose estimation method by fitting sinusoid functions of pixel displacements in panoramic images of omnidirectional cameras. We calculated the average pixel displacements of columns in the image using the iFMI algorithm. Experimental results for the 3D rotation part show that our method outperforms featurebased approaches, which we attribute to the facts that we use a spectral method and then fit many results to (sinusoid) functions, which thus takes care of outliers of the spectral method. The iFMI method can only estimate 2D transforms of pinhole images, while our approach can, in principal, estimate the full 6 DoF 3D camera pose (up to scale factor) from panoramic images.
From the experiment results, we can see that the accuracy of our proposed method is about two to four times better than the other featurebased approaches for roll and pitch. The speed of our method is almost twice as fast than the accurate AKAZEbased algorithm. Thus our method is the most costeffective one among the three methods.
We envision our algorithm to be used more for fast visualodometry rather than visual SLAM with big displacements between frames. In this case we estimate that also the translation part of our sinusoid fitting model will perform reasonably well. Soon we will do experiments also regarding the translation part of our algorithm, and use a tracking system to generate more precise ground truth data. Furthermore, we will attempt to incorporate also the scaling and rotation of the sliding window in our model, so we can use all results that iFMI is giving us, to maybe get even better and more robust results. We then also plan to investigate other means of generating the pixel displacements per column (e.g. optical flow, 1D spectral methods). Finally, we aim to integrate our method into a full omnivisual odometry and SLAM framework.
References
 [1] A. A. Argyros, K. E. Bekris, S. C. Orphanoudakis, and L. E. Kavraki, “Robot homing by exploiting panoramic vision,” Autonomous Robots, vol. 19, no. 1, pp. 7–25, 2005.
 [2] T. Lemaire and S. Lacroix, “Slam with panoramic vision,” Journal of Field Robotics, vol. 24, no. 12, pp. 91–111, 2007.
 [3] R. Benosman, S. Kang, and O. Faugeras, Panoramic vision. SpringerVerlag New York, Berlin, Heidelberg, 2000.
 [4] A. J. Davison, I. D. Reid, N. D. Molton, and O. Stasse, “Monoslam: Realtime single camera slam,” IEEE Transactions on Pattern Analysis & Machine Intelligence, no. 6, pp. 1052–1067, 2007.

[5]
J. Engel, T. Schöps, and D. Cremers, “Lsdslam: Largescale direct
monocular slam,” in
European conference on computer vision
. Springer, 2014, pp. 834–849.  [6] J. Engel, V. Koltun, and D. Cremers, “Direct sparse odometry,” IEEE transactions on pattern analysis and machine intelligence, vol. 40, no. 3, pp. 611–625, 2018.
 [7] R. MurArtal, J. M. M. Montiel, and J. D. Tardos, “Orbslam: a versatile and accurate monocular slam system,” IEEE transactions on robotics, vol. 31, no. 5, pp. 1147–1163, 2015.

[8]
A. Handa, M. Bloesch, V. Pătrăucean, S. Stent, J. McCormac, and A. Davison, “gvnn: Neural network library for geometric computer vision,” in
European Conference on Computer Vision. Springer, 2016, pp. 67–82.  [9] G. Costante, M. Mancini, P. Valigi, and T. A. Ciarfuglia, “Exploring representation learning with cnns for frametoframe egomotion estimation,” IEEE robotics and automation letters, vol. 1, no. 1, pp. 18–25, 2016.
 [10] R. A. Newcombe, S. J. Lovegrove, and A. J. Davison, “Dtam: Dense tracking and mapping in realtime,” in 2011 international conference on computer vision. IEEE, 2011, pp. 2320–2327.
 [11] B. Triggs, P. F. McLauchlan, R. I. Hartley, and A. W. Fitzgibbon, “Bundle adjustment—a modern synthesis,” in International workshop on vision algorithms. Springer, 1999, pp. 298–372.
 [12] R. Kümmerle, G. Grisetti, H. Strasdat, K. Konolige, and W. Burgard, “g 2 o: A general framework for graph optimization,” in 2011 IEEE International Conference on Robotics and Automation. IEEE, 2011, pp. 3607–3613.
 [13] M. Pfingsthorn and A. Birk, “Generalized graph slam: Solving local and global ambiguities through multimodal and hyperedge constraints,” International Journal of Robotics Research (IJRR), vol. 35, 2016.
 [14] R. I. Hartley, “In defence of the 8point algorithm,” in Proceedings of IEEE international conference on computer vision. IEEE, 1995, pp. 1064–1070.

[15]
H. Stewénius, D. Nistér, F. Kahl, and F. Schaffalitzky, “A minimal
solution for relative pose with unknown focal length,” in
2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’05)
, vol. 2. IEEE, 2005, pp. 789–794.  [16] Y. Zheng, Y. Kuang, S. Sugimoto, K. Astrom, and M. Okutomi, “Revisiting the pnp problem: A fast, general and optimal solution,” in Proceedings of the IEEE International Conference on Computer Vision, 2013, pp. 2344–2351.
 [17] H. Bülow and A. Birk, “Fast and robust photomapping with an unmanned aerial vehicle (uav),” in Intelligent Robots and Systems, 2009. IROS 2009. IEEE/RSJ International Conference on. IEEE, 2009, pp. 3368–3373.
 [18] S. Schwertfeger, H. Bülow, and A. Birk, “On the effects of sampling resolution in improved fourier mellin based registration for underwater mapping,” in 7th Symposium on Intelligent Autonomous Vehicles (IAV), IFAC, IFAC. IFAC, 2010.
 [19] A. Birk, B. Wiggerich, H. Bülow, M. Pfingsthorn, and S. Schwertfeger, “Safety, security, and rescue missions with an unmanned aerial vehicle (uav): Aerial mosaicking and autonomous flight at the 2009 european land robots trials (elrob) and the 2010 response robot evaluation exercises (rree),” Journal of Intelligent and Robotic Systems, vol. 64, no. 1, pp. 57–76, 2011.
 [20] S. Schwertfeger, A. Birk, and H. Bülow, “Using ifmi spectral registration for video stabilization and motion detection by an unmanned aerial vehicle (uav),” in IEEE International Symposium on Safety, Security, and Rescue Robotics (SSRR), IEEE Press. IEEE Press, 2011, pp. 1–6.
 [21] H. Bülow and A. Birk, “Spectral 6dof registration of noisy 3d range data with partial overlap,” IEEE Transactions on Pattern Analysis and Machine Intelligence (PAMI), vol. 35, pp. 954–969, 2013.
 [22] D. Scaramuzza, “Omnidirectional camera,” Computer Vision: A Reference Guide, pp. 552–560, 2014.
 [23] T. L. Conroy and J. B. Moore, “Resolution invariant surfaces for panoramic vision systems,” in Proceedings of the Seventh IEEE International Conference on Computer Vision, vol. 1. IEEE, 1999, pp. 392–397.
 [24] D. W. Marquardt, “An algorithm for leastsquares estimation of nonlinear parameters,” Journal of the society for Industrial and Applied Mathematics, vol. 11, no. 2, pp. 431–441, 1963.
 [25] P. J. Huber, “Robust estimation of a location parameter,” in Breakthroughs in statistics. Springer, 1992, pp. 492–518.
 [26] P. Charbonnier, L. BlancFéraud, G. Aubert, and M. Barlaud, “Deterministic edgepreserving regularization in computed imaging,” IEEE Transactions on image processing, vol. 6, no. 2, pp. 298–311, 1997.
 [27] S. A. K. Tareen and Z. Saleem, “A comparative analysis of sift, surf, kaze, akaze, orb, and brisk,” in 2018 International Conference on Computing, Mathematics and Engineering Technologies (iCoMET). IEEE, 2018, pp. 1–10.
Comments
There are no comments yet.