In sports science research, there are many topics that utilize the body motion of athletes extracted by motion capture system, since motion information is valuable data for improving an athlete’s skills. However, one of the unsolved challenges in motion capture is extraction of athletes’ motion information during the actual game or match, as placing markers on athletes is a challenge during game play. In this research, the authors propose a method for acquisition of motion information without attaching a marker, utilizing computer vision technology. In the proposed method, the three-dimensional world joint position of the athlete’s body can be acquired using just two cameras without any visual markers. Furthermore, the athlete’s three-dimensional joint position during game play can also be obtained without complicated preparations. Camera calibration that estimates the projective relationship between three-dimensional world and two-dimensional image spaces is one of the principal processes for the respective three-dimensional image processing, such as three-dimensional reconstruction and three-dimensional tracking. A strong-calibration method, which needs to set up landmarks with known three-dimensional positions, is a common technique. However, as the target space expands, landmark placement becomes increasingly complicated. Although a weak-calibration method does not need known landmarks, the estimation precision depends on the accuracy of the correspondence between image captures. When multiple cameras are arranged sparsely, sufficient detection of corresponding points is difficult. In this research, the authors propose a calibration method that bridges multiple sparsely distributed cameras using mobile camera images. Appropriate spacing was confirmed between the images through comparative experiments evaluating camera calibration accuracy by changing the number of bridging images. Furthermore, the proposed method was applied to multiple capturing experiments in a large-scale space to verify its robustness. As a relevant example, the proposed method was applied to the three-dimensional skeleton estimation of badminton players. Subsequently, a quantitative evaluation was conducted on camera calibration for the three-dimensional skeleton. The reprojection error of each part of the skeletons and standard deviations were approximately 2.72 and 0.81 mm, respectively, confirming that the proposed method was highly accurate when applied to camera calibration. Consequently, a quantitative evaluation was conducted on the proposed calibration method and a calibration method using the coordinates of eight manual points. In conclusion, the proposed method stabilizes calibration accuracy in the vertical direction of the world coordinate system.

In sports science research, there are many topics that utilize the body motion of athletes extracted by motion capture system, since motion information is a valuable data for improving an athlete’s skills.1 However, in order to acquire motion information, markers must be attached to the body of the athlete. These markers are prone to issues (size, mass, etc.) that hinder the athlete’s mobility. In particular, sports research often investigates high-level skills, which are not possible to measure accurately when wearing markers. Sports locations are large spaces, such as fields, links, and arenas. Motion capture surrounds and digitizes the space using multiple cameras. Creating a capturable space is a necessary preliminary work, and one of the more complicated tasks. In addition, one of the unsolved challenges for motion capture is extracting the athlete’s motion information during the actual game or match due to the difficulty of putting markers on athletes during game play.

In this research, the authors propose a method for acquisition of motion capture information without attaching a marker, utilizing computer vision technology. In the proposed method, it is possible to acquire the three-dimensional (3D) world joint position of the athlete’s body using just two cameras without markers. Furthermore, the athlete’s 3D joint position during game play is also possible to obtain without complicated preparations. In this way, the athlete’s 3D joint position acquired by the proposed method can be expected as suitable for use with athletes. Although, the proposed method is not suitable for all types of sports analysis, it is possible to apply the method to various sports held in an environment which is similar to the experiment of the proposed method. If the measurement accuracy of the proposed method is high, analysis of body motion in general sports can be used to obtain joint angles. However, instantaneous motion analysis cannot be applied because the error is too large for measurements such as reaction velocity. If the measurement accuracy of the proposed method is low, it is not valid data for analyzing body movement. However, the data can be used as a positioning sensor by calculating the center of gravity of the estimated joint position, which can be applied to strategy and performance analysis.

3D image processing approaches, such as 3D tracking and 3D reconstruction, are active research topics in computer vision. 3D positional estimation in large-scale spaces is being scrutinized for various scenes.2 For such processes, the projective relationship must be obtained between the 3D world and the 2D image space, established by the camera parameters of the capturing camera. In camera calibration processes, landmarks need to be set up with known 3D positions in the space and the projective transformation matrix from the correspondence relationship between the 3D points and their observed positions in a 2D image plane needs to be estimated. This is called strong calibration.3 However, calibration of a large-scale space can be problematic due to the increased time and effort required for landmark installation. Alternatively, the weak-calibration method (or self-calibration)4 does not require landmark placement. Relative position and orientation information can be estimated among multiple cameras, as well as the intrinsic parameters of the capturing cameras from the correspondence information among multiple viewpoint images. However, when the cameras are arranged sparsely, sufficient correspondence points cannot be obtained, and the estimation precision of the projection relationship is reduced. In large-scale spaces, such as a gymnasium or stadium, where 3D image processing is often implemented, dense camera arrangement becomes increasingly difficult.

In sports analysis, the 3D position of athletes, implements, and projectiles is the basic data that is crucial for improving performance, and will eventually be applied to the automation of game officiating and management in the near future. The 3D position estimation of a subject using game images is currently being researched. As a target application for 3D image processing, the authors focused on badminton games in a relatively large-scale space, where the installation of 3D tracking equipment for capturing players and the shuttlecock is difficult. Since the target space is too large to install enough cameras to guarantee the accuracy of weak calibration, strong calibration is typically employed. However, setting up many landmarks for an accurate strong calibration is time-consuming in such a space. Moreover, in official international competitions, setting up landmarks in the target space is even more difficult because permission must be obtained from the association to perform measurements with landmarks before the official match. Setting up the landmarks in such a large space could take half to several days, which can be expensive.

The authors have proposed a method to combine the advantages of strong and weak calibrations by bridging multiple sparsely distributed cameras with a mobile camera.5 As illustrated in Figure 1, the target space was captured while moving among sparse multi-view cameras using a mobile camera, so that densely captured multi-view images (interpolation images) could be acquired virtually. By utilizing the captured images, a simple and accurate calibration method is realized.


                        figure

Figure 1. Dense images captured in the proposed method. Mobile camera captures the target scene while moving among sparsely fixed multi-view cameras.

However, in the previous paper by Shishido and Kitahara,5 the experiment demonstrated was insufficient in portraying the effectiveness of the proposed method due to its limited application range. In this paper, the authors also conduct an evaluation experiment on the calibration accuracy using manually and automatically acquired interpolation images to show the effectiveness of the proposed method. Furthermore, the proposed method was applied to the 3D skeleton estimation of a badminton player to confirm the possibility of application. Subsequently, a quantitative evaluation was conducted on camera calibration for the 3D skeleton. Moreover, a quantitative evaluation was conducted on the proposed calibration method and a calibration method using the coordinates of eight manual points.

Strong calibration using known landmarks (e.g. checkerboards) is one common approach for camera calibration. Scaramuzza et al.6 proposed a flexible technique for single viewpoint omnidirectional camera calibration using checkerboards that calibrated a panoramic camera with a vertical field of view over 200°. Chen et al.7 proposed a refractive calibration method for an underwater stereo camera system where both cameras are looking through multiple parallel flat refractive interfaces. In research on improving calibration accuracy, the estimation error was minimized by calculating the epipolar geometry from dynamic silhouettes and color codes. The motion barcode of Ben-Artzi et al.8 is a binary temporal sequence for lines that indicate the existence of at least one foreground pixel on that line. The search for corresponding epipolar lines was limited to lines with similar barcodes. Schillebeeckx and Pless9 introduced a calibration object based on a flat lenticular array that creates a color-coded light field in which the observed color changes depending on the angle from which it is viewed. Other studies have also addressed environments where it is difficult to calibrate cameras, such as medical endoscopes. Nishimura et al.10 proposed a camera calibration algorithm for camera systems involving distortions by unknown refraction and reflection processes. Melo et al.11 proposed a complete software-based system that calibrates and corrects radial distortion in clinical endoscopy in real-time. All of the above approaches and methods target relatively small spaces. However, due to the large-scale space targeted in this research, significant labor is required because landmarks needed to be set up to cover the entire space. To solve such problems, Workman et al.12 proposed a camera calibration method that used the geometry of a rainbow to describe the minimum set of constraints that is sufficient for estimating camera calibration, and presented both semi-automatic and fully automatic methods for camera calibration. However, rainbows are relatively rare, and applying them in a large indoor space is difficult. Calibration methods that utilize the corresponding information between multi-view images without the installation of landmarks have been studied extensively.1315 By analyzing the motion field of radially distorted images, Wu14 found critical surface pairs that can render the same motion field under different radial distortions and possibly different camera motions.

Cohen et al.16 described an example of robust calibration by adding corresponding points and proposed a combinatorial approach for solving this variant by automatically stitching multiple sides of a building together. However, when obtaining sufficient corresponding points is difficult and the cameras are installed sparsely, the estimation accuracy readily decreases. In weak calibration, the relative position, orientation information, and intrinsic camera parameters are estimated from the correspondence information of the multi-view images. Therefore, the scale parameter between the captured 3D space (the world coordinate system) and the reconstructed 3D space obtained by weak calibration (the camera coordinate system) is difficult to estimate. As a countermeasure, the camera coordinate system was converted to the world coordinate system using 3D information defined by each individual sport (e.g. court size).

Chen et al.17 proposed a camera calibration method for soccer video. First, two manually corresponding 3D–2D points are acquired, and the focal length between them is estimated. Next, the initial pan/tilt angle is estimated using one point. Finally, both points are used to minimize reprojection error. The pan tilt zoom (PTZ) parameters were then optimized, and subsequently integrated and applied to the calibration of sports cameras. Obtaining corresponding points of the soccer field lines requires complicated manual work. To solve this problem, automatic calibration corresponding to 3D–2D points was realized by detecting the soccer field lines from images.18 In addition, rendering football fields and athletes from YouTube video frames with a 3D viewer/augmented reality (AR) device was accomplished by combining the automatic calibration method using soccer field lines and the depth estimation method of a player using a trained deep network.19 These calibration methods can be conducted by the corresponding soccer field 3D–2D points. The calibration accuracy decreases with increasing distance from the soccer field ground, but the influence on the calibration accuracy in soccer is negligible because of the many uses of information near the ground, such as player foot positions. However, the calibration accuracy in badminton is subject to greater influence because of the many uses of information far from the ground, such as player arm movement.

In this research, the authors propose a solution for accurate calibration in the vertical direction on the court, which has not been achieved in previous research.

In general, utilization of the court lines is a valuable mechanism in camera calibration of the sports scene. However, calibration accuracy proves unstable when utilizing the court lines far from the ground (high height). Therefore, a calibration is proposed that stabilizes accuracy for positions high off the ground utilizing weak calibration.

Acquisition of projective transformation matrix using weak calibration

As depicted in Figure 1, multi-view images are captured by sparsely installed fixed cameras. At the same time, a video sequence was captured using a mobile camera moving among and facing in the same direction as the fixed cameras. This means that a mobile camera visually bridges sparsely arranged multi-view cameras. As a result, dense multi-view images were acquired, including images captured by the fixed cameras. By applying weak calibration to the images, the projective transformation matrix can be estimated for all multi-view images, including sparsely fixed cameras, without setting landmarks, which is salient because sufficient detection of corresponding points is necessary for improving the estimation accuracy. The authors assume that the observed image features are sufficient for obtaining corresponding points in the captured images of the target space, where the size of at least one object is known in order to estimate the scale parameters.

Calculation of 3D coordinates

In this research, a weak-calibration method was adopted that uses correspondence between multiple viewpoint images without setting landmarks.13,14 This method is called structure-from-motion and is hereinafter referred to as sfm. A 3D coordinate of an arbitrary point in the weak-calibration sfm coordinate system is defined as Msfm=[Xs,Ys,Zs,1]T. When m=[u,v,1]T is observed in the camera coordinate system, the projective relationship between the weak-calibration and camera coordinate system is expressed by equation (1). The projective transformation matrix (P) of the camera in the weak-calibration coordinate system, acquired by the method in section “Acquisition of projective transformation matrix using weak calibration,” is employed

λmPMsfm(1)

The projective relationship is similarly estimated in multiple viewpoint images. The 3D coordinates are calculated from the observed image coordinates by the stereo vision method using the estimated projective transformation matrix. The stereo vision method is an approach for acquiring depth coordinates which are 3D coordinates from multiple images having parallax. This technique applies the principle of triangulation.

Transformation from weak-calibration coordinate system to world coordinate system

A weak-calibration coordinate system is defined by the distribution of the observed corresponding points. Therefore, the origin and direction of each axis are different for each calibration process. As shown in Figure 2, the capturing space was originally assigned the weak-coordinate system, but it was transformed into a world coordinate system to unify measurements across different capturing data.


                        figure

Figure 2. Geometric relationship among camera coordinate system, weak-calibration coordinate system, and world coordinate system.

An arbitrary point of the world coordinate system is defined as Mworld=[Xw,Yw,Zw]T. The transformation from a weak-calibration coordinate system to a world coordinate system is expressed by a transformation matrix using rotation matrix R and translation vector t, represented by equation (2)

Mworld=RMsfm+t(2)

Here, 3D transformation matrix D is shown in equation (3)

D=[Rt01](3)

Equation (4) is expressed using transformation matrix D

M~world=DM~sfm(4)

As shown in Figure 3(a) and (b), the origin of the world coordinate system was defined to satisfy the following two conditions: the vertical intersection of two straight lines (edges: X0Y0) from the capturing scene of the multi-view video, and the existence of an object whose size is known.


                        figure

Figure 3. World coordinate system’s image-capturing environment (a and b).

So corresponds to the origin of the world coordinate system in the weak-calibration coordinate system. Vector t is the parallel translation magnitude from point So to origin osfm. In addition, the scale is obtained from the size ratio of the weak-calibration coordinate system to an object whose size is known in the world coordinate system. An orthonormal vector of the weak-calibration coordinate system, represented by equation (5), is calculated by points Sx,Sy, and Sz in the weak-calibration coordinate system that correspond to the points on the X-, Y-, and Z-axes of the world coordinate system. Rotation matrix R is obtained from the components of each vector ei

ei=SiS0|SiS0|(i=x,y,z)(5)

Through transformation from a weak-calibration coordinate system into a world coordinate system, the authors calculated the 3D position of the subject in the 3D world coordinate system.

As illustrated in Figure 1, two fixed cameras were installed in a gymnasium to capture badminton scenes. Figure 3(a) and (b) are images taken by each camera. The origin is set at the corner of the court, and theX- and Y-axes are set along the court line, defined by the standard badminton regulations. The distance between ① and ② is 6.1 m, and the distance between ① and ③ is 13.4 m, as shown in Figure 3(a) and (b), respectively. The scale parameters are estimated based on these distances.

The two videos are captured using digital video cameras (Sony FDR AX-1) with 3840 × 2160 pixel resolution at 30 frames/second. Video of the same space was also captured by moving (bridging) between two fixed cameras using a camera with identical specifications. The bridging images are obtained through extraction of the individual frames. In this experiment, the mobile camera was moved along the gymnasium’s layout, as shown in Figure 1.

To evaluate the relationship between the accuracy of the camera calibration and each frame’s intervals (i.e. bridging gap), the bridging gap was adjusted to 1.0°, 1.5°, 2.5°, 6°, 12°, 21°, and 26° for 300, 150, 75, 40, 20, 10, and 5 capturing images, respectively. The interpolation image of (a)–(g) in Figure 4 is automatically acquired using frame extraction of the moving image. In addition, based on the position and orientation information of the interpolation image estimated by (c), the interpolation image of (h)–(k) in Figure 4 is manually acquired with equal spacing.


                        figure

Figure 4. Results of estimating camera parameters using the mobile camera images outlined in Figure 1. Number of bridging images for automatic acquisition (a)–(g) and manual acquisition (h)–(k): (a) 300, (b) 150, (c) 75, (d) 40, (e) 20, (f) 10, (g) 5, (h) 40, (i) 20, (j) 10, and (k) 5.

As displayed in Figure 4(a)–(k), weak-calibration processing is applied to the captured bridging image. Using the estimated camera parameters, ① origin: oworld, ② Xo, and ③ Yo of the world coordinate system are calculated and applied in Figure 3(a) and (b). The authors further verified the estimation accuracy of the 3D position.

As detailed in Figure 5, strong-calibration processing was conducted with the known badminton court coordinates to evaluate the accuracy of the camera calibration. The court’s lines were defined based on badminton regulations (Figure 5). The court coordinates of the world coordinate system and the pole tip position (nos 1–18 in Figure 5) were calculated based on the origin’s position defined in Figure 3. Similarly, the specified position coordinates of each image were acquired, and strong calibration was performed using this information. Thus, the camera parameters, ① the origin of the world coordinate system: oworld, ② Xo, and ③ Yo, presented in Figure 3(a) and (b), were calculated and the estimation accuracy of the 3D position was verified.


                        figure

Figure 5. Badminton court coordinates of the world coordinate system and pole tip position (nos 1–18).

Estimation error versus number of bridging images

The authors compared the values defined in the world coordinate system (① oworld, ② Xo, and ③ Yo) with those defined in badminton regulations (6.1 m between ① and ②, and 13.4 m between ① and ③). The calculated error of the Euclidean distance (X0 and ground truth/Y0 and ground truth) is delineated in Figure 6. Estimated error of the strong calibration is indicated by the orange bar. Estimated error associated with manual and automatic acquisition is represented by blue and yellow bars, respectively. The average estimated error of strong calibration using the badminton court coordinates was 2.2 cm. The average error of the convergence angles 1.0°, 1.5°, and 2.5° is approximately 4.3, 3.1, and 5.8 cm, respectively. These results indicate that the proposed method has approximately the same accuracy as the strong calibration.


                        figure

Figure 6. Calculated error of Euclidean distance by each bridging gap (left: X0 and ground truth, right: Y0 and ground truth).

The estimation error from the automatic acquisition exhibits a drastic change from the convergence angle of 6° (approximate average error of 15 cm). Alternatively, the estimation error from manual acquisition exhibits a similarly abrupt change from the convergence angle of 21° (approximate average error of 20 cm).

For both manual and automatic acquisition, the estimation error monotonically increases as the convergence angle increases. For interpolation images acquired manually, error images that do not capture subjects can be excluded. In addition, manual acquisition enables selection of only focused images. Consequently, selected images may exhibit many positive correspondences between adjacent images. Therefore, the convergence angle at which the estimation error drastically increased was wider than that of the automatic acquisition. However, manual acquisition of interpolation images requires significant labor.

As a result, to estimate the altitude and position of the fixed cameras with sufficient precision, the proposed method must sample bridging images at convergence angles less than 6°. In the experiment, the total path length of the mobile camera motion was approximately 40 m (20 m translation along the Y-axis followed by a 20 m translation along the X-axis). Accordingly, bridging images were deemed sufficient for this experiment sampled at 1 frame/second and captured by a moving camera at 1 m/s.

3D skeleton estimation using the proposed method

A 3D pose estimation of badminton players is one of the promising applications of the proposed method. To explore this potential, an experiment was conducted to capture badminton scenes in a gymnasium. As shown in Figure 7(a), two fixed cameras were installed parallel to the X-axis of the world coordinate system. As shown in Figure 7(b), camera position and orientation were estimated by applying weak calibration (fixed camera 1, 2, and interpolation image). The distance between the two cameras was approximately 10 m. The positions of the origin, X-axis, and Y-axis were established as outlined in section “Multiple camera calibration method.” Multi-videos were captured using digital video cameras (Blackmagic Studio Camera 4K) with 3840 × 2160 pixel resolution at 30 frames/second. These two cameras captured images using synchronizing signals. The authors also captured a video sequence of the same space by moving (bridging) between two fixed cameras using a camera with the same specifications as the fixed cameras. An interpolation image was acquired by sequencing the captured video into individual frames. As a result, 234 interpolation images were generated in this experiment. The badminton scene used for estimation of the 3D skeleton position totalled 17 frames (nos 111–127) from the start of hitting the shuttlecock to the end.


                        figure

Figure 7. (a) Two fixed cameras installed parallel to the X-axis of the world coordinate system. (b) Camera position and orientation estimated by applying weak calibration (fixed camera 1, 2, and interpolation image). Weak calibration estimates the position and orientation of the camera and, at the same time, estimates the 3D point cloud. Dots represent an estimated 3D point cloud.

To estimate a subject’s pose in the captured image, the pose estimation method of the convolutional neural network (CNN) was applied.20 Figure 8 (left) is the result of applying convolutional pose machines20 to the captured multi-view images. The resulting 3D pose position, estimated by the pose information detected at two viewpoints, is centered in Figure 8. Accordingly, the projective transformation matrix for stereo processing was estimated by the proposed method.


                        figure

Figure 8. Left: the result of applying convolutional pose machines20 to the captured multi-view images. Middle: the estimation result of the 3D pose position from the pose information detected at two viewpoints. Right: the estimated trajectories of the wrist, elbow, head, and neck.

As shown in Figure 8 (right), the trajectories of the wrist, elbow, head, and neck were estimated. The orange, green, purple, and yellow plots illustrate the trajectory of the right wrist, right elbow, head, and neck, respectively. As shown in the neck and head Z values in Figure 8, the head estimate is never lower than the neck. Similarly, the right elbow and right wrist Z values represent parabola. This shows the swinging motion of the racket. In this way, the estimated skeleton does not reverse human structure. It was confirmed that the estimated value during the movement of the racket swing did not lose continuity. Therefore, based on these results, the position of the skeleton from the first hit of the shuttlecock to the end was well-estimated.

Accordingly, the Euclidean distances of the right wrist and the right elbow were calculated as shown in Figure 9 (nos 111–127). The average distance and standard deviation were approximately 21.5 and 1.9 cm, respectively. With this method, the 3D skeleton position can be calculated with less labor using the two fixed cameras to capture and produce interpolation images. The estimation data of the 3D skeleton position can contribute to improvement of an athlete’s technical skills, through applications such as calculation of the skeleton’s movement and corresponding data analysis.


                        figure

Figure 9. Euclidean distances of the right wrist and the right elbow (nos 111–127).

Subsequently, a quantitative evaluation was conducted on camera calibration for the 3D skeleton. In this experiment, 3D key points were difficult to annotate, so the authors evaluated reprojection errors. The badminton scene used in this experiment yielded 17 frames (nos 111–127) from the first hit of the shuttlecock to the end. First, skeleton positions (2D) were acquired from the two viewpoints by convolutional pose machines.20 Second, the proposed method was applied to the two skeleton positions (2D) in order to calculate the 3D skeleton position. Third, the calculated 3D skeleton position was projected onto each camera image. Fourth, the proposed method was applied again to the 2D skeleton position of the two projected viewpoints in order to calculate the reprojected 3D skeleton position. Finally, the Euclidean distance between the first and second 3D skeleton positions (calculated and reprojected, respectively) was calculated to determine the reprojection error. The results of the calculation are reported in Table 1. The reprojection error specifies the average error of 17 frames for each part of the skeleton. As shown in Table 1, the average of the reprojection errors and standard deviations were approximately 2.72 and 0.81 mm, respectively. In effect, it was confirmed that the proposed method was highly accurate when applied to camera calibration.

Table

Table 1. The Euclidean distance between the calculated and reprojected 3D skeleton positions.

Table 1. The Euclidean distance between the calculated and reprojected 3D skeleton positions.

Furthermore, Figure 10 presents the results of executing steps 1–3 in the reprojection error procedure described above, exemplified by frames 123 and 126. The yellow line indicates the 2D skeleton positions estimated by the convolutional pose machines,20 and the blue line represents the 2D coordinates projected onto each camera from the 3D position once estimated by the proposed method. Evidently, the lines in each frame (123 and 126) are nearly identical. However, the segment of the line from the left wrist to the left elbow does not overlap. This is due to the player’s self-occlusion. Essentially, the whole body can be observed in the camera 1 image, but in the camera 2 image the left hand is hidden in front of the player’s body and cannot be observed. Convolutional pose machines20 estimate the skeleton position even if there is self-occlusion, but estimation accuracy is low for skeleton parts that cannot be observed. Therefore, the estimation precision of the 3D position by the proposed method is low in the 2D skeleton region where the estimation accuracy of convolutional pose machines20 is low. A possible solution to this problem is positioning the camera so that self-occlusion does not occur, as opposed to the camera placement exhibited in Figure 7.


                        figure

Figure 10. 2D skeleton positions estimated from two viewpoints by convolutional pose machines20 and by projection onto each camera from the 3D skeleton position estimated by the proposed method (frame nos 123 and 126).

Moreover, a quantitative evaluation was conducted on the proposed calibration method and a calibration method using the coordinates of eight manual points (hereinafter referred to as the 8-points calibration method). The positions of points 1, 2, 3, 10, 11, 16, 17, and 18, as shown in Figure 5, were manually acquired in the 8-points calibration method. Camera parameters were calculated by the corresponding eight points with the 3D field. As shown in the lower part of Figure 11 (yellow plot), the data used for the quantitative evaluation were the racket positions manually acquired from the player’s image (21 frames). The authors calculated the 3D racket length by applying both the proposed method and 8-points calibration method above and below the racket, acquired by the two viewpoints. The actual size of the racket was 674 mm. The estimated racket length is outlined in the upper part of Figure 11. The 8-points calibration and proposed methods are indicated by the blue and orange plots, respectively. Results indicate that the average racket length estimation error was nearly equivalent between the two methods: 12.69 and 11.86 mm for the 8-points calibration and proposed methods, respectively. Similarly commensurate, the standard deviations of the estimated racket length from 1 to 10 frames were 6.21 and 6.27 mm for the 8-points calibration and proposed methods, respectively. However, when comparing the standard deviations of the estimated racket length from 11 to 21 frames, the 8-points calibration and proposed methods were 12.30 and 8.47 mm, respectively. This result indicates that the estimation errors from 1 to 10 frames did not change between methods, but the estimation errors from 11 to 21 frames were less dispersed in the proposed method. This difference is explicated by the fact that frames 1–10 exhibit the racket at a height lower than the badminton court net, while in frames 11–21, the racket is above the net. Therefore, the accuracy of the 8-points calibration method decreases at a position higher than the height of the badminton court net. Alternatively, the proposed method provides stable calibration accuracy regardless of position relative to the net. The stabilization of the proposed method is due to uniformly distributed image features in 3D space. In conclusion, the proposed method stabilizes calibration accuracy in the vertical direction of the world coordinate system.


                        figure

Figure 11. Quantitative evaluation of the proposed method and the 8-points calibration method. Upper: estimation result of racket length using the proposed method and the 8-points calibration method. Lower: racket positions manually acquired from the player’s image (21 frames).

A method was introduced that achieves a calibration for multiple sparsely distributed cameras by bridging them with mobile camera images. Experiments were also conducted to evaluate camera calibration accuracy by changing the convergence angles between each bridging image, effectively verifying the proposed method’s effectiveness. When the distance between the sparsely installed cameras increased, the proposed method performed with high accuracy and less labor.

In addition, the accuracy evaluation was executed using the interpolation images acquired manually and automatically in order to verify the effectiveness of the proposed method. Furthermore, the proposed method was applied to the 3D skeleton estimation of badminton players to confirm the possibility of the application. As a result of this study, the range of the proposed method’s application expanded, demonstrating its effectiveness.

Declaration of conflicting interests
The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Funding
The author(s) disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: This work was supported by Japan Society for the Promotion of Science (JSPS) KAKENHI (grant number: 17K13180), Japan Science and Technology Agency (JST) CREST (grant number: JPMJCR16E3), including AIP Challenge Program, Japan.

ORCID iD
Hidehiko Shishido https://orcid.org/0000-0001-8575-0617

1. Van der Kruk, E, Reijne, MM. Accuracy of human motion capture systems for sport applications; state-of-the-art review. Eur J Sport Sci 2018; 18(6): 806819.
Google Scholar | Crossref | Medline
2. Xu, Y, Liu, X, Liu, Y, et al. Multi-view people tracking via hierarchical trajectory composition. In: Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR), Las Vegas, NV, 27–30 June 2016, pp.42564265. New York: IEEE.
Google Scholar
3. Shishido, H, Kameda, Y, Ohta, Y, et al. Visual tracking method of a quick and anomalously moving badminton shuttlecock. ITE Trans Media Technol Appl 2017; 5(3): 110120.
Google Scholar | Crossref
4. Kanatani, K, Ohta, N, Shimizu, Y. 3D reconstruction from uncalibrated-camera optical flow and its reliability evaluation. Syst Comput Jpn 2002; 33(9): 110.
Google Scholar | Crossref
5. Shishido, H, Kitahara, I. Calibration method for sparse multi-view cameras by bridging with a mobile camera. In: Proceedings of the 2017 seventh international conference on image processing theory, tools and applications (IPTA), Montreal, QC, Canada, 28 November–1 December 2017, pp.16. New York: IEEE.
Google Scholar
6. Scaramuzza, D, Martinelli, A, Siegwart, R. A flexible technique for accurate omnidirectional camera calibration and structure from motion. In: Proceedings of the fourth IEEE international conference on computer vision systems (ICVS’06), New York, 4–7 January 2006, pp.4545. New York: IEEE.
Google Scholar
7. Chen, X, Yang, YH. Two-view camera housing parameters calibration for multi-layer flat refractive interface. In: Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR), Columbus, OH, 23–28 June 2014, pp.524531. New York: IEEE.
Google Scholar
8. Ben-Artzi, G, Kasten, Y, Peleg, S, et al. Camera calibration from dynamic silhouettes using motion barcodes. In: Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR), Las Vegas, NV, 27–30 June 2016, pp.40954103. New York: IEEE.
Google Scholar
9. Schillebeeckx, I, Pless, R. Single image camera calibration with lenticular arrays for augmented reality. In: Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR), Las Vegas, NV, 27–30 June 2016, pp.32903298. New York: IEEE.
Google Scholar
10. Nishimura, M, Nobuhara, S, Matsuyama, T, et al. A linear generalized camera calibration from three intersecting reference planes. In: Proceedings of the IEEE international conference on computer vision (ICCV), Santiago, Chile, 7–13 December 2015, pp.23542362. New York: IEEE.
Google Scholar
11. Melo, R, Barreto, JP, Falcao, G. A new solution for camera calibration and real-time image distortion correction in medical endoscopy-initial technical evaluation. IEEE T Biomed Eng 2012; 59(3): 634644.
Google Scholar | Crossref | Medline
12. Workman, S, Mihail, RP, Jacobs, N. A pot of gold: rainbows as a calibration cue. In: Proceedings of the 13th European conference on computer vision (ECCV), Zurich, 6–12 September 2014, pp.820835. Cham: Springer.
Google Scholar
13. Wu, C . Towards linear-time incremental structure from motion. In: Proceedings of the 2013 international conference on 3D vision—3DV 2013, Seattle, WA, 29 June–1 July 2013, pp.127134. New York: IEEE.
Google Scholar
14. Wu, C . Critical configurations for radial distortion self-calibration. In: Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR), Columbus, OH, 23–28 June 2014, pp.2532. New York: IEEE.
Google Scholar
15. Wilson, K, Snavely, N. Network principles for SfM: disambiguating repeated structures with local context. In: Proceedings of the IEEE international conference on computer vision (ICCV), Sydney, NSW, Australia, 1–8 December 2013, pp.513520. New York: IEEE.
Google Scholar
16. Cohen, A, Sattler, T, Pollefeys, M. Merging the unmatchable: stitching visually disconnected SfM models. In: Proceedings of the IEEE international conference on computer vision (ICCV), Santiago, Chile, 7–13 December 2015, pp.21292137. New York: IEEE.
Google Scholar
17. Chen, J, Zhu, F, Little, JJ. A two-point method for PTZ camera calibration in sports. In: Proceedings of the IEEE workshop on applications of computer vision (WACV), Lake Tahoe, NV, 12–15 March 2018, pp.287-295. New York: IEEE.
Google Scholar
18. Carr, P, Sheikh, Y, Matthews, I. Point-less calibration: camera parameters from gradient-based alignment to edge images. In: Proceedings of the IEEE workshop on applications of computer vision (WACV), Breckenridge, CO, 9–11 January 2012, pp.377384. New York: IEEE.
Google Scholar
19. Rematas, K, Kemelmacher-Shlizerman, I, Curless, B, et al. Soccer on your tabletop. In: Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR), Salt Lake City, UT, 18–23 June 2018, pp.47384747. New York: IEEE.
Google Scholar
20. Wei, SE, Ramakrishna, V, Kanade, T, et al. Convolutional pose machines. In: Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR), Las Vegas, NV, 27–30 June 2016, pp.47244732. New York: IEEE.
Google Scholar

Cookies Notification

This site uses cookies. By continuing to browse the site you are agreeing to our use of cookies. Find out more.
Top