Expanded photo-model-based stereo vision pose estimation using a shooting distance unknown photo

Model-based stereo vision pose estimation depends on the establishment of the model. The photo-model-based method simplifies the model-building process with just one photo. Programming languages do not predefine the shapes, colors, and patterns of objects. In the past, however, it was necessary to calculate a pixel per metric ratio, that is, the number of pixels per millimeter of the object, based on the photo’s shooting distance to generate a photo-model with the same size (length and width) as the actual object. It restricts the real application. The proposed method extends the traditional photo-modeling algorithm and relaxes the photo prerequisite for target pose determination. Various pixel per metric ratios will be assumed to generate 3D photo-models of different sizes. These models will then be employed in stereo vision image matching techniques to detect the pose of the target object. Since it is not a data-driven method, it does not require many pictures and pretraining time. This article applies the algorithm to the cleaning of seaports and aquaculture, aiming to locate dead or diseased marine life on the water surface before collection. Pose estimation experiments have been conducted to detect an object’s pose and a prepared photo’s pixel per metric ratio in real application scenarios. The results show that the expanded photo-model stereo vision method can estimate the pose of a target with one pixel per metric ratio unknown photo.


Introduction
Robots play an important role in marine biological fishing, invasive species control, and surface collection.Using robots such as autonomous underwater vehicles (AUVs) for precise positioning and targeted marine creatures can reduce the capture of nontarget species and minimize the impact on the marine environment.They are used for water surface cleaning and retrieval in marine ports and aquaculture.
Estimating the six degrees of freedom (DOF) pose of a marine creature floating on the water surface is crucial for autonomous robots to track or grasp it effectively.Concerning automated robots, visual information is expected to allow robots to adapt to different environments. 1,2For a robot with vision sensors, such as cameras, it has been difficult till now to accurately detect the 3D pose of the target object, especially if the target object cannot be predefined since the shape or size is arbitrary.On the other hand, some vision-based AUVs have been researched, for example, for eliminating invasive species with a monocular camera. 3In contrast to the current state of robot vision research for use on land, applying robot vision in water is at a lower stage.
Since monocular vision has a single and cheap hardware setup, it is widely utilized for visual pose detection. 4,5A fish-catching robot has been developed to deal with problems of the target's 3DOF recognition. 6However, it cannot avoid the disadvantage that its depth distance measurement is inaccurate.Many studies have used Red, Green, Blue, and Depth (RGB-D) cameras composed of one RGB camera and depth sensor with infrared light to improve the distance detection abilities of monocular vision. 7,8However, sometimes, depth information is not readily available, including systems operating outdoors, where common depth sensors do not work well with noisy and sparse depth information. 9,10][13][14] However, these sensors often come with a higher cost.With the development of deep-learning technology, monocular RGB images can also achieve pose recognition. 9,15However, this requires many pictures and pretraining time.
Unlike monocular vision and RGB-D methods, stereo vision is another method for estimating 3D pose and perceives a greater variety of target material properties and light conditions. 16When it comes to pose detection, stereo vision methods can be roughly categorized into two types: stereo-matching and model-matching methods.
Stereo matching, also known as disparity estimation, utilizes epipolar geometry to compute the 3D coordinates of a physical point (2D-3D method).In terms of stereomatching methods, feature-based approaches are commonly used for pose detection.These methods involve extracting feature points and matching them using techniques such as Fast, 17 Sift, 18 and Surf. 19However, mismatch is inevitable.Removing mismatched noise points is a complex problem.
Point-cloud-based methods use a scene point cloud generated by stereo matching, 20 which can be seen as a global extension of feature-based methods.However, it is generally necessary to organize and structure the 3D discrete points into a higher-level representation, such as voxels. 21One of the challenges in this process is removing noise points that do not correspond to the target objects.
Additionally, identifying and segmenting the desired objects within the point clouds is a complex task. 22However, mismatches are inevitable in stereo vision, and addressing them is crucial in achieving accurate results.
Model-matching methods, also known as model-based recognition methods, have the advantage of avoiding mismatches and are especially effective in handling occlusion scenarios.Although monocular model-based method can estimate 6D pose, the distance measurement accuracy is lower than that of stereo vision due to the inability to use parallax displacement. 23,24These methods first acquire the object model, projecting all points of a solid 3D model onto stereo vision image planes.Subsequently, these projected points are matched with their corresponding counterparts on the actual target, leading to accurate pose estimation (3D-2D method).
Traditional model-based pose estimation methods mainly rely on handcrafted models.They are made according to the style and size of the original object, so are generally used when a target size is given.An application utilizing a fixed 3D marker has been developed for the purposes of AUV navigation and battery recharging. 25,26owever, for other situations, aquatic organisms are always on the move, making it difficult to measure and model them accurately.An innovative approach using deformable models and stereo vision was employed to accurately measure the size of tuna. 27However, the complexity of model building limits the generality of this method in detection.
A photo-model-based pose estimation method has been proposed 28 in response to overcome the disadvantages encountered in constructing models.It simplified the model-making process since it does not need to predefine the object's shape, color, pattern, and coding in the programming language. 28,29This method belongs to the model-based recognition method.It involves creating 3D models from 2D photos and then projecting these models onto binocular images to match actual objects for pose estimation (2D-3D-2D).
The photos employed for photo-model generation are pre-prepared instead of being captured on-site to minimize the environmental constraints at the specific application site.In the pose estimation process, the generated models are matched with binocular images taken on-site to find the best match.Regarding deformable objects, such as clothes, and the issue of partial occlusion, previous studies have investigated various environmental factors that influence the handling of such objects. 29,30These studies have conducted experiments and provided empirical evidence supporting the effectiveness of the photo-model approach.Additionally, a 3D target object's pose could be estimated and tracked in real time by using stereo vision and its 2D photo. 31Moreover, a visual servoing system for catching marine creatures was developed using the photo-model approach. 24owever, previous algorithms based on photo-models rely on camera calibration to obtain the pixel per metric (PM) ratio and calculate the target size based on the shooting distance.Previous method uses a known PM ratio photo to generate a photo-model of the same size as the object.It cannot use photos with unknown PM ratios.The ratio measures the number of pixels per unit length of an object.3][34] Existing studies usually rely on the camera calibration with reference objects of known size to ensure this ratio. 35,36owever, in practical scenarios, the size of targets on the ocean surface is often unknown, which poses a challenge for model-based approaches.To address this issue, a proposed expanded method overcomes the limitations of the previous size-fixed photo-model approach by assuming different PM ratios.This enables the generation of photomodels of different sizes from the same photo.By utilizing this approach, spatial model matching of the object with an unknown size can be facilitated, ultimately leading to accurate estimation of the target's pose.
On the other hand, while the data-driven method with deep-learning techniques utilizes images for 3D pose detection, it necessitates a considerable amount of training data and pretraining time. 37,38In contrast, the photo-modelbased method, which belongs to the model-based approach, can accurately recognize the object's pose with just a single photo. 31This approach eliminates the need for large amounts of training data and simplifies the modelbuilding process.
The main purpose of this article is to verify whether it is possible to use photo-models of assumed dimensions, that is, generated from different PM ratios, to perform modelbased stereo vision methods for estimating the pose of objects of uncertain dimensions.This article aims to enhance the current photo-model-based algorithm and propose a convenient pose detection method using stereo vision.The proposed method will serve as a foundation for future research on visual servoing in marine aquaculture, specifically for the collection of deceased or ill marine creatures on the surface of the water.This article does not discuss recognition target underwater.
In this study, we assume the PM ratio for model generation.To verify the generality of the algorithm, in addition to taking pictures of the target, we also downloaded a photo with unknown shooting distance from Bing image.Using two photos for separate pose estimation experiments, the results confirmed that using photos of the same species with unknown shooting distance and stereo vision, the target pose can be recognized.
More precisely, the contributions of this article are as follows: (1) This article proposes a method to estimate target pose by using stereo vision and a shooting distance unknown photo.(2) In the case where the model of the same size as the object cannot be generated, 3D planar models of different sizes are generated by assuming the PM ratio of the pixel length to the actual length of the object.
(3) The target pose and PM ratio estimation problem is transformed into an optimization problem.The pose and ratio can be solved simultaneously.
The rest of this article is organized in the following sections: The second section presents expanded photo-model generation and the photo-model-based pose estimation method.In the third section, we discuss the adaptability of the proposed method for recognizing an object's pose according to the pose-ratio fitness distribution and pose estimation experimental results.The conclusions and future work are described in the final section.

Expanded photo-model-based pose detection
This section introduces the methodology of the expanded photo-model-based recognition method with the variable PM ratio.The developed photo-model-based stereo vision system is shown in Figure 1.Each coordinate system is as follows: S H : end-effector (hand) coordinate system, S M : target object coordinate system, S CL , S CR : left and right camera coordinate systems.
S CL and S CR are inclined at an angle of approximately 14 with respect to the vertical line.The baseline between the two cameras is 323.4 mm.S H is located above the center of the line connecting the two cameras.z H is oriented downward along the vertical direction.The center of S M is the geometric center.z M is perpendicular to the target back.
Figure 2 shows a perspective projection of the stereo vision system.Each coordinate system is as follows: S IL , S IR : left and right image 2D coordinate systems, S Mj : jth model coordinate system, M r j i : position of ith point on jth 3D model in S Mj , CL r j i , CR r j i : position of ith point on jth 3D model based on S CR and S CL , IL r j i , IR r j i : projected 2D position on S IL and S IR of ith point on jth 3D model.

2D pixel photo-model
This subsection is a description of the 2D pixel photomodel generation before explaining the 3D photo-model The hue value in HSV color representation is used for the extraction of the target color.The advantage of HSV is that each of its attributes corresponds directly to the basic color concepts, which makes it conceptually simple.Therefore, it is easy to understand the program for the image matching process.In addition, the hue of the HSV color system shows good robustness against a change in the lighting intensity.
The model generation process is represented in Figure 3. Scan Figure 3(a) from outside to inside.The part of the image with target hue values is determined as the photomodel frame.As shown in Figure 4(a), the model is generated based on a photo.The coordinate system of the model S P is shown in Figure 4(b).The length of the model frame is L P ðpixelÞ, the width of the model frame is B P ðpixelÞ, that is, the model size.The outer portion is set for image matching, and its size is larger than the model size.Sampling points are taken in the model at regular pixel intervals and used for the model matching.The coordinate of ith sampling point into the 2D pixel coordinate system in S P is

3D photo-model with specified PM ratio
To explore the object, the photo-model needs to be converted from a 2D pixel model to a 3D spatial plane model.For the jth 3D spatial plane model, its length and width are calculated as follows where a j is the PM ratio of the jth model.Its unit is (pixel/mm).It is the ratio of the 2D pixel model to the 3D spatial plane model.a M is defined as the real ratio of the 2D pixel model to the target object.
The coordinate of the ith point of jth model M r j i in coordinate system S Mj in 3D searching space is Because from 2D photo generate a 3D searching model, the thick of a target is unknown, therefore M z i ¼ 0, the generated 3D photo-model is a 3D spatial plane.Equation ( 4) indicates the conversion relationship of the ith sampling point between S P shown in Figure 4 and S Mj in Figure 2 M x j i ða j ÞðmmÞ ¼ P x i ðpixelÞ a j M y j i ða j ÞðmmÞ ¼ P y i ðpixelÞ Therefore, M r j i can be described as the function of a j , that is, M r j i ða j Þ.

Projective transformation of the photo-model
This subsection introduces the basic component of the projective transformation of the photo-model as follows.
More details have been proposed in the literature. 28,30,39t should be noted that in the past M r j i is a fixed value for a specific object.In this article, from step 1 to step 2 in Figure 5, the photo-model M r j i ða j Þ is not a fixed value, but a function (4) of the PM ratio a j .
As shown in Figure 1, the pose of S M based on S H , including three position variables and three orientation variables in quaternion, is Based on S H , the pose of the jth 3D model H j M is defined as For simplicity, H j M is written as j M .The homogenous transformation H T Mj based on the hand coordinate system S H can be calculated through the pose of jth model j M . 40About stereo vision, position CL r j i of ith point on jth 3D model based on S CL can be calculated through equation ( 7) The position vector IL r j i of the ith point in the left camera image coordinate can be calculated by using the projective transformation matrix P CL from 3D space S CL to 2D image space S IL as Then IL r j i can be described in short as where The projective transformation process is summarized in Figure 5. IR r j i can also be described as the same manner like IL r j i .

Photo-model-based matching in 3D space
As shown in Figure 6, the 3D toys of marine creatures are prepared.The squid is used for explanation the photomodel-based matching.Figure 7 illustrates the experimental setup for the fitness distribution experiment, which will be elaborated upon in the subsequent subsection.The target object and manipulator remain stationary while the 3D photo-models vary in distance and size.
In Figure 7, two example models of equal size, generated from the same photo, are displayed.Therefore, they have the same PM ratio.The first model's projection transformation result is depicted in Figure 8(a), while the second Additional example results for 3D model projection onto stereo vision image planes are illustrated in Figure 8.
In Figure 8(c), through forward projection equation ( 9), the 3D model in 3D searching space is projected onto the left and right camera images.
The evaluation of the matching degree between the projected model and images of the real target object captured by the dual-eye cameras is defined as a fitness function FðΦ j M Þ.As shown in Figure 4, when the prepared photo is 640 Â 480 pixel, the divided squid model is 386 Â 152 pixel.According to equation (4), in Figure 8(a), the photomodel spatial size is 193 Â 76 Â 0 mm 3 with a ¼ 2. In Figure 8(c), the model spatial size is 321.7 Â 126.7 Â 0 mm with a ¼ 1:2 and bigger than that in (a).Comparing these two subfigures (c) and (a), since the position of the model in (a) is more near to the cameras than that in Figure 8(c), through the perspective projection, in right and left images, the photo-model projection results have the similar sizes.
However, because of the stereo vision, the projection results in Figure 8(c) are closer to respective centers of the left and right images than that in (a).Comparing (b) and (d) can also draw the same conclusion.As shown in (b), when the models are completely overlapped with the target object, it is considered that the a is the correct ratio of the prepared photo.The pose of the model is the same as the target object.
Compared to monocular vision, which only observes the projection results of the left camera, binocular vision tends to exhibit a greater positional difference when projecting the model onto the image.Therefore, stereo vision is more helpful in accurately identifying pose and size.As shown in    , when the distance between the model and the object is close and their sizes are similar, the coincidence is higher.This characteristic has inspired us to create a fitness function, which utilizes coincidence to accurately describe the resemblance in pose and size between a photo-model and the target object. . . .; IL r j iÀ1 ; IL r j i ; IL r j iþ1 ; . .., are indicated by white dots in inside area S L;in , and those in outside strip S L;out .

Definition of the fitness function
The 2D model is composed of dots whose relative positions are predefined and fixed.Figure 9(b) shows another situation where the overlap area between the real target and the model is increased compared to the area depicted in (a).
The correlation between the projected model and captured images on the left and right 2D images is calculated by equations ( 11)-( 13) N in and N out are the total number of the inner portion and outer portion sampling points, respectively.The evaluation of every point in the input image that lies inside the model inner portion IL r j i 2 S L;in ðΦ j M Þ and outside area IL r j i 2 S L;out ðΦ j M Þ is represented as p L;in ð IL r j i Þ and p L;out IL r j i À Á respectively.Equations ( 12) and ( 13) are used for their calculation where The evaluation values in equations ( 12) and ( 13) are tuned experimentally.
Calculating p of each sampling point (equations ( 12) and ( 13)) based on color similarity is considered to be constant, with a time complexity of O(1).Furthermore, for each photo-model (the jth photo-model), the fitness calculation complexity given in equation (11) is the total number of sampling points for each photo-model.
In equation (12), if the hue value of each point of captured images, which lies inside the surface model frame S L;in , is similar to the hue value of each point in a model, the fitness value will increase with the voting value of e 1 .These sampling points are represented by dots designated by (A) in Figure 9(b).The fitness value will decrease with the value of e 2 for every model inner portion point when hue values of S L;in are different from the hue value of pixel point in the left camera image.This represents that the model does not overlap precisely the target in the input image, which are represented by (B) in Figure 9(b).
Similarly, in equation ( 13), if the hue value of each point in the left camera image, which are in S L;out , is near to the hue value of the background, with the tolerance of 20, the fitness value will increase with the value of e 3 .This means S L;out strip area surrounding S L;in overlaps the background, expressing the model and the target overlap rather correctly as (C) in Figure 9(b).Otherwise, the fitness value will be decreased with the value of e 4 .This represents points on S L;out overlap with the real target as (D) in Figure 9(b).
Likewise, a function p R;in IR r j i À Á and p R;out IR r j i À Á are represented for the right camera image.

Stereo image acquisition
The photo-model-based stereo vision system is shown in Figures 1 and 7.The utilized manipulator in the system is a PA-10 robot arm manufactured by Mitsubishi Heavy Industries, Tokyo, Japan.And two CCD cameras mounted on the end effector.The resolution of stereo images is 640 Â 480 pixel.The PC is Yoga Pro 13s (CPU: Core(TM) i5-1135G7, 2.42 GHz; RAM: 16 GB).

Fitness distribution experiment
Using still pictures of the target captured by the left and right cameras, the fitness value FðΦ j M Þ is calculated with the assumed photo's ratio and model's pose varied as parameters.We call it "pose-ratio fitness distribution." Figures 10 to 12 illustrate the distribution of fitness results for the pose-ratio of C01 crab and C02, which are depicted in Figure 6.
The true pose of an object is set as H M ¼ ½0; 0; 500 ðmmÞ; 0; 0; 0 T (14)   In this experiment, the target objects' sizes and poses do not change.To imitate photos taken at different heights, photos of different sizes are prepared.Therefore, corresponding to different size photos, the true photo ratios are as follows The relationship between S H and S M based on equation ( 14) is depicted in Figure 7.In the experiment, the target S M and the end-effector S H do not move.Even though the fitness distribution is made by an exhaustive search method, it is impossible to calculate all possibilities.In this experiment, the position incremental distance of fitness value is set at 2.0 mm, the orientation increment is 0.02 (quaternion does not have the unit), and the PM ratio increment is 0.05.Search ranges of fitness distribution are set as position: H x M and H y M 2 ½À100; 100 mm, H z M 2 ½400; 600 mm; orientation: H e 1M , H e 2M , and H e 3M 2 ½À0:3; 0:3; model size ratio: a 2 ½0:5; 3 in Figure 10 and a 2 ½1; 4 in Figures 11 and 12.
Figure 10(a) shows the left and right camera images of the C02 squid.And the size of the prepared photo (b) is smaller than that of Figure 11(b).Figure 10(c) to (e) shows the fitness distribution results with position and photo ratio scan, (f) to (h) show the fitness distribution results with orientation and photo ratio scan.All the fitness distributions (c) to (h) have peaks whose poses and ratios are near the actual values given by equations ( 14) and (15).
When the size of the prepared photo changes in Figure 11, the same conclusion can be drawn.As shown in Figure 11(e), when a changes from 1 to 4, that means the size of the squid model changes from 386 Â 152 Â 0 mm to 96.5 Â 38 Â 0 mm, although z is still 480 mm, the fitness has changed dramatically due to the change of a.Only when a is close to the actual scale, that is, when the model size is close to the object's actual size, the fitness has a high value.
Concerning C01 crab, same as the squid, Figure 12(c) to (e) shows the position-ratio fitness distribution, and (f) to (h) show the orientation-ratio fitness distribution.All the pose-ratio fitness distributions (c) to (h) also have peaks near the true value.
In this section, by the fitness distribution experiment, it is verified that the fitness function equation (11) can transform the PM ratio estimation and target's pose detection problems into optimization problems.It is also confirmed that the proposed method can estimate 3D target pose by using stereo vision and one PM ratio unknown photo.

Pose estimation experiment with genetic algorithm and different photos
To verify the detection ability of the proposed expanded photo-model-based algorithm, pose and ratio detection experiments were conducted with different photos in real application scenarios.As shown in Figure 1, in this experiment, the squid object floats on the water in the pool without pose constraints.The distance between S H and S M at the vertical direction is H z M ¼ W z H À W z M ¼ 680 mm.In other directions, H x M and H y M are unknown.While the fitness function transforms the main problem of recognizing the pose of an object and the ratio of a prepared photo into an optimization problem, the pose and ratio fitness distributions involve much computation.We choose the genetic algorithm (GA) as an optimization method to find the maximum fitness value because of its simplicity and effectiveness. 31,41Because of the limited space, GA will not be introduced in detail here.
In previous studies, 31,41,42 the chromosome in the GA consisted of six variables, representing possible pose solutions.However, for PM ratio detection, as shown in equation (16), each chromosome is elongated and now comprises seven variables.The 30 individuals of GA are used in this experiment, where the chromosome of an individual consists of 68 bit.The first three variables (1-30 bit) of an individual in 3D space are the jth model's position ð H x j M ; H y j M ; H z j M Þ and the middle three variables (31-60 bit) are the orientation ð H e j 1 M ; H e j 2 M ; H e j 3M Þ based on S H .The last variable (61-68 bit) is the PM ratio a j of jth model.The specification of GA is that the number of genes is 30, selection rate 20%, mutation rate 50%, crossover is twopoint, evolutionary strategy is elitism preservation.These parameters are adjusted through experimental tuning |fflfflffl ffl{zfflfflffl ffl} 8bits (16)   As shown in Figure 13(a), the cameras capture the left and right images at an arbitrary moment.Each experiment was performed using a prepared photo.Figure 13(c.2) is downloaded from Bing images (http://cn.bing.com/images).Its PM ratio is unknown.According to GA, the 3D models with random poses and ratios generated from the prepared photos (b.2) and (c.2) converge to target objects in 3D space.GA stops evolving after the 500th generation.(b.1) shows the estimation results using the  c)-(e) fitness distribution with position-ratio scan, that is, x À a, y À a, and z À a, respectively, (f)-(h) fitness distribution with orientation-ratio scan, that is, e 1 À a, e 2 À a, and e 3 À a, respectively.The rotations e 1 ; e 2 ; e 3 are represented by quaternion around axes, corresponding to around x H , y H , z H of S H depicted in Figure 7.   Table 1 summarizes the GA estimation results of two experiments.The distance between the end effector and the target floating on the pool's water surface is 680 mm.The length and width of the target are measured with a manual tape measure.The experimental results of two different photos are close to the actual value.Since they are two different size photos corresponding to the same target object, the PM ratios are different.Photo 1 is the target's photo, and the estimation result is closer to the true value than that with photo 2. Although photo 2 is not the target photo and the shooting distance is unknown, the detection result using it is close to the true value.The detected distance, object length, and width are close to the true value.We can see that the expanded photomodel-based algorithm can detect the pose of objects using PM ratio unknown photos in the practical application scenario.
Table 2 presents the relative error of GA in estimating the distance and size of objects on the water surface.Based  a Through perspective transformation, the projection results of two models on the left and right images corresponding to the pose and ratio are shown in Figure 13(b.1)and (b.2), respectively.The last row shows the measurement of the target under the tape measure.Even though photo 2 is not the target photo, the detection result is near the actual value.
on Table 1 detection results, when the photo 1 was used, the distance detection absolute error is and the relative error is The length detection absolute error is and the relative error is The calculation for the relative error in the last row, when using photo 2, is performed in the same manner as explained earlier.However, it should be noted that photo 2 in Table 2 (Figure 13(c.2)) and the target object C02 in Figure 6(a) have some shape differences, although they belong to the same species.While using the same species' photo can help explore the pose and size, there might be a slight increase in the error value DB.
For comparison, the performance of the expanded photo-model method is evaluated against other existing methods.The most common research is the size detection of marine creatures.Tuna research 27 makes use of stereo vision technology, employing finely constructed models.On the other hand, fish research 14,34 utilizes laser sensors for dimensional measurement, offering increased precision at the expense of higher costs.In Billfish and Tuna research, 43 caught fish size on board is detected using a size known reference object.
In the research, data on the size and positioning detection of relevant binocular products in the air have been included for comparison. 14,34Similar studies on agricultural products in the air have also been incorporated for comparison. 44,45The results show that our research achieves high positioning accuracy but only average size detection accuracy.Overall, as can be seen from the table, our method is a low-cost and practical approach in terms of effectiveness in distance and size measurements.
In the previous subsection, the fitness distribution experiments were conducted to verify the feasibility of the proposed expanded photo-model-based recognition method.Those pose-ratio fitness distributions of the fitness function in Figures 10 to 12 have maximum peaks at the true pose of targets and true ratios of the prepared photos.These results prove that the problem of detecting the pose of a marine creature from a picture of unknown shooting status is transformed into an optimization problem.And through the pose estimation experimental results in this subsection, it is confirmed that (1) the proposed expanded photo-model-based method can estimate a target object's pose by using stereo vision and only one PM ratio unknown photo; (2) the fitness function equation (11) can transform the target pose and PM ratio estimation problem into an optimization problem that GA can solve.
We conducted the experiments with one target and two different photos to clarify that (3) this proposed method can detect both the pose and size of an object in the actual application using just a single pre-prepared photo where the shooting distance is unknown.
The above three points are the contributions of this article and are verified by the pose estimation experiments.

Conclusion and future work
This study presents the expanded photo-model-based pose estimation method that overcomes the limitations of the previous fixed PM ratio approach.By utilizing photos with unknown PM ratios taken at unknown distances, the proposed method allows for the generation of photo-models of different sizes from the same photo.Experimental results have demonstrated the effectiveness of this approach in detecting the pose and size of objects.Moving forward, further research and optimization efforts are necessary to improve the performance and efficiency of this method.
Indeed, the proposed expanded photo-model-based method is still in its early stages.The model shape can be further refined.The addition of a new PM ratio parameter has increased the computational complexity compared to the previous method.As a result, the amount of calculation required for pose estimation also increases.Therefore, it is important to explore ways to optimize the algorithm and minimize the computational burden while maintaining accurate pose and size detection.Reducing the number of <7 Billfish 43 5.01 Tuna 43 4.24 Fish 14,34 10 1 Eggplant 44 2.15 5.97 Pineapple 45 1.17 Photo 1 (ours) À3.4 À0.5 À13.2 À6.3 À5 À6.1 Photo 2 (ours) À0.3 À0.04 À12.6 À6.0 À25.6 À31.0 a The most common research is the size detection of marine creatures.Since this research is specifically aimed at detecting objects on the water surface, we have also included data on binocular product size and positioning detection for comparison purposes.The results clearly demonstrate that our research achieves a high level of accuracy in terms of positioning.However, in terms of size detection, our results fall within the average range.
sampling points can improve speed, but accuracy may be affected.The real-time tracking performance of GA needs to be further investigated in the future.It is recommended that a wider variety of experimental objects both on the water surface and underwater be included to enhance the generalizability of the findings.In addition, in this study marine organisms are considered as rigid bodies.Although death weakens their deformation, they still undergo deformation on the water surface.Therefore, conducting further experiments is crucial for determining the reliability of the proposed method.

Declaration of conflicting interests
The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Figure 2 .
Figure 2. Perspective projection of stereo vision system.In the 3D search space, the spatial plane model is projected onto the left and right images through perspective projection.

Figure 3 .
Figure 3. 2D pixel photo-model generation processes are described as (a)-(d): (a) shows a photograph with a target object (the squid) in the background, (b) represents a model surface space S in constituted by inner points group, (c) represents a model outside space S out that envelops S in , and (d) generated model.

Figure 4 .
Figure 4.The prepared photo and the generated 2D pixel photomodel.(a) The photo size is 640 Â 480 pixel.The photo-model is composed of the inner portion and outer portion with sampling points.(b) The 2D pixel photo-model is only a small part of the photo including the target, the whole photo is not a model.Sampling points are collected at a certain interval.Its coordinate system S P in pixels.The 2D model size in pixel is L P ¼ 386 ðpixelÞ and B P ¼ 152 ðpixelÞ in this situation.

Figure 5 .
Figure 5.The summary of the calculation process from 2D pixel photo-model generation to the 3D photo-model's stereo vision perspective projection.

Figure 6 .
Figure 6.(a) Marine biological models.The three labels correspond to model number, English name, and size (unit: cm).(b) Photos of marine biological models.It should be noted that the model is only a part of photos, including the target, that is, inside of the black frame in the photo.

Figure 7 .
Figure 7. Experimental environment for 3D model projection onto stereo vision image planes.The pose of the object is fixed as H M ¼ ½0; 0; 500; 0; 0; 0 T .Two 3D photo-models are of the same size and are positioned at distance 300 mm and 500 mm, respectively.

Figure 8 (
Figure 8(b), when the distance between the model and the object is close and their sizes are similar, the coincidence is higher.This characteristic has inspired us to create a fitness function, which utilizes coincidence to accurately describe the resemblance in pose and size between a photo-model and the target object.

Figure 9 (
Figure 9(a) shows the left image projection example of the jth model.The evaluation points of hue value,

Figure 9 .
Figure 9. Calculation of the matched degree of each point in model space (S L;in and S L;out ).(a) Evaluation position IL r j i .That is the ith point of jth model which is projected on the left image.(b) Classification of evaluation points (A)-(D): (A) represents points that satisfy the first case of equation (12), H IL À IL r j i Á À H ML À IL r j i Á 20, representing that inner model S L;in overlaps with the real target, (B) representing that inner model S L;in overlaps with background, (C) means that the outer model S L;out overlaps with background, and (D) shows outer portion S L;out overlaps with the real target.

Figure 10 .
Figure 10.Fitness distribution of C02 squid listed in Figure 6.(a) Left and right camera images.(b) Prepared photo for photo-model generation.(c)-(e) fitness distribution with position-ratio scan, that is, x À a, y À a, and z À a, respectively, (f)-(h) fitness distribution with orientation-ratio scan, that is, e 1 À a, e 2 À a, and e 3 À a, respectively.The rotations e 1 ; e 2 ; e 3 are represented by quaternion around axes, corresponding to around x H , y H , z H of S H depicted in Figure7.

Figure 11 .
Figure 11.Fitness distribution of C02 squid.The size of the prepared photo (b) is different from that in Figure 10.(c)-(e) fitness distribution with position-ratio scan.(f)-(h) fitness distribution with orientation-ratio scan.

Figure 12 .
Figure 12.Fitness distribution of C01 crab.The size of the prepared photo (b) is the same as that in Figure 11.(c)-(e) fitness distribution with position-ratio scan.(f)-(h) fitness distribution with orientation-ratio scan.In each subfigure of (c)-(h), the maximum fitness value and corresponding coordinate to give the maximum value are shown in text boxes.
model generated from the photo (b.2). (c.1) is the estimation result of the model generated from the photo (c.2).The average one-generation evolution time of the model generated from Figure 13(b.2) is 0.213 s with N in þ N out ¼ 448.Similarly, the average one-generation evolution time of the model generated from Figure 13(c.2) is 0.208 s with a total of 420 sampling points.

Figure 13 .
Figure 13.The 3D pose estimation results with GA and two different prepared photos.Figure 1 shows the experimental environment.(a) shows the original stereo images at a moment.The distance between S H and S M at the vertical direction is H z M ¼ W z H À W z M ¼ 680 mm.At (a) moment, with photo (b.2), the estimation result of GA is shown in (b.1).Use a photo (c.2) of the same category as the target downloaded from Bing images for pose estimation, and the result is shown in (c.1).

Table 1 .
The target detection results of GA.

Table 2 .
Relative error in distance (mm) and size (mm) for different methods.a