Learning hand latent features for unsupervised 3D hand pose estimation

Journal: Journal of Autonomous Intelligence DOI: 10.32629/jai.v2i1.36

Jamal Firmat Banzi1, Isack Bulugu2, Zhongfu Ye3

1. 1 School of Information Science and Technology, University of Science and Technology of China, 230026, China 2 Sokoine University of Agriculture, Morogoro, 3167, Tanzania
2. 3 College of information and communication Technology, University of Dare-es-salaam, Dar-es-Salaam, 33335, Tanzania
3.

Abstract

Recent hand pose estimation methods require large numbers of annotated training data to extract the dynamic information from a hand representation. Nevertheless, precise and dense annotation on the real data is difficult to come by and the amount of information passed to the training algorithm is significantly higher. This paper presents an approach to developing a hand pose estimation system which can accurately regress a 3D pose in an unsupervised manner. The whole process is performed in three stages. Firstly, the hand is modelled by a novel latent tree dependency model (LTDM) which transforms internal joints location to an explicit representation. Secondly, we perform predictive coding of image sequences of hand poses in order to capture latent features underlying a given image without supervision. A mapping is then performed between an image depth and a generated representation. Thirdly, the hand joints are regressed using convolutional neural networks to finally estimate the latent pose given some depth map. Finally, an unsupervised error term which is a part of the recurrent architecture ensures smooth estimations of the final pose. To demonstrate the performance of the proposed system, a complete experiment is conducted on three challenging public datasets, ICVL, MSRA, and NYU. The empirical results show the significant performance of our method which is comparable or better than state-of-the-art approaches.

Keywords

Hand pose estimation; Convolutional neural networks; Recurrent neural networks; Human-machine interaction; Predictive coding; Unsupervised learning

References

[1] Barsoum E., ‘Articulated Hand Pose Estimation Review’2016, arXiv: 1604.06195 (preprint) pp. 1–50.
[2] Erol A., Boyle R. D. et al. Vision-based hand pose estimation: A review. Computer Vision and Image Understanding. 2007; 108(1-2): 52–73.
[3] Sridhar. S, et al. "Fast and robust hand tracking using detection-guided optimization." In Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2015, 07-12-June-2015, pp. 3213–3221.
[4] Krejov. P, Andrew. G, and Richard. B, "Combining discriminative and model-based approaches for hand pose estimation." In 2015 11th IEEE International Conference and Workshops on Automatic Face and Gesture Recognition (FG); IEEE, 2015; vol. 1, pp. 1-7
[5] Tracewski, L., Bastin, L., & Fonte, C. C. Repurposing a deep learning network to filter and classify volunteered photographs for land cover and land use characterization. Geo-spatial information science, 2017; 20(3), 252-268.
[6] Yu, H., et al. Analysis of large-scale UAV images using a multi-scale hierarchical representation. Geo-spatial Information Science, 2018., 21(1), 33-44.
[7] Chen, et al. Learning a deep network with spherical part model for 3D hand pose estimation, Pattern Recognition, 2018, 80, 1-20.
[8] Zimmermann .C, and Brox.T “Learning to Estimate 3D Hand Pose from Single RGB Images,” in Proceedings of the IEEE International Conference on Computer Vision, 2017, vol. 2017–October, pp. 4913–4921.
[9] Tagliasacchi. A, et al, Robust articulated-ICP for real-time hand tracking,” Eurographics Symp. Geom. Process., 2015, 34(5) pp. 101–114.
[10] Patel. H, et al, Neural network with deep learning architectures, Journal of Information and Optimization Sciences, 2018, 39 (1), pp 31-38.
[11] Ye. Q, Yuan. V, and Kim. T, Spatial attention deep net with partial PSO for hierarchical hybrid hand pose estimation Computer vision and pattern recognition, 2016.; arXiv:1604.03334,
[12] Sun. X, et al, “Cascaded hand pose regression,” in Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2015, vol. 07-12-June-2015, pp. 824–832.
[13] Sinha. A, Choi. C, and Ramani. K, “Deep Hand: Robust Hand Pose Estimation by Completing a Matrix Imputed with Deep Features,” in 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016, pp. 4150–4158.
[14] Choi. C, et al, “A collaborative filtering approach to real-time hand pose estimation,” in Proceedings of the IEEE International Conference on Computer Vision, 2015, vol. 2015 International Conference on Computer Vision, ICCV 2015, pp. 2336–2344.
[15] Oikonomidis. M, Lourakis I., and AR gyros. A. , “Evolutionary quasi-random search for hand articulations tracking,” in Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2014, pp. 3422–3429.
[16] Krejov. P, Gilbert. A, and Bowden. R, Guided optimisation through classification and regression for hand pose estimation, Comput. Vis. Image Underst., 2017.; 155, pp. 124–138.
[17] Qian et al, Realtime and robust hand tracking from depth, in Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2014, pp. 1106–1113.
[18] Tang, T. H. Yu, and T. K. Kim, “Real-time articulated hand pose estimation using semi-supervised transductive regression forests,” in Proceedings of the IEEE International Conference on Computer Vision, 2013, pp. 3224–3231.
[19] Banzi. J, Zhongfu. Ye, and Bulugu. I, “A novel hand pose estimation using discriminative deep model and Transductive learning approach for occlusion handling and reduced discrepancy,” in proceedings of IEEE International Conference on Computer and Communications, 2017, pp. 347–352.
[20] Tang. D et al, “Latent regression forest: Structured estimation of 3D hand poses,” IEEE Trans. Pattern Anal. Mach. Intell. 2017., 39(7), pp. 1374–1387.
[21] F. Chollet. Keras, 2016.
[22] Tompson. M, Stein. Y, and Perlin. K, “Real-Time Continuous Pose Recovery of Human Hands Using Convolutional Networks,” ACM Trans. Graph., 33(5), pp. 1–10, 2014.
[23] Guo. H, et al. "Region ensemble network: Improving convolutional network for hand pose estimation." 2017 IEEE International Conference on Image Processing (ICIP), pp. 4512-4516.
[24] Oberweger. M, Wohlhart. P, and Lepetit.V, “Training a Feedback Loop for Hand Pose Estimation,” 2015 IEEE Int. Conf. Comput. Vis., pp. 3316–3324, 2015.
[25] A. Dosovitskiy, et al, “Discriminative Unsupervised Feature Learning with Exemplar Convolutional Neural Networks, 2014, ”38(9), pp. 1734–1747.
[26] Chen. L, et al “DeepLab: Semantic Image Segmentation with Deep Convolutional Nets, Atrous Convolution, and Fully Connected CRFs,” IEEE Trans. Pattern Anal. Mach. Intell., 2018, 40(4), pp. 834–848.
[27] Ge. L, et al. "3d convolutional neural networks for efficient and robust hand pose estimation from single depth images." Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017, pp. 1991-2000.
[28] Oberweger. M, and Lepetit, V., 2017. DeepPrior++: Improving fast and accurate 3d hand pose estimation. In Proceedings of the IEEE International Conference on Computer Vision, pp. 585-594.
[29] Zhou. ZH, Feng J., Deep forest: Towards an alternative to deep neural networks, 2017, arXiv preprint arXiv:1702.08835.
[30] Wang, F. and Li, Y., Beyond physical connections: Tree models in human pose estimation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2013, pp. 596-603.
[31] Tan, V.Y et al. Learning high-dimensional Markov forest distributions: Analysis of error rates. Journal of Machine Learning Research, 2011, 12(May), pp.1617-1653.
[32] Chow C, Liu C. Approximating discrete probability distributions with dependence trees. IEEE transactions on Information Theory. 1968 May;14(3):462-7.
[33] Huang. Y and Rao R., “Predictive coding,” Wiley Interdiscip. Rev. Cogn. Sci., 2011., 2(5), pp. 580–593,
[34] Kingma D and J. Ba. J, “Adam: a method for stochastic optimization,”, 2014, arXiv Prepr. arXiv1412.6980, pp. 1–13,
[35] Theano Development Team, “Theano: A Python framework for fast computation of mathematical expressions,” 2016, arXiv e-prints.
[36] Hill et al., “Deep Predictive Coding Network for Video Prediction and Unsupervised Learning, 2017, ICLR, pp. 1–9.
[37] Srivastava, N., Mansimov, E, and Salakhutdinov R.“Unsupervised Learning of Video Representations using LSTMs,” 2015, BMVC2015, p. 2009,.
[38] W. Lotter, G. Kreiman, D. Cox, ‘Deep Predictive Coding Networks for Video Prediction and Unsupervised Learning’, pp. 1–12, 2016.
[39] S. Frieder, O. Michael, and O. Obst. "Predictive Neural Networks." arXiv preprint arXiv:1802.03308(2018).
[40] X. Zhou, Q. Wan, Z. Wei, X. Xue, and Y. Wei, “Model-based deep hand pose estimation,” in IJCAI International Joint Conference on Artificial Intelligence, 2016, vol. 2016–January, pp. 2421–2427.
[41] Madadi, M., Escalera, S., Baró, X., & Gonzalez, J. (2017). End-to-end global to local CNN learning for hand pose recovery in-depth data. arXiv preprint arXiv:1705.09606.
[42] Tzu-Yang Chen, Pai-Wen Ting, Mn-Yu Wu, Li-Chen Fu, Learning a deep network with spherical part model for 3D hand pose estimation,” pattern recognition. 2018, 33, pp.3203.
[43] Deng, X., Yang, S., Zhang, Y., Tan, P., Chang, L. and Wang, H., 2017. Hand3d: Hand pose estimation using 3d neural network. arXiv preprint arXiv:1704.02224.
[44] Yuan, S., et al. Bighand2. 2m benchmark: Hand pose dataset and state of the art analysis. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017, pp. 4866-4874.
[45] Ge, L., et al, Hand PointNet: 3d hand pose estimation using point sets. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018, pp. 8417-8426.
[46] Wan, C., et al,. Crossing nets: Combining gans and vaes with a shared latent space for hand pose estimation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017, pp. 680-689.

Copyright © 2019 Jamal Firmat Banzi, Isack Bulugu, Zhongfu Ye

Creative Commons License
This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License