Improving vision-based robotic manipulation with affordance understanding
MetadataShow full item record
The objective of the thesis is to improve robotic manipulation via vision-based affordance understanding, which would advance the application of robotics to industrial use cases, as well as impact the area of assistive robotics. Research literature related to manipulation primarily focus on the grasp affordance due to its essential necessity in robotic manipulation. Recent modern methods for grasp recognition emphasize data-driven machine learning techniques, which improve the generalizability of grasp representation over novel objects, while falling short of real-time robotic application due to computational cost. Beyond grasp affordance, real-world applications of robotic manipulation involve exploiting more general affordances. Recent studies on detecting object affordances in images address theproblem at the pixel-level. Though achieving state-of-the-art performance, per-pixel segmentation approach requires labor-intensive manual annotations. Furthermore, existing affordance datasets contain relatively a small quantity of objects, as compared to classification datasets. The learned affordances thus apply to a limited object set, while a larger variety of objects will be potentially encountered in real-world. Lastly, a single object part may possess multiple affordances in practical, as affordances define action opportunities for manipulation. Recent literature pay attention to single affordance prediction, while ranked affordances of an object part benefit flexibility in achieving goal-oriented robotic manipulation tasks. In this thesis, we focus on vision-based manipulation for real-time robotic application. A series of methods are proposed to improve the applicability for practical scenarios. Specifically we tackle the problem of identifying viable candidate robotic grasps of objects, and seek for more general affordance map prediction methods with reduced annotation costs. Besides, we target on generalizing learned affordances to unseen categories, and predicting multiple ranked affordance for each object part. We aim to narrow the bridge between the vision detection to robotic manipulations by linking action primitives to task execution in real-world. To account for various shapes and poses of objects for universal grasp identification, CNN-based architecture is adopted to learn grasp representation without hand-engineering features. Unlike regression methods, the identification of grasp configurations in this architecture is broken into a grasp detection process, followed by a more refined grasp orientation classification process, where both processes are embedded within two coupled networks. To reduce the labor-intensive annotation cost, learning from supervised synthetic data with unlabelled real images is considered. To maintain the advantage of jointly optimizing detection and affordance prediction, labelled synthetic data is applied and jointly adapted to unlabelled real images for detection and affordance segmentation. To preserve the advantages of an object-based method while generalizing to unseen categories, binary classification mode is added for objectness detection and localization. The proposed architecture further adopts KL-divergence to learn the distributions instead of cross entropy for a single label ground truth on each pixel,enabling multiple ranked affordance prediction of one object part. Improvements on affordance prediction is made by proposed branch-wise attention module and attribute-like auxiliary task. A system combining proposed affordance detector with a pre-trained object detector illustrates the usage with the Planning Domain Definition Language (PDDL) in practical robotic manipulation applications. Through this research, we study vision-based robotic affordance learning for real-world manipulation scenarios. Methods for identifying graspable area as well as general affordances are made applicable to robotic manipulation, improving the training overhead and inferencing efficiency. Real-world scenarios may involve absent of desired functionalities/tools for goal-oriented tasks. To compensate, methods for affordance ranking, unseen category generalization and vision architecture improvements are studied, enhancing the flexibility in practical manipulation. In Chapter 2, a multi-object grasping architecture is introduced to enable situations where no, one, or multiple object(s) are seen, while achieving state-of-the-art on standard benchmarks and physical robot experiments. In Chapter 3, an affordance segmentation architecture is introduced to enable unsupervisedly adapting annotations from synthetic data while achieving comparable performance to supervisedly learned approaches. Considering more realistic scenarios where one object part may support multiple affordances, Chapter 4 extends to multiple affordance with rankings and generalize to unseen categories. In Chapter 5, branch-wise attention module and attribute-like auxiliary task are introduced to improve detection performance on unseen categories. Further integration with object detector and PDDL is introduced to demonstrate applicability in real-world robotic manipulation. Lastly, a case study of the system design is presented in Chapter 6 with proposed components experimented with human subjects for the completeness of this research.