Recently, Processor Zhang Zhenyue from SMBU Faculty of Computational Mathematics and Cybernetics has made significant scientific breakthrough, publishing his research paper Global understanding via local extraction for data clustering and visualization on an international top academic journal - Patterns (CELL’s sub-journal) with himself as first author and Shenzhen MSU-BIT University as the primary affiliation. This research addressed the critical challenges of clustering and visualizing the complex unlabeled data. Through category-consistent local extraction, global propagation and self learning, the GULE framework proposed in the paper achieves the high-precision clustering such as cell type identification of RNA-seq data) and topology-preserving visualization, to provide new solutions for biomedicine and other fields and drive the discovery of multi-disciplinary data models.

In the current big data era, the extraction of latent categorical information from complex data brings a great challenge in the scientific research field. Whether classifying cell types in biomedicine or analyzing user behaviors in social networks, conventional clustering methods often rely on the strong assumptions about data structure or distribution. However, the real data are often highly complex, without clear distribution pattern, limiting the accuracy and robustness of existing algorithms. How to extract latent categories from local correlations of original data without dependence on assumptions has become an urgent problem to be solved.

Fig. 1: GULE Framework Overview
Based on the core principle of “local consistency extraction - global propagation”, the GULE (Global Understanding via Local Extraction) framework adopts a two-layer self-learning structure for category structure analysis. This process is featured with two core steps: 1. local extraction: discover the category consistency from local correlations of data, without prior assumption about data structure; 2. global propagation: form the complete category classification through global diffusion and self learning of locally discovered consistency information. Through theoretical analysis, researchers have demonstrated GULE’s capability of accurately recovering the latent categories in data. In addition, this method is applicable to data visualization, retaining the topology structure of category during dimension reduction. Experimental results have revealed CULE’s clustering accuracy and visualization reliability higher than conventional methods, especially for the complex data such as biomedicine data.

Fig. 2: Performance of brain cell clustering methods for mouse brain datasets
GULE provides a new solution for complex data processing through three key technologies. The first one is the adaptive graph partitioning (Acut) which balances the maximum intra-category connectivity and the minimum inter-category connectivity via the parameter β, to adopt to the datasets with different densities and structures. The second one is the progressive learning which gradually optimizes the category consistency through two layers of projections, with the first layer used to process sparse graphs of raw data, and the second layer used to further refine dense graphs in low-dimension protections, so as to increase the clustering precision. The third one is to integrate with t-SNE and other technologies through topology-preserving visualization to combine raw data with GULE projections and preserve the topological structure during dimension reduction, such as circular structure of COIL20 dataset and linear mode of PIE dataset.

Fig. 3: Topology-enhanced data visualization via GULE projection
The core breakthrough in GLUE lies in its liberation from conventional assumptions about data distribution, just extracting global patterns from local correlations. Such ground-breaking concept reveals a new solution for processing the unstructured data in the real world. This research drives the development of unsupervised learning technology, and also provides a practical tool for cross-disciplinary complex data processing. In future, GULE may become an important foundation of data-driven researches, providing new insights into diversified applications in the fields of biology and medicine, etc.
Link to the paper:
https://www.cell.com/patterns/fulltext/S2666-3899(25)00114-X