按提交时间
按主题分类
按作者
按机构
  • 机器学习的信息科学原理:基于形式化信息映射的因果链元框架

    分类: 计算机科学 >> 计算机科学的集成理论 提交时间: 2025-07-14

    摘要: [目的]聚焦于解决目前机器学习缺乏统一的形式化理论框架、缺乏可解释性和伦理安全保障等问题。[方法]本文首先构建形式化信息模型,运用合式公式集合显式定义机器学习各典型环节的本体状态和载体映射,引入可学习和可处理谓词、学习和处理函数分析模型因果链逻辑推演与约束法则。[结果]构建了机器学习理论元框架MLT-MF,以此为基础分别建立了模型可解释性和伦理安全性的普适性定义,证明了模型可解释与信息可还原性、伦理安全保障和泛化误差估计等三个重要定理。[局限]当前框架假设理想条件下的信息无噪声使能映射,主要针对静态场景中的模型学习和处理逻辑,同时还未涉及多模态、多智能体系统跨本体空间的信息融合与冲突消解。[结论]本文突破碎片化研究局限,为系统解决当前机器学习面临的关键问题提供了统一的理论基础。

  • 人工智能与人类交互的情感根基:源于演化连续性与种间情感沟通的理论洞见

    分类: 心理学 >> 应用心理学 分类: 计算机科学 >> 计算机科学的集成理论 提交时间: 2025-05-10

    摘要: 通用人工智能(AGI)时代即将到来,促使我们重新评估人工智能与人类的交互,尤其是通过情感沟通的方式。本研究综合了演化生物学、比较心理学和人工智能发展的见解,倡导超越传统的类人认知过程的范式转变。研究强调了情感通路的普遍性,这在不同物种中都有体现。我们引入了三种情感交互模型——情感阈值模型、动态调定点模型和情感图式模型,这些模型均源于对物种间情感交互现象及可能机制的深入分析。这些模型为设计与人类情感体验相契合的人工智能界面提供了路线图,阐明了机器与人类之间建立信任、直觉和相互认可的途径。通过进一步明确“大情感模型”的概念,我们展望了一个人工智能不仅能够解读,而且能够理解人类伙伴情感的未来,为人工智能与人类之间的革命性合作范式铺平了道路。

  • 表意AI vs. 表音AI:AI新范式与去殖民化宣言

    分类: 语言学及应用语言学 >> 语言学及应用语言学 分类: 计算机科学 >> 计算机科学的集成理论 提交时间: 2025-03-07

    摘要: 本文基于对现有AI理论框架的质疑,提出将“Al”概念分为表意AI(LAl,LogographicAl)与表音 Al(PAl,Phonographic Al)。现有 Al理论建立在表音文字基础上,导致表意文字(如汉字)被迫接受殖民妥协,无法释放其先天优势。本文提出表意AI的新理论框架,引入形根(M-Root,Morpho-Root)、形构熵(Morpho-Structural Entropy)、汉字熵场(HEF,HanziEntropy Field)等核心概念,揭示了表意文字在信息密度、文化适应性与认知效率上的优势。表意AI不仅抵抗了表音AI的语言霸权,更为全球AI格局中的“中文降维打击”奠定基础。本文主张文字多样性即智能多样性,提出形音并行算法与芯片设计,推动表意AI与表音AI的互补,实现文明的量子跃迁。

  • Solving the all pairs shortest path problem after minor update of a large dense graph

    分类: 计算机科学 >> 计算机科学的集成理论 提交时间: 2025-02-05

    摘要: The all pairs shortest path problem is a fundamental optimization problem in graph theory. We deal with re-calculating the all-pairs shortest path (APSP) matrix after a minor modification of a weighted dense graph, e.g., adding a node, removing a node, or updating an edge. We assume the APSP matrix for the original graph is already known. The graph can be directed or undirected. A cold-start calculation of the new APSP matrix by traditional algorithms, like the Floyd-Warshall algorithm or Dijkstra’s algorithm, needs $ O(n^3) $ time. We propose two algorithms for warm-start calculation of the new APSP matrix. The best case complexity for a warm-start calculation is $ O(n^2) $, the worst case complexity is $ O(n^3) $. We implemented the algorithms and tested their performance with experiments. The result shows a warm-start calculation can save a great portion of calculation time, compared with cold-start calculation.

  • Multiple Mutation Strategies Differential Evolution With the Best Individuals Allocated to the Best Performer Among the Strategies

    分类: 计算机科学 >> 计算机科学的集成理论 提交时间: 2024-12-10

    摘要: Real parameter single objective optimization has been a prominent field for these decades. Recently, long-term search of real parameter single objective optimization is widely concerned based on the fact that solving difficulty always scales exponentially with the increase of dimensionality of solution space. So far, a number of population-based metaheuristics have been proposed. Among the algorithms, IMODE - a differential evolution algorithm based on three mutation strategies and the binomial or exponential crossover - demonstrates good performance. In this paper, based on IMODE, we propose multiple mutation strategies Differential Evolution with the Best Individuals allocated to the Best performer among the Strategies - BIBSDE - by revising IMODE. Altogether, we make five revisions in algorithm behavior and a change in parameter setting. The most important revision is that, during execution, for the next generation, the current best individuals are allocated to the best performer among the three mutation strategies as reward. Experimental results show that our BIBSDE performs better or at least not worse than existing population based metaheuristics for long-term search. Besides, each measure proposed by us is effective for enhancement.

  • An efficient implementation for solving the all pairs minimax path problem in an undirected dense graph

    分类: 计算机科学 >> 计算机科学的集成理论 提交时间: 2024-12-05

    摘要: We provide an efficient $ O(n^2) $ implementation for solving the all pairs minimax path problem or widest path problem in an undirected dense graph. It is a code implementation of the Algorithm 4 (MMJ distance by Calculation and Copy) in a previous paper. The distance matrix is also called the all points path distance (APPD). We conducted experiments to test the implementation and algorithm, compared it with several other algorithms for solving the APPD matrix. Result shows Algorithm 4 works good for solving the widest path or minimax path APPD matrix. It can drastically improve the efficiency for computing the APPD matrix. There are several theoretical outcomes which claim the APPD matrix can be solved accurately in $ O(n^2) $ . However, they are impractical because there is no code implementation of these algorithms. It seems Algorithm 4 is the first algorithm that has an actual code implementation for solving the APPD matrix of minimax path or widest path problem in $ O(n^2) $, in an undirected dense graph.

  • FairSort: Learning to Fair Rank for PersonalizedRecommendations in Two-Sided Platforms

    分类: 计算机科学 >> 计算机科学的集成理论 提交时间: 2024-12-03

    摘要: Traditional recommendation systems focus on max#2;imizing user satisfaction by suggesting their favorite items. Thisuser-centric approach may lead to unfair exposure distributionamong the providers. On the contrary, a provider-centric designmight become unfair to the users. Therefore, this paper pro#2;poses a re-ranking model FairSort1to find a trade-off solutionamong user-side fairness, provider-side fairness, and personalizedrecommendations utility. Previous works habitually treat thisissue as a knapsack problem, incorporating both-side fairnessas constraints.In this paper, we adopt a novel perspective, treating eachrecommendation list as a runway rather than a knapsack. Inthis perspective, each item on the runway gains a velocity andruns within a specific time, achieving re-ranking for both-sidefairness. Meanwhile, we ensure the Minimum Utility Guaranteefor personalized recommendations by designing a Binary Searchapproach. This can provide more reliable recommendations com#2;pared to the conventional greedy strategy based on the knapsackproblem. We further broaden the applicability of FairSort,designing two versions for online and offline recommendationscenarios. Theoretical analysis and extensive experiments on real#2;world datasets indicate that FairSort can ensure more reliablepersonalized recommendations while considering fairness forboth the provider and user.

  • A New Index for Clustering Evaluation Based on Density Estimation

    分类: 计算机科学 >> 计算机科学的集成理论 提交时间: 2024-06-18

    摘要: A new index for internal evaluation of clustering is introduced. The index is defined as a mixture of two sub-indices. The first sub-index $ I_a $ is called the Ambiguous Index; the second sub-index $ I_s $ is called the Similarity Index. Calculation of the two sub-indices is based on density estimation to each cluster of a partition of the data. An experiment is conducted to test the performance of the new index, and compared with six other internal clustering evaluation indices -- Calinski-Harabasz index, Silhouette coefficient, Davies-Bouldin index, CDbw, DBCV, and VIASCKDE, on a set of 145 datasets. The result shows the new index significantly improves other internal clustering evaluation indices.

  • Federated Learning based on Pruning and Recovery

    分类: 计算机科学 >> 计算机科学的集成理论 提交时间: 2024-03-16

    摘要: A novel federated learning training framework for heterogeneous environments is presented, taking into account the diverse network speeds of clients in realistic settings. This framework integrates asynchronous learning algorithms and pruning techniques, effectively addressing the inefficiencies of traditional federated learning algorithms in scenarios involving heterogeneous devices, as well as tackling the staleness issue and inadequate training of certain clients in asynchronous algorithms. Through the incremental restoration of model size during training, the framework expedites model training while preserving model accuracy. Furthermore, enhancements to the federated learning aggregation process are introduced, incorporating a buffering mechanism to enable asynchronous federated learning to operate akin to synchronous learning. Additionally, optimizations in the process of the server transmitting the global model to clients reduce communication overhead. Our experiments across various datasets demonstrate that: (i) significant reductions in training time and improvements in convergence accuracy are achieved compared to conventional asynchronous FL and HeteroFL; (ii) the advantages of our approach are more pronounced in scenarios with heterogeneous clients and non-IID client data.

  • 基于深度卷积网络的手写体数字识别

    分类: 计算机科学 >> 计算机科学的集成理论 提交时间: 2024-01-07

    摘要: 由于人工神经网络具有高度非线性描述的特点,这个特点导致了他们被愈来愈广泛的研究和应用,在这些研究和应用当中主要的应用领域就是分类。分类实现的基础是特征分类,所以要进行分类就需要先提取样本的特征。在常见的卷积神经网络中,通常是由输入层、卷积层、池化层、激活层、全连接层,按照一定的次序连接而构成。卷积神经网络的输入层实现的是整个神经网络的输入,在本设计中,训练和推理的数据为30*30像素的单通道灰度图

  • Delving into Semantic Scale Imbalance

    分类: 计算机科学 >> 计算机科学的集成理论 提交时间: 2023-02-16

    摘要: Model bias triggered by long-tailed data has been widely studied. However, measure based on the number of samples cannot explicate three phenomena simultaneously: (1) Given enough data, the classification performance gain is marginal with additional samples. (2) Classification performance decays precipitously as the number of training samples decreases when there is insufficient data. (3) Model trained on sample-balanced datasets still has different biases for different classes. In this work, we define and quantify the semantic scale of classes, which is used to measure the feature diversity of classes. It is exciting to find experimentally that there is a marginal effect of semantic scale, which perfectly describes the first two phenomena. Further, the quantitative measurement of semantic scale imbalance is proposed, which can accurately reflect model bias on multiple datasets, even on sample-balanced data, revealing a novel perspective for the study of class imbalance. Due to the prevalence of semantic scale imbalance, we propose semantic-scale-balanced learning, including a general loss improvement scheme and a dynamic re-weighting training framework that overcomes the challenge of calculating semantic scales in real-time during iterations. Comprehensive experiments show that dynamic semantic-scale-balanced learning consistently enables the model to perform superiorly on large-scale long-tailed and non-long-tailed natural and medical datasets, which is a good starting point for mitigating the prevalent but unnoticed model bias. In addition, we look ahead to future challenges.

  • Geometric Prior Guided Feature Representation Learning for Long-Tailed Classification

    分类: 计算机科学 >> 计算机科学的集成理论 提交时间: 2023-02-16

    摘要: Real-world data are long-tailed, the lack of tail samples leads to a significant limitation in the generalization ability of the model. Although numerous approaches of class re-balancing perform well for moderate class imbalance problems, additional knowledge needs to be introduced to help the tail class recover the underlying true distribution when the observed distribution from a few tail samples does not represent its true distribution properly, thus allowing the model to learn valuable information outside the observed domain. In this work, we propose to leverage the geometric information of the feature distribution of the well-represented head class to guide the model to learn the underlying distribution of the tail class. Specifically, we first systematically define the geometry of the feature distribution and the similarity measures between the geometries, and discover four phenomena regarding the relationship between the geometries of different feature distributions. Then, based on four phenomena, feature uncertainty representation is proposed to perturb the tail features by utilizing the geometry of the head class feature distribution. It aims to make the perturbed features cover the underlying distribution of the tail class as much as possible, thus improving the models generalization performance in the test domain. Finally, we design a three-stage training scheme enabling feature uncertainty modeling to be successfully applied. Experiments on CIFAR-10/100-LT, ImageNet-LT, and iNaturalist2018 show that our proposed approach outperforms other similar methods on most metrics. In addition, the experimental phenomena we discovered are able to provide new perspectives and theoretical foundations for subsequent studies. The code will be available at https://github.com/mayanbiao1234/Geometric-Prior

  • 基于FPGA的SDN中QoS保障算法的设计与实现

    分类: 计算机科学 >> 计算机科学的集成理论 提交时间: 2023-02-15 合作期刊: 《桂林电子科技大学学报》

    摘要: 传统网络越发难以面对复杂化的网络结构,于是诞生了一种新型网络架构,即软件定义网络(SDN)。SDN数据中 心的业务流主要有长流和短流,长流有持续时间长、时延不敏感、带宽需求高的特点;而短流持续时间短、时延敏感程度高、 带宽需求低。短流的流量占总流量不足20%,但流量条数则约占总流量数的80%以上;长流的流量占总流量80%以上,但 流量条数不足总流量数的20%。研究发现,在出端口队列中长流往往在短流前,造成短流长时间等待,极易引发网络拥塞。 根据2种业务流特点提出排队机制和路由优化保障机制,将短流设置为高优先级队列,由SDN控制器优先调度排队机制; 将长流设置为低优先级队列,同时采用路由保障算法进行补偿。路由保障算法首先删除不满足长流带宽需求的链路,再计 算最短时延路径。为了提升本设计的算法效率,使用FPGA和万兆以太网对SDN中业务流进行仿真,并在FPGA上仿真 验证了本设计对于网络的时延、带宽的优化与FPGA并行运算的优势。

  • Toward Training and Assessing Reproducible Data Analysis in Data Science Education

    分类: 计算机科学 >> 计算机科学的集成理论 提交时间: 2022-11-29 合作期刊: 《数据智能(英文)》

    摘要: Reproducibility is a cornerstone of scientific research. Data science is not an exception. In recent years scientists were concerned about a large number of irreproducible studies. Such reproducibility crisis in science could severely undermine public trust in science and science-based public policy. Recent efforts to promote reproducible research mainly focused on matured scientists and much less on student training. In this study, we conducted action research on students in data science to evaluate to what extent students are ready for communicating reproducible data analysis. The results show that although two-thirds of the students claimed they were able to reproduce results in peer reports, only one-third of reports provided all necessary information for replication. The actual replication results also include conflicting claims; some lacked comparisons of original and replication results, indicating that some students did not share a consistent understanding of what reproducibility means and how to report replication results. The findings suggest that more training is needed to help data science students communicating reproducible data analysis.

  • Paving the Way to Open Data

    分类: 计算机科学 >> 计算机科学的集成理论 提交时间: 2022-11-29 合作期刊: 《数据智能(英文)》

    摘要: It is easy to argue that open data is critical to enabling faster and more effective research discovery. In this article, we describe the approach we have taken at Wiley to support open data and to start enabling more data to be FAIR data (Findable, Accessible, Interoperable and Reusable) with the implementation of four data policies: Encourages, Expects, Mandates and Mandates and Peer Reviews Data. We describe the rationale for these policies and levels of adoption so far. In the coming months we plan to measure and monitor the implementation of these policies via the publication of data availability statements and data citations. With this information, well be able to celebrate adoption of data-sharing practices by the research communities we work with and serve, and we hope to showcase researchers from those communities leading in open research.

  • Playing Well on the Data FAIRground: Initiatives and Infrastructure in Research Data Management

    分类: 计算机科学 >> 计算机科学的集成理论 提交时间: 2022-11-29 合作期刊: 《数据智能(英文)》

    摘要: Over the past five years, Elsevier has focused on implementing FAIR and best practices in data management, from data preservation through reuse. In this paper we describe a series of efforts undertaken in this time to support proper data management practices. In particular, we discuss our journal data policies and their implementation, the current status and future goals for the research data management platform Mendeley Data, and clear and persistent linkages to individual data sets stored on external data repositories from corresponding published papers through partnership with Scholix. Early analysis of our data policies implementation confirms significant disparities at the subject level regarding data sharing practices, with most uptake within disciplines of Physical Sciences. Future directions at Elsevier include implementing better discoverability of linked data within an article and incorporating research data usage metrics.

  • Knowledge Graph Construction and Applications for Web Search and Beyond

    分类: 计算机科学 >> 计算机科学的集成理论 提交时间: 2022-11-29 合作期刊: 《数据智能(英文)》

    摘要: Knowledge graph (KG) has played an important role in enhancing the performance of many intelligent systems. In this paper, we introduce the solution of building a large-scale multi-source knowledge graph from scratch in Sogou Inc., including its architecture, technical implementation and applications. Unlike previous works that build knowledge graph with graph databases, we build the knowledge graph on top of SogouQdb, a distributed search engine developed by Sogou Web Search Department, which can be easily scaled to support petabytes of data. As a supplement to the search engine, we also introduce a series of models to support inference and graph based querying. Currently, the data of Sogou knowledge graph that are collected from 136 different websites and constantly updated consist of 54 million entities and over 600 million entity links. We also introduce three applications of knowledge graph in Sogou Inc.: entity detection and linking, knowledge based question answering and knowledge based dialogue system. These applications have been used in Web search products to help user acquire information more efficiently.

  • Building a Holistic Taxonomy Model for OGD-Related Risks: Based on a Lifecycle Analysis

    分类: 计算机科学 >> 计算机科学的集成理论 提交时间: 2022-11-29 合作期刊: 《数据智能(英文)》

    摘要: For many government departments, uncertainty aversion is a source of barriers in the advancement of data openness. A more active response to potential risks is needed and necessitates an in-depth examination of risks related to open government data (OGD). With a cross-case study in which three cases from the United Kingdom, the United States and China are examined, this study identifies potential risks that might emerge at different stages of the lifecycle of OGD programs and constructs a taxonomy model for them. The taxonomy model distinguishes the risks from OGD from the risks to OGD, which can help government departments make better responses. Finally, risk response strategies are suggested based on the research results.

  • Faster Zero-shot Multi-modal Entity Linking via Visual#2;Linguistic Representation

    分类: 计算机科学 >> 计算机科学的集成理论 提交时间: 2022-11-28 合作期刊: 《数据智能(英文)》

    摘要: Multi-modal entity linking plays a crucial role in a wide range of knowledge-based modal-fusion tasks, i.e., multi-modal retrieval and multi-modal event extraction. We introduce the new ZEro-shot Multi-modal Entity Linking (ZEMEL) task, the format is similar to multi-modal entity linking, but multi-modal mentions are linked to unseen entities in the knowledge graph, and the purpose of zero-shot setting is to realize robust linking in highly specialized domains. Simultaneously, the inference efficiency of existing models is low when there are many candidate entities. On this account, we propose a novel model that leverages visual#2; linguistic representation through the co-attentional mechanism to deal with the ZEMEL task, considering the trade-off between performance and efficiency of the model. We also build a dataset named ZEMELD for the new task, which contains multi-modal data resources collected from Wikipedia, and we annotate the entities as ground truth. Extensive experimental results on the dataset show that our proposed model is effective as it significantly improves the precision from 68.93% to 82.62% comparing with baselines in the ZEMEL task.

  • Uncovering Topics of Public Cultural Activities: Evidence from China

    分类: 计算机科学 >> 计算机科学的集成理论 提交时间: 2022-11-28 合作期刊: 《数据智能(英文)》

    摘要: In this study, we uncover the topics of Chinese public cultural activities in 2020 with a two-step short text clustering (self-taught neural networks and graph-based clustering) and topic modeling approach. The dataset we use for this research is collected from 108 websites of libraries and cultural centers, containing over 17,000 articles. With the novel framework we propose, we derive 3 clusters and 8 topics from 21 provincial#2; level regions in China. By plotting the topic distribution of each cluster, we are able to shows unique tendencies of local cultural institutes, that is, free lessons and lectures on art and culture, entertainment and service for socially vulnerable groups, and the preservation of intangible cultural heritage respectively. The findings of our study provide decision-making support for cultural institutes, thus promoting public cultural service from a data-driven perspective.