EP-4742045-A1 - SEMANTIC SEGMENTATION OF A SEQUENCE OF CONTENT CATPURES IN A DESKTOP ENVIRONMENT

EP4742045A1EP 4742045 A1EP4742045 A1EP 4742045A1EP-4742045-A1

Abstract

The techniques disclosed herein provide a system for segmenting a sequence of content captures (e.g., screenshots) of a desktop environment based on a semantic relationship between individual content captures. Generally described, the system generates a numerical representation (e.g., an embedding) of a content capture in the sequence. The numerical representation is then compared against numerical representations of neighboring content captures to detect changes in user activity such as switching activities. Accordingly, the system calculates a difference metric that quantifies the level of change between content captures and compares these difference metrics against a threshold difference metric to identify such changes in user activity. In the event at least one difference metric satisfies the threshold difference metric, the system partitions the sequence of content captures to generate at least a first segment and a second segment. The segments are then rendered in an interactive timeline interface.

Inventors

KRAL, Kyle Thomas
PURI, Yohann
ZHONG, Si Cheng

Assignees

Microsoft Technology Licensing, LLC

Dates

Publication Date: 20260513
Application Date: 20251103

Claims (15)

A method for segmenting a sequence of content captures (104A-104C) depicting a desktop environment based on a semantic relationship between individual content captures (104A-104C) of the sequence of content captures (104A-104C), the method comprising: receiving the sequence of content captures (104A-104C) from a content capture generation component (106); for an individual content capture (104A-104C) of the sequence of content captures (104A-104C): generating a numerical representation (110B) of a semantic content (106) depicted in the individual content capture (104B); comparing the numerical representation (110B) of the individual content capture (104B) against a preceding numerical representation (110A) of a preceding content capture (104A) and a subsequent numerical representation (110C) of a subsequent content capture (104C); calculating a difference metric (112) for the numerical representation (110B) of the individual content capture (104B) based on the comparison against the preceding numerical representation (110A) and the subsequent numerical representation (110C), wherein the difference metric (112) quantifies a level of change between the individual content capture (104B), the preceding content capture (104A), and the subsequent content capture (104C); determining that the difference metric (112) for the individual content capture (104B) satisfies a threshold difference metric (114) indicating a substantive change in the desktop environment based on a comparison between the difference metric (112) and the threshold difference metric (114); and in response to determining that the difference metric (112) for the individual content capture (104B) satisfies the threshold difference metric (114), partitioning the sequence of content captures (104A-104C) at the individual content capture (104B) into at least a first segment (116A) and a second segment (116B); and rendering at least the first segment (116A) and the second segment (116B) within an interactive timeline user interface (302).
The method of claim 1, wherein the content capture generation component generates an individual content capture at a regular time interval.
The method of claim 1 or claim 2, wherein: the first segment is associated with a first grouping of content captures; and the second segment is associated with a second grouping of content captures.
The method of claim 3, wherein the sequence of content captures is a first sequence of content captures, the method further comprising: determining that a third segment within a second sequence of content captures is substantially similar to the first segment based on a numerical representation of content captures within the third segment; in response to determining that the third segment is substantially similar to the first segment, associating the third segment with the first grouping of content captures.
The method of claim 3, further comprising: detecting an unassigned content capture within the first segment that is associated with an undetermined grouping of content captures; determining that a number of content captures within the first segment associated with the first grouping of content captures satisfies a threshold number; and in response to determining that the number of content captures within the first segment associated with the first grouping of content captures satisfies the threshold number, associating the unassigned content capture with the first grouping of content captures.
The method of any one of claim 1 through claim 5, wherein: the first segment is rendered within the interactive timeline user interface in a first color; and the second segment is rendered within the interactive timeline user interface in a second color.
The method of any one of claim 1 through claim 6, wherein the numerical representation of the individual content capture is a vector embedding of onscreen content and system metadata.
The method of any one of claim 1 through claim 7, further comprising: receiving an external request for an additional analysis of the sequence of content captures; and in response to the external request, providing the sequence of content captures to an advanced analysis model.
The method of any one of claim 1 through claim 8, further comprising: assigning a first semantic profile to the first segment based on a semantic content of the first segment; assigning a second semantic profile to the second segment based on a semantic content of the second segment; detecting a third segment having the first semantic profile; and rendering a suggestion interface element in association with the interactive timeline interface, the suggestion interface element surfacing a semantic relationship between the first segment and the third segment based on the first semantic profile.
A system for segmenting a sequence of content captures depicting a desktop environment based on a semantic relationship between individual content captures of the sequence of content captures, the system comprising: a processing system; and a computer-readable medium having encoded thereon computer-readable instructions that when executed by the processing system causes the system to perform operations comprising: receiving the sequence of content captures (104A-104C) from a content capture generation component (106); for an individual content capture (104A-104C)) of the sequence of content captures (104A-104C): generating a numerical representation (110B) of a semantic content (106) depicted in the individual content capture (104B); comparing the numerical representation (110B) of the individual content capture (104B) against a preceding numerical representation (110A) of a preceding content capture (104A) and a subsequent numerical representation (110C) of a subsequent content capture (104C); calculating a difference metric (112) for the numerical representation (110B) of the individual content capture (104B) based on the comparison against the preceding numerical representation (110A) and the subsequent numerical representation (110C), wherein the difference metric (112) quantifies a level of change between the individual content capture (104B), the preceding content capture (104A), and the subsequent content capture (104C); determining that the difference metric (112) for the individual content capture (104B) satisfies the threshold difference metric (114) indicating a substantive change in the desktop environment based on a comparison between the difference metric (112) and a threshold difference metric (114); and in response to determining that the difference metric (112) for the individual content capture (104B) satisfies the threshold difference metric (114), partitioning the sequence of content captures (104A-104C) at the individual content capture (104B) into at least a first segment (116A) and a second segment (116B); and rendering the first segment (116A) and the second segment (116B) within an interactive timeline user interface (302).
The system of claim 10, wherein: the first segment is associated with a first grouping of content captures; and the second segment is associated with a second grouping of content captures.
The system of claim 11, wherein the sequence of content captures is a first sequence of content captures, the operations further comprising: determining that a third segment within a second sequence of content captures is substantially similar to the first segment based on a numerical representation of content captures within the third segment; in response to determining that the third segment is substantially similar to the first segment, associating the third segment with the first grouping of content captures.
The system of claim 11, wherein the operations further comprise: detecting an unassigned content capture within the first segment that is associated with an undetermined grouping of content captures; determining that a number of content captures within the first segment associated with the first grouping of content captures satisfies a threshold number; and in response to determining that the number of content captures within the first segment associated with the first grouping of content captures satisfies the threshold number, associating the unassigned content capture with the first grouping of content captures.
The system of any one of claim 10 through 13, wherein the operations further comprise: assigning a first semantic profile to the first segment based on a semantic content of the first segment; assigning a second semantic profile to the second segment based on a semantic content of the second segment; detecting a third segment having the first semantic profile; and rendering a suggestion interface element in association with the interactive timeline interface, the suggestion interface element surfacing a semantic relationship between the first segment and the third segment based on the first semantic profile.
A computer-readable storage medium having encoded thereon, computer-readable instructions that when executed by a system cause the system to perform operations comprising: receiving the sequence of content captures (104A-104C) from a content capture generation component (106); for an individual content capture (104A-104C) of the sequence of content captures (104A-104C): generating a numerical representation (110B) of a semantic content (106) depicted in the individual content capture (104B); comparing the numerical representation (110B) of the individual content capture (104B) against a preceding numerical representation (110A) of a preceding content capture (104A); calculating a difference metric (112) for the numerical representation (110B) of the individual content capture (104B) based on the comparison against the preceding numerical representation (110A), wherein the difference metric (112) quantifies a level of change between the individual content capture (104B) and the preceding content capture (104A); determining that the difference metric (112) for the individual content capture (104B) satisfies the threshold difference metric (114) indicating a substantive change in the desktop environment based on a comparison between the difference metric (112) and a threshold difference metric (114); in response to determining that the difference metric (112) for the individual content capture (104B) satisfies the threshold difference metric (114), partitioning the sequence of content captures (104A-104C) at the individual content capture (104B) into at least a first segment (116A) and a second segment (116B); and rendering the first segment (116A) and the second segment (116B) within an interactive timeline user interface (302).

Description

BACKGROUND More and more of daily life occurs through computing devices, from completing assignments for work and school, to planning vacations, and online shopping. As such, a user may utilize a diverse array of software applications to accomplish various tasks. Moreover, a given software application can be transformed by different contexts. For instance, an internet browser can be utilized to look up nearby restaurants at one moment and research information for a presentation at another moment. Consequently, the user may lose track of what they were doing at a given moment as well as the context of that activity. To aid users in retracing their steps, many software applications include features for searching and retrieving content and/or activity, such as the browsing history in an internet browser and/or a listing of recent files in a file explorer. However, existing features such as keyword-based searches, folder hierarchies, and app-specific organization tools may lack the ability to record context and decipher user intent. For example, a user may attempt a keyword search to recover a source of information for citation in a presentation. Unfortunately, the lack of specificity in existing approaches may prevent the user from finding the information for which they are looking. Moreover, such features place an additional burden on the user to remember exact details about their past activity such as the name of a website, title of an article, or other information. Manual recollection can be especially challenging due to the sheer amount of information the user generates and interacts with. That is, many existing systems place the onus on the user to spend time manually organizing, categorizing, and documenting information rather than accomplishing the tasks they wish to complete. It is with respect to these and other considerations that the disclosure made herein is presented. SUMMARY The techniques disclosed herein provide a partitioning system for segmenting a sequence of content captures (e.g., screenshots) utilizing a semantic relationship between individual content captures to detect changes in user activity and intent. As mentioned above, the sheer volume of user activity that occurs on computing devices (e.g., laptops, desktops, tablets) can render manual activity recollection overly burdensome and even unfeasible. To that end, end user experiences have streamlined activity recall operations by collecting, with the consent of the user, records of user activity such as a content captures of a desktop environment. Content captures enable an accurate recollection of moments of interest in past user activity thereby enhancing user engagement and productivity. In addition, content captures can be grouped, for example, in an interactive user activity timeline that renders such groups as various segments representing user activity sessions delineating a period of substantially continuous user interaction with a given software application, for example. However, generating groups of content captures may be a difficult balance between grouping accuracy and quick processing times. For instance, accurately grouping content captures by topic (e.g., vacation planning, online shopping) may require significant processing from advanced artificial intelligence models (e.g., a small language model, a large language model). Conversely, grouping content captures more generally, such as by software application (e.g., a text editor, a web browser, a music player), incurs much less processing costs but may also obscure semantic relationships that justify their own segments despite originating from the same application. For example, a user may open a web browser to shop for clothes and subsequently watch a movie via the web browser at a later point in time. Intuitively, these are two distinct activities that should be represented as separate segments despite originating from the same software application. As such, the techniques presented herein enable segmenting content captures based on semantic relationships between individual content captures without requiring the elevated processing costs of advanced artificial intelligence models. That is, the present system segments sequences of content captures without requiring knowledge of the human-readable visual content of the content captures. Within the context of the present disclosure, a sequence of content captures is a plurality of individual content captures that are ordered with respect to time. Stated another way, the sequence of content captures, when received by the partitioning system, is organized chronologically by when each content capture was generated. Generally described, a content capture is recording of a current state of a desktop environment during a given moment of interest that captures the content (e.g., images, text, audio) that the user was interacting with. Moreover, the desktop environment is a graphical user interface abstraction of an operating sy