Using neighborhood approximation to make trust queries more efficient

Abstract

For a curation system, working at all pretty much implies containing clusters of loose consensus (places with many cycles of trust/reciprocal trust) (I think the word is "neighborhoods"), and if those clusters are far enough from the user running a query, we can approximate the web of trust graph by replacing clusters with cluster approximation nodes. Cluster approximation nodes aggregate the posts and ingoing/outgoing trust relations of the set of users covered by them. Cluster node state is maintained globally, to be shared. This pattern will allow us to assemble a simplified representation of the entire web of trust for each user, to be accurate in the places they expect it to be, and to resort to shared approximations where complete accuracy would be infeasible, which drastically reduces the number of nodes that have to be visited in the most common types of query.

Prior Art

I don't know anything about that sort of thing.

Rationale

The community of a healthy web of trust will usually have clusters of loose consensus

A helpful curator is selecting for some quality about which they must have mutual knowledge with their audience. In a web of trust, mutual knowledge of the quality in question manifests as cyclical structures in the trust web: People trusting each other or trusting enough mutual friends that by transitivity they will trust each other. We expect that trust webs based on shared taste will often form a number of broad tangles of cycles of consensus, likely to correspond to demographics or communities who are looking for the same thing. We call those clusters and they seem like they'd be very useful to be able to detect, even if only for anthropological surveys.

Following a member of a cluster is like following the whole cluster

Clusters of loose consensus will kind of give you most of the same recommendations no matter where you enter them, no matter which specific member you trust. Everything they post will propagate through all of the others. If a network that is fairly far away from the user opens into a cluster, it may be adequate to just approximate their connection to it with a connection to a prepared cluster approximation node.

A cluster approximation node can be shared by many users' web models, and this lets them share pre-computed results when querying over the vast outer fringes of the web. The unique computation that needs to occur when a user queries their view of the web is drastically reduced.

Distant users who are not in a cluster might not be worth visiting at all

Where there is no cluster, there is no consensus. Where there is no consensus (which will manifest as very divergent branching with very few cycles), it is somewhat unlikely that any true sense of the web's inclusion criterion is understood there. In the least, if there were a true sense, it seems that it would be difficult for a distant judge to recognize it anyway, no near judges have.

Consequently, it is probably acceptable to drop any distant node that cannot be approximated with clusters nodes

Given that, it's hard to imagine a natural web of trust that couldn't be approximated pretty decently with neighborhood approximation via cluster nodes.

Types of approximation

It's worth naming a few different ways of talking about the accuracy of an approximation AG of a web G.

Where C(G) is the set of users considered to be trusted in the neighborhood of the querier, in the web G

InclusionError(AG, G), the extent to which AG contains users that it shouldn't, := |C(AG)\C(G)| / |C(AG)∪C(G)|
ExclusionError(AG, G), the extent to which AG is missing users that it's supposed to have, := |C(G)\C(AG)| / |C(AG)∪C(G)|
TotalError(AG, G) = InclusionError(AG, G) + ExclusionError(AG, G)

(These produce numbers in the unit range, retain meaning wrt to the absolute sizes of the graphs, and generally feel truthy to me)

It seems that different webs will need to focus on minimizing different kinds of error. Minimizing both is ideal, of course, but wherever we face a tradeoff:

cases when it is more important to minimize inclusion error
- A web respectful, polite would be extremely useful in that it would enable more high-profile people to keep their inboxes open without excess risk of receiving any dick pics or death threats.
- Webs absolute cream of the cream, where every entry is supposed to be guaranteed to be a great use of attention, would be very useful to time-deficient people, who can't really afford to waste even 10 minutes on a mediocre recommendation. Since they are only going to see a limited number of articles anyway, it will often be acceptable to for the web to run a risk of missing articles at the expense of avoiding including any that shouldn't be in there.
and OTOH, places where it's more important to minimize exclusion error
- Anything where exclusion is, you know, personally very harmful to the people who might be excluded, so for instance most of the basic human qualities like isn't a spambot, or uses basic tags correctly.
- An adequate reviewing industry, for instance, wants to hear about every single new release in their genre, and would not be especially harmed if they were shown a few articles that actually weren't new releases.
  - Harmful to the subjects too; if a release is dropped unjustly at this stage of the information collective information processing pipeline, it might never get recognized.
  - Likely, in these cases, once an article has been evaluated, it will end up being forwarded to some more exclusive web, for instance, game releases/pre-release →(read and processed by game reviewers)→ game reviews →(read and processed by very hungry game likers and other reviewers)→ surprising, entertaining or important game reviews →(read by all game likers)→ personal game recommendations →(a much more subjective web, read by extremely bespoke personal social networks of game likers)

I guess I'd recommend just keeping both kinds of approximations in cache, instead of having approximation parameters be part of a web's definition or whatever.

It may be worth measuring the degree of connectedness, roughly the average number of steps it takes to get from any node in the neighborhood to any other node. This will vary. As long as the path is short enough, we can get a pretty good guarantee that all of the nodes in the clique would be included in any query extending far enough through it.

Subproblems

These things all feel solvable to me

Drawing the borders of clusters correctly. Picking clusters whose approximations are likely to be reused by many users. I would know a correct solution when I saw it, but it's not obvious to me which of the standard clustering algorithms do it. We should look at https://en.wikipedia.org/wiki/Expectation%E2%80%93maximization_algorithm. Just looking at various gifs of clusterings of two dimensional points, it seemed like the only one that puts the borders at the right places for clusters of varying sizes?
Making good approximations to minimize some combination of model size and the various types of error.
Think more about the trust weights and what they mean (I currently understand them as meaning "lower bound of similarity", or "lower bound of the accuracy of this person as a detector for the web's defining quality of interest". Multiplying these to decrease as the user traverses outwards is very much a lower bound, to an extreme extent that I'm not sure I like (could we think in distributions here? Maybe means and variances instead of point probabilities?) Note, non-bayesians, P(A and B) is only equal to P(A)P(B) if A and B are completely causally unrelated (having no shared causes, and not causing each other), and within a serendipitously connected neighborhood, among humans, it's a very unlikely assumption, different detectors will be entangled in ways that the similarity scores consistently undermeasure, conservative to the point of being wrong). Think about how giving these trust weights coherent meanings might naturally lead to a way of getting the property of increasing approximation roughness at the fringes of the web. I have a leading feeling about this. Right now I'm not sure I see how approximating more at the fringes than the center is even truthlike behaviour, despite seeming useful, and being what we want and expect.

mako yass

September 2020