POSSE post discovery to support link-less syndicated content by karadaisy · Pull Request #130 · snarfed/bridgy

karadaisy · 2014-04-19T01:11:58Z

The first time a syndicated post is encountered, fetch the author's homepage.
For each h-entry u-url, if it has not been seen on a previous iteration, fetch the
permalink and check for rel=syndication/u-syndication links. Store any discovered
relationships to support future comment backfeed.

See http://indiewebcamp.com/posse-post-discovery#Discover_POSSE_copies_via_rel-syndication
for more detail.

Remaining questions: should we limit the number of h-entries that we'll search
for a given h-feed? I'm currently storing relationships indefinitely in the NDB,
should they be refetched every so often?

Possible todo: add a cron job to poll author's h-feeds once a day or so.

P.S. Sorry that it's kind of a huge PR...This felt like the smallest reasonable incremental chunk.

…kup on the author's homepage

… be reused next time

…tiple subsequent searches

karadaisy · 2014-04-19T01:12:53Z

This might be interesting to @aaronpk @kartikprabhu @gRegorLove @tantek

aaronpk · 2014-04-19T01:15:18Z

oh very cool! This basically means bridgy will be more proactive about finding comments from silos?

snarfed · 2014-04-20T17:03:23Z

i wonder if we should make this the key id instead of a property. that would guarantee there's only ever one SyndicatedPost per syndicated URL, which would avoid duplicates even in the (rare) case of two parallel poll tasks running for the same source.

agree a key would be a good idea. hmm, right now in the case where an original post does not have any syndication urls, I store original=permalink, syndication=None to avoid refetching it every time we process the h-feed ... couple ideas:

a compound key of the two properties

or, only store non-null SyndicatedPost -> non-null OriginalPost relationships in the database, and remember "recently searched syndicated posts" and "recently searched permalinks" to the cache.

also, consider adding created and updated timestamp properties, like in Response and other models. that'll help with debugging if/when we hit someone who changes a syndication url, or similar cases.

...and maybe a source property too.

sorry, you're right, i did see later that SyndicatedPosts may have both null permalinks and syndicated urls. that's probably better than either a compound key or not storing posts w/o syndicated urls.

another alternative is to store SyndicatedPosts as children of their Source entity, so we can use ancestor queries and write them inside a transaction:
https://developers.google.com/appengine/docs/python/ndb/entities#Overview
https://developers.google.com/appengine/docs/python/ndb/queries#ancestor

This is one thing I'm a little concerned about leaving as a TODO, since it seems like it could be a little difficult to change later.

One question: it could be that you are doing original-post-discovery on a twitter post, and find rel=syndication links to facebook posts. Is it possible to look up a Source given the author's domain_url and the silo url of the syndicated content? maybe Source.query(domain_url=...), and then filter manually by checking source.AS_CLASS.DOMAIN?

And I think I basically understand the difference between adding a source property to SyndicatedPost and making SyndicatedPost a child of the Source, but not the tradeoffs of one approach over the other. I'd be very open to an executive decision if you have a preference ;)

yes, that Source lookup is possible. the catch is that you can't query across entity kinds (equivalent to database tables), and each silo is a separate kind. (to oversimplify, kind is determined by class name.)

even so, you did basically come up with the answer. make a constant dict that maps AS_CLASS.DOMAIN to Source subclass like the SOURCES dict in publish.py, use that to look up class by silo domain, then query by Source.domain. (domain_url since domain is normalized, lowercased, etc. amusingly, there are existing bridgy users with mixed case domain names in their silo profiles, e.g. https://www.brid.gy/twitter/gRegorLove)

good call that this is important to decide up front if we can. properties are easy to change after the fact, but parent/child relnships aren't. and good question; the main reason to define parent/child relationships is to support transactions (including transactional queries) across multiple entities, like here where we want to enforce uniqueness.

...there's also a spatial locality argument for parent/child relationships, but it's really subtle and usually not important. forget i mentioned it. :P

background reading and videos for if you get really bored:
https://developers.google.com/appengine/docs/python/datastore/structuring_for_strong_consistency
https://sites.google.com/site/io/under-the-covers-of-the-google-app-engine-datastore
https://snarfed.org/transactions_across_datacenters_io.html

snarfed · 2014-04-20T19:21:38Z

heading out, more tonight. i'm getting excited! btw, again, after i look at the rest, i'm totally happy to do all big things (moving to a separate file, new functionality, big new tests) in a separate PR. holler whenever you're happy with this one and i can merge it.

also, just to note from irc, in the common case, we're going to try to make this add either zero or one http fetches to original post discovery. the two common cases are 1) domain_url that fails fetch or has no webmention endpoint, and 2) we already have a matching SyndicatedPost

karadaisy added 5 commits April 17, 2014 08:28

Moved original post discovery into bridgy and start doing reverse-loo…

a64c975

…kup on the author's homepage

Store the results of posse-post-discovery in the database so they can…

b98b749

… be reused next time

Merge branch 'master' into posse-post-discovery

ee9aee4

Added a test for posse-post-discovery

44ef4af

Added a second test of posse-post-discovery to check behavior for mul…

f288b6f

…tiple subsequent searches

snarfed reviewed Apr 20, 2014
View reviewed changes

aaronpk mentioned this pull request Apr 20, 2014

PESOS photos from Instagram aaronpk/p3k#29

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

POSSE post discovery to support link-less syndicated content#130

POSSE post discovery to support link-less syndicated content#130
snarfed merged 13 commits intosnarfed:masterfrom
karadaisy:posse-post-discovery

karadaisy commented Apr 19, 2014

karadaisy commented Apr 19, 2014

aaronpk commented Apr 19, 2014

snarfed Apr 20, 2014

karadaisy Apr 20, 2014

snarfed Apr 20, 2014

snarfed Apr 20, 2014

snarfed Apr 21, 2014

karadaisy Apr 22, 2014

snarfed Apr 22, 2014

snarfed Apr 22, 2014

snarfed commented Apr 20, 2014

Labels

3 participants

Conversation

karadaisy commented Apr 19, 2014

karadaisy commented Apr 19, 2014

aaronpk commented Apr 19, 2014

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

snarfed commented Apr 20, 2014

Labels

3 participants