POSSE post discovery to support link-less syndicated content#130
POSSE post discovery to support link-less syndicated content#130snarfed merged 13 commits intosnarfed:masterfrom
Conversation
…kup on the author's homepage
… be reused next time
…tiple subsequent searches
|
This might be interesting to @aaronpk @kartikprabhu @gRegorLove @tantek |
|
oh very cool! This basically means bridgy will be more proactive about finding comments from silos? |
There was a problem hiding this comment.
i wonder if we should make this the key id instead of a property. that would guarantee there's only ever one SyndicatedPost per syndicated URL, which would avoid duplicates even in the (rare) case of two parallel poll tasks running for the same source.
There was a problem hiding this comment.
agree a key would be a good idea. hmm, right now in the case where an original post does not have any syndication urls, I store original=permalink, syndication=None to avoid refetching it every time we process the h-feed ... couple ideas:
- a compound key of the two properties
- or, only store non-null SyndicatedPost -> non-null OriginalPost relationships in the database, and remember "recently searched syndicated posts" and "recently searched permalinks" to the cache.
There was a problem hiding this comment.
also, consider adding created and updated timestamp properties, like in Response and other models. that'll help with debugging if/when we hit someone who changes a syndication url, or similar cases.
There was a problem hiding this comment.
...and maybe a source property too.
There was a problem hiding this comment.
sorry, you're right, i did see later that SyndicatedPosts may have both null permalinks and syndicated urls. that's probably better than either a compound key or not storing posts w/o syndicated urls.
another alternative is to store SyndicatedPosts as children of their Source entity, so we can use ancestor queries and write them inside a transaction:
https://developers.google.com/appengine/docs/python/ndb/entities#Overview
https://developers.google.com/appengine/docs/python/ndb/queries#ancestor
There was a problem hiding this comment.
This is one thing I'm a little concerned about leaving as a TODO, since it seems like it could be a little difficult to change later.
One question: it could be that you are doing original-post-discovery on a twitter post, and find rel=syndication links to facebook posts. Is it possible to look up a Source given the author's domain_url and the silo url of the syndicated content? maybe Source.query(domain_url=...), and then filter manually by checking source.AS_CLASS.DOMAIN?
And I think I basically understand the difference between adding a source property to SyndicatedPost and making SyndicatedPost a child of the Source, but not the tradeoffs of one approach over the other. I'd be very open to an executive decision if you have a preference ;)
There was a problem hiding this comment.
yes, that Source lookup is possible. the catch is that you can't query across entity kinds (equivalent to database tables), and each silo is a separate kind. (to oversimplify, kind is determined by class name.)
even so, you did basically come up with the answer. make a constant dict that maps AS_CLASS.DOMAIN to Source subclass like the SOURCES dict in publish.py, use that to look up class by silo domain, then query by Source.domain. (domain_url since domain is normalized, lowercased, etc. amusingly, there are existing bridgy users with mixed case domain names in their silo profiles, e.g. https://www.brid.gy/twitter/gRegorLove)
good call that this is important to decide up front if we can. properties are easy to change after the fact, but parent/child relnships aren't. and good question; the main reason to define parent/child relationships is to support transactions (including transactional queries) across multiple entities, like here where we want to enforce uniqueness.
...there's also a spatial locality argument for parent/child relationships, but it's really subtle and usually not important. forget i mentioned it. :P
There was a problem hiding this comment.
background reading and videos for if you get really bored:
https://developers.google.com/appengine/docs/python/datastore/structuring_for_strong_consistency
https://sites.google.com/site/io/under-the-covers-of-the-google-app-engine-datastore
https://snarfed.org/transactions_across_datacenters_io.html
|
heading out, more tonight. i'm getting excited! btw, again, after i look at the rest, i'm totally happy to do all big things (moving to a separate file, new functionality, big new tests) in a separate PR. holler whenever you're happy with this one and i can merge it. also, just to note from irc, in the common case, we're going to try to make this add either zero or one http fetches to original post discovery. the two common cases are 1) domain_url that fails fetch or has no webmention endpoint, and 2) we already have a matching SyndicatedPost |
The first time a syndicated post is encountered, fetch the author's homepage.
For each h-entry u-url, if it has not been seen on a previous iteration, fetch the
permalink and check for rel=syndication/u-syndication links. Store any discovered
relationships to support future comment backfeed.
See http://indiewebcamp.com/posse-post-discovery#Discover_POSSE_copies_via_rel-syndication
for more detail.
Remaining questions: should we limit the number of h-entries that we'll search
for a given h-feed? I'm currently storing relationships indefinitely in the NDB,
should they be refetched every so often?
Possible todo: add a cron job to poll author's h-feeds once a day or so.
P.S. Sorry that it's kind of a huge PR...This felt like the smallest reasonable incremental chunk.