Skip to content

POSSE post discovery to support link-less syndicated content#130

Merged
snarfed merged 13 commits intosnarfed:masterfrom
karadaisy:posse-post-discovery
Apr 26, 2014
Merged

POSSE post discovery to support link-less syndicated content#130
snarfed merged 13 commits intosnarfed:masterfrom
karadaisy:posse-post-discovery

Conversation

@karadaisy
Copy link
Copy Markdown
Contributor

The first time a syndicated post is encountered, fetch the author's homepage.
For each h-entry u-url, if it has not been seen on a previous iteration, fetch the
permalink and check for rel=syndication/u-syndication links. Store any discovered
relationships to support future comment backfeed.

See http://indiewebcamp.com/posse-post-discovery#Discover_POSSE_copies_via_rel-syndication
for more detail.

Remaining questions: should we limit the number of h-entries that we'll search
for a given h-feed? I'm currently storing relationships indefinitely in the NDB,
should they be refetched every so often?

Possible todo: add a cron job to poll author's h-feeds once a day or so.

P.S. Sorry that it's kind of a huge PR...This felt like the smallest reasonable incremental chunk.

@karadaisy
Copy link
Copy Markdown
Contributor Author

This might be interesting to @aaronpk @kartikprabhu @gRegorLove @tantek

@aaronpk
Copy link
Copy Markdown
Contributor

aaronpk commented Apr 19, 2014

oh very cool! This basically means bridgy will be more proactive about finding comments from silos?

Comment thread models.py Outdated
Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i wonder if we should make this the key id instead of a property. that would guarantee there's only ever one SyndicatedPost per syndicated URL, which would avoid duplicates even in the (rare) case of two parallel poll tasks running for the same source.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

agree a key would be a good idea. hmm, right now in the case where an original post does not have any syndication urls, I store original=permalink, syndication=None to avoid refetching it every time we process the h-feed ... couple ideas:

  1. a compound key of the two properties
  2. or, only store non-null SyndicatedPost -> non-null OriginalPost relationships in the database, and remember "recently searched syndicated posts" and "recently searched permalinks" to the cache.
Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also, consider adding created and updated timestamp properties, like in Response and other models. that'll help with debugging if/when we hit someone who changes a syndication url, or similar cases.

Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

...and maybe a source property too.

Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sorry, you're right, i did see later that SyndicatedPosts may have both null permalinks and syndicated urls. that's probably better than either a compound key or not storing posts w/o syndicated urls.

another alternative is to store SyndicatedPosts as children of their Source entity, so we can use ancestor queries and write them inside a transaction:
https://developers.google.com/appengine/docs/python/ndb/entities#Overview
https://developers.google.com/appengine/docs/python/ndb/queries#ancestor

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is one thing I'm a little concerned about leaving as a TODO, since it seems like it could be a little difficult to change later.

One question: it could be that you are doing original-post-discovery on a twitter post, and find rel=syndication links to facebook posts. Is it possible to look up a Source given the author's domain_url and the silo url of the syndicated content? maybe Source.query(domain_url=...), and then filter manually by checking source.AS_CLASS.DOMAIN?

And I think I basically understand the difference between adding a source property to SyndicatedPost and making SyndicatedPost a child of the Source, but not the tradeoffs of one approach over the other. I'd be very open to an executive decision if you have a preference ;)

Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, that Source lookup is possible. the catch is that you can't query across entity kinds (equivalent to database tables), and each silo is a separate kind. (to oversimplify, kind is determined by class name.)

even so, you did basically come up with the answer. make a constant dict that maps AS_CLASS.DOMAIN to Source subclass like the SOURCES dict in publish.py, use that to look up class by silo domain, then query by Source.domain. (domain_url since domain is normalized, lowercased, etc. amusingly, there are existing bridgy users with mixed case domain names in their silo profiles, e.g. https://www.brid.gy/twitter/gRegorLove)

good call that this is important to decide up front if we can. properties are easy to change after the fact, but parent/child relnships aren't. and good question; the main reason to define parent/child relationships is to support transactions (including transactional queries) across multiple entities, like here where we want to enforce uniqueness.

...there's also a spatial locality argument for parent/child relationships, but it's really subtle and usually not important. forget i mentioned it. :P

Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@snarfed
Copy link
Copy Markdown
Owner

snarfed commented Apr 20, 2014

heading out, more tonight. i'm getting excited! btw, again, after i look at the rest, i'm totally happy to do all big things (moving to a separate file, new functionality, big new tests) in a separate PR. holler whenever you're happy with this one and i can merge it.

also, just to note from irc, in the common case, we're going to try to make this add either zero or one http fetches to original post discovery. the two common cases are 1) domain_url that fails fetch or has no webmention endpoint, and 2) we already have a matching SyndicatedPost

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

3 participants