Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SubstackBridge] Add Substack bridge #4174

Merged
merged 5 commits into from
Jul 31, 2024
Merged

Conversation

SqrtMinusOne
Copy link
Contributor

This adds a bridge for Substack.

I noticed that their RSS returns the full content if fetched with the right set of cookies, so this bridge leverages that. I've got no clue whether it's intended or not.

This also required:

  • Adding the header parameter to FeedExpander.collectExpandableDatas;
  • Adding support for the content extension to FeedParser.

Copy link

github-actions bot commented Jul 31, 2024

Pull request artifacts

Bridge Context Status
Substack 1 untitled (pr) ✔️

last change: Wednesday 2024-07-31 19:56:58

@dvikan
Copy link
Contributor

dvikan commented Jul 31, 2024

this is nice.

but i dont like the increased complexity of FeedExpander for http headers.

also the content key on $item is mutated by others. Might be unintended consequences here im not sure.

@dvikan
Copy link
Contributor

dvikan commented Jul 31, 2024

there are two locations earlier where $item['content'] is written to (which will be overwritten by this pr):

foreach ($feedItem as $k => $v) {
            $hasChildren = count($v) !== 0;
            if (!$hasChildren) {
                $item[$k] = (string) $v;
            }
        }

and

        if (isset($feedItem->description)) {
            $item['content'] = (string)$feedItem->description;
        }

this applies only for rss2.0

the content namespace is later excluded when enumerating all modules.

i think this change is good but we must be clear that we are indeed possibly overwriting prior values in $item['content']

@dvikan
Copy link
Contributor

dvikan commented Jul 31, 2024

the full content is already present (using curl):

<item>
      <title><![CDATA[The biggest-ever global outage: lessons for software engineers]]></title>
      <description><![CDATA[Cybersecurity vendor CrowdStrike shipped a routine rule definition change to all customers, and chaos followed as 8.5M machines crashed, worldwide. There are plenty of learnings for developers.]]></description>
      <link>https://newsletter.pragmaticengineer.com/p/the-biggest-ever-global-outage-lessons</link>
      <guid isPermaLink="true">https://newsletter.pragmaticengineer.com/p/the-biggest-ever-global-outage-lessons</guid>
      <dc:creator><![CDATA[Gergely Orosz]]></dc:creator>
      <pubDate>Tue, 23 Jul 2024 15:20:09 GMT</pubDate>
      <enclosure url="https://substack-post-media.s3.amazonaws.com/public/images/aeebe9e0-97a3-4ea0-9c23-24275a9623d6_1280x946.png" length="0" type="image/jpeg"/>
      <content:encoded><![CDATA[<p><em>&#128075; Hi, this is Gergely with the monthly, free issue of the Pragmatic Engineer Newsletter. In every issue, I cover challenges at Big Tech and startups through the lens of engineering managers and senior engineers. To get issues like th
is in your inbox, sign up here:</em></p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://newsletter.pragmaticengineer.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe now&quot;,&quot;action&quot;:null,&quot;class&quot;:null}" data-component-name="ButtonCr
eateButton"><a class="button primary" href="https://newsletter.pragmaticengineer.com/subscribe?"><span>Subscribe now</span></a></p><p>Unless you were under a rock since last week, you likely heard about the CrowdStrike / Windows outage that took down critical services like airlin
es, banks, supermarkets, police departments, hospitals, TV channels, and more, around the world. Businesses saw their Windows machines crash with the &#8220;Blue Screen of Death,&#8221; and no obvious fixes &#8211; at least not initially. The incident was unusual in size and scal
e, and also because it involved software running at the kernel-level; a factor which gives this us all the more reason to take a look at it.</p><p>Today, we cover:</p><ol><li><p><strong>Recap</strong>. 8.5M Windows machines impacted across industries</p></li><li><p><strong>Root c
ause. </strong>An update to naming rule

@SqrtMinusOne
Copy link
Contributor Author

SqrtMinusOne commented Jul 31, 2024

Overwriting prior values in $item['content'] was indeed intended - by default it's set to description (as in normal RSS 2.0), and only replaced by the content namespace if the latter is present.

The first foreach can only retrieve a plain <content> tag, which is present neither in the RSS standard nor in the content extension.

And regardless, I've rebased this PR to include #4178 and it works as intended, so I removed my change to FeedParser. Edit: ah, your change included mine...

As for FeedExpander - I can just copy collectExpandableDatas to SubstackBridge, but this seems more troublesome, e.g. because it would copy the value of $problematicStrings and Failed to parse xml from %s: %s. If it were Lisp, I'd override getContent or some variable inside it, but I don't know how to do it better here.

@dvikan
Copy link
Contributor

dvikan commented Jul 31, 2024

the cookie lifetime is 3 months you say but the session might die after e.g. 7 days or e.g. 24 hours of inactivity.

so this bridge is kind of a hack indeed.

dont they have some kind of api so we could create SubstackApiBridge or something?

the added $headers to FeedExpander isnt that bad. I think maybe im okay with it.

Maybe change description because works without cookie too for non-payawalled (but those could use feed directly without rss-bridge)

For more url manipulation you could checkout the php class Url contained in this project.

@SqrtMinusOne
Copy link
Contributor Author

Rephrased the description.

I'll see how long the session lives, it's certainly longer than a few days 🙂 And I have some hope because this hack worked with The Economist.

Substack doesn't have an official public API, and their "normal" API is authorized with the same session cookie, which requires passing a CAPTCHA to obtain. So I think this is the simplest way.

@dvikan dvikan merged commit b505667 into RSS-Bridge:master Jul 31, 2024
7 checks passed
@dvikan
Copy link
Contributor

dvikan commented Jul 31, 2024

ok cool

NotsoanoNimus pushed a commit to Solstice-Software/better-rss-bridge that referenced this pull request Aug 8, 2024
* [SubstackBridge] Add Substack

* [SubstackBridge] Add docs

* [SubstackBridge] Fix lint

* [SubstackBridge] Update description

* [SubstackBridge] Update description (x2)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants