[chore] remove nollamas middleware for now (after discussions with a security advisor) (#4433)

i'll keep this on a separate branch for now while i experiment with other possible alternatives, but for now both our hacky implementation especially, and more popular ones (like anubis) aren't looking too great on the deterrent front: https://github.com/eternal-flame-AD/pow-buster

Co-authored-by: tobi <tobi.smethurst@protonmail.com>
Reviewed-on: https://codeberg.org/superseriousbusiness/gotosocial/pulls/4433
Co-authored-by: kim <grufwub@gmail.com>
Co-committed-by: kim <grufwub@gmail.com>
This commit is contained in:
kim 2025-09-17 14:16:53 +02:00 committed by kim
commit 6801ce299a
28 changed files with 207 additions and 1395 deletions

View file

@ -10,6 +10,6 @@ You can allow or disallow crawlers from collecting stats about your instance fro
The AI scrapers come from a [community maintained repository][airobots]. It's manually kept in sync for the time being. If you know of any missing robots, please send them a PR!
A number of AI scrapers are known to ignore entries in `robots.txt` even if it explicitly matches their User-Agent. This means the `robots.txt` file is not a foolproof way of ensuring AI scrapers don't grab your content. In addition to this you might want to look into blocking User-Agents via [requester header filtering](request_filtering_modes.md), and enabling a proof-of-work [scraper deterrence](../advanced/scraper_deterrence.md).
A number of AI scrapers are known to ignore entries in `robots.txt` even if it explicitly matches their User-Agent. This means the `robots.txt` file is not a foolproof way of ensuring AI scrapers don't grab your content. In addition to this you might want to look into blocking User-Agents via [requester header filtering](request_filtering_modes.md).
[airobots]: https://github.com/ai-robots-txt/ai.robots.txt/

View file

@ -1,19 +0,0 @@
# Scraper Deterrence
GoToSocial provides an optional proof-of-work based scraper and automated HTTP client deterrence that can be enabled on profile and status web views. The way
it works is that it generates a unique but deterministic challenge for each incoming HTTP request based on client information and current time, that-is a hex encoded SHA256 hash. It then asks the client to find an integer addition to a portion of this that will generate an expected encoded hash result. This is served to the client as a minimal holding page with a single JavaScript worker that computes a solution to this.
The number of hash encode rounds the client is required to complete may be configured, where high values will take the client longer to find a solution and vice-versa. We also instill a certain amount of jitter to make it harder for scrapers to "game" the algorithm. If your challenges take too long to solve, you may deter users from accessing your web pages. And conversely, the longer it takes for a solution to be found, the more you'll be incurring costs for scrapers (and in some cases, causing their operation to time-out). That balance is up to you to configure, hence why this is an advanced feature.
Once a solution to this challenge has been provided, by refreshing the page with the solution in the query parameter, GoToSocial will verify this solution and on success will return the expected profile / status page with a cookie that provides challenge-less access to the instance for up-to the next hour.
The outcomes of this, (when enabled), is that it should make scraping of your instance's profile / status pages economically unviable for automated data gathering (e.g. by AI companies, search engines). The only negative, is that it places a requirement on JavaScript being enabled for people to access your profile / status web views.
This was heavily inspired by the great project that is [anubis], but ultimately we determined we could implement it ourselves with only the features we require, minimal code, and more granularity with our existing authorization / authentication procedures.
The GoToSocial implementation of this scraper deterrence is still incredibly minimal, so if you're looking for more features or fine-grained control over your deterrence measures then by all means keep ours disabled and stand-up a service like [anubis] in front of your instance!
!!! warning
This proof-of-work scraper deterrence does not protect user profile RSS feeds due to the extra complexity involved. If you rely on your RSS feed being exposed, this is one such case where [anubis] may be a better fit!
[anubis]: https://github.com/TecharoHQ/anubis

View file

@ -182,23 +182,4 @@ advanced-csp-extra-uris: []
# Options: ["block", "allow", ""]
# Default: ""
advanced-header-filter-mode: ""
# Bool. Enables a proof-of-work based deterrence against scrapers
# on profile and status web pages. This will generate a unique but
# deterministic challenge for each HTTP client to complete before
# accessing the above mentioned endpoints, on success being given
# a cookie that permits challenge-less access within a 1hr window.
#
# The outcome of this is that it should make scraping of these
# endpoints economically unfeasible, while having a negligible
# performance impact on your own instance.
#
# The downside is that it requires javascript to be enabled.
#
# For more details please check the documentation at:
# https://docs.gotosocial.org/en/latest/admin/scraper_deterrence
#
# Options: [true, false]
# Default: true
advanced-scraper-deterrence: false
```

View file

@ -10,6 +10,6 @@ GoToSocial 在主域名上提供一个 `robots.txt` 文件。该文件包含试
AI 爬虫来自一个[社区维护的仓库][airobots]。目前是手动保持同步的。如果你知道有任何遗漏的爬虫,请给他们提交一个 PR
众所周知,很多 AI 爬虫在 `robots.txt` 不允许其 User-Agent 的情况下,仍然会忽略对应规则并继续抓去内容。这意味着 `robots.txt` 文件并不是确保 AI 爬虫不抓取你的内容的万无一失的方法。除此以外,你可能还需要考虑通过[请求标头过滤](request_filtering_modes.md)来阻止对应 User-Agent,以及启用基于工作证明的[爬虫防护](../advanced/scraper_deterrence.md)
众所周知,很多 AI 爬虫在 `robots.txt` 不允许其 User-Agent 的情况下,仍然会忽略对应规则并继续抓去内容。这意味着 `robots.txt` 文件并不是确保 AI 爬虫不抓取你的内容的万无一失的方法。除此以外,你可能还需要考虑通过[请求标头过滤](request_filtering_modes.md)来阻止对应 User-Agent。
[airobots]: https://github.com/ai-robots-txt/ai.robots.txt/

View file

@ -1,18 +0,0 @@
# 爬虫防护
GoToSocial 提供一个可选的、基于工作量证明的爬虫和自动化 HTTP 客户端防护机制,可在账户页和贴文页的网页视图上启用。
它的工作原理是:针对每个传入的 HTTP 请求,系统会根据客户端信息和当前时间生成一个唯一质询(一个十六进制编码的 SHA256 哈希值)。然后,它要求客户端为该质询的一部分找到一个附加值,使(附加值+质询部分)组合计算出的新 SHA256 哈希值(同样为十六进制编码)至少包含 4 个前导 '0' 字符。这个质询会通过一个极简的等待页面呈现给客户端,该页面包含一个独立的 JavaScript worker 来计算解决方案。
一旦客户端提供了此质询的解并通过在查询参数中携带该方案刷新页面GoToSocial 将验证此方案。验证成功后,服务端会返回用户期望访问的账户或贴文页面,并设置一个 Cookie。该 Cookie 允许用户在接下来最多一小时内免验证访问该实例。
启用此功能的目的是让自动化数据收集(例如 AI 公司、搜索引擎)对你实例的账户和贴文页面进行爬取的行为,在经济上变得不可行。唯一的缺点是,用户需要启用 JavaScript 才能访问你的账户和贴文网页视图。
这个功能深受优秀的 [anubis] 项目的启发,但我们最终决定自己实现,只包含我们需要的功能,使用最少的代码,并能与我们现有的授权/认证流程实现更细粒度的结合。
GoToSocial 实现的这个爬虫防护功能仍然是极其精简的。因此,如果你需要更多功能或对防护措施进行更精细的控制,那么完全可以禁用我们的内置功能,并在你的实例前部署像 [anubis] 这样的服务!
!!! warning "警告"
这个基于工作量证明的爬虫防护机制并不保护用户账户页的 RSS feed因为这会带来额外的复杂性。如果你需要确保 RSS feed 可被访问,那么在这种情况下,[anubis] 可能是更合适的选择!
[anubis]: https://github.com/TecharoHQ/anubis

View file

@ -149,21 +149,4 @@ advanced-csp-extra-uris: []
# 选项: ["block", "allow", ""]
# 默认: ""
advanced-header-filter-mode: ""
# 布尔值。启用基于工作量证明的爬虫威慑机制,
# 作用于账户页和贴文页面。这将为每个 HTTP 客户端生成一个唯一确定
# 的质询,需要由客户端在访问上述端点时完成。
# 完成后,客户端会获得一个 Cookie允许其在 1 小时窗口内免验证访问。
#
# 这样做的结果是,它理论上使得对这些端点的抓取在经济上变得不可行,
# 同时对你自己的实例的性能影响可以忽略不计。
#
# 缺点是它要求客户端启用 JavaScript。
#
# 更多详情请查阅文档:
# https://docs.gotosocial.org/zh-cn/latest/admin/scraper_deterrence
#
# 选项: [true, false]
# 默认值: true
advanced-scraper-deterrence: false
```

View file

@ -80,7 +80,6 @@ nav:
- "advanced/tracing.md"
- "advanced/metrics.md"
- "advanced/replicating-sqlite.md"
- "advanced/scraper_deterrence.md"
- "advanced/sqlite-networked-storage.md"
- "适用进阶场景的构建":
- "advanced/builds/nowasm.md"