The Wikimedia Foundation, stewards of the finest projects on the web, have written about the hammering their servers are taking from the scraping bots that feed large language models.
[…]
When we talk about the unfair practices and harm done by training large language models, we usually talk about it in the past tense: how they were trained on other people’s creative work without permission. But this is an ongoing problem that’s just getting worse.
The worst of the internet is continuously attacking the best of the internet. This is a distributed denial of service attack on the good parts of the World Wide Web.
If you’re using the products powered by these attacks, you’re part of the problem. Don’t pretend it’s cute to ask ChatGPT for something. Don’t pretend it’s somehow being technologically open-minded to continuously search for nails to hit with the latest “AI” hammers.
If you’re going to use generative tools powered by large language models, don’t pretend you don’t know how your sausage is made.
World Wide Web
Denial
FOSS infrastructure is under attack by AI companies
in LibreNewsThree days ago, Drew DeVault - founder and CEO of SourceHut - published a blogpost called, "Please stop externalizing your costs directly into my face", where he complained that LLM companies were crawling data without respecting robosts.txt and causing severe outages to SourceHut.
[…]
Then, yesterday morning, KDE GitLab infrastructure was overwhelmed by another AI crawler, with IPs from an Alibaba range; this caused GitLab to be temporarily inaccessible by KDE developers.
[…]
By now, it should be pretty clear that this is no coincidence. AI scrapers are getting more and more aggressive, and - since FOSS software relies on public collaboration, whereas private companies don't have that requirement - this is putting some extra burden on Open Source communities.
Configuring Firefox
Really good tips here, including a couple I'd not heard about and promptly followed:
This is the bare minimum necessary to configure Firefox so that it behaves in a reasonable manner.
This document was last updated on 27 January 2025 and was tested with a clean install of Firefox 134.
Verify these steps each time Firefox is updated.
- Go to uBlock Origin and click Add to Firefox
This will filter out most of the advertisements on websites, saving you a shitload of network traffic (and if your computer is slow, not having to show all that crap is a big speedup). Once you get it set up you can just ignore it, but if you care it will tell you how much stuff it's blocked on your behalf.- Go to LocalCDN and click Add to Firefox
Most websites load the same files over and over from the same places -- primarily Google servers. This thing puts all that right in your browser, making for less network traffic and denies Google the privilege of inspecting your usage patterns. Once it's installed you can ignore it.[…]
Mozilla's Original Sin
Some will tell you that Mozilla's worst decision was to accept funding from Google, and that may have been the first domino, but I hold that implementing DRM is what doomed them, as it led to their culture of capitulation. It demonstrated that their decisions were the decisions of a company shipping products, not those of a non-profit devoted to preserving the open web.
Those are different things and are very much in conflict. They picked one. They picked the wrong one.
[…]
In my humble but correct opinion, Mozilla should be doing two things and two things only:
- Building THE reference implementation web browser, and
- Being a jugular-snapping attack dog on standards committees.
- There is no 3.
Vision for W3C
for World Wide Web Consortium (W3C)A pithy little declaration.
This document articulates W3C’s mission, its values, its organizational principles, and our vision for W3C as an organization in the context of our vision for the Web itself. The goal of this vision is not to predict the future, but to define shared principles to guide our decisions.
The goals of this document are to:
- Help the world understand what W3C is, what it does, and why it matters
- Communicate shared values and principles of the W3C community
- Be opinionated enough to provide a framework for making decisions, particularly on controversial issues
- Be timeless enough to guide W3C yet flexible enough to evolve when needed
Paramount Is Taking Down Decades Worth of Old TV Clips from the Web
in IndieWireA rep for Paramount told IndieWire: “As part of broader website changes across Paramount, we have introduced more streamlined versions of our sites, driving fans to Paramount+ to watch their favorite shows.”
For now though, many of these series are not currently available on Paramount+, such as “The Colbert Report” or “The Nightly Show.” Even “The Daily Show” has only two of the most recent seasons, encompassing 2024 and 2023, available, despite decades of the show’s history. “South Park” clips used to be hosted on Comedy Central’s website, but the only place to watch full episodes of those are via Max, not Paramount+.
The likely reason for this? Cost cutting. In a town hall this week, Paramount’s “Office of the CEO” including co-chiefs George Cheeks, Chris McCarthy, and Brian Robbins, expressed plans to save $500 million in order to stave off profit drops and one day make Paramount+ profitable.
Is Google Getting Worse? A Longitudinal Investigation of SEO Spam in Search Engines
Many users of web search engines have been complaining in recent years about the supposedly decreasing quality of search results. This is often attributed to an increasing amount of search-engine-optimized but low-quality content. Evidence for this has always been anecdotal, yet it’s not unreasonable to think that popular online marketing strategies such as affiliate marketing incentivize the mass production of such content to maximize clicks. Since neither this complaint nor affiliate marketing as such have received much attention from the IR community, we hereby lay the groundwork by conducting an in-depth exploratory study of how affiliate content affects today’s search engines. We monitored Google, Bing and DuckDuckGo for a year on 7,392 product review queries. Our findings suggest that all search engines have significant problems with highly optimized (affiliate) content—more than is representative for the entire web according to a baseline retrieval system on the ClueWeb22. Focussing on the product review genre, we find that only a small portion of product reviews on the web uses affiliate marketing, but the majority of all search results do. Of all affiliate networks, Amazon Associates is by far the most popular. We further observe an inverse relationship between affiliate marketing use and content complexity, and that all search engines fall victim to large-scale affiliate link spam campaigns. However, we also notice that the line between benign content and spam in the form of content and link farms becomes increasingly blurry—a situation that will surely worsen in the wake of generative AI. We conclude that dynamic adversarial spam in the form of low-quality, mass-produced commercial content deserves more attention.
The End of Indie Web Browsers: You Can (Not) Compete
A good explainer:
In 2017, the body responsible for standardizing web browser technologies, W3C, introduced Encrypted Media Extensions (EME)—thus bringing with it the end of competitive indie web browsers.
No longer is it possible to build your own web browser capable of consuming some of the most popular content on the web. Websites like Netflix, Hulu, HBO, and others require copyright content protection which is only accessible through browser vendors who have license agreements with large corporations.
[…]
These roadblocks were primarily introduced to appease the media industry.
[…]
Since the introduction of EME to web standards, the ability for new browsers to compete has become restricted by gatekeepers, which goes against the promises of the platform.
Privacy First: A Better Way to Address Online Harms
for Electronic Frontier Foundation (EFF)The truth is many of the ills of today’s internet have a single thing in common: they are built on a system of corporate surveillance. Multiple companies, large and small, collect data about where we go, what we do, what we read, who we communicate with, and so on. They use this data in multiple ways and, if it suits their business model, may sell it to anyone who wants it—including law enforcement. Addressing this shared reality will better promote human rights and civil liberties, while simultaneously holding space for free expression, creativity, and innovation than many of the issue-specific bills we’ve seen over the past decade.
In other words, whatever online harms you want to alleviate, you can do it better, with a broader impact, if you do privacy first.
Marking the Web’s 35th Birthday: An Open Letter
for World Wide Web Consortium (W3C)There are two clear, connected issues to address. The first is the extent of power concentration, which contradicts the decentralised spirit I originally envisioned. This has segmented the web, with a fight to keep users hooked on one platform to optimise profit through the passive observation of content. This exploitative business model is particularly grave in this year of elections that could unravel political turmoil. Compounding this issue is the second, the personal data market that has exploited people’s time and data with the creation of deep profiles that allow for targeted advertising and ultimately control over the information people are fed.
How has this happened? Leadership, hindered by a lack of diversity, has steered away from a tool for public good and one that is instead subject to capitalist forces resulting in monopolisation. Governance, which should correct for this, has failed to do so, with regulatory measures being outstripped by the rapid development of innovation, leading to a widening gap between technological advancements and effective oversight.
The future hinges on our ability to both reform the current system and create a new one that genuinely serves the best interests of humanity. To achieve this, we must break down data silos to encourage collaboration, create market conditions in which a diversity of options thrive to fuel creativity, and shift away from polarising content to an environment shaped by a diversity of voices and perspectives that nurture empathy and understanding.