A poster’s guide to who’s selling your data to train AI

A poster’s guide to who’s selling your data to train AI

If you’ve ever posted anything on the internet, chances are that your data has already been scraped, collected, and used to train AI systems like the ones powering ChatGPT, Midjourney, and Sora. Generative AI is designed to succeed as a generalist, and learning to do so, OpenAI has said, requires “internet-scale” data to train on.

You probably don’t need me to tell you what happened when companies used scraped public data — often without the permission of those who created it — from news articles, books, and creative projects to teach AI tools how to, say, generate news articles, books, and creative projects.

The New York Times is currently suing OpenAI for allegedly using its expansive archives without permission to train chatbots (in a recent filing, OpenAI accused the Times of hiring “someone to hack” ChatGPT in order to prove that the chatbot was stealing their content). Getty Images sued Stable Diffusion for copyright infringement. Other lawsuits from authors and creators, angry to find that their works were used to train AI models, have faced setbacks in court.

Other companies have decided to make deals. The Associated Press has licensed part of its archives to OpenAI. Shutterstock, the stock photo archive, has signed a six-year deal with OpenAI to provide training data, which includes access to its photo, video, and music databases.

The ways AI systems use the work of journalists, musicians, and photographers have pretty consequential implications for our information and cultural ecosystem and for the people who work in the fields that AI companies seem dead-set on developing tools to replace. The need to gather more and more training data with as little fuss as possible means that anyone who’s an online poster — whether its a fandom Tumblr account, an active Reddit presence, or a personal blog — could see access to their content being sold by the platforms hosting it to one of these big AI companies.

Below is a quick guide to what we know right now about who might be selling your best posts as training data.

Tumblr and WordPress.com

Earlier this week, 404 Media reported that Automattic, the parent company for Tumblr and WordPress, was preparing to announce deals selling user data to OpenAI and Midjourney. According to 404’s reporting, which describes such a deal as “imminent,” the data seems likely to include user posts on Tumblr and on WordPress.com. On Wednesday, a day after 404’s report, Automattic announced a way for users to opt out of sharing their public content with third parties.

The Tumblr staff announcement on the change framed the whole thing as a sign that the company was working to protect its users. “We already discourage AI crawlers from gathering content from Tumblr and will continue to do so,” the announcement read, “save for those with which we partner.”

Automattic said in a statement that it was “working directly with select AI companies as long as their plans align with what our community cares about: attribution, opt-outs, and control,” but has not provided any further information on the reported deals with OpenAI and Midjourney.

Although Tumblr’s cultural heft has waned over the past decade, it’s still a pretty important platform for fandom content, including fanfiction and fan art. There are also plenty of artists who use Tumblr to host their original work and take commissions.

Reddit

Reddit’s enormous archives of posts are driven by the labor of volunteers: Unpaid subreddit moderators oversee communities of unpaid users. Their collective efforts on Reddit make the platform valuable.

So when Reddit announced that it was launching an IPO, the company reached out to a selection of mods and frequent posters to offer them the opportunity to buy stock early. Some of those who received the offer were not super enthusiastic about it. But Reddit does not need buy-in from its users to profit from their work: It has already sold access to their posts to Google.

Just before the IPO announcement, Reddit and Google entered into a $60 million deal that would give Google access to Reddit’s API in order to, among other things, train its generative AI models.

Everything else, to be honest

The reported deals above are just a couple that have become public. But this doesn’t mean that large AI models aren’t already being trained on your posts across the internet.

Last year, the Washington Post examined one of the massive data sets of scraped public internet data used to train generative AI models and found everything from World of Warcraft message boards to Patreon and Kickstarter and several huge repositories of personal blogs. And it should not be a surprise that Meta uses public posts from Facebook and Instagram to train its AI models.

rn rn vox-markrn rn rn rn rn rn“,”cross_community”:false,”internal_groups”:[{[{“base_type”:”EntryGroup”,”id”:112403,”timestamp”:1709640043,”title”:”Approach — Dissects something complicated”,”type”:”SiteGroup”,”url”:””,”slug”:”approach-dissects-something-complicated”,”community_logo”:”rnrn rn vox-markrn rn rn rn rn rn“,”community_name”:”Vox”,”community_url”:”https://www.vox.com/”,”cross_community”:false,”entry_count”:578,”always_show”:false,”description”:””,”disclosure”:””,”cover_image_url”:””,”cover_image”:null,”title_image_url”:””,”intro_image”:null,”four_up_see_more_text”:”View All”}],”image”:{“ratio”:”*”,”original_url”:”https://cdn.vox-cdn.com/uploads/chorus_image/image/73172456/2036327466.0.jpg”,”network”:”unison”,”bgcolor”:”white”,”pinterest_enabled”:false,”caption”:null,”credit”:”Photo Illustration by Rafael Henrique/SOPA Images/LightRocket via Getty Images”,”focal_area”:{“top_left_x”:2177,”top_left_y”:608,”bottom_right_x”:2977,”bottom_right_y”:1408},”bounds”:[0,0,5000,3617],”uploaded_size”:{“width”:5000,”height”:3617},”focal_point”:null,”image_id”:73172456,”alt_text”:”In this photo illustration, the Reddit logo is seen in behind a silhouette of a person typing.”},”hub_image”:{“ratio”:”*”,”original_url”:”https://cdn.vox-cdn.com/uploads/chorus_image/image/73172456/2036327466.0.jpg”,”network”:”unison”,”bgcolor”:”white”,”pinterest_enabled”:false,”caption”:null,”credit”:”Photo Illustration by Rafael Henrique/SOPA Images/LightRocket via Getty Images”,”focal_area”:{“top_left_x”:2177,”top_left_y”:608,”bottom_right_x”:2977,”bottom_right_y”:1408},”bounds”:[0,0,5000,3617],”uploaded_size”:{“width”:5000,”height”:3617},”focal_point”:null,”image_id”:73172456,”alt_text”:”In this photo illustration, the Reddit logo is seen in behind a silhouette of a person typing.”},”lede_image”:{“ratio”:”*”,”original_url”:”https://cdn.vox-cdn.com/uploads/chorus_image/image/73172457/2036327466.0.jpg”,”network”:”unison”,”bgcolor”:”white”,”pinterest_enabled”:false,”caption”:null,”credit”:”Photo Illustration by Rafael Henrique/SOPA Images/LightRocket via Getty Images”,”focal_area”:{“top_left_x”:2177,”top_left_y”:608,”bottom_right_x”:2977,”bottom_right_y”:1408},”bounds”:[0,0,5000,3617],”uploaded_size”:{“width”:5000,”height”:3617},”focal_point”:null,”image_id”:73172457,”alt_text”:”In this photo illustration, the Reddit logo is seen in behind a silhouette of a person typing.”},”group_cover_image”:null,”picture_standard_lead_image”:{“ratio”:”*”,”original_url”:”https://cdn.vox-cdn.com/uploads/chorus_image/image/73172457/2036327466.0.jpg”,”network”:”unison”,”bgcolor”:”white”,”pinterest_enabled”:false,”caption”:null,”credit”:”Photo Illustration by Rafael Henrique/SOPA Images/LightRocket via Getty Images”,”focal_area”:{“top_left_x”:2177,”top_left_y”:608,”bottom_right_x”:2977,”bottom_right_y”:1408},”bounds”:[0,0,5000,3617],”uploaded_size”:{“width”:5000,”height”:3617},”focal_point”:null,”image_id”:73172457,”alt_text”:”In this photo illustration, the Reddit logo is seen in behind a silhouette of a person typing.”,”picture_element”:{“loading”:”eager”,”html”:{},”alt”:”In this photo illustration, the Reddit logo is seen in behind a silhouette of a person typing.”,”default”:{“srcset”:”https://cdn.vox-cdn.com/thumbor/NNxTJ4FxWutRWRgkftexh8Jh7Ro=/0x0:5000×3617/320×240/filters:focal(2177×608:2977×1408)/cdn.vox-cdn.com/uploads/chorus_image/image/73172457/2036327466.0.jpg 320w, https://cdn.vox-cdn.com/thumbor/Wb7pKTT0WJb08QMWCmUvG7M693o=/0x0:5000×3617/620×465/filters:focal(2177×608:2977×1408)/cdn.vox-cdn.com/uploads/chorus_image/image/73172457/2036327466.0.jpg 620w, https://cdn.vox-cdn.com/thumbor/ZWQ0xGQ1D2F23O4BZl9XnAvqnzo=/0x0:5000×3617/920×690/filters:focal(2177×608:2977×1408)/cdn.vox-cdn.com/uploads/chorus_image/image/73172457/2036327466.0.jpg 920w, https://cdn.vox-cdn.com/thumbor/rgs7kkEi7T9VeIcQvBhRn2dHN3s=/0x0:5000×3617/1220×915/filters:focal(2177×608:2977×1408)/cdn.vox-cdn.com/uploads/chorus_image/image/73172457/2036327466.0.jpg 1220w, https://cdn.vox-cdn.com/thumbor/whUxnosIOTNT5E7rTHBncsgWtPY=/0x0:5000×3617/1520×1140/filters:focal(2177×608:2977×1408)/cdn.vox-cdn.com/uploads/chorus_image/image/73172457/2036327466.0.jpg 1520w”,”webp_srcset”:”https://cdn.vox-cdn.com/thumbor/X5Uvh_Kxbo_PH8jgLN5ILljpnII=/0x0:5000×3617/320×240/filters:focal(2177×608:2977×1408):format(webp)/cdn.vox-cdn.com/uploads/chorus_image/image/73172457/2036327466.0.jpg 320w, https://cdn.vox-cdn.com/thumbor/xkp2Kqf_H9nVvbYmR8q8ldxnw6g=/0x0:5000×3617/620×465/filters:focal(2177×608:2977×1408):format(webp)/cdn.vox-cdn.com/uploads/chorus_image/image/73172457/2036327466.0.jpg 620w, https://cdn.vox-cdn.com/thumbor/zK2bVgrfcNcFrKJzLUsGBC1a-2k=/0x0:5000×3617/920×690/filters:focal(2177×608:2977×1408):format(webp)/cdn.vox-cdn.com/uploads/chorus_image/image/73172457/2036327466.0.jpg 920w, https://cdn.vox-cdn.com/thumbor/jG39A446zonMPaY7DjyS7S0Ptuo=/0x0:5000×3617/1220×915/filters:focal(2177×608:2977×1408):format(webp)/cdn.vox-cdn.com/uploads/chorus_image/image/73172457/2036327466.0.jpg 1220w, https://cdn.vox-cdn.com/thumbor/m66WsjR4qKU4RM58ggRhls1JVkQ=/0x0:5000×3617/1520×1140/filters:focal(2177×608:2977×1408):format(webp)/cdn.vox-cdn.com/uploads/chorus_image/image/73172457/2036327466.0.jpg 1520w”,”media”:null,”sizes”:”(min-width: 809px) 485px, (min-width: 600px) 60vw, 100vw”,”fallback”:”https://cdn.vox-cdn.com/thumbor/d10YFXfHj1QI7S-3jvitpUPaDKI=/0x0:5000×3617/1200×900/filters:focal(2177×608:2977×1408)/cdn.vox-cdn.com/uploads/chorus_image/image/73172457/2036327466.0.jpg”},”art_directed”:[]}},”image_is_placeholder”:false,”image_is_hidden”:false,”network”:”vox”,”omits_labels”:false,”optimizable”:false,”promo_headline”:”A poster’s guide to who’s selling your data to train AI “,”recommended_count”:0,”recs_enabled”:false,”slug”:”technology/24086039/reddit-tumblr-wordpress-whos-selling-your-data-to-train-ai”,”dek”:”Those Tumblr, Reddit, and WordPress posts you never thought would see the light of day? Yep, them too.”,”homepage_title”:”A poster’s guide to who’s selling your data to train AI “,”homepage_description”:”Those Tumblr, Reddit, and WordPress posts you never thought would see the light of day? Yep, them too.”,”show_homepage_description”:false,”title_display”:”A poster’s guide to who’s selling your data to train AI “,”pull_quote”:null,”voxcreative”:false,”show_entry_time”:true,”show_dates”:true,”paywalled_content”:false,”paywalled_content_box_logo_url”:””,”paywalled_content_page_logo_url”:””,”paywalled_content_main_url”:””,”article_footer_body”:”At Vox, we believe that clarity is power, and that power shouldn’t only be available to those who can afford to pay. That’s why we keep our work free. Millions rely on Vox’s clear, high-quality journalism to understand the forces shaping today’s world. Support our mission and help keep Vox free for all by making a financial contribution to Vox today. rn”,”article_footer_header”:”Will you help keep Vox free for all?“,”use_article_footer”:true,”article_footer_cta_annual_plans”:”{rn “default_plan”: 1,rn “plans”:[rn{rn[rn{rn”amount”: 50,rn “plan_id”: 99546rn },rn {rn “amount”: 100,rn “plan_id”: 99547rn },rn {rn “amount”: 150,rn “plan_id”: 99548rn },rn {rn “amount”: 200,rn “plan_id”: 99549rn }rn ]rn}”,”article_footer_cta_button_annual_copy”:”year”,”article_footer_cta_button_copy”:”Yes, I’ll give”,”article_footer_cta_button_monthly_copy”:”month”,”article_footer_cta_default_frequency”:”monthly”,”article_footer_cta_monthly_plans”:”{rn “default_plan”: 0,rn “plans”:[rn{rn[rn{rn”amount”: 5,rn “plan_id”: 99543rn },rn {rn “amount”: 10,rn “plan_id”: 99544rn },rn {rn “amount”: 25,rn “plan_id”: 99545rn },rn {rn “amount”: 50,rn “plan_id”: 46947rn }rn ]rn}”,”article_footer_cta_once_plans”:”{rn “default_plan”: 0,rn “plans”:[rn{rn[rn{rn”amount”: 20,rn “plan_id”: 69278rn },rn {rn “amount”: 50,rn “plan_id”: 48880rn },rn {rn “amount”: 100,rn “plan_id”: 46607rn },rn {rn “amount”: 250,rn “plan_id”: 46946rn }rn ]rn}”,”use_article_footer_cta_read_counter”:true,”use_article_footer_cta”:true,”groups”:[{[{“base_type”:”EntryGroup”,”id”:27524,”timestamp”:1709653288,”title”:”Technology”,”type”:”SiteGroup”,”url”:”https://www.vox.com/technology”,”slug”:”technology”,”community_logo”:”rnrn rn vox-markrn rn rn rn rn rn“,”community_name”:”Vox”,”community_url”:”https://www.vox.com/”,”cross_community”:false,”entry_count”:24593,”always_show”:false,”description”:”Uncovering and explaining how our digital world is changing — and changing us.”,”disclosure”:””,”cover_image_url”:””,”cover_image”:null,”title_image_url”:””,”intro_image”:null,”four_up_see_more_text”:”View All”,”primary”:true},{“base_type”:”EntryGroup”,”id”:71037,”timestamp”:1709208016,”title”:”Social Media”,”type”:”SiteGroup”,”url”:”https://www.vox.com/social-media”,”slug”:”social-media”,”community_logo”:”rnrn rn vox-markrn rn rn rn rn rn“,”community_name”:”Vox”,”community_url”:”https://www.vox.com/”,”cross_community”:false,”entry_count”:707,”always_show”:false,”description”:”From Facebook to Twitter to YouTube, social media platforms are transforming communication and internet culture, even as they raise privacy concerns for users.”,”disclosure”:””,”cover_image_url”:””,”cover_image”:null,”title_image_url”:””,”intro_image”:null,”four_up_see_more_text”:”View All”,”primary”:false}],”featured_placeable”:false,”video_placeable”:false,”disclaimer”:null,”volume_placement”:”lede”,”video_autoplay”:false,”youtube_url”:”http://bit.ly/voxyoutube”,”facebook_video_url”:””,”play_in_modal”:true,”user_preferences_for_privacy_enabled”:false,”show_branded_logos”:true}”>

$5/month

$10/month

$25/month

$50/month

Other

Yes, I’ll give $5/month

Yes, I’ll give $5/month

We accept credit card, Apple Pay, and

Google Pay. You can also contribute via

Read More

Leave a Reply

Your email address will not be published. Required fields are marked *