AI Content Copyright and Data Licensing: Investment Research

Core Conclusions

AI content copyright and data licensing has shifted from an "abstract compliance question" into one of the upstream supply constraints on foundation models, AI search, enterprise RAG, and multimodal generation. But the revenue that has actually become public and verifiable concentrates mainly in data that is high-value, structured, traceable, and rights-clear, not the entire open internet. What landed first was not "universal copyright payment" but top news-archive licensing, UGC/API licensing, professional-database/educational-content licensing, image-library safe generation, and enterprise private-data governance.
Scenarios that have already produced real revenue, with the strongest public evidence, include: AP–OpenAI news-archive licensing, FT–OpenAI, Axel Springer–OpenAI, News Corp–OpenAI, NYT–Amazon, Reuters–Meta, Reddit–Google/OpenAI, Stack Overflow–OpenAI/Google/Moveworks, Informa/Taylor & Francis–Microsoft, Wiley's AI licensing of research content, Shutterstock–OpenAI, and Getty's "commercially safe" generation and AI-platform partnerships. Most deal values are undisclosed, but at minimum they have crossed the stages of "contract signed" and "revenue landed."
The highest revenue certainty today does not lie with media broadly but with professional, workflow-type database companies: Thomson Reuters, RELX/LexisNexis, Wolters Kluwer, Pearson, S&P Global, Moody's, FactSet, and Bloomberg. These firms own content and data that is high-quality, continuously updated, rich in metadata, and embedded in critical decision flows. Their AI commercialization more often takes the form of AI-enhanced subscriptions / workflow products rather than "selling raw training corpora," which carries higher gross margin and retention and stronger defensibility.
Still clearly stuck in litigation, policy contention, or gray private deals are mainly: large-scale public web pretraining, general book corpora, unlicensed music training, film/anime/character training, general code scraping, low-transparency data brokerage, and most personality-rights/voiceprint/likeness training. The core problem in these areas is not "whether there is value" but that rights boundaries, proof of provenance, market-substitution harm, and cross-jurisdiction compliance remain unsettled.
Legal signals have moved from "is AI training inherently fair use" toward "is the data source lawful, was it acquired for payment, is it high-value structured content, and does it cause provable substitution harm." Several key 2025 rulings diverged sharply: Anthropic won a favorable ruling on "lawfully acquired books used for training" but retains high risk on pirated book libraries; Meta prevailed in the authors' case; while Thomson Reuters v. Ross Intelligence went against the fair-use defense for using Westlaw headnotes to train/compare. This combination means that general internet text will not necessarily be charged for across the board, but professional databases and pirated sources carry markedly higher legal risk.
News publishers' AI licensing currently looks more like "defensive monetization" than a mature new core business. Top brands can sign large deals, but the vast majority of deal values are confidential, revenue is often blended into "licensing/other revenue," and AI search continues to erode traffic and summary attribution. So top news groups can win compensation while mid-tier publishers may not.
The music industry is moving from "suing generative platforms" toward "selective licensing + revenue sharing + artist-consent mechanisms." In 2024 the three major labels sued Suno/Udio; by the second half of 2025, WMG and Suno/Udio, the three majors and Klay, and others began to show licensing and partnerships, indicating that music copyright will not simply be consumed by models for free but will more likely evolve into a combined model of licensed catalogs, controllable style/voice, subscription revenue sharing, copyright filtering, and royalty accounting. Even so, the public financial contribution remains clearly weaker than news and professional databases.
The image-library and visual track already shows a clearer productization path than news: Getty explicitly turns "commercially safe," "indemnification," and "contributor compensation" into enterprise selling points; Shutterstock both supplies training data and extends the chain through a Contributor Fund and its OpenAI partnership; Visual China Group, in the Chinese market, emphasizes copyright transactions and AI creative customization built on "commercial use + traceability + platform service fees." The long-term profit pool in the visual track will more likely sit with content libraries and trading platforms that have releases, metadata, and commercial-safety guarantees, rather than one-off training licenses.
UGC and community data is among the earliest AI raw materials to be repriced. Reddit licensed its Data API to Google and OpenAI, with Google explicitly using the API to display, train on, and understand Reddit content; Stack Overflow packages its public Q&A corpus, API, and enterprise knowledge products together as "Knowledge Solutions / Data Licensing." The core value of this data lies beyond the text itself, in its freshness, structure, community validation, and question–answer graph.
AI-native challengers are moving in on traditional copyright-management firms' positions, but most are still at the strong-narrative, weak-proof-of-scale stage. Cloudflare has pushed default AI-crawler blocking, Pay Per Crawl, and content-signal tools into the mainstream; RSL introduced a machine-readable licensing standard; TollBit already has "transactions live"; ProRata offers a 50% revenue-share framework; Created by Humans modularizes book training/RAG rights; Vermillio and Loti focus on likeness/voice protection and licensing. The issue is that standardization capability has emerged, but sustainable large-scale revenue has not been fully disclosed.
The long-term profit pool will more likely settle with three kinds of companies: first, professional databases and workflow platforms; second, content platforms with clear rights and commercially safe output; third, data-governance/compliance/provenance infrastructure. By contrast, pure model companies may prefer to compress licensing costs onto a few key sources of content rather than pay broadly for the general internet.
From a valuation standpoint, the market already prices the "data-licensing option" on Reddit and some top AI-narrative platforms at no low level; pricing for News Corp's multi-LLM licensing capability, Wiley/Informa's AI content monetization, Getty/Shutterstock's compliant visual-asset revaluation, and the professional-information giants upgrading content libraries into AI workflow products remains divergent. Relatively, visual-asset platforms such as Getty/Shutterstock are priced markedly below professional-database companies, while Reddit, NYT, and similar names have already priced in a meaningful portion of AI expectations.
The biggest catalysts over the next twelve to twenty-four months are not the win/loss of any single lawsuit but three things: whether the EU's GPAI training-data summary and copyright-enforcement rules genuinely land, whether U.S. copyright and fair-use case law continues to diverge, and whether AI search forms a quantifiable publisher revenue-share / citation-traffic system. These will determine whether AI copyright licensing stays a handful of big deals or evolves into a long-term cost and infrastructure market.

Value-Chain Landscape and Commercialization Stages

The most important segmentation in this track is not "content industry vs. AI industry" but five stages: litigation claim, licensing negotiation, contract signed, revenue landed, and sustainable at-scale licensing. So far, very few have truly crossed the fifth stage; the most mature is the professional-database subscription-type AI product, the second most mature is UGC/API data licensing, and only third comes the AI licensing of top news and image libraries. Music, film/TV IP, the long tail of book copyright, personality rights, and general web scraping remain broadly stuck in the first four stages.

Value-chain position	Sub-segment	Core products/services	AI demand driver	Main revenue model	Content/copyright/governance barrier	Regulatory/litigation risk	Commercialization stage	Margin profile	Representative companies	Benefit intensity	Investment elasticity
News publishing	News-archive and real-time news licensing	Archives, real-time feeds, summary/display licensing	LLM training, AI search, real-time Q&A	Multi-year fixed license fees, API fees, summary-display fees, partial revenue share	Brand trust, original reporting, attribution need, paywall	High: NYT/OpenAI, publisher–AI-search relations unsettled	Contract signed → revenue landed	High incremental margin, but uneven sustainability	AP, News Corp, NYT, FT, Reuters, Axel Springer	High	High
Academic publishing	TDM/corpus licensing	Journal full text, metadata, citation networks	Training, professional search, RAG	Data-access fees, enterprise licensing, one-off + deferred payment	Peer review, citation metadata, institutional relationships	Medium-high: author consent and contract boundaries	Contract signed → revenue landed	High margin, but large political/public-opinion friction	Informa/T&F, Wiley, Springer Nature	High	Medium-high
Professional databases	Legal/tax/risk/science/finance databases	Search libraries, citators, knowledge graphs, AI copilots	Enterprise agents, professional RAG, workflow automation	Subscription, seat, usage, workflow software	High update frequency, structured, embedded in processes, compliant	Medium: but professional-content rights are the strongest	Sustainable at-scale licensing	Best margin and retention	Thomson Reuters, RELX, Wolters Kluwer, S&P Global, Moody's, FactSet, Bloomberg	Very high	Medium-high
Music copyright	Recording/composition/voice/likeness licensing	Catalogs, styles, voice rights, filtering and revenue splits	AI music generation, voice cloning, remix	License fees, subscription revenue share, royalty allocation, style/likeness licensing	Complex rights chain but high concentration	Very high: litigation and personality rights run in parallel	Litigation claim → selective signing	High margin if standardization succeeds	UMG, WMG, Sony, Merlin, Klay, Suno, Udio	Medium-high	Very high
Image libraries and video footage	Rights-cleared visual data	Images/video/3D, releases, metadata	Image/video training, enterprise generation, brand-safe creation	Subscription, usage, training licensing, image-generation fees	Model/property release, metadata, copyright indemnification	High: Getty v. Stability and others	Revenue landed → scale exploration	Potential high-margin "safe generation" products	Getty, Shutterstock, Adobe Stock, Visual China Group	High	High
Books and authors	Book training/RAG	Full-text books, summaries, translation/audiobook rights	LLM training, writing assistants, knowledge Q&A	Single-book/bulk licensing, platform opt-in	Long-tail rights, fragmented contracts	Very high: active author rights enforcement	Litigation claim → early platformization	High license margin, but high rights-clearance cost	Authors Guild, Created by Humans, publisher alliances	Medium	High
UGC platforms	Forum/community/comment data	Data API, structured dialogue, real-time discussion	Training, search augmentation, RAG	API fees, annual fees, data licensing	Freshness, discussion context, user signals	Medium-high: user consent / platform terms	Revenue landed	High margin, low incremental cost	Reddit, Stack Overflow, Quora/RSL ecosystem	Very high	Very high
Enterprise private data	Enterprise RAG and permissioned data	Documents, support logs, code repositories, CRM	Enterprise agents, internal Q&A, automation	SaaS, usage, data-governance add-on fees	Permission systems, data lineage, privacy and audit	Medium: mainly privacy/security	Sustainable at-scale licensing	Strong SaaS margin	Snowflake, Databricks, MongoDB, Elastic	Very high	Medium-high
Data exchanges	Dataset marketplace / exchange	Data distribution, rights metadata, audit	Training, vertical models, agent memory	Platform commission, subscription, transaction fees	Supply-organizing capability, rights metadata	Medium-high: proof of provenance is core	Early revenue validation	Potential high platform margin	TollBit, ProRata, Hugging Face, DataCite	Medium-high	Very high
Annotation/RLHF	Human feedback and evaluation	Annotation, red-teaming, preference data, evaluation	Post-training alignment, fine-tuning, model evaluation	Project fees, long-term service contracts	Human network and quality control	Medium: price competition and automation substitution	Mature but more service-oriented	Medium margin	Scale AI, Appen, TELUS Digital, Defined.ai	Medium	Medium
Content provenance/watermarking/detection	Provenance & authenticity	C2PA, metadata, detection and forensics	Compliance, brand safety, infringement management	SaaS, enterprise editions, platform integration	Network effects and standard compatibility	Medium: technical effectiveness needs validation	Early commercialization	High software margin	CAI/C2PA, Truepic, Vermillio, Loti	Medium-high	High
Crawler control/licensing standards	Access control	robots/RSL/pay-per-crawl	AI search, training, agent scraping	Platform service fees, transaction take	Infrastructure coverage	Medium: needs AI-company cooperation	Early to mid stage	High software-infrastructure margin	Cloudflare, Fastly, Akamai, RSL Collective	Medium-high	High

Viewed across two dimensions—"already real revenue" and "still contested"—the split is more direct: revenue already realized is mainly top news brands, UGC APIs, institutional licensing of academic/educational content, professional-database AI subscriptions, rights-cleared visual generation, and enterprise data governance; still contested is mainly open-web pretraining, pirated or unsourced books, unlicensed music/film training, long-tail author revenue sharing, personality-rights training, and training-data transparency across multiple jurisdictions.

Business Models and Profit Pools

The core of AI content copyright and data licensing is not "how much content is worth" but which use case is willing to pay over the long term. Model pretraining values volume, diversity, and marginal cost; AI search values timeliness, authority, and summary/citation rights; enterprise RAG values permissions, update frequency, and audit trails; music and visual generation value commercial safety, person/voice consent, and downstream royalty accounting. So even though it is all "licensing," the pricing logic differs entirely.

Business model	Typical scenario	Pricing logic	Pros	Cons	Better-suited suppliers
One-off license fee	Archive training, historical corpora, bulk content access	Corpus scale, exclusivity, litigation deterrence	Fast to land, high margin	Not sustainable, customers push back on price	Top news, academic publishers
Annual/multi-year fixed fee	News libraries, UGC API, enterprise content access	Authority, update frequency, API availability	Predictable, fits budgets	Renewal price subject to negotiation	AP, News Corp, Reddit, T&F
Per-call/per-token/per-API	Real-time news, RAG, retrieval augmentation	Query volume, latency, SLA	Scales with usage	Volatile cost	Reuters Connect, UGC API, enterprise data platforms
Charge by training use	Foundation-model/multimodal training	Training rounds, scope of use, re-licensing limits	Easy to lock in large customers	Hard to sustain renewals after training is done	Shutterstock, Getty, some academic/news archives
Citation/summary/search revenue share	AI search, answer-page citations	Display volume, clicks, ad/subscription revenue share	Close to the traffic logic	Hard attribution, data black box	Publisher alliances, ProRata, Perplexity model
Output revenue share/royalties	Music, voice, characters, AI creation	Downloads, plays, subscriptions, generation counts	Can lock in creators long term	Complex rights clearance	UMG/WMG/Sony, Vermillio, collective management organizations
AI-enhanced subscription/workflow	Legal, tax, finance, education, medicine	Customer ROI, time saved, compliance value	Highest retention, best profit	Long build cycle	TRI, RELX, WKL, Pearson
Collective licensing/standardized licensing	Web scraping, long-tail creators	Coverage scope, standardized protocols, enforcement	Solves long-tail rights clearance	Needs network effects and a neutral enforcement layer	RSL, TollBit, Created by Humans, ProRata

Where the profit pool ultimately lands depends on the scenario:

First, the training-data profit pool will not be distributed evenly across all rights holders. For general pretraining, model companies will use public data, existing paid agreements, user data, and synthetic data to lower cost as much as possible; what can truly be priced separately over the long term is high-value "gap-filling" data, not the entire web corpus. The outcomes of the Anthropic and Meta authors' cases reinforce this further.

Second, the AI-search and RAG profit pool will more likely land with a few high-authority content libraries and the interface layer. The reason is that search/Q&A needs fresh, traceable, citable, and correctable content; enterprise RAG further requires permissions and audit. So productized databases such as Reuters, Factiva, LexisNexis, Westlaw, Wolters Kluwer, and Pearson are easier to charge for over the long term than ordinary news pages and general web text.

Third, the profit pool for music, voice, characters, and likeness will more likely tilt toward "rights management + filtering + revenue splits" rather than the raw model. Because the commercial-use risk on the user side is higher, and personality-rights/consent mechanisms cannot be simply substituted. Large-scale financial landing is still hard to see in public markets, but the industry direction has already shifted from "whether to license" to "who licenses, who filters, who clears."

Fourth, the visual-content profit pool will most likely concentrate on platforms with "clear rights + complete metadata + indemnification capability." Getty directly commoditizes "uncapped indemnification" and contributor compensation; Shutterstock sells training data on one hand and operates generation tools and a contributor fund on the other. This business model is closer to long-term high-margin SaaS/subscription than to one-off data sales.

On the question of whether AI copyright licensing is a long-term cost for model companies, my judgment is scenario-dependent: general-pretraining licensing looks more like a transitional, strategic cost; high-value professional data, RAG permissioned data, AI-search real-time citation, and commercial-use music/visual/likeness licensing look more like long-term structural cost. The EU AI Act's requirements on training-content summaries and copyright policy, and the rise of access-control layers such as Cloudflare/RSL/TollBit, are both pushing "transparent provenance + conditional payment" to become the norm.

Dimension	Conservative	Base	Aggressive
Key assumption	U.S. fair use stays broad; transparency requirements limited	U.S. case law keeps diverging; EU transparency lands; top deals increase	Training-data transparency requirements tighten; platforms enforce licensing standards
Copyright-litigation trajectory	General training mostly protected, pirated sources excepted	Lawfully acquired / professional content better protected	Rights holders significantly strengthened
AI-company licensing willingness	Buy only the hardest-to-substitute content	Willing to pay for real-time, authoritative, compliant data	Training, search, and generation all more broadly licensed
Content-side bargaining power	Only top content libraries have leverage	Top brands and platform intermediaries strengthen	Collective licensing / standardized markets form
AI-search traffic impact	Negative-leaning for publishers	Needs partial revenue sharing to offset	Traffic decline partly offset by licensing revenue
Benefiting segments	Professional databases, enterprise RAG, UGC API	Top news licensing, professional databases, visual safe-gen, UGC, compliance infrastructure	Licensing standards, clearing, royalties, provenance, rights tech
Main beneficiary companies	TRI, RELX, WKL, Reddit, Stack Overflow	News Corp, NYT, Wiley, Informa, Getty, Cloudflare, TollBit	RSL/TollBit/ProRata/Created by Humans/Vermillio, plus large rights libraries
Main pressured companies	Free-traffic-dependent media, long-tail authors, low-differentiation image libraries	Mid-tail publishers, content sites without interface capability	Model companies unable to clear rights and gray data brokers

Among these three scenarios, the model most worth investing in over the long term is not the one-off big deal but upgrading traditional subscription/copyright revenue into AI-native workflow revenue. This is also why companies like RELX, Thomson Reuters, Wolters Kluwer, and Pearson—though their "AI licensing narrative" runs less hot than media's—often offer higher investment quality.

Track Depth and Competitive Landscape

Below, the thirty sub-tracks the user listed are compressed by investable logic into fifteen-plus "profit-pool units." Scores are research priorities, not buy/sell recommendations.

Track	Track logic	Current commercialization stage	Main customers	Pricing model	Margin trend	Copyright clarity	Regulatory/litigation risk	Future catalysts	Investment appeal
News content licensing	Top news brands supply authoritative content to model and search platforms	Signed, but sustainability to be proven	OpenAI, Amazon, Meta, Perplexity	Fixed fee + summary display	High incremental margin	Medium-high	High	AI-search revenue-share mechanisms, more LLM deals	7/10
AI-search citation licensing	Citations, traffic, and revenue share become core	Early trials	AI search/answer engines	rev-share / citation fee	Undetermined	Medium	High	RSL, Pay-per-crawl, platform disclosure	6/10
Academic TDM licensing	Institutional corpora and metadata are scarce	Revenue already generated	Microsoft, research-tool vendors, institutions	One-off + deferred	High	High	Medium-high	Author/publisher contract standardization	8/10
Professional-database licensing	Use AI to enhance existing subscription workflows	At scale	Law firms, investment banks, tax, enterprises	High-price subscription + module fee	Best	Very high	Medium	Enterprise-agent landing	10/10
Legal databases	Legal search, citator, drafting and review	At scale	Law firms / legal departments	seat + usage	Very high	Very high	Medium	Professional-agent adoption rate	10/10
Medical databases	Clinical decision support and medical RAG	Early-mid	Hospitals, pharma, healthcare SaaS	Subscription/API	High	High	High	Medical regulation and liability frameworks	8/10
Financial data licensing	Data + research + factors + workflow	At scale	Buy-side, sell-side, corporate finance	Terminal/license/API	Very high	Very high	Medium	Buy-side copilot penetration	9/10
Music AI licensing	Catalogs, voice, style, royalties	From litigation to signing	AI music platforms, streaming, brands	License + revenue share	High if it works	Medium-high but complex	Very high	More deals from majors/independent catalogs	8/10
Voice and likeness rights	likeness/voice become standalone assets	Early	Film/TV, advertising, AI audio	Licensing + monitoring + revenue share	High	Medium	Very high	NO FAKES / consent standards	7/10
Image libraries and video footage	Rights-cleared, indemnified generation	Already landed	Brands, advertisers, creative tools	Subscription/usage/training	High	High	High	Enterprise safe-generation penetration	9/10
Film/TV IP/game assets	Characters, scenes, performance rights	Mostly still early	Video models, game platforms, studios	franchise license	Potentially high	Medium	Very high	Hollywood/major-game-studio licensing templates	6/10
Books and author copyright	Long-tail rights clearance sets the scale ceiling	Litigation + platformization coexist	Writing AI, model companies, publishers	Single-book/bulk licensing	High	Low to medium	Very high	Collective licensing or platformization	6/10
UGC community data	High freshness, authentic expression, discussion chains	Already landed	Models, search, agents	API/annual fee	Very high	Medium	Medium-high	More communities adopting paid APIs	9/10
Code data	High training value, but complex litigation and open-source licensing	Contested period	Copilot/coding-agent vendors	API/data license/enterprise knowledge base	High	Medium-low	High	Code-licensing case law	6/10
Enterprise RAG data	Permissions, lineage, audit are core	At scale	Mid-to-large enterprises	SaaS/usage	High	Very high	Medium	Agent productionization	10/10
Data exchanges	Standardized supply-demand matching	Early	Model vendors, publishers, enterprises	Platform take	Potentially high	Depends on metadata	High	Enforcement standards and network effects	7/10
Annotation and RLHF	Still a training necessity, but more labor-like	Mature	Foundation and enterprise models	Project-based/long-term contracts	Medium	High	Medium	Higher-end evaluation/red-teaming	6/10
Synthetic data	Reduce reliance on real copyrighted data	Maturity rising	Autonomous driving, industrial, AI training	Software/data packs	High	High	Low to medium	Expanding regulatory-permitted scope	7/10
Content provenance/watermarking/detection	Not selling content directly, but selling trust	Early	Platforms, media, brands, governments	SaaS/API	High	N/A	Medium	C2PA adoption	8/10
Copyright fingerprinting/clearing/royalty allocation	Property clearing after music/visual/character output	Early-mid	Platforms, labels, collective organizations	SaaS + revenue share	High	Depends on rights database	Medium-high	AI output monetization	8/10
Model compliance audit/training transparency	New infrastructure driven by regulation	Early	Model companies, enterprises, regulation-bound industries	Audit fee/subscription	High	N/A	Low to medium	EU template enforcement, enterprise procurement rules	8/10
AI copyright legal tech	Rights clearance, contract automation, discovery	Early	Publishers, entertainment, law firms, platforms	SaaS/case services	High	N/A	Medium	Continued large volume of AI copyright disputes	7/10

On the competitive landscape, it can be summarized into four main threads. First, media groups' game paths differ: News Corp takes a hybrid strategy of "sign + keep negotiating + sue when necessary"; NYT sued first then signed, choosing Amazon rather than OpenAI for its first deal; AP entered licensing cooperation earlier; Reuters chose to license trusted news content to tech platforms such as Meta; Perplexity tries to win over publishers through rev-share. Top news brands have bargaining power, but that power is highly concentrated.

Second, professional-information companies prefer to turn content assets into AI workflow products rather than sell raw corpora to general models. Thomson Reuters has explicitly stated that third-party model partners may not use customer data to train models; its news business once recorded "generative AI related content licensing revenue," but its overall strategic focus is on professional products such as CoCounsel. RELX, Wolters Kluwer, and Pearson likewise embed AI into existing workflows and emphasize in their filings the value of trust, verification, evaluation, and embedded data.

Third, music and visual are two different copyright economics. The music rights chain is more complex but more concentrated, making it easy to form a "license–filter–revenue split" loop; the visual content rights chain is relatively clear, and as long as there are releases, metadata, and indemnification capability, it is easier to form enterprise safe-generation products. The former's core is catalog control and royalty systems; the latter's core is commercial safety and metadata.

Fourth, AI companies' content strategies also clearly diverge. OpenAI is more active in signing top licensing deals and announcing them loudly; Google buys both content and community data; Meta is later and more selective on news and social content; Anthropic faces relatively heavy pressure in public copyright litigation; Perplexity, ProRata, and TollBit represent the new path that "AI search/AI agents must pay content owners directly."

Investment Targets and Company Tiers

The table below prioritizes coverage of high-credibility companies with existing public evidence; for projects that do not separately disclose AI licensing revenue, it clearly notes "not separately disclosed" or "mainly defensive."

Company	Code/status	Sub-segment	AI copyright/data benefit path	Public evidence	Current view
Thomson Reuters	TRI / US stock	Legal/tax/professional information	Uses high-value databases to build AI workflow subscriptions; occasional news-content AI licensing is only a side line	Q1 2025 Reuters News revenue declined partly due to a high prior-year AI content licensing base; CoCounsel/multi-model strategy keeps advancing	Tier A: platform-type winner
RELX	RELX / UK listed	Legal/science/risk databases	Converts proprietary content into AI-enhanced subscriptions and agent tools	2025 annual revenue £9.59bn, up 7%; management says GenAI tools keep driving growth	Tier A: high moat
Wolters Kluwer	WKL / Netherlands listed	Legal/medical/tax databases	AI-enhanced professional software and content suites	2025 annual report and full-year results emphasize AI innovation and continued margin improvement	Tier A: defensive + growth
News Corp	NWSA / US stock	News/professional news/books	Licenses top news and Factiva/WSJ/Dow Jones content to LLMs and platforms at multiple points	Signed a global multi-year agreement with OpenAI; annual report explicitly notes generative-AI platform content licensing; filings repeatedly show higher content licensing revenues	Tier A: direct beneficiary
Reddit	RDDT / US stock	UGC data	Charges Google and OpenAI for the Data API, turning community content into AI raw material	Google expanded cooperation and obtained the Data API; OpenAI integrated the Data API; the 10-K lists content licensing under other revenue	Tier A: high elasticity but high valuation
Wiley	WLY / US stock	Academic publishing	Research-content licensing + efficiency improvement	FY2025 AI licensing revenue about US$11 million, repeatedly cited by management as a growth driver	Tier A: small but real
Informa	INF / UK listed	Academic publishing/exhibition data	T&F content and data licensing, plus internal AI applications	Microsoft agreement 2024–2027, first year US$10 million, with subsequent deferred payments; the company says it highlights IP value	Tier A: undervalued academic licensing
Getty Images	GETY / US stock	Image library/visual data	Rights-cleared generation, training-set partnerships, AI-platform access	Contributors compensated as content is included in AI training sets; 2025 "Other revenue" up 35.2%, with mention of two important AI-platform partnerships	Tier B: high elasticity, high risk
Shutterstock	SSTK / US stock	Image library/training data	Multi-year OpenAI training-data partnership, contributor fund, generation tools	Signed a six-year agreement with OpenAI; the Contributor Fund exists publicly
Pearson	PSO / UK and US	Education/assessment/corporate learning	Turns content, assessment, and corporate learning into AI-enhanced subscriptions and solutions	2025 sales £3.577bn, adjusted operating profit £614m; deep cooperation with Microsoft/AWS/Google Cloud; expansion of enterprise customers and AI products	Tier B: more workflow than raw licensing
New York Times	NYT / US stock	Premium news/sports/cooking	Sued first then signed; monetizes through Amazon's first GenAI license	Signed a multi-year agreement with Amazon; continues to sue OpenAI/Microsoft and bears litigation costs	Tier B: strong brand, but heavier defensive attribute
Adobe	ADBE / US stock	Creative software/Stock ecosystem	Enhances Creative Cloud with safe training sets and content credentials, rather than selling corpora directly	Firefly is commercially usable for enterprises; Adobe Stock contributors receive Firefly bonus	Tier B: shovel seller
Warner Music Group	WMG / US stock	Music copyright	Shifts from litigation toward licensing, revenue sharing, and artist likeness	Participated in suing Suno/Udio; later showed licensing and platform cooperation with Suno/Udio/Klay and others	Tier B: high mid-to-long-term elasticity
Universal Music Group	UMG / Europe listed	Music copyright	Same as above, with stronger catalog and publishing rights	Participated in the Suno/Udio litigation and entered licensing arrangements such as Klay	Tier B: strong rights holder
Visual China Group	Visual China / A-share	Image library/copyright trading	"AI intelligence + content data + application scenarios," emphasizing commercial use, traceability, and platform service fees	Investor relations and annual-report summaries both emphasize AI-empowered copyright trading, creative customization, platform revenue sharing, and long-term agreements	Tier B: scarce China sample
Stack Overflow	Private	Developer UGC/enterprise knowledge	Data licensing + enterprise knowledge products + public-corpus API	Official data-licensing page, OpenAI partnership, Knowledge Solutions transformation	Tier A: high-quality private target
ProRata	Private	News citation/revenue-share platform	Attributes AI answers and shares revenue with publishers	News/Media Alliance framework agreement, 50% of revenue shared with publishers	Tier B: new model, scale to be proven
TollBit	Private	Content-payment gateway/exchange	AI agents pay websites directly	At Series A claimed transactions live on product, with multiple publishers and AI companies integrated	Tier B: infrastructure option
Cloudflare	NET / US stock	Crawler control/licensing infrastructure	Default blocking of AI crawlers, Pay Per Crawl	Already blocks AI crawlers by default and launched a paid-crawl pilot; supports RSL/content signals
Created by Humans	Private	Book rights platform	Authors select training/RAG licensing by use	The platform supports ISBN/upload claiming and AI-rights settings; partners with the Authors Guild	Tier C: right direction, early validation
Vermillio	Private	likeness/voice protection and licensing	Provides monitoring, licensing, and protection for celebrities/IP	Sony Music participated in the investment; TraceID is used for licensing and infringement identification	Tier C: high potential, high uncertainty
Loti	Private	likeness protection	Face/voice monitoring and takedown	Officially positioned as likeness protection for everyone	Tier C: event-driven

Based on public evidence, companies can be split into five tiers:

Tier A: core direct beneficiaries of AI copyright/data licensing Thomson Reuters, RELX, Wolters Kluwer, News Corp, Reddit, Wiley, Stack Overflow. The common thread: either they already have clear licensing revenue, or they upgrade high-barrier content directly into AI workflow subscriptions.
Tier B: clear beneficiaries, but with valuation, litigation, regulatory, or sustainability risk NYT, Getty, Shutterstock, Pearson, Adobe, WMG, UMG, Visual China, Cloudflare, ProRata, TollBit. The common thread: the commercialization direction is clear, but either revenue share is still small, or the market has partly reflected it, or it depends on new-standard adoption.
Tier C: AI licensing is mainly a defensive tool, with weak near-term profit elasticity Integrated education platforms, some large content platforms, Created by Humans, Vermillio, and Loti—companies leaning more toward infrastructure and rights management. Their direction is right, but scale validation is early.
Tier D: strong narrative, but lacking verifiable licensing revenue in this public dataset Most AI music/video generation startups, some "AI copyright concept" A-share software stocks, and companies that emphasize "AI content cooperation" but do not separately disclose contract amounts, revenue contribution, or customer expansion should stay on a watch list rather than be treated directly as beneficiaries. The most common problem in public materials is disclosing only "cooperation," "exploration," or "integration" without disclosing financial contribution.
Tier E: may be hit by AI-generated content, AI search, or unlicensed substitution Free-traffic-dependent mid-to-long-tail media, low-differentiation footage libraries, visual-content platforms lacking releases and metadata, and general data brokers without clear proof of provenance will face stronger price pressure and a compliance discount in the AI era. This risk is directly flagged in the risk disclosures of companies such as News Corp, Getty, and Stack Overflow.

Risk, Valuation, and Final Conclusions

First, valuation and market expectations. As of May 19, 2026, NYT's P/E was about 32.4x, Reddit about 44.7x, Wiley about 14.6x, Warner Music about 20.5x, and S&P Global about 26.5x; while Shutterstock's market cap was only about US$584 million, Getty about US$404 million, News Corp about US$15.1 billion, and Reddit about US$31.7 billion. Purely from market pricing, the "AI content option" on Reddit/NYT is already not cheap; the revaluation room for Wiley, News Corp, and Getty/Shutterstock depends more on subsequent contract renewals and rising revenue share; while the professional-database leaders look more like high-quality compounders—their valuations are not cheap, but their logic is the most stable.

Using the user's weighting framework, my suggested positive scoring model is as follows: direct exposure to AI licensing revenue 20%; content-asset and copyright clarity 25%; customer quality and bargaining power 15%; data governance/metadata/API capability 10%; litigation and regulatory risk management 10%; financial quality and margins 10%; valuation reasonableness 10%. Under this model, the group with the highest current priority is usually not the "hottest" media stocks but companies that combine all three of content barriers + workflow + compliance.

Rank	Company	Directional total
RELX	84	Legal/science content library + AI products + high margin + high update frequency
Thomson Reuters	83	Professional database + CoCounsel + legal case-law environment friendlier to professional content
Wolters Kluwer	82	Regulatory/medical/tax workflow content + AI embedding
News Corp	80	Top news brands + multiple AI licensing deals signed + Dow Jones/Factiva assets
Reddit	78	Real API licensing revenue + extremely strong content freshness, but high valuation
Wiley	77	Disclosed AI licensing revenue; small in scale but the most solid evidence
Pearson	76	AI + assessment + enterprise customers, leaning toward productized subscription rather than one-off licensing
Getty Images	75	Rights-cleared visual library + AI cooperation, but high litigation/financial risk
Adobe	74	Sells shovels through safe generation and content credentials, not a pure-licensing beneficiary
Informa	73	Academic/exhibition data licensing is real, but disclosure remains insufficient

The corresponding reverse risk-scoring model can be set as: insufficient licensing-revenue sustainability 20%; copyright-litigation and regulatory uncertainty 20%; high content substitutability 20%; AI companies' bargaining power too strong 15%; generated content depressing original-content value 15%; valuation too high 10%. Under this model, the highest risk is usually not professional databases but the high-narrative news, long-tail books, and the music/film/likeness tracks not yet covered by standardized licensing, that have not formed stable revenue-share mechanisms.

There are mainly six categories of systemic risk. First, U.S. courts continuing to expand fair-use space would suppress the long-term pricing of general-training licensing; second, AI companies cutting procurement and shifting to user data, internal data, and synthetic data would make one-off corpus contracts hard to renew; third, AI search diverting traffic faster than licensing compensation would squeeze small and mid-size publishers on both sides; fourth, the complex rights chain for music/likeness/voice means that even willingness to pay may not enable efficient rights clearance; fifth, provenance, watermarking, and detection technologies are not perfect, so infrastructure companies also carry technology-delivery risk; sixth, at high valuations the market will focus more on "revenue share" than on the "story."

The final conclusions can be condensed into the following ten points:

First, AI content copyright and data licensing is the "high-quality data supply layer" of the AI value chain, but not all content can be repriced; what can truly be priced is content that is rights-clear, rich in metadata, timely, industry-scarce, and verifiable in provenance.

Second, the five sub-tracks most worth watching are: professional-database AI workflows, UGC/API data licensing, rights-cleared visual data, academic/educational TDM licensing, and enterprise RAG data governance. If "selling shovels" is included, then Cloudflare/RSL/TollBit/compliance audit also merit high-priority tracking.

Third, the ten listed companies most worth deep research are: RELX, Thomson Reuters, Wolters Kluwer, News Corp, Reddit, Wiley, Informa, Pearson, Getty Images, and Adobe. For those who prefer music and the Chinese market, add Warner Music, UMG, and Visual China as elasticity supplements.

Among the unlisted, the ten most worth tracking are: Stack Overflow, TollBit, ProRata, Created by Humans, Vermillio, Loti, Scale AI, Databricks, Rightsify, and RSL Collective. The first five bet more directly on the "copyright and licensing layer," while the latter five lean toward the "data and infrastructure layer."

The five points the market most easily misreads are: First, assuming all copyright will be charged for universally; in reality it is more likely that "high-value data gets charged first." Second, assuming all AI licensing revenue will become large incremental gains; in reality much of it is merely defensive compensation. Third, assuming media is the highest-quality beneficiary; in reality professional databases are often stronger. Fourth, assuming music and likeness will immediately form a large market; in reality rights clearance and revenue splitting are the most complex. Fifth, assuming copyright-tech platforms have already scaled; in reality most are still in early validation.

The metrics most worth tracking over the next six to twelve months are: the share of AI licensing revenue, the licensing/other revenue split, the number of licensing contracts and renewal rate, the number of enterprise RAG customers, API call volume, AI-search citation traffic and revenue share, AI platforms' rev-share disclosure to publishers/creators, EU training-data summary enforcement, and the next round of rulings in key U.S. AI copyright cases.

For so-called "AI copyright platform-type companies," I lean toward placing RELX, Thomson Reuters, Wolters Kluwer, Reddit, Stack Overflow, Getty, and Cloudflare at the core; "AI-native data-licensing challengers" are TollBit, ProRata, Created by Humans, Vermillio, Loti, and RSL Collective; "AI content-licensing shovel sellers" include the layer of Adobe, Cloudflare, Snowflake, Databricks, Elastic, and MongoDB.

Companies at higher risk of being hit by AI-generated content, AI search, or unlicensed training are not all content companies but those that lack originality, lack real-time data, have unclear rights chains, have no data-interface capability, and depend heavily on search-traffic redistribution. This risk is most evident in news, general internet content, and low-differentiation footage libraries.

For narrower follow-up research directions worth digging into further, my suggested priority order is: professional-database licensing and AI workflows, UGC data licensing, news-content licensing and AI-search revenue sharing, image-library training data and safe generation, music AI licensing and likeness rights, enterprise RAG data governance, and content provenance and model compliance audit. These directions are closer to real profit pools than a vague "AI copyright."

Open questions and limitations also need to be made clear: a large number of licensing contracts still do not disclose amounts; many companies blend AI licensing revenue into "other revenue/licensing/subscription" without breaking it out; public financial evidence for the music, film/TV, and personality-rights tracks is clearly less than for news and professional databases; and the rules in jurisdictions such as China, South Korea, Japan, and Australia are still evolving. So at this stage the most important thing is not "who told an AI copyright story" but who has already proven: contracts, customers, revenue, renewals, and a workflow position.

This report is based on public information and does not constitute investment advice. Markets carry risk; invest with caution.

Mentioned Tickers

TRI.USTRI · US NWSA.USNWSA · US RDDT.USRDDT · US WLY.USWLY · US GETY.USGETY · US SSTK.USSSTK · US PSO.USPSO · US NYT.USNYT · US ADBE.USADBE · US WMG.USWMG · US NET.USNET · US 000681.SHE000681 · Shenzhen