Report · AI Copyright & Data Licensing

AI Content Copyright and Data Licensing: Investment Research

AI Content Copyright and Data Licensing (Sector Research)
SECTOR · AI
Lead

AI content copyright and data licensing has become an upstream supply constraint on foundation models, AI search, and enterprise RAG, yet the real revenue concentrates in high-value data that is rights-clear, structured, and traceable rather than the entire open internet. Revenue certainty is highest among professional-database companies that upgrade their content libraries into AI workflow subscriptions; UGC/API deals, top news-archive licensing, and rights-cleared image licensing have already landed, while open web pages, the long tail of books, and music/personality-rights training remain stuck in litigation and gray zones. Rating Watch: prioritize RELX, TRI, WKL, News Corp, Reddit, and Wiley.

Core Conclusions

  • AI content copyright and data licensing has shifted from an "abstract compliance question" into one of the upstream supply constraints on foundation models, AI search, enterprise RAG, and multimodal generation. But the revenue that has actually become public and verifiable concentrates mainly in data that is high-value, structured, traceable, and rights-clear, not the entire open internet. What landed first was not "universal copyright payment" but top news-archive licensing, UGC/API licensing, professional-database/educational-content licensing, image-library safe generation, and enterprise private-data governance.

  • Scenarios that have already produced real revenue, with the strongest public evidence, include: AP–OpenAI news-archive licensing, FT–OpenAI, Axel Springer–OpenAI, News Corp–OpenAI, NYT–Amazon, Reuters–Meta, Reddit–Google/OpenAI, Stack Overflow–OpenAI/Google/Moveworks, Informa/Taylor & Francis–Microsoft, Wiley's AI licensing of research content, Shutterstock–OpenAI, and Getty's "commercially safe" generation and AI-platform partnerships. Most deal values are undisclosed, but at minimum they have crossed the stages of "contract signed" and "revenue landed."

  • The highest revenue certainty today does not lie with media broadly but with professional, workflow-type database companies: Thomson Reuters, RELX/LexisNexis, Wolters Kluwer, Pearson, S&P Global, Moody's, FactSet, and Bloomberg. These firms own content and data that is high-quality, continuously updated, rich in metadata, and embedded in critical decision flows. Their AI commercialization more often takes the form of AI-enhanced subscriptions / workflow products rather than "selling raw training corpora," which carries higher gross margin and retention and stronger defensibility.

  • Still clearly stuck in litigation, policy contention, or gray private deals are mainly: large-scale public web pretraining, general book corpora, unlicensed music training, film/anime/character training, general code scraping, low-transparency data brokerage, and most personality-rights/voiceprint/likeness training. The core problem in these areas is not "whether there is value" but that rights boundaries, proof of provenance, market-substitution harm, and cross-jurisdiction compliance remain unsettled.

  • Legal signals have moved from "is AI training inherently fair use" toward "is the data source lawful, was it acquired for payment, is it high-value structured content, and does it cause provable substitution harm." Several key 2025 rulings diverged sharply: Anthropic won a favorable ruling on "lawfully acquired books used for training" but retains high risk on pirated book libraries; Meta prevailed in the authors' case; while Thomson Reuters v. Ross Intelligence went against the fair-use defense for using Westlaw headnotes to train/compare. This combination means that general internet text will not necessarily be charged for across the board, but professional databases and pirated sources carry markedly higher legal risk.

  • News publishers' AI licensing currently looks more like "defensive monetization" than a mature new core business. Top brands can sign large deals, but the vast majority of deal values are confidential, revenue is often blended into "licensing/other revenue," and AI search continues to erode traffic and summary attribution. So top news groups can win compensation while mid-tier publishers may not.

  • The music industry is moving from "suing generative platforms" toward "selective licensing + revenue sharing + artist-consent mechanisms." In 2024 the three major labels sued Suno/Udio; by the second half of 2025, WMG and Suno/Udio, the three majors and Klay, and others began to show licensing and partnerships, indicating that music copyright will not simply be consumed by models for free but will more likely evolve into a combined model of licensed catalogs, controllable style/voice, subscription revenue sharing, copyright filtering, and royalty accounting. Even so, the public financial contribution remains clearly weaker than news and professional databases.

  • The image-library and visual track already shows a clearer productization path than news: Getty explicitly turns "commercially safe," "indemnification," and "contributor compensation" into enterprise selling points; Shutterstock both supplies training data and extends the chain through a Contributor Fund and its OpenAI partnership; Visual China Group, in the Chinese market, emphasizes copyright transactions and AI creative customization built on "commercial use + traceability + platform service fees." The long-term profit pool in the visual track will more likely sit with content libraries and trading platforms that have releases, metadata, and commercial-safety guarantees, rather than one-off training licenses.

  • UGC and community data is among the earliest AI raw materials to be repriced. Reddit licensed its Data API to Google and OpenAI, with Google explicitly using the API to display, train on, and understand Reddit content; Stack Overflow packages its public Q&A corpus, API, and enterprise knowledge products together as "Knowledge Solutions / Data Licensing." The core value of this data lies beyond the text itself, in its freshness, structure, community validation, and question–answer graph.

  • AI-native challengers are moving in on traditional copyright-management firms' positions, but most are still at the strong-narrative, weak-proof-of-scale stage. Cloudflare has pushed default AI-crawler blocking, Pay Per Crawl, and content-signal tools into the mainstream; RSL introduced a machine-readable licensing standard; TollBit already has "transactions live"; ProRata offers a 50% revenue-share framework; Created by Humans modularizes book training/RAG rights; Vermillio and Loti focus on likeness/voice protection and licensing. The issue is that standardization capability has emerged, but sustainable large-scale revenue has not been fully disclosed.

  • The long-term profit pool will more likely settle with three kinds of companies: first, professional databases and workflow platforms; second, content platforms with clear rights and commercially safe output; third, data-governance/compliance/provenance infrastructure. By contrast, pure model companies may prefer to compress licensing costs onto a few key sources of content rather than pay broadly for the general internet.

  • From a valuation standpoint, the market already prices the "data-licensing option" on Reddit and some top AI-narrative platforms at no low level; pricing for News Corp's multi-LLM licensing capability, Wiley/Informa's AI content monetization, Getty/Shutterstock's compliant visual-asset revaluation, and the professional-information giants upgrading content libraries into AI workflow products remains divergent. Relatively, visual-asset platforms such as Getty/Shutterstock are priced markedly below professional-database companies, while Reddit, NYT, and similar names have already priced in a meaningful portion of AI expectations.

  • The biggest catalysts over the next twelve to twenty-four months are not the win/loss of any single lawsuit but three things: whether the EU's GPAI training-data summary and copyright-enforcement rules genuinely land, whether U.S. copyright and fair-use case law continues to diverge, and whether AI search forms a quantifiable publisher revenue-share / citation-traffic system. These will determine whether AI copyright licensing stays a handful of big deals or evolves into a long-term cost and infrastructure market.

Value-Chain Landscape and Commercialization Stages

The most important segmentation in this track is not "content industry vs. AI industry" but five stages: litigation claim, licensing negotiation, contract signed, revenue landed, and sustainable at-scale licensing. So far, very few have truly crossed the fifth stage; the most mature is the professional-database subscription-type AI product, the second most mature is UGC/API data licensing, and only third comes the AI licensing of top news and image libraries. Music, film/TV IP, the long tail of book copyright, personality rights, and general web scraping remain broadly stuck in the first four stages.

Value-chain position Sub-segment Core products/services AI demand driver Main revenue model Content/copyright/governance barrier Regulatory/litigation risk Commercialization stage Margin profile Representative companies Benefit intensity Investment elasticity
News publishing News-archive and real-time news licensing Archives, real-time feeds, summary/display licensing LLM training, AI search, real-time Q&A Multi-year fixed license fees, API fees, summary-display fees, partial revenue share Brand trust, original reporting, attribution need, paywall High: NYT/OpenAI, publisher–AI-search relations unsettled Contract signed → revenue landed High incremental margin, but uneven sustainability AP, News Corp, NYT, FT, Reuters, Axel Springer High High
Academic publishing TDM/corpus licensing Journal full text, metadata, citation networks Training, professional search, RAG Data-access fees, enterprise licensing, one-off + deferred payment Peer review, citation metadata, institutional relationships Medium-high: author consent and contract boundaries Contract signed → revenue landed High margin, but large political/public-opinion friction Informa/T&F, Wiley, Springer Nature High Medium-high
Professional databases Legal/tax/risk/science/finance databases Search libraries, citators, knowledge graphs, AI copilots Enterprise agents, professional RAG, workflow automation Subscription, seat, usage, workflow software High update frequency, structured, embedded in processes, compliant Medium: but professional-content rights are the strongest Sustainable at-scale licensing Best margin and retention Thomson Reuters, RELX, Wolters Kluwer, S&P Global, Moody's, FactSet, Bloomberg Very high Medium-high
Music copyright Recording/composition/voice/likeness licensing Catalogs, styles, voice rights, filtering and revenue splits AI music generation, voice cloning, remix License fees, subscription revenue share, royalty allocation, style/likeness licensing Complex rights chain but high concentration Very high: litigation and personality rights run in parallel Litigation claim → selective signing High margin if standardization succeeds UMG, WMG, Sony, Merlin, Klay, Suno, Udio Medium-high Very high
Image libraries and video footage Rights-cleared visual data Images/video/3D, releases, metadata Image/video training, enterprise generation, brand-safe creation Subscription, usage, training licensing, image-generation fees Model/property release, metadata, copyright indemnification High: Getty v. Stability and others Revenue landed → scale exploration Potential high-margin "safe generation" products Getty, Shutterstock, Adobe Stock, Visual China Group High High
Books and authors Book training/RAG Full-text books, summaries, translation/audiobook rights LLM training, writing assistants, knowledge Q&A Single-book/bulk licensing, platform opt-in Long-tail rights, fragmented contracts Very high: active author rights enforcement Litigation claim → early platformization High license margin, but high rights-clearance cost Authors Guild, Created by Humans, publisher alliances Medium High
UGC platforms Forum/community/comment data Data API, structured dialogue, real-time discussion Training, search augmentation, RAG API fees, annual fees, data licensing Freshness, discussion context, user signals Medium-high: user consent / platform terms Revenue landed High margin, low incremental cost Reddit, Stack Overflow, Quora/RSL ecosystem Very high Very high
Enterprise private data Enterprise RAG and permissioned data Documents, support logs, code repositories, CRM Enterprise agents, internal Q&A, automation SaaS, usage, data-governance add-on fees Permission systems, data lineage, privacy and audit Medium: mainly privacy/security Sustainable at-scale licensing Strong SaaS margin Snowflake, Databricks, MongoDB, Elastic Very high Medium-high
Data exchanges Dataset marketplace / exchange Data distribution, rights metadata, audit Training, vertical models, agent memory Platform commission, subscription, transaction fees Supply-organizing capability, rights metadata Medium-high: proof of provenance is core Early revenue validation Potential high platform margin TollBit, ProRata, Hugging Face, DataCite Medium-high Very high
Annotation/RLHF Human feedback and evaluation Annotation, red-teaming, preference data, evaluation Post-training alignment, fine-tuning, model evaluation Project fees, long-term service contracts Human network and quality control Medium: price competition and automation substitution Mature but more service-oriented Medium margin Scale AI, Appen, TELUS Digital, Defined.ai Medium Medium
Content provenance/watermarking/detection Provenance & authenticity C2PA, metadata, detection and forensics Compliance, brand safety, infringement management SaaS, enterprise editions, platform integration Network effects and standard compatibility Medium: technical effectiveness needs validation Early commercialization High software margin CAI/C2PA, Truepic, Vermillio, Loti Medium-high High
Crawler control/licensing standards Access control robots/RSL/pay-per-crawl AI search, training, agent scraping Platform service fees, transaction take Infrastructure coverage Medium: needs AI-company cooperation Early to mid stage High software-infrastructure margin Cloudflare, Fastly, Akamai, RSL Collective Medium-high High

Viewed across two dimensions—"already real revenue" and "still contested"—the split is more direct: revenue already realized is mainly top news brands, UGC APIs, institutional licensing of academic/educational content, professional-database AI subscriptions, rights-cleared visual generation, and enterprise data governance; still contested is mainly open-web pretraining, pirated or unsourced books, unlicensed music/film training, long-tail author revenue sharing, personality-rights training, and training-data transparency across multiple jurisdictions.

Business Models and Profit Pools

The core of AI content copyright and data licensing is not "how much content is worth" but which use case is willing to pay over the long term. Model pretraining values volume, diversity, and marginal cost; AI search values timeliness, authority, and summary/citation rights; enterprise RAG values permissions, update frequency, and audit trails; music and visual generation value commercial safety, person/voice consent, and downstream royalty accounting. So even though it is all "licensing," the pricing logic differs entirely.

Business model Typical scenario Pricing logic Pros Cons Better-suited suppliers
One-off license fee Archive training, historical corpora, bulk content access Corpus scale, exclusivity, litigation deterrence Fast to land, high margin Not sustainable, customers push back on price Top news, academic publishers
Annual/multi-year fixed fee News libraries, UGC API, enterprise content access Authority, update frequency, API availability Predictable, fits budgets Renewal price subject to negotiation AP, News Corp, Reddit, T&F
Per-call/per-token/per-API Real-time news, RAG, retrieval augmentation Query volume, latency, SLA Scales with usage Volatile cost Reuters Connect, UGC API, enterprise data platforms
Charge by training use Foundation-model/multimodal training Training rounds, scope of use, re-licensing limits Easy to lock in large customers Hard to sustain renewals after training is done Shutterstock, Getty, some academic/news archives
Citation/summary/search revenue share AI search, answer-page citations Display volume, clicks, ad/subscription revenue share Close to the traffic logic Hard attribution, data black box Publisher alliances, ProRata, Perplexity model
Output revenue share/royalties Music, voice, characters, AI creation Downloads, plays, subscriptions, generation counts Can lock in creators long term Complex rights clearance UMG/WMG/Sony, Vermillio, collective management organizations
AI-enhanced subscription/workflow Legal, tax, finance, education, medicine Customer ROI, time saved, compliance value Highest retention, best profit Long build cycle TRI, RELX, WKL, Pearson
Collective licensing/standardized licensing Web scraping, long-tail creators Coverage scope, standardized protocols, enforcement Solves long-tail rights clearance Needs network effects and a neutral enforcement layer RSL, TollBit, Created by Humans, ProRata

Where the profit pool ultimately lands depends on the scenario:

First, the training-data profit pool will not be distributed evenly across all rights holders. For general pretraining, model companies will use public data, existing paid agreements, user data, and synthetic data to lower cost as much as possible; what can truly be priced separately over the long term is high-value "gap-filling" data, not the entire web corpus. The outcomes of the Anthropic and Meta authors' cases reinforce this further.

Second, the AI-search and RAG profit pool will more likely land with a few high-authority content libraries and the interface layer. The reason is that search/Q&A needs fresh, traceable, citable, and correctable content; enterprise RAG further requires permissions and audit. So productized databases such as Reuters, Factiva, LexisNexis, Westlaw, Wolters Kluwer, and Pearson are easier to charge for over the long term than ordinary news pages and general web text.

Third, the profit pool for music, voice, characters, and likeness will more likely tilt toward "rights management + filtering + revenue splits" rather than the raw model. Because the commercial-use risk on the user side is higher, and personality-rights/consent mechanisms cannot be simply substituted. Large-scale financial landing is still hard to see in public markets, but the industry direction has already shifted from "whether to license" to "who licenses, who filters, who clears."

Fourth, the visual-content profit pool will most likely concentrate on platforms with "clear rights + complete metadata + indemnification capability." Getty directly commoditizes "uncapped indemnification" and contributor compensation; Shutterstock sells training data on one hand and operates generation tools and a contributor fund on the other. This business model is closer to long-term high-margin SaaS/subscription than to one-off data sales.

On the question of whether AI copyright licensing is a long-term cost for model companies, my judgment is scenario-dependent: general-pretraining licensing looks more like a transitional, strategic cost; high-value professional data, RAG permissioned data, AI-search real-time citation, and commercial-use music/visual/likeness licensing look more like long-term structural cost. The EU AI Act's requirements on training-content summaries and copyright policy, and the rise of access-control layers such as Cloudflare/RSL/TollBit, are both pushing "transparent provenance + conditional payment" to become the norm.

Dimension Conservative Base Aggressive
Key assumption U.S. fair use stays broad; transparency requirements limited U.S. case law keeps diverging; EU transparency lands; top deals increase Training-data transparency requirements tighten; platforms enforce licensing standards
Copyright-litigation trajectory General training mostly protected, pirated sources excepted Lawfully acquired / professional content better protected Rights holders significantly strengthened
AI-company licensing willingness Buy only the hardest-to-substitute content Willing to pay for real-time, authoritative, compliant data Training, search, and generation all more broadly licensed
Content-side bargaining power Only top content libraries have leverage Top brands and platform intermediaries strengthen Collective licensing / standardized markets form
AI-search traffic impact Negative-leaning for publishers Needs partial revenue sharing to offset Traffic decline partly offset by licensing revenue
Benefiting segments Professional databases, enterprise RAG, UGC API Top news licensing, professional databases, visual safe-gen, UGC, compliance infrastructure Licensing standards, clearing, royalties, provenance, rights tech
Main beneficiary companies TRI, RELX, WKL, Reddit, Stack Overflow News Corp, NYT, Wiley, Informa, Getty, Cloudflare, TollBit RSL/TollBit/ProRata/Created by Humans/Vermillio, plus large rights libraries
Main pressured companies Free-traffic-dependent media, long-tail authors, low-differentiation image libraries Mid-tail publishers, content sites without interface capability Model companies unable to clear rights and gray data brokers

Among these three scenarios, the model most worth investing in over the long term is not the one-off big deal but upgrading traditional subscription/copyright revenue into AI-native workflow revenue. This is also why companies like RELX, Thomson Reuters, Wolters Kluwer, and Pearson—though their "AI licensing narrative" runs less hot than media's—often offer higher investment quality.

Track Depth and Competitive Landscape

Below, the thirty sub-tracks the user listed are compressed by investable logic into fifteen-plus "profit-pool units." Scores are research priorities, not buy/sell recommendations.

Track Track logic Current commercialization stage Main customers Pricing model Margin trend Copyright clarity Regulatory/litigation risk Future catalysts Investment appeal
News content licensing Top news brands supply authoritative content to model and search platforms Signed, but sustainability to be proven OpenAI, Amazon, Meta, Perplexity Fixed fee + summary display High incremental margin Medium-high High AI-search revenue-share mechanisms, more LLM deals 7/10
AI-search citation licensing Citations, traffic, and revenue share become core Early trials AI search/answer engines rev-share / citation fee Undetermined Medium High RSL, Pay-per-crawl, platform disclosure 6/10
Academic TDM licensing Institutional corpora and metadata are scarce Revenue already generated Microsoft, research-tool vendors, institutions One-off + deferred High High Medium-high Author/publisher contract standardization 8/10
Professional-database licensing Use AI to enhance existing subscription workflows At scale Law firms, investment banks, tax, enterprises High-price subscription + module fee Best Very high Medium Enterprise-agent landing 10/10
Legal databases Legal search, citator, drafting and review At scale Law firms / legal departments seat + usage Very high Very high Medium Professional-agent adoption rate 10/10
Medical databases Clinical decision support and medical RAG Early-mid Hospitals, pharma, healthcare SaaS Subscription/API High High High Medical regulation and liability frameworks 8/10
Financial data licensing Data + research + factors + workflow At scale Buy-side, sell-side, corporate finance Terminal/license/API Very high Very high Medium Buy-side copilot penetration 9/10
Music AI licensing Catalogs, voice, style, royalties From litigation to signing AI music platforms, streaming, brands License + revenue share High if it works Medium-high but complex Very high More deals from majors/independent catalogs 8/10
Voice and likeness rights likeness/voice become standalone assets Early Film/TV, advertising, AI audio Licensing + monitoring + revenue share High Medium Very high NO FAKES / consent standards 7/10
Image libraries and video footage Rights-cleared, indemnified generation Already landed Brands, advertisers, creative tools Subscription/usage/training High High High Enterprise safe-generation penetration 9/10
Film/TV IP/game assets Characters, scenes, performance rights Mostly still early Video models, game platforms, studios franchise license Potentially high Medium Very high Hollywood/major-game-studio licensing templates 6/10
Books and author copyright Long-tail rights clearance sets the scale ceiling Litigation + platformization coexist Writing AI, model companies, publishers Single-book/bulk licensing High Low to medium Very high Collective licensing or platformization 6/10
UGC community data High freshness, authentic expression, discussion chains Already landed Models, search, agents API/annual fee Very high Medium Medium-high More communities adopting paid APIs 9/10
Code data High training value, but complex litigation and open-source licensing Contested period Copilot/coding-agent vendors API/data license/enterprise knowledge base High Medium-low High Code-licensing case law 6/10
Enterprise RAG data Permissions, lineage, audit are core At scale Mid-to-large enterprises SaaS/usage High Very high Medium Agent productionization 10/10
Data exchanges Standardized supply-demand matching Early Model vendors, publishers, enterprises Platform take Potentially high Depends on metadata High Enforcement standards and network effects 7/10
Annotation and RLHF Still a training necessity, but more labor-like Mature Foundation and enterprise models Project-based/long-term contracts Medium High Medium Higher-end evaluation/red-teaming 6/10
Synthetic data Reduce reliance on real copyrighted data Maturity rising Autonomous driving, industrial, AI training Software/data packs High High Low to medium Expanding regulatory-permitted scope 7/10
Content provenance/watermarking/detection Not selling content directly, but selling trust Early Platforms, media, brands, governments SaaS/API High N/A Medium C2PA adoption 8/10
Copyright fingerprinting/clearing/royalty allocation Property clearing after music/visual/character output Early-mid Platforms, labels, collective organizations SaaS + revenue share High Depends on rights database Medium-high AI output monetization 8/10
Model compliance audit/training transparency New infrastructure driven by regulation Early Model companies, enterprises, regulation-bound industries Audit fee/subscription High N/A Low to medium EU template enforcement, enterprise procurement rules 8/10
AI copyright legal tech Rights clearance, contract automation, discovery Early Publishers, entertainment, law firms, platforms SaaS/case services High N/A Medium Continued large volume of AI copyright disputes 7/10

On the competitive landscape, it can be summarized into four main threads. First, media groups' game paths differ: News Corp takes a hybrid strategy of "sign + keep negotiating + sue when necessary"; NYT sued first then signed, choosing Amazon rather than OpenAI for its first deal; AP entered licensing cooperation earlier; Reuters chose to license trusted news content to tech platforms such as Meta; Perplexity tries to win over publishers through rev-share. Top news brands have bargaining power, but that power is highly concentrated.

Second, professional-information companies prefer to turn content assets into AI workflow products rather than sell raw corpora to general models. Thomson Reuters has explicitly stated that third-party model partners may not use customer data to train models; its news business once recorded "generative AI related content licensing revenue," but its overall strategic focus is on professional products such as CoCounsel. RELX, Wolters Kluwer, and Pearson likewise embed AI into existing workflows and emphasize in their filings the value of trust, verification, evaluation, and embedded data.

Third, music and visual are two different copyright economics. The music rights chain is more complex but more concentrated, making it easy to form a "license–filter–revenue split" loop; the visual content rights chain is relatively clear, and as long as there are releases, metadata, and indemnification capability, it is easier to form enterprise safe-generation products. The former's core is catalog control and royalty systems; the latter's core is commercial safety and metadata.

Fourth, AI companies' content strategies also clearly diverge. OpenAI is more active in signing top licensing deals and announcing them loudly; Google buys both content and community data; Meta is later and more selective on news and social content; Anthropic faces relatively heavy pressure in public copyright litigation; Perplexity, ProRata, and TollBit represent the new path that "AI search/AI agents must pay content owners directly."

Investment Targets and Company Tiers

The table below prioritizes coverage of high-credibility companies with existing public evidence; for projects that do not separately disclose AI licensing revenue, it clearly notes "not separately disclosed" or "mainly defensive."

Company Code/status Sub-segment AI copyright/data benefit path Public evidence Current view
Thomson Reuters TRI / US stock Legal/tax/professional information Uses high-value databases to build AI workflow subscriptions; occasional news-content AI licensing is only a side line Q1 2025 Reuters News revenue declined partly due to a high prior-year AI content licensing base; CoCounsel/multi-model strategy keeps advancing Tier A: platform-type winner
RELX RELX / UK listed Legal/science/risk databases Converts proprietary content into AI-enhanced subscriptions and agent tools 2025 annual revenue £9.59bn, up 7%; management says GenAI tools keep driving growth Tier A: high moat
Wolters Kluwer WKL / Netherlands listed Legal/medical/tax databases AI-enhanced professional software and content suites 2025 annual report and full-year results emphasize AI innovation and continued margin improvement Tier A: defensive + growth
News Corp NWSA / US stock News/professional news/books Licenses top news and Factiva/WSJ/Dow Jones content to LLMs and platforms at multiple points Signed a global multi-year agreement with OpenAI; annual report explicitly notes generative-AI platform content licensing; filings repeatedly show higher content licensing revenues Tier A: direct beneficiary
Reddit RDDT / US stock UGC data Charges Google and OpenAI for the Data API, turning community content into AI raw material Google expanded cooperation and obtained the Data API; OpenAI integrated the Data API; the 10-K lists content licensing under other revenue Tier A: high elasticity but high valuation
Wiley WLY / US stock Academic publishing Research-content licensing + efficiency improvement FY2025 AI licensing revenue about US$11 million, repeatedly cited by management as a growth driver Tier A: small but real
Informa INF / UK listed Academic publishing/exhibition data T&F content and data licensing, plus internal AI applications Microsoft agreement 2024–2027, first year US$10 million, with subsequent deferred payments; the company says it highlights IP value Tier A: undervalued academic licensing
Getty Images GETY / US stock Image library/visual data Rights-cleared generation, training-set partnerships, AI-platform access Contributors compensated as content is included in AI training sets; 2025 "Other revenue" up 35.2%, with mention of two important AI-platform partnerships Tier B: high elasticity, high risk
Shutterstock SSTK / US stock Image library/training data Multi-year OpenAI training-data partnership, contributor fund, generation tools Signed a six-year agreement with OpenAI; the Contributor Fund exists publicly
Pearson PSO / UK and US Education/assessment/corporate learning Turns content, assessment, and corporate learning into AI-enhanced subscriptions and solutions 2025 sales £3.577bn, adjusted operating profit £614m; deep cooperation with Microsoft/AWS/Google Cloud; expansion of enterprise customers and AI products Tier B: more workflow than raw licensing
New York Times NYT / US stock Premium news/sports/cooking Sued first then signed; monetizes through Amazon's first GenAI license Signed a multi-year agreement with Amazon; continues to sue OpenAI/Microsoft and bears litigation costs Tier B: strong brand, but heavier defensive attribute
Adobe ADBE / US stock Creative software/Stock ecosystem Enhances Creative Cloud with safe training sets and content credentials, rather than selling corpora directly Firefly is commercially usable for enterprises; Adobe Stock contributors receive Firefly bonus Tier B: shovel seller
Warner Music Group WMG / US stock Music copyright Shifts from litigation toward licensing, revenue sharing, and artist likeness Participated in suing Suno/Udio; later showed licensing and platform cooperation with Suno/Udio/Klay and others Tier B: high mid-to-long-term elasticity
Universal Music Group UMG / Europe listed Music copyright Same as above, with stronger catalog and publishing rights Participated in the Suno/Udio litigation and entered licensing arrangements such as Klay Tier B: strong rights holder
Visual China Group Visual China / A-share Image library/copyright trading "AI intelligence + content data + application scenarios," emphasizing commercial use, traceability, and platform service fees Investor relations and annual-report summaries both emphasize AI-empowered copyright trading, creative customization, platform revenue sharing, and long-term agreements Tier B: scarce China sample
Stack Overflow Private Developer UGC/enterprise knowledge Data licensing + enterprise knowledge products + public-corpus API Official data-licensing page, OpenAI partnership, Knowledge Solutions transformation Tier A: high-quality private target
ProRata Private News citation/revenue-share platform Attributes AI answers and shares revenue with publishers News/Media Alliance framework agreement, 50% of revenue shared with publishers Tier B: new model, scale to be proven
TollBit Private Content-payment gateway/exchange AI agents pay websites directly At Series A claimed transactions live on product, with multiple publishers and AI companies integrated Tier B: infrastructure option
Cloudflare NET / US stock Crawler control/licensing infrastructure Default blocking of AI crawlers, Pay Per Crawl Already blocks AI crawlers by default and launched a paid-crawl pilot; supports RSL/content signals
Created by Humans Private Book rights platform Authors select training/RAG licensing by use The platform supports ISBN/upload claiming and AI-rights settings; partners with the Authors Guild Tier C: right direction, early validation
Vermillio Private likeness/voice protection and licensing Provides monitoring, licensing, and protection for celebrities/IP Sony Music participated in the investment; TraceID is used for licensing and infringement identification Tier C: high potential, high uncertainty
Loti Private likeness protection Face/voice monitoring and takedown Officially positioned as likeness protection for everyone Tier C: event-driven

Based on public evidence, companies can be split into five tiers:

  • Tier A: core direct beneficiaries of AI copyright/data licensing Thomson Reuters, RELX, Wolters Kluwer, News Corp, Reddit, Wiley, Stack Overflow. The common thread: either they already have clear licensing revenue, or they upgrade high-barrier content directly into AI workflow subscriptions.

  • Tier B: clear beneficiaries, but with valuation, litigation, regulatory, or sustainability risk NYT, Getty, Shutterstock, Pearson, Adobe, WMG, UMG, Visual China, Cloudflare, ProRata, TollBit. The common thread: the commercialization direction is clear, but either revenue share is still small, or the market has partly reflected it, or it depends on new-standard adoption.

  • Tier C: AI licensing is mainly a defensive tool, with weak near-term profit elasticity Integrated education platforms, some large content platforms, Created by Humans, Vermillio, and Loti—companies leaning more toward infrastructure and rights management. Their direction is right, but scale validation is early.

  • Tier D: strong narrative, but lacking verifiable licensing revenue in this public dataset Most AI music/video generation startups, some "AI copyright concept" A-share software stocks, and companies that emphasize "AI content cooperation" but do not separately disclose contract amounts, revenue contribution, or customer expansion should stay on a watch list rather than be treated directly as beneficiaries. The most common problem in public materials is disclosing only "cooperation," "exploration," or "integration" without disclosing financial contribution.

  • Tier E: may be hit by AI-generated content, AI search, or unlicensed substitution Free-traffic-dependent mid-to-long-tail media, low-differentiation footage libraries, visual-content platforms lacking releases and metadata, and general data brokers without clear proof of provenance will face stronger price pressure and a compliance discount in the AI era. This risk is directly flagged in the risk disclosures of companies such as News Corp, Getty, and Stack Overflow.

Risk, Valuation, and Final Conclusions

First, valuation and market expectations. As of May 19, 2026, NYT's P/E was about 32.4x, Reddit about 44.7x, Wiley about 14.6x, Warner Music about 20.5x, and S&P Global about 26.5x; while Shutterstock's market cap was only about US$584 million, Getty about US$404 million, News Corp about US$15.1 billion, and Reddit about US$31.7 billion. Purely from market pricing, the "AI content option" on Reddit/NYT is already not cheap; the revaluation room for Wiley, News Corp, and Getty/Shutterstock depends more on subsequent contract renewals and rising revenue share; while the professional-database leaders look more like high-quality compounders—their valuations are not cheap, but their logic is the most stable.

Using the user's weighting framework, my suggested positive scoring model is as follows: direct exposure to AI licensing revenue 20%; content-asset and copyright clarity 25%; customer quality and bargaining power 15%; data governance/metadata/API capability 10%; litigation and regulatory risk management 10%; financial quality and margins 10%; valuation reasonableness 10%. Under this model, the group with the highest current priority is usually not the "hottest" media stocks but companies that combine all three of content barriers + workflow + compliance.

Rank Company Directional total Core reason
RELX 84 Legal/science content library + AI products + high margin + high update frequency
Thomson Reuters 83 Professional database + CoCounsel + legal case-law environment friendlier to professional content
Wolters Kluwer 82 Regulatory/medical/tax workflow content + AI embedding
News Corp 80 Top news brands + multiple AI licensing deals signed + Dow Jones/Factiva assets
Reddit 78 Real API licensing revenue + extremely strong content freshness, but high valuation
Wiley 77 Disclosed AI licensing revenue; small in scale but the most solid evidence
Pearson 76 AI + assessment + enterprise customers, leaning toward productized subscription rather than one-off licensing
Getty Images 75 Rights-cleared visual library + AI cooperation, but high litigation/financial risk
Adobe 74 Sells shovels through safe generation and content credentials, not a pure-licensing beneficiary
Informa 73 Academic/exhibition data licensing is real, but disclosure remains insufficient

The corresponding reverse risk-scoring model can be set as: insufficient licensing-revenue sustainability 20%; copyright-litigation and regulatory uncertainty 20%; high content substitutability 20%; AI companies' bargaining power too strong 15%; generated content depressing original-content value 15%; valuation too high 10%. Under this model, the highest risk is usually not professional databases but the high-narrative news, long-tail books, and the music/film/likeness tracks not yet covered by standardized licensing, that have not formed stable revenue-share mechanisms.

There are mainly six categories of systemic risk. First, U.S. courts continuing to expand fair-use space would suppress the long-term pricing of general-training licensing; second, AI companies cutting procurement and shifting to user data, internal data, and synthetic data would make one-off corpus contracts hard to renew; third, AI search diverting traffic faster than licensing compensation would squeeze small and mid-size publishers on both sides; fourth, the complex rights chain for music/likeness/voice means that even willingness to pay may not enable efficient rights clearance; fifth, provenance, watermarking, and detection technologies are not perfect, so infrastructure companies also carry technology-delivery risk; sixth, at high valuations the market will focus more on "revenue share" than on the "story."

The final conclusions can be condensed into the following ten points:

First, AI content copyright and data licensing is the "high-quality data supply layer" of the AI value chain, but not all content can be repriced; what can truly be priced is content that is rights-clear, rich in metadata, timely, industry-scarce, and verifiable in provenance.

Second, the five sub-tracks most worth watching are: professional-database AI workflows, UGC/API data licensing, rights-cleared visual data, academic/educational TDM licensing, and enterprise RAG data governance. If "selling shovels" is included, then Cloudflare/RSL/TollBit/compliance audit also merit high-priority tracking.

Third, the ten listed companies most worth deep research are: RELX, Thomson Reuters, Wolters Kluwer, News Corp, Reddit, Wiley, Informa, Pearson, Getty Images, and Adobe. For those who prefer music and the Chinese market, add Warner Music, UMG, and Visual China as elasticity supplements.

Among the unlisted, the ten most worth tracking are: Stack Overflow, TollBit, ProRata, Created by Humans, Vermillio, Loti, Scale AI, Databricks, Rightsify, and RSL Collective. The first five bet more directly on the "copyright and licensing layer," while the latter five lean toward the "data and infrastructure layer."

The five points the market most easily misreads are: First, assuming all copyright will be charged for universally; in reality it is more likely that "high-value data gets charged first." Second, assuming all AI licensing revenue will become large incremental gains; in reality much of it is merely defensive compensation. Third, assuming media is the highest-quality beneficiary; in reality professional databases are often stronger. Fourth, assuming music and likeness will immediately form a large market; in reality rights clearance and revenue splitting are the most complex. Fifth, assuming copyright-tech platforms have already scaled; in reality most are still in early validation.

The metrics most worth tracking over the next six to twelve months are: the share of AI licensing revenue, the licensing/other revenue split, the number of licensing contracts and renewal rate, the number of enterprise RAG customers, API call volume, AI-search citation traffic and revenue share, AI platforms' rev-share disclosure to publishers/creators, EU training-data summary enforcement, and the next round of rulings in key U.S. AI copyright cases.

For so-called "AI copyright platform-type companies," I lean toward placing RELX, Thomson Reuters, Wolters Kluwer, Reddit, Stack Overflow, Getty, and Cloudflare at the core; "AI-native data-licensing challengers" are TollBit, ProRata, Created by Humans, Vermillio, Loti, and RSL Collective; "AI content-licensing shovel sellers" include the layer of Adobe, Cloudflare, Snowflake, Databricks, Elastic, and MongoDB.

Companies at higher risk of being hit by AI-generated content, AI search, or unlicensed training are not all content companies but those that lack originality, lack real-time data, have unclear rights chains, have no data-interface capability, and depend heavily on search-traffic redistribution. This risk is most evident in news, general internet content, and low-differentiation footage libraries.

For narrower follow-up research directions worth digging into further, my suggested priority order is: professional-database licensing and AI workflows, UGC data licensing, news-content licensing and AI-search revenue sharing, image-library training data and safe generation, music AI licensing and likeness rights, enterprise RAG data governance, and content provenance and model compliance audit. These directions are closer to real profit pools than a vague "AI copyright."

Open questions and limitations also need to be made clear: a large number of licensing contracts still do not disclose amounts; many companies blend AI licensing revenue into "other revenue/licensing/subscription" without breaking it out; public financial evidence for the music, film/TV, and personality-rights tracks is clearly less than for news and professional databases; and the rules in jurisdictions such as China, South Korea, Japan, and Australia are still evolving. So at this stage the most important thing is not "who told an AI copyright story" but who has already proven: contracts, customers, revenue, renewals, and a workflow position.

This report is based on public information and does not constitute investment advice. Markets carry risk; invest with caution.

AI CopyrightData LicensingProfessional DatabasesUGC LicensingEnterprise RAGNews LicensingFair Use
Ask about this report

Members can ask about this report; once answered it appears under "Reader Q&A" on this page. You can also highlight a passage in the text to ask about it directly.