Supported content types for text extraction and transcript generation – Docebo Help & Support

Introduction

When you upload content to your platform, the system analyzes the file to extract textual information. Depending on the format, the platform retrieves this information through text extraction (for documents, images, web files, and similar formats) or transcript generation (for audio, video, and supported learning packages).

Only content from which the system can successfully extract or generate text can be used by platform features that rely on textual analysis:

This article outlines all supported content types and the conditions required for successful text extraction and transcript generation.

Supported content types for content analysis

The following table lists all file types that can be analyzed by the platform.

Category	Types	Extracted content	Training materials / Assets
Text files	.txt, .csv	Text	Training materials and assets
Document files	.doc, .docx, .odt, .ppt, .pptx, .pdf, .xls, .xlsx	Text	Training materials and assets
Image files	.bmp, .jpeg, .png, .tiff	Text in the image	Training materials and assets
Web files	.html, .htm Note: When a web page URL is provided, the transcript is generated only for that specific page. Content from links embedded within the page is not extracted.	Text	Training materials and assets
Audio files	.acc, .mpeg, .wav	Audio transcription	Training materials and assets
Video files	.mp4, .mov	Audio transcription	Training materials and assets
Google workspace files✴	Docs, Sheets, Slides	Text	Training materials and assets
Linked online videos✴	YouTube, Vimeo, Wistia	Subtitles	Training materials and assets
E-learning packages✴	SCORM and xAPI/TinCan (Articulate Rise and Articulate Storyline)	Text and audio transcription	Training materials
Docebo files	Creator lessons	Text and audio transcription	Training materials

✴Private content (content requiring authentication to be accessed) is not supported

Unsupported content and extraction limitations

Content types not listed in the table above are not supported for text extraction or transcript generation. These include assignment, Docebo Learning Impact (DLI), LTI, observation checklist, survey, test, Elucidat, archive, playlist, Shape, and AICC.

In addition to being a supported file type, the system must also be able to extract text or generate a transcript from the content. If text extraction fails, the content cannot be used by features that rely on textual analysis.

Text extraction or transcript generation may fail in the following cases:

Audio or video files that contain no speech (for example, background music only)
Transcripts shorter than 30 words, which are discarded
Private content that requires authentication to be accessed
Images compressed to a degree that prevents accurate Optical Character Recognition (OCR).

Only content from which the platform can successfully extract text or generate a transcript can be used by features such as global search, Harmony, and auto tagging.