How to load Markdown
Markdown is a lightweight markup language for creating formatted text using a plain-text editor.
Here we cover how to load Markdown
documents into LangChain
Document
objects that we can use downstream.
We will cover:
- Basic usage;
- Parsing of Markdown into elements such as titles, list items, and text.
LangChain implements an UnstructuredLoader class.
This guide assumes familiarity with the following concepts:
Installationโ
- npm
- yarn
- pnpm
npm i @langchain/community
yarn add @langchain/community
pnpm add @langchain/community
Setupโ
Although Unstructured has an open source offering, youโre still required to provide an API key to access the service. To get everything up and running, follow these two steps:
- Download & start the Docker container:
docker run -p 8000:8000 -d --rm --name unstructured-api quay.io/unstructured-io/unstructured-api:latest --port 8000 --host 0.0.0.0
- Get a free API key & API URL here, and set it in your environment (as per the Unstructured website, it may take up to an hour to allocate your API key & URL.):
export UNSTRUCTURED_API_KEY="..."
# Replace with your `Full URL` from the email
export UNSTRUCTURED_API_URL="https://<ORG_NAME>-<SECRET>.api.unstructuredapp.io/general/v0/general"
Basic usage will ingest a Markdown file to a single document. Here we demonstrate on LangChainโs readme:
import { UnstructuredLoader } from "@langchain/community/document_loaders/fs/unstructured";
const markdownPath = "../../../../README.md";
const loader = new UnstructuredLoader(markdownPath, {
apiKey: process.env.UNSTRUCTURED_API_KEY,
apiUrl: process.env.UNSTRUCTURED_API_URL,
});
const data = await loader.load();
console.log(data);
[
Document {
pageContent: '๐ฆ๏ธ๐ LangChain.js',
metadata: {
languages: [Array],
filename: 'README.md',
filetype: 'text/markdown',
category: 'Title'
}
},
Document {
pageContent: 'โก Building applications with LLMs through composability โก',
metadata: {
languages: [Array],
filename: 'README.md',
filetype: 'text/markdown',
category: 'Title'
}
},
Document {
pageContent: 'Looking for the Python version? Check out LangChain.',
metadata: {
languages: [Array],
parent_id: '7ea17bcb17b10f303cbb93b4cb95de93',
filename: 'README.md',
filetype: 'text/markdown',
category: 'NarrativeText'
}
},
Document {
pageContent: 'To help you ship LangChain apps to production faster, check out LangSmith.\n' +
'LangSmith is a unified developer platform for building, testing, and monitoring LLM applications.\n' +
'Fill out this form to get on the waitlist or speak with our sales team.',
metadata: {
languages: [Array],
parent_id: '7ea17bcb17b10f303cbb93b4cb95de93',
filename: 'README.md',
filetype: 'text/markdown',
category: 'NarrativeText'
}
},
Document {
pageContent: 'โก๏ธ Quick Install',
metadata: {
languages: [Array],
filename: 'README.md',
filetype: 'text/markdown',
category: 'Title'
}
},
Document {
pageContent: 'You can use npm, yarn, or pnpm to install LangChain.js',
metadata: {
languages: [Array],
parent_id: '8f698a6f3038c268bf6d65bc6065890b',
filename: 'README.md',
filetype: 'text/markdown',
category: 'NarrativeText'
}
},
Document {
pageContent: 'npm install -S langchain or yarn add langchain or pnpm add langchain',
metadata: {
languages: [Array],
parent_id: '8f698a6f3038c268bf6d65bc6065890b',
filename: 'README.md',
filetype: 'text/markdown',
category: 'NarrativeText'
}
},
Document {
pageContent: 'typescript\nimport { ChatOpenAI } from "langchain/chat_models/openai";',
metadata: {
languages: [Array],
filename: 'README.md',
filetype: 'text/markdown',
category: 'Title'
}
},
Document {
pageContent: '๐ Supported Environments',
metadata: {
languages: [Array],
filename: 'README.md',
filetype: 'text/markdown',
category: 'Title'
}
},
Document {
pageContent: 'LangChain is written in TypeScript and can be used in:',
metadata: {
languages: [Array],
parent_id: '975643d774ab3b861962f9dc13588d84',
filename: 'README.md',
filetype: 'text/markdown',
category: 'NarrativeText'
}
},
Document {
pageContent: 'Node.js (ESM and CommonJS) - 18.x, 19.x, 20.x',
metadata: {
languages: [Array],
parent_id: '975643d774ab3b861962f9dc13588d84',
filename: 'README.md',
filetype: 'text/markdown',
category: 'ListItem'
}
},
Document {
pageContent: 'Cloudflare Workers',
metadata: {
languages: [Array],
parent_id: '975643d774ab3b861962f9dc13588d84',
filename: 'README.md',
filetype: 'text/markdown',
category: 'ListItem'
}
},
Document {
pageContent: 'Vercel / Next.js (Browser, Serverless and Edge functions)',
metadata: {
languages: [Array],
parent_id: '975643d774ab3b861962f9dc13588d84',
filename: 'README.md',
filetype: 'text/markdown',
category: 'ListItem'
}
},
Document {
pageContent: 'Supabase Edge Functions',
metadata: {
languages: [Array],
parent_id: '975643d774ab3b861962f9dc13588d84',
filename: 'README.md',
filetype: 'text/markdown',
category: 'ListItem'
}
},
Document {
pageContent: 'Browser',
metadata: {
languages: [Array],
parent_id: '975643d774ab3b861962f9dc13588d84',
filename: 'README.md',
filetype: 'text/markdown',
category: 'ListItem'
}
},
Document {
pageContent: 'Deno',
metadata: {
languages: [Array],
parent_id: '975643d774ab3b861962f9dc13588d84',
filename: 'README.md',
filetype: 'text/markdown',
category: 'ListItem'
}
},
Document {
pageContent: '๐ค What is LangChain?',
metadata: {
languages: [Array],
filename: 'README.md',
filetype: 'text/markdown',
category: 'Title'
}
},
Document {
pageContent: 'LangChain is a framework for developing applications powered by language models. It enables applications that:\n' +
'- Are context-aware: connect a language model to sources of context (prompt instructions, few shot examples, content to ground its response in, etc.)\n' +
'- Reason: rely on a language model to reason (about how to answer based on provided context, what actions to take, etc.)',
metadata: {
languages: [Array],
parent_id: 'e2396958560b4688b2a242fbe54cd832',
filename: 'README.md',
filetype: 'text/markdown',
category: 'NarrativeText'
}
},
Document {
pageContent: 'This framework consists of several parts.\n' +
'- LangChain Libraries: The Python and JavaScript libraries. Contains interfaces and integrations for a myriad of components, a basic runtime for combining these components into chains and agents, and off-the-shelf implementations of chains and agents.\n' +
'- LangChain Templates: (currently Python-only) A collection of easily deployable reference architectures for a wide variety of tasks.\n' +
'- LangServe: (currently Python-only) A library for deploying LangChain chains as a REST API.\n' +
'- LangSmith: A developer platform that lets you debug, test, evaluate, and monitor chains built on any LLM framework and seamlessly integrates with LangChain.',
metadata: {
languages: [Array],
parent_id: 'e2396958560b4688b2a242fbe54cd832',
filename: 'README.md',
filetype: 'text/markdown',
category: 'NarrativeText'
}
},
Document {
pageContent: 'The LangChain libraries themselves are made up of several different packages.\n' +
'- @langchain/core: Base abstractions and LangChain Expression Language.\n' +
'- @langchain/community: Third party integrations.\n' +
"- langchain: Chains, agents, and retrieval strategies that make up an application's cognitive architecture.",
metadata: {
languages: [Array],
parent_id: 'e2396958560b4688b2a242fbe54cd832',
filename: 'README.md',
filetype: 'text/markdown',
category: 'NarrativeText'
}
},
Document {
pageContent: 'Integrations may also be split into their own compatible packages.',
metadata: {
languages: [Array],
parent_id: 'e2396958560b4688b2a242fbe54cd832',
filename: 'README.md',
filetype: 'text/markdown',
category: 'NarrativeText'
}
},
Document {
pageContent: 'This library aims to assist in the development of those types of applications. Common examples of these applications include:',
metadata: {
languages: [Array],
parent_id: 'e2396958560b4688b2a242fbe54cd832',
filename: 'README.md',
filetype: 'text/markdown',
category: 'NarrativeText'
}
},
Document {
pageContent: 'โQuestion Answering over specific documents',
metadata: {
languages: [Array],
filename: 'README.md',
filetype: 'text/markdown',
category: 'Title'
}
},
Document {
pageContent: 'Documentation',
metadata: {
languages: [Array],
parent_id: '2321e263d4278955b49ae7185a2e7071',
filename: 'README.md',
filetype: 'text/markdown',
category: 'ListItem'
}
},
Document {
pageContent: 'End-to-end Example: Doc-Chatbot',
metadata: {
languages: [Array],
parent_id: '2321e263d4278955b49ae7185a2e7071',
filename: 'README.md',
filetype: 'text/markdown',
category: 'ListItem'
}
},
Document {
pageContent: '๐ฌ Chatbots',
metadata: {
languages: [Array],
filename: 'README.md',
filetype: 'text/markdown',
category: 'Title'
}
},
Document {
pageContent: 'Documentation',
metadata: {
languages: [Array],
parent_id: '13bfe7de8241ff139f084c9528169836',
filename: 'README.md',
filetype: 'text/markdown',
category: 'ListItem'
}
},
Document {
pageContent: 'End-to-end Example: Chat-LangChain',
metadata: {
languages: [Array],
parent_id: '13bfe7de8241ff139f084c9528169836',
filename: 'README.md',
filetype: 'text/markdown',
category: 'ListItem'
}
},
Document {
pageContent: '๐ How does LangChain help?',
metadata: {
languages: [Array],
filename: 'README.md',
filetype: 'text/markdown',
category: 'Title'
}
},
Document {
pageContent: 'The main value props of the LangChain libraries are:\n' +
'1. Components: composable tools and integrations for working with language models. Components are modular and easy-to-use, whether you are using the rest of the LangChain framework or not\n' +
'2. Off-the-shelf chains: built-in assemblages of components for accomplishing higher-level tasks',
metadata: {
languages: [Array],
parent_id: '1967058b7817d63c366c58df67e61178',
filename: 'README.md',
filetype: 'text/markdown',
category: 'NarrativeText'
}
},
Document {
pageContent: 'Off-the-shelf chains make it easy to get started. Components make it easy to customize existing chains and build new ones.',
metadata: {
languages: [Array],
parent_id: '1967058b7817d63c366c58df67e61178',
filename: 'README.md',
filetype: 'text/markdown',
category: 'NarrativeText'
}
},
Document {
pageContent: 'Components fall into the following modules:',
metadata: {
languages: [Array],
parent_id: '1967058b7817d63c366c58df67e61178',
filename: 'README.md',
filetype: 'text/markdown',
category: 'NarrativeText'
}
},
Document {
pageContent: '๐ Model I/O:',
metadata: {
languages: [Array],
filename: 'README.md',
filetype: 'text/markdown',
category: 'Title'
}
},
Document {
pageContent: 'This includes prompt management, prompt optimization, a generic interface for all LLMs, and common utilities for working with LLMs.',
metadata: {
languages: [Array],
parent_id: '7742f15be2acbf645543557b71bee56e',
filename: 'README.md',
filetype: 'text/markdown',
category: 'NarrativeText'
}
},
Document {
pageContent: '๐ Retrieval:',
metadata: {
languages: [Array],
filename: 'README.md',
filetype: 'text/markdown',
category: 'Title'
}
},
Document {
pageContent: 'Data Augmented Generation involves specific types of chains that first interact with an external data source to fetch data for use in the generation step. Examples include summarization of long pieces of text and question/answering over specific data sources.',
metadata: {
languages: [Array],
parent_id: '6a6b63610d2ca00f121f094a94d520be',
filename: 'README.md',
filetype: 'text/markdown',
category: 'NarrativeText'
}
},
Document {
pageContent: '๐ค Agents:',
metadata: {
languages: [Array],
filename: 'README.md',
filetype: 'text/markdown',
category: 'Title'
}
},
Document {
pageContent: 'Agents involve an LLM making decisions about which Actions to take, taking that Action, seeing an Observation, and repeating that until done. LangChain provides a standard interface for agents, a selection of agents to choose from, and examples of end-to-end agents.',
metadata: {
languages: [Array],
parent_id: 'cc022877b6536240ca7e38e6827c4dba',
filename: 'README.md',
filetype: 'text/markdown',
category: 'NarrativeText'
}
},
Document {
pageContent: '๐ Documentation',
metadata: {
languages: [Array],
filename: 'README.md',
filetype: 'text/markdown',
category: 'Title'
}
},
Document {
pageContent: 'Please see here for full documentation, which includes:',
metadata: {
languages: [Array],
parent_id: 'e38f3af90533af34e7e50debd571bfc1',
filename: 'README.md',
filetype: 'text/markdown',
category: 'NarrativeText'
}
},
Document {
pageContent: 'Getting started: installation, setting up the environment, simple examples',
metadata: {
languages: [Array],
parent_id: 'e38f3af90533af34e7e50debd571bfc1',
filename: 'README.md',
filetype: 'text/markdown',
category: 'ListItem'
}
},
Document {
pageContent: 'Overview of the interfaces, modules and integrations',
metadata: {
languages: [Array],
parent_id: 'e38f3af90533af34e7e50debd571bfc1',
filename: 'README.md',
filetype: 'text/markdown',
category: 'ListItem'
}
},
Document {
pageContent: 'Use case walkthroughs and best practice guides',
metadata: {
languages: [Array],
parent_id: 'e38f3af90533af34e7e50debd571bfc1',
filename: 'README.md',
filetype: 'text/markdown',
category: 'ListItem'
}
},
Document {
pageContent: 'Reference: full API docs',
metadata: {
languages: [Array],
parent_id: 'e38f3af90533af34e7e50debd571bfc1',
filename: 'README.md',
filetype: 'text/markdown',
category: 'ListItem'
}
},
Document {
pageContent: '๐ Contributing',
metadata: {
languages: [Array],
filename: 'README.md',
filetype: 'text/markdown',
category: 'Title'
}
},
Document {
pageContent: 'As an open-source project in a rapidly developing field, we are extremely open to contributions, whether it be in the form of a new feature, improved infrastructure, or better documentation.',
metadata: {
languages: [Array],
parent_id: '248eb0e90cb2116083e2351ddd5218b8',
filename: 'README.md',
filetype: 'text/markdown',
category: 'NarrativeText'
}
},
Document {
pageContent: 'For detailed information on how to contribute, see here.',
metadata: {
languages: [Array],
parent_id: '248eb0e90cb2116083e2351ddd5218b8',
filename: 'README.md',
filetype: 'text/markdown',
category: 'NarrativeText'
}
},
Document {
pageContent: 'Please report any security issues or concerns following our security guidelines.',
metadata: {
languages: [Array],
parent_id: '248eb0e90cb2116083e2351ddd5218b8',
filename: 'README.md',
filetype: 'text/markdown',
category: 'NarrativeText'
}
},
Document {
pageContent: '๐๏ธ Relationship with Python LangChain',
metadata: {
languages: [Array],
filename: 'README.md',
filetype: 'text/markdown',
category: 'Title'
}
},
Document {
pageContent: 'This is built to integrate as seamlessly as possible with the LangChain Python package. Specifically, this means all objects (prompts, LLMs, chains, etc) are designed in a way where they can be serialized and shared between languages.',
metadata: {
languages: [Array],
parent_id: '48411b9b9512447054ee50f01d3fd6ee',
filename: 'README.md',
filetype: 'text/markdown',
category: 'NarrativeText'
}
}
]
Retain Elementsโ
Under the hood, Unstructured creates different โelementsโ for different
chunks of text. By default we combine those together, but you can easily
keep that separation by specifying chunkingStrategy: "by_title"
.
const loader = new UnstructuredLoader(markdownPath, {
chunkingStrategy: "by_title",
});
const data = await loader.load();
console.log(`Number of documents: ${data.length}\n`);
for (const doc of data.slice(0, 2)) {
console.log(doc);
console.log("\n");
}
Number of documents: 13
Document {
pageContent: '๐ฆ๏ธ๐ LangChain.js\n' +
'\n' +
'โก Building applications with LLMs through composability โก\n' +
'\n' +
'Looking for the Python version? Check out LangChain.\n' +
'\n' +
'To help you ship LangChain apps to production faster, check out LangSmith.\n' +
'LangSmith is a unified developer platform for building, testing, and monitoring LLM applications.\n' +
'Fill out this form to get on the waitlist or speak with our sales team.',
metadata: {
filename: 'README.md',
filetype: 'text/markdown',
languages: [ 'eng' ],
orig_elements: 'eJzNUtuO0zAQ/ZVRnquSS3PjBcGyPHURgr5tV2hijxNTJ45ip0u14t8Zp1y6CCF4ACFLlufuc+bcPkRkqKfBv9cyegpREWNZosxS0RRVzmeTCiFlnmRUFZmQ0QqinjxK9Mj5D5HShgbsKRS/vX7+8uZ63S9ZIeBP4xLw9NE/6XxvQsDg0M7YkuPIbURDG919Wp1zQu5+llVGfMta7GdFsVo8MniSErZcfdWhHtYfXOj2dcROe0MRN/oRUUmYlI1o+EpilcWZaJo6azaiqXNJdfYvEKUFJvBi1kbqoQUcR6MFem0HB/fad7Dd3jjw3WTntgNh+9E6bLTR/gTn4t9CmhHFTc1w80oKSUlTpFWaFKWsVR5nFf0dpOwdcfoDvi+p2Vp7CJQoOzF+gjcn39kBjjQ5ZucZXHUkDmBnf7H3Sy5e4zQxkUfahYY/4UQqVcZJpSpspKqSMslVllWJzDdMC6XVf8jJzkJHZoSTncF1evwOPSiHdWJhnKycRRAQKHSephWIR0y961lW6/3w7Q3aAcI8aKVJgqQjGTvSBKNBz+T3ywaaLwpdgSfnlwcOEno7aG+nsCcW6iP58ohX2phlru94xtKLf9iSB/5d2Ok9smC1Y3sCNxIezpq3M5toiAER9r/a6t1n6BJ/zg==',
category: 'CompositeElement'
}
}
Document {
pageContent: 'โก๏ธ Quick Install\n' +
'\n' +
'You can use npm, yarn, or pnpm to install LangChain.js\n' +
'\n' +
'npm install -S langchain or yarn add langchain or pnpm add langchain\n' +
'\n' +
'typescript\n' +
'import { ChatOpenAI } from "langchain/chat_models/openai";\n' +
'\n' +
'๐ Supported Environments\n' +
'\n' +
'LangChain is written in TypeScript and can be used in:\n' +
'\n' +
'Node.js (ESM and CommonJS) - 18.x, 19.x, 20.x\n' +
'\n' +
'Cloudflare Workers\n' +
'\n' +
'Vercel / Next.js (Browser, Serverless and Edge functions)\n' +
'\n' +
'Supabase Edge Functions\n' +
'\n' +
'Browser\n' +
'\n' +
'Deno',
metadata: {
filename: 'README.md',
filetype: 'text/markdown',
languages: [ 'eng' ],
orig_elements: 'eJzNlm1v2zYQx7/KQa9WwE1Iik/qXnWpB2RoM2wOOgx1URzJY6pVogyJTlME/e6j3KZIhgBzULjIG0Li3VH+/e/BfHNdUUc9pfyuDdUzqGzUjUUda1ZbL7R1UQetnNdMK9swVy2g6iljwIzF/7qKbUcJe5qD/1w+f/FqedSH2Ws25E+bnSHTVT5+n/tuNnSYLrZ4QVOxvKkoXVRvPy+++My+663QyNfbSCzCH9vWf4DTNGXsdsE3J563uaOqxP0XIDSxCdobSZIYd9w7JpQlLU3TaKf4YQDK7gbHB8h4m/jvYQseE2wngrTpF/AJx7SAYYRNeYU8QPtFAHhZvnzyHtt09M90W40zHEfM7SWdz0fep0otuUISLBqMjfNFjMYzI6SWFFWQj1CVGf2G++kK5uP9jD7rMgsEGMLd3Z1ad3YfpJHWsubSchGQeNRItUGPElF7wck2hy/9OWbyY7vJ69T2m2HMcA0l3/n3DaXnp/AZ4jj0sK6+AR6XNb/rh0DddDwUL2zX1c97NUpjVAEOxkh0tbOaN1qU1vG8VtYGe6CSuNvpwda+rJEzWG03MzAFWKbLdhzS/FOnvUhcdChlNC6iKBWuJVrCGMhxIaKMP6i4/1fP2+jfGhnaCT6Obc5UHhOcl4+vdhUAmMJuKjiaB0Mo1mcPKmdBvlFWK6ZMaXfNI2ojIvNORMsUHWiSf5cqZ6WOy2SDn5arVzv+k6Hvh/Tb6gk8BW6PrhbAm3kV7Ojqthgv2ymfZurvrQ4hvRLCSaUEj8YG77TzQTNriYv6B/0hPEiHk24oTdGVePhrGD/QOO0LyxRHKZivAxldS41akzXcxELPm/oxJv01jZ46OIazsrHL/i/j8HGicQErGi9p7GiadtWwDBcEcZt8boc0PdlXE9KlAoSkZh4PtUBZ5oRjTAbiSgd3oLn+XZqUYYgOy3Vgh/zrDfK+xA0rqY6GaQrGo5JM1azcgawzjeOa2CMk/przvXMayvXQEA8meEmCsxiDrkO54/iAVvtHSPiC0nA/3tt/AY+igwk=',
category: 'CompositeElement'
}
}
Note that in this case we recover just one distinct element type:
const categories = new Set(data.map((document) => document.metadata.category));
console.log(categories);
Set(1) { 'CompositeElement' }