GPT Crawler: How to Create Knowledge Files from URLs to ChatGPT Automatically

Photo of author
Written By Alston Antony

Created 10k plus community of business owners and helped thosands of digital business transformations.

Introduction to GPT-Crawler

In this guide, we are going to take an in-depth look on GPT-Crawler, which is a gitub code available right now, which is an easy-to-use solution to create knowledge-based files in JSON format, which you can use it for ChatGPT or Custom GPT, AI-assistance, or even in your playground within your OpenAI account.

You just need to provide the URL parameters. It will create a custom knowledge base file in JSON format, which you can use it easily.

In this video, I will show you how you can download it, set it up, instal, and start using this file for you. So without further ado, let me directly go into the video.

First, this is the project which I was talking about. I will leave a link for this in the description, which is called GPT Crawler, which I discovered it just today itself after seeing so many tweets on Twitter.

Basically, what they are saying this will do is crawl a site to generate knowledge file to create your own custom GPT from one or multiple URLs. And they have given a basic introduction video example is provided and what we need to do in order to get started, which I will explain in just a moment, and configuration settings and how you can go about creating custom GPT with these knowledge files and create custom assistant.

If this is confusing, just in case I will explain each of this as I go through this in the video.

Links mentioned the video:

Installation and Setup of GPT Crawler

First, let me see how we can download and instal it so we can start using it. So in order to make this whole process easy, first make sure that you have the node JS installed on your computer, because I did encounter an error while installing, then only I noticed that I don’t have the node JS.

So first step is to download the node JS, which I will leave the link in the description, choose your operating system based on what you need. I basically I’m a Windows user, so I use this and instal the node.Js

The installation wizard is pretty easy.

There is no custom option. Just click Next, Next until you go to the finish button.

I’m not going to instal it again because I have already installed that one here. Once you have node.

Js installed, you don’t need to worry everything is easy because you just need one more software. It is also free, which is Visual Studio Code.

This is not mandatory. You can also manage through various other application which helps you to run and maintain code effectively.

But I prefer Visual Studio Code is really helpful for in this case.

Here also you have multiple options, whether you are in Mac, Windows, Linux, depending on that, you can choose whichever the one you want. This is also completely free.

So once you download this, what you need to do is open up your Visual Studio. Okay, you will see something like this.

If you are opening for the first time, you won’t see something like this. Instead, you will see all welcome screens and everything.

But here what we need to do is first we need to clone a JIT repository. So when we click on that, it will ask you to provide a link here.

So in order to do that, first we need to go into this page and click on this button and copy this particular URL and click on that again and do that and click on Enter and it will ask you to select a folder for you to place all the code and all the output file for you. Here you can see I’m going to create a new folder which I already have a folder because I’ve been testing this particular application and let me do Video.

I know this is for video demonstration and select that folder.

Cloning and Configuring Files

Once I do that, it will automatically clone that particular source codes and everything for me. Now I’m going to click on Open.

Once I open, this is the screen I am created with same thing I’m having here. But apart from that, this is the thing which we are interested.

First we need to wait to see all these things. The config is the important one where we can set up all these things.

But before we start using it, we need to instal it now. So first we are going to open up a new terminal.

You can see I have opened up a new terminal here. And click on type this particular to instal all the dependency you need.

And you can see automatically as soon as I pasted it, it goes out and start downloading all the dependency it need and start adding into this particular folder. So let us wait until this particular section finish it out.

You can either copy paste this or you just can type the thing after seeing it. You can also use this to copy the things you want.

So let me wait until this is finished.

Okay, you can see it added 321 packages and all the informations are there. So once that is done, we need one more piece of text to copy, which is this one.

I’m copy pasting that also and pasting that into the URL, the terminal and pressing on Enter and we are done basically. It is because I have already installed this into my system, that’s why I didn’t instal it again.

But we are good to go right now because now we are saying now we need to just configure it. We have set up the process.

We go into the Config section of it. Here we have various configurations depending on what we need.

If you are just looking for a default basic one, you don’t need to tinker with these options because first it gives me various different one, what are the things it’s using on them. If you want to make it advanced, you can obviously tinker with this to make a much more effective one because it explains what each of this does, like pattern to match against the links for the things or this thing.

I’m not going to go into this in much depth and I’m just going to go with the basic options right now.

Creating Knowledge Files from URLs to ChatGPT

So first, in order to run the crawler, we have to copy this command where you want to set up the data source, that is the important one.

So we are not going to do any of this here, but we are interested on this particular example, where this is the URL which you give, where you have all the documents or all the articles which you ideally want to scrape. So here I’m going into this and this is created by Build.

Io. So they have provided their documentation, for example.

So this is all the documentation page which is available, and if you see the format like they have the URL as a doc slash developers and all the pages within that is listed inside that particular page itself. So that is how they are structured.

So here what they have done is we have given the URL, we have given a match parameter. So I want to scrape all the URLs which is matching this particular part of the URL.

So for example, when I’ve given something like that, it will scrape all these pages because it’s showing doc, but it’s not going to scrape a page like this because that does not match our criteria.

So that is what we are doing here, match and a selector. Selector is important because without selective, as soon as it loads this page, it might try to scrape all the things or it does not know which are the contents it need to scrape.

So what we are going to do is we can use something like inspect element here. For example, I’m going into the inspect element and until you find section like let me see, you find something like this like…

What is this? It is a class or ID which helps me to identify.

Okay, I only want content within that particular page. So this is the Docs Builder container, which they are using.

Basically, when I go above this, it shows that when I’m overing over it, it is going to be highlighting all this content for me. Let me go into a different guide and show you so you can see.

So it automatically know that this container contains all the information which I need for this particular scrape. So I can ID this also, this names, IDs and everything will be differing for different websites.

So this is just an example in this case.

So I copy that particular ID and give into this. And finally, we give the number of results we want to scrape.

This is also a personal preference, depending on how much pages you want to crawl, you give the information here. Finally, output JSON file name.

Basically, you can edit the name or you can keep it as it is. It will create a new file with the data with called output.

Json. Those are the only things which you need to edit if you want.

In this case, for just demo purposes, I’m going to use the same information available here. In order to run that, we saw the run command available here, so I’m just going to copy paste it or just type it also, I can do that.

Come into the terminal and I’m going to press start and you can see it automatically started. It shows that it’s opening up a player right crawler, starting the crawler.

They are also using Edlast browser to scrape all the data. And let me wait until to show exactly, I just pasted 50, maybe I can show the entire workflow to give you guys an idea on how the speed might look like.

This also will be speed also will be affected by your own internet speed, also just something to keep in mind.

And you can see it’s automatically crawling all the different pages for me.

Created Files to Upload to ChatGPT

I wanted to show you guys what has done. It crawled it again and it shows that 50 and all were now finished.

Total 57 request was done. I’m not sure why it went above than the 50, but it shows that total 57 requests, 57 succeeded, zero failed.

So we can see everything done and I’m back to the zero one. So now if I open up this project folder, which I can do right now, let me go into that.

And this is the Explorer, we already saw that this particular one was my existing and this is the one which I created just now for this video purposes, which I’m going to use.

And inside that I’m going into GPT Crawler, and inside that you can see it has created a new output. Json file.

And I’m going to open this up to see how this content look like. Let me show it here and you can see it automatically scraped all the things we gave and it prepared a really great JSON file with title, URL and the HTML with line breaks and everything.

So let me scroll just like you can see it is full amount of content from each pages. So that is how easy to build your own knowledge files with this.

So what is the next step? You can use this knowledge file in multiple ways.

For example, one, you can create a custom GPT like this. I have other videos in my channel which I’m showing you how you can create this GPT.

And inside the Knowledge section of this GPT, you can upload this output JSON file and ask the GPT to answer respond to your queries using the knowledge available in that particular file.

You can do that. Or you can create your own custom assistant through OpenAI Playground and include these files into that.

Or you can just use your basic upload feature in your ChatGPT conversation and possibly reference this data and ask it to answer your queries. So that is how easy to start using this GPT crawler.

I will leave all the links here, anything to do with GPT. If you have any other doubts also do let me know.

Conclusion

I also just discovered this project today, so I will be exploring more and more about this. And if you like to see more videos on AI and other digital tech and deals, do subscribe to the channel and hit the bell notification and also do leave a like button.

It really helps a small content creator like me. Once again, thank you so much and I hope you have a great day.

Do you want business AI tools and news directly to your email?

Email Newsletter

Leave a Comment