Node.js Web Crawler from Scratch | Full Tutorial

22,882
0
Published 2023-01-24
I'll walk you through building a web crawler in JavaScript using Node.js and a few minimal dependencies. We'll be parsing raw HTML and following hyperlinks. It will be a simplified version of what Google does when they ingest the internet.

Full project instructions: boot.dev/build/link-analyzer

Learn back-end development: boot.dev/

All Comments (21)
  • @CodewithAbhi03
    Just awesome😁 you not only helped creating a crawler but also taught how to use testcases and code. Thank you soo much 🥳🥳
  • @leonss2356
    that was pretty great, helped me get more familiar with the URL class and string manipulation and parsing in general, also finally got me learning testing which I had been ignoring for quite a while now.
  • This was great, i learnt more than just crawling the internet... Am experimenting with TDD with Jest. thanks a banch.
  • Nice video, clear sound, more information and very helpful Thank you so much for this working hard We need more of these projects of nodejs DC from Sudan
  • @matiassomoza8207
    Ok. It works. It's actually amazing seeing it work (since, from my experience, most code tutorials in you tube, at some point, don't work). I learnt some Node.js, (mostly Express) to make REST Apps (CRUD). But that was it. A server, some routes, some controllers; Sequelize to post stuff into a Postgres Data Base, and that's it. This is another level. I just was able to follow the tutorial, but I would be lying if say I understood everything you did. Yes, you import some modules, you install some packages from npm, you tested some functions... yep. And it works, and I don't know how. How can I learn what you do? I know you are a Backend Developer, but, (at least with Node.js), how did you learnt all that? It's awesome, it really is.
  • @bryanarycode3417
    I ran into an issue concerning Jest. Everything passes no issue with just 2 pages concerning sorting for the report. I get an error with any more than 2 pages, stating the output didn't match the expected. I feel this has something to do with the a,b hits function, but cannot for the life of me figure out what. The project works flawlessly in production, it only fails when trying to test with more than 2 pages with Jest. Any ideas on this? (Edit!) I just figured it out, for some reason it required me to put the pages in the expected variable in the exact opposite order the input variable pages were ordered and the test passed. A bug from a recent update perhaps? Either way thank you for the knowlege!
  • @hsider
    Subscribed ! You're other videos seem interesting, I'm checking em out soon. Nice content 👍
  • @michaelpumo83
    Brilliant video and your teaching style is very clear! Is this code available in a GitHub repo or Gist somewhere that I can use for reference at all? Thank you
  • @vinhngotrung859
    Hello, Do you know any way to create a web crawling for multiple websites with different structures?
  • @amt.7rambo670
    bro can this crawl all websites like complex ones like amazon or other ecommerce website pls reply bro ??
  • Thankyou very much.😊 Watched the complete video. :medal-yellow-first-red: Please post videos like this more🤗
  • @karthiksharma6752
    @bootdotdev ,for the function getURLfromHTMl test, you are checking both the arrays are equal or not using toEqual< I'm unable to do that donno why, and the other thing is dom.window.querySelectorAll("a") isn't giving the output array, I debugged in my case and found it to be dom.window.querySelectorAll("a") .forEach(linkElement=>{ } this thing, bt still I tried resolving the test error multiple times using toMatch or new Set(actual), but nothing worked...........kindly provide me with a solution....i hope there will be a reply soon......
  • @exe.m1dn1ght
    ok so i created a spider, and i'm crawling this website, my spider goes page by page , but it's very slow , half a second for each page, why is that ?
  • In sortPages you are creating aHits and bHits but actually dont use them :P .. great tutorial thank you.