What I learned today
Today I came across a fantastic tool called pup.
From github:
Pup is “a command line tool for processing HTML. It reads from stdin, prints to stdout, and allows the user to filter parts of the page using CSS selectors.”
It is basically a html parser, and makes web scraping from websites dead simple.
What I did
I used pup in a simple script, which just fetches all the links on a website, and gives you a list to choose one, and then opens it up in a browser
Tools used
Walkthrough
So first lets get a website, I’ll use mine, and curl it
curl -sL https://kshitijaucharmal.github.io/main
Now pass that in pup with the –color flag. This just makes it look better
curl -sL https://kshitijaucharmal.github.io/main | pup --color
We want all the links now, so as you might know, links in html are in a tag “<a>” and in the attribute href.
curl -sL https://kshitijaucharmal.github.io/main | pup --color 'a attr{href}'
Output
#main-content
/
/main
/blog
/
/main/
/tags/personal/
#what-this-website-is-about
https://gohugo.io/
#my-youtube-channel
https://youtube.com/@artificialcode
#online-projects
https://kshitijaucharmal.github.io/gridworld
https://kshitijaucharmal.github.io/NEAT-JS
https://narutotheboss.itch.io/bishop-challenge
#passion-projects-on-github--gitlab
https://github.com/kshitijaucharmal/2048
https://github.com/kshitijaucharmal/WaveFunctionCollapse
https://github.com/kshitijaucharmal/gridworld
https://github.com/kshitijaucharmal/GridWorld-Processing
https://github.com/kshitijaucharmal/NEAT-JS
https://github.com/kshitijaucharmal/NEAT-Algorithm
https://github.com/kshitijaucharmal/bishop-challenge
https://github.com/kshitijaucharmal/Reverse-Shell
https://github.com/PlumPeach
https://github.com/kshitijaucharmal/Genetic-Sentences
https://github.com/kshitijaucharmal/KMeans-Visualization
https://github.com/kshitijaucharmal/Lorenz-Equation
https://github.com/kshitijaucharmal/Flocking
https://github.com/kshitijaucharmal/Boids
https://www.facebook.com/sharer/sharer.php?u=https://kshitijaucharmal.github.io/main/&quote=Kshitij%27s%20website
https://twitter.com/intent/tweet/?url=https://kshitijaucharmal.github.io/main/&text=Kshitij%27s%20website
https://pinterest.com/pin/create/bookmarklet/?url=https://kshitijaucharmal.github.io/main/&description=Kshitij%27s%20website
https://reddit.com/submit/?url=https://kshitijaucharmal.github.io/main/&resubmit=true&title=Kshitij%27s%20website
mailto:?body=https://kshitijaucharmal.github.io/main/&subject=Kshitij%27s%20website
#the-top
https://gohugo.io/
https://git.io/hugo-congo
Lets grep out the lines starting with http to just get the links and pipe it to dmenu
curl -sL https://kshitijaucharmal.github.io/main | pup --color 'a attr{href}' | grep '^http' | dmenu -i -l 10
Now you can store it in a variable and ask brave to open it!!
brave $(curl -sL https://kshitijaucharmal.github.io/main | pup --color 'a attr{href}' | grep '^http' | dmenu -i -l 10)
Thats it.
Comments